Forum Stats

  • 3,874,151 Users
  • 2,266,674 Discussions
  • 7,911,744 Comments

Discussions

How to determine Selectivity when using character arithmetic: Academic question

perfdba
perfdba Member Posts: 150
edited Apr 10, 2017 5:56AM in SQL & PL/SQL

Hi all,

i am brushing through selectivity basics by going through various blogs & presentation on selectivity and have this question ,academic only,about range arithmetic with characters.

(i didnt find any other blog on exactly same subject so posting here)

DDL: i created the table and collected statistics without histograms.

create table t1 as
with v1 as (select rownum n from dual connect by level <= 1e4)
select
rownum id,
mod(rownum-1,200) mod_200,
trunc(dbms_random.value(0,300)) rand_300,
mod(rownum-1,10000) mod_10000,
trunc(sqrt(dbms_random.value(0,75000))) sqrt,
trunc(sysdate) +
trunc(dbms_random.value(0,1000)) date_1000,
dbms_random.string('l',6) alpha_06,
dbms_random.string('l',20) alpha_20
from v1, v1
where
rownum <= 1e6;

SQL> explain plan for select * from t1 where alpha_06 like ‘mm%’;

Explained.

SQL> @pln

PLAN_TABLE_OUTPUT
————————————————————————————————————————————————————————————————————————————————————————————————————
Plan hash value: 3617692013

————————————————————————–
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
————————————————————————–
| 0 | SELECT STATEMENT | | 157 | 8792 | 1669 (1)| 00:00:21 |
|* 1 | TABLE ACCESS FULL| T1 | 157 | 8792 | 1669 (1)| 00:00:21 |
————————————————————————–

Predicate Information (identified by operation id):
—————————————————

  1 – filter(“ALPHA_06” LIKE ‘mm%’)

Questino. How is number 157 derived exactly. When i do the actual count it returns 1529.

  COUNT(*)
———-
1529

I tried various calculations but none seem to get closer to 157. Just one that 26(number of letters) *6 =156 but it is not making sense w.r.t. explain plan

DB version:11.2.0.4.0

Regds,
Sachin

Mustafa_KALAYCI

Best Answer

  • Jonathan Lewis
    Jonathan Lewis Member Posts: 10,116 Blue Diamond
    edited Apr 8, 2017 4:08AM Answer ✓

    The DDL looks like one of mine - was it from one of my blogs, or from one of the presentations that I've given on selectivity, and somewhere online I'm fairly sure I've described the method.

    The selectivity calculation is the standard "range you want" divided by "total range".

    If you did a dbms_random.seed(0) that usually goes into the scripts I publish (and depending on which version you're running on) then the low and high values are probably very close to 'aaaa??' and 'zzz???' (which the '?' means unknown, not literally a question mark).

    As Andrew said, the calculation is the same as that for:  column >= 'mm' and column < 'mn' - i.e. using the next possible value for the 2nd character.

    For a very good approximation calculate the numberic values that would represent the min, the max, 'mm' and 'mn' if they were the endpoint values stored for a histogram - the algorithm is here: https://jonathanlewis.wordpress.com/2010/10/05/frequency-histogram-4/   (I've assumed you've done a 100% sample so the min and max would be stored as the low and high in user_tab_columns - if not then translate the low and high back to their ascii equivalents).

    For example, if min = 'aaaaud' and max = 'zzzypj' then you get:

    select

            (568191422636698000000000000000000000 - 568171140227094000000000000000000000)  /

            (635944374132308000000000000000000000-505627904303159000000000000000000000)     sel0

    from dual

    ;

      SEL0

    ----------

    .00015564

    Add 1/num_distinct  (998848 in my case) because one end of the range is bounded, giving: 1.5664115332863458704427500480554e-4

    Multiply by num_rows in table (1,000,000) and round (up) -> 157 Q.E.D.

    Regards

    Jonathan Lewis

«1

Answers

  • AndrewSayer
    AndrewSayer Member Posts: 13,007 Gold Crown
    edited Apr 7, 2017 2:00PM

    Never tried to calculate it but you could...

    Consider that your predicate can be rewritten

    alpha_06 like 'mm%'

    is equiv to

    alpha_06 >= 'mm' And alpha_06 < 'mn'

    Remember that any valid character could be used in this range (a-z,A-Z,0-9 all other characters..)

    You have a range of values that could fit in that gap, you also have a range of low and high values of your table's column stat.

    That might be enough to get you started, I dont have anything in front of me to work with so I can't follow through with some actual numbers, but you should get an idea of where to start.

    -edit

    Converting the range into raw (like what the low and high value stats would use) will probably be a good idea. You can convert the hex to decimal and do math with that.

  • Mustafa_KALAYCI
    Mustafa_KALAYCI Member Posts: 3,405 Bronze Crown
    edited Apr 7, 2017 2:07PM

    could it be dynamic sampling? was there a "Note - dynamic sampling used for this statement (level=2)" kind of statement in execution plan?

    when I run your code, my execution plan indicates 1774 rows and the actual one was 1465.

  • perfdba
    perfdba Member Posts: 150
    edited Apr 7, 2017 2:22PM

    hi Andrew,

    Thanks for the reply . i did the steps you mentioned already , unfortunately numbers dont match up with what i see from the explain plan.

  • perfdba
    perfdba Member Posts: 150
    edited Apr 7, 2017 2:32PM

    hi Mustafa,

    It is not dynamic sampling. There is no note.

    optimizer_dynamic_sampling is set to 2 already in the DB.

    when you run the code it is in which version and do you see dynamic sampling the notes section?

    Just one note about the dynamic sampling in the DB where i am testing. Only when i set the optimizer_dynamic_sampling does the sampling kick in and change the estimated row to something different.

    SQL> alter session set optimizer_dynamic_sampling=11;

    Session altered.

    SQL> explain plan for select *FROM t1 where ALPHA_06 like 'mm%';

    Explained.

    SQL> @pln

    PLAN_TABLE_OUTPUT

    ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

    Plan hash value: 3617692013

    --------------------------------------------------------------------------

    | Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |

    --------------------------------------------------------------------------

    |   0 | SELECT STATEMENT  |      |  1750 | 98000 |  1669   (1)| 00:00:21 |

    |*  1 |  TABLE ACCESS FULL| T1   |  1750 | 98000 |  1669   (1)| 00:00:21 |

    --------------------------------------------------------------------------

    Predicate Information (identified by operation id):

    ---------------------------------------------------

       1 - filter("ALPHA_06" LIKE 'mm%')

    Note

    -----

       - dynamic sampling used for this statement (level=11)

  • Mustafa_KALAYCI
    Mustafa_KALAYCI Member Posts: 3,405 Bronze Crown
    edited Apr 7, 2017 3:35PM

    hi,

    my db is 11.2.0.1

    show parameter dynamic;NAME                       TYPE    VALUE -------------------------- ------- ----- optimizer_dynamic_sampling integer 2 create table t1 aswith v1 as (select rownum n from dual connect by level <= 1e4)selectrownum id,mod(rownum-1,200) mod_200,trunc(dbms_random.value(0,300)) rand_300,mod(rownum-1,10000) mod_10000,trunc(sqrt(dbms_random.value(0,75000))) sqrt,trunc(sysdate) +trunc(dbms_random.value(0,1000)) date_1000,dbms_random.string('l',6) alpha_06,dbms_random.string('l',20) alpha_20from v1, v1whererownum <= 1e6;explain plan for select * from t1 where alpha_06 like 'mm%';Plan FOR succeeded.Plan hash value: 3617692013--------------------------------------------------------------------------| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |--------------------------------------------------------------------------|   0 | SELECT STATEMENT  |      |  1501 |  5977K|  2357   (2)| 00:00:29 ||*  1 |  TABLE ACCESS FULL| T1   |  1501 |  5977K|  2357   (2)| 00:00:29 |--------------------------------------------------------------------------Predicate Information (identified by operation id):---------------------------------------------------   1 - filter("ALPHA_06" LIKE 'mm%')Note-----   - dynamic sampling used for this statement (level=2)select count(8) from t1 where alpha_06 like 'mm%';1476

    I did not change any default settings in that db and by default level 2 dynamic sampling, quite accurate results.

  • perfdba
    perfdba Member Posts: 150
    edited Apr 7, 2017 3:44PM

    What do you see after collecting statistics without histograms?

  • Mustafa_KALAYCI
    Mustafa_KALAYCI Member Posts: 3,405 Bronze Crown
    edited Apr 7, 2017 3:49PM

    hi,

    here:

    exec dbms_stats.gather_table_stats(user, 'T1');PL/SQL procedure successfully completed.explain plan for select * from t1 where alpha_06 like 'mm%';Plan FOR succeeded.select * from table(dbms_xplan.display);Plan hash value: 3617692013--------------------------------------------------------------------------| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |--------------------------------------------------------------------------|   0 | SELECT STATEMENT  |      |  1140 | 63840 |  2360   (2)| 00:00:29 ||*  1 |  TABLE ACCESS FULL| T1   |  1140 | 63840 |  2360   (2)| 00:00:29 |--------------------------------------------------------------------------Predicate Information (identified by operation id):---------------------------------------------------   1 - filter("ALPHA_06" LIKE 'mm%')

    there is a change after gathering statistics by default parameter.

    so I also gather statistics by adding estimated_percent and cascade (there is no index on table) and result is more accurate again:

    exec dbms_stats.gather_table_stats(user, 'T1', estimate_percent=>100, cascade => true);PL/SQL procedure successfully completed.explain plan for select * from t1 where alpha_06 like 'mm%';Plan FOR succeeded.select * from table(dbms_xplan.display);Plan hash value: 3617692013--------------------------------------------------------------------------| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |--------------------------------------------------------------------------|   0 | SELECT STATEMENT  |      |  1589 | 88984 |  2360   (2)| 00:00:29 ||*  1 |  TABLE ACCESS FULL| T1   |  1589 | 88984 |  2360   (2)| 00:00:29 |--------------------------------------------------------------------------Predicate Information (identified by operation id):---------------------------------------------------   1 - filter("ALPHA_06" LIKE 'mm%')
  • perfdba
    perfdba Member Posts: 150
    edited Apr 7, 2017 4:16PM

    the way you are gathering statistics histograms will get gathered, by default,  for reference https://blogs.oracle.com/optimizer/entry/how_does_the_method_opt

    Please user the below command and let me know what you see for explain plan results.

    exec dbms_stats.gather_table_stats(user, 'T1',method_opt=>'FOR TABLE FOR ALL COLUMNS SIZE 1');

  • Mustafa_KALAYCI
    Mustafa_KALAYCI Member Posts: 3,405 Bronze Crown
    edited Apr 7, 2017 4:27PM

    sorry, my mistake. here it is and here 157:

    exec dbms_stats.gather_table_stats(user, 'T1', method_opt=>'FOR TABLE FOR ALL COLUMNS SIZE 1');PL/SQL procedure successfully completed.explain plan for select * from t1 where alpha_06 like 'mm%';Plan FOR succeeded.select * from table(dbms_xplan.display);Plan hash value: 3617692013--------------------------------------------------------------------------| Id  | Operation         | Name | Rows  | Bytes | Cost (%CPU)| Time     |--------------------------------------------------------------------------|   0 | SELECT STATEMENT  |      |   157 |  8792 |  2360   (2)| 00:00:29 ||*  1 |  TABLE ACCESS FULL| T1   |   157 |  8792 |  2360   (2)| 00:00:29 |--------------------------------------------------------------------------Predicate Information (identified by operation id):---------------------------------------------------   1 - filter("ALPHA_06" LIKE 'mm%')

    also same result if I use 100% estimate percent.

  • mathguy
    mathguy Member Posts: 10,895 Black Diamond
    edited Apr 8, 2017 12:12AM

    The optimizer sees the first two letters are between 'aa' and 'zz'. But the optimizer doesn't know there are letters only (unless it has histograms available); the first letter is between 'a' and 'z', but when we go to the first two characters, things like 'm3' and 'm*' and 'mG' are between 'lz' and 'mz', not just 'mc' and 'mm' and 'mr'.

    The count is 25 * 256 + 26 = 6400 + 26 = 6426  (it is NOT 26 * 256, because it is known that 'a3', for example, is not possible - it is not between 'aa' and 'zz' - and similarly 'z~' is not possible because it is greater than 'zz'). I am using the fact that digits come before letters, and ~ comes after letters, in ASCII and therefore in all or almost all character sets.

    1 million divided by 6426 is approx. 156.6. This is close enough; perhaps only 255 characters are valid (the "nul" - character 0 in ASCII - may not be valid) so you get 25*255 + 26 = 6401, and 1 million divided by 6401 is 156.23 - round it up and you get 157.

    I believe this is pretty much what Andrew said in Reply 1.

    Mustafa_KALAYCIperfdba
This discussion has been closed.