Forum Stats

  • 3,734,382 Users
  • 2,246,958 Discussions
  • 7,857,255 Comments

Discussions

Extended statistics issue

RobK
RobK Member Posts: 99 Blue Ribbon
edited Jan 10, 2014 9:31AM in General Database Discussions

Hello!

I have a problem with extended statistics on 11.2.0.3

Here is the script I run

drop table col_stats;

create table col_stats as

select  1 a, 2 b,

from dual

connect by level<=100000;

insert into col_stats (

select  2, 1,

from dual

connect by level<=100000);

-- check the a,b distribution

A    BCOUNT(1)

2    1  100000
1    2  100000

-- extended stats DEFINITION

select dbms_stats.create_extended_stats('A','COL_STATS','(A,B)') name

from dual;

-- set estimate_percent to 100%

EXEC dbms_stats.SET_TABLE_prefs ('A','COL_STATS','ESTIMATE_PERCENT',100);

-- check the changes

select dbms_stats.get_prefs ('ESTIMATE_PERCENT','A','COL_STATS')

from dual;

-- NOW GATHER COLUMN STATS

BEGIN

  DBMS_STATS.GATHER_TABLE_STATS (

    OWNNAME    => 'A',

    TABNAME    => 'COL_STATS',

    METHOD_OPT => 'FOR ALL COLUMNS' );

END;

/

set autotrace traceonly explain

select * from col_stats where a=1 and b=1;

SQL> select * from col_stats where a=1 and b=1;

Execution Plan

----------------------------------------------------------

Plan hash value: 1829175627

-------------------------------------------------------------------------------

| Id  | Operation         | Name      | Rows  | Bytes | Cost (%CPU)| Time     |

-------------------------------------------------------------------------------

|   0 | SELECT STATEMENT  |           | 50000 |   683K|   177 (2)| 00:00:03 |

|*  1 |  TABLE ACCESS FULL| COL_STATS | 50000 |   683K|   177 (2)| 00:00:03 |

-------------------------------------------------------------------------------

Predicate Information (identified by operation id):

---------------------------------------------------

   1 - filter("A"=1 AND "B"=1)

How come the optimizer expects 50000 rows?

Thanks in advance.

Rob

Tagged:
Richa Agrawal

Best Answer

  • Martin Preiss
    Martin Preiss Member Posts: 2,381 Gold Trophy
    Accepted Answer

    I think that's a result of Oracle's strategy to calculate cardinalities with frequency histograms when there are only a few different known values: " If the value you supply does not appear in the histogram, but is inside the low/high range of the histogram then the cardinality will be half the cardinality of the least frequently occurring value that is in the histogram." Histogram change | Oracle Scratchpad.

    In your case there are two known values each with 100K rows - so they are also the least frequently occuring values and so the calculation brings 100K/2 = 50K.

Answers

  • Martin Preiss
    Martin Preiss Member Posts: 2,381 Gold Trophy
    Accepted Answer

    I think that's a result of Oracle's strategy to calculate cardinalities with frequency histograms when there are only a few different known values: " If the value you supply does not appear in the histogram, but is inside the low/high range of the histogram then the cardinality will be half the cardinality of the least frequently occurring value that is in the histogram." Histogram change | Oracle Scratchpad.

    In your case there are two known values each with 100K rows - so they are also the least frequently occuring values and so the calculation brings 100K/2 = 50K.

  • Jonathan Lewis
    Jonathan Lewis Member Posts: 9,607 Gold Crown

    Martin,

    Absolutely spot on.

    There's not a lot you can do about this situation if you want to query for a value that probably doesn't exist when the only other values have very high frequencies - you're reduce to faking it by program.

    This example introduces an interesting problem - how do you know which combinations are inside the range (and give the "half the least common" selectivity) and which are outside the range ?  If the OP changes the initial (1,2) to (1,1) and then queries for (1,2) he'll find that the cardinality is one.  (Because the relationships of the hashed values of the three combinations.)

    The relevant function to generate values is sys_op_combined_hash()

    Regards

    Jonathan Lewis

    Jonathan Lewis
  • RobK
    RobK Member Posts: 99 Blue Ribbon

    Thanks Martin and Jonathan!

    I am honored that you spent time on my question.

    Few related thoughts:

    - I created this testcase as a demo to show how useful extended statistics are. In this case having no stats at all is better as dynamic sampling does much better work

    - This example is a bit unusual because I changed the stats collection defaults by setting estimate percent to 100%. In this particular case Oracle could have known that values that do not appear in the histogram are simply not in the table

    - Could you please also show in more detail how sys_op_combined_hash() could be used. Shall I fake statistics with it? How?

    Thanks,

    Rob.

  • Jonathan Lewis
    Jonathan Lewis Member Posts: 9,607 Gold Crown

    select sys_op_combined_hash(1,2) from dual;

    compare with the contents of the histogram (user_tab_histograms) for the column and you'll see how to generate values.

    Then read my blog about faking frequency histograms for the type of code you need: Frequency Histograms | Oracle Scratchpad

    Regards

    Jonathan Lewis

  • Hemant K Chitale
    Hemant K Chitale Member Posts: 15,755 Blue Diamond
    edited Jan 8, 2014 10:10PM

    >In this particular case Oracle could have known that values that do not appear in the histogram are simply not in the table

    The values were not in the table when statistics were gathered.  However, Oracle can't confirm or deny whether values were inserted 1second after the gather_table_stats.  So it has to come up with an estimate for a value not present in the histogram.

    Hemant K Chitale

  • RobK wrote:
    
    SQL> select * from col_stats where a=1 and b=1;
    
    
    Execution Plan
    ----------------------------------------------------------
    Plan hash value: 1829175627
    
    
    -------------------------------------------------------------------------------
    | Id  | Operation         | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
    -------------------------------------------------------------------------------
    |   0 | SELECT STATEMENT  |           | 50000 |   683K|   177 (2)| 00:00:03 |
    |*  1 |  TABLE ACCESS FULL| COL_STATS | 50000 |   683K|   177 (2)| 00:00:03 |
    -------------------------------------------------------------------------------
    
    
    Predicate Information (identified by operation id):
    ---------------------------------------------------
    
    
       1 - filter("A"=1 AND "B"=1)
    
    
    How come the optimizer expects 50000 rows?
    
    Thanks in advance.
    Rob
    

    This is an expected behavior.

    When you create extended statistics then it creates histogram for column groups(and it is virtual column).

    The query predicate "where a=1 and b=1" is actually out-of-range predicate for that virtual column. In such cases optimizer should be estimate

    selectivity as 0 (so cardinality 1) but in such cases optimizer uses density(in our case actually used Newdensity) of column(or your virtual columns).

    Let see following information(this is exact your case)

    NUM_ROWS

    --------

      200000

    COLUMN_NAME                                           NUM_DISTINCT     DENSITY            HISTOGRAM

    ------------------------------ ------------                          -----------                -----------

    A                                                                  2                         .00000250            FREQUENCY

    B                                                                  2                          .00000250           FREQUENCY

    SYS_STUNA$6DVXJXTP05EH56DTIR0X          2                          .00000250           FREQUENCY

    COLUMN_NAME                    ENDPOINT_NUMBER               ENDPOINT_VALUE

    -----------------------------                ---------------                                 --------------

    A                                          100000                                      1

    A                                          200000                                      2

    B                                          100000                                      1

    B                                          200000                                      2

    SYS_STUNA$6DVXJXTP05EH56DTIR0X          100000             1977102303

    SYS_STUNA$6DVXJXTP05EH56DTIR0X          200000             7894566276

    Your predicate is "where a=1 and b=1" and it is equivalent with "where SYS_STUNA$6DVXJXTP05EH56DTIR0X = sys_op_combined_hash (1, 1)"

    As you know with frequency histogram selectivity for equ(=) predicate is (E_endpoint-B_Endpoint)/num_rows. Here predicate value has located

    between E_endpoint and B_Endpoint histogram buckets(endpoint numbers). But sys_op_combined_hash (1, 1) = 7026129190895635777. So then how can

    i compare this value and according histogram endpoint values?. Answer is when creating histogram oracle do not use exact sys_op_combined_hash(x,y)

    but it also apply MOD function, so you have to compare MOD (sys_op_combined_hash (1, 1), 9999999999)(which is equal 1598248696) with endpoint values

    . So 1598248696 this is not locate between any endpoint number. Due to optimizer use NewDensity as density(in this case can not endpoint inf)  

     

    In below trace file you clearly can see that

    BASE STATISTICAL INFORMATION

    ***********************

    Table Stats::

      Table: COL_STATS  Alias: COL_STATS

        #Rows: 200000  #Blks:  382  AvgRowLen:  18.00

    Access path analysis for COL_STATS

    ***************************************

    SINGLE TABLE ACCESS PATH

      Single Table Cardinality Estimation for COL_STATS[COL_STATS]

      Column (#1):

        NewDensity:0.250000, OldDensity:0.000003 BktCnt:200000, PopBktCnt:200000, PopValCnt:2, NDV:2

      Column (#2):

        NewDensity:0.250000, OldDensity:0.000003 BktCnt:200000, PopBktCnt:200000, PopValCnt:2, NDV:2

      Column (#3):

        NewDensity:0.250000, OldDensity:0.000003 BktCnt:200000, PopBktCnt:200000, PopValCnt:2, NDV:2

      ColGroup (#1, VC) SYS_STUNA$6DVXJXTP05EH56DTIR0X

        Col#: 1 2    CorStregth: 2.00

      ColGroup Usage:: PredCnt: 2  Matches Full: #1  Partial:  Sel: 0.2500

      Table: COL_STATS  Alias: COL_STATS

        Card: Original: 200000.000000  Rounded: 50000  Computed: 50000.00  Non Adjusted: 50000.00

      Access Path: TableScan

        Cost:  107.56  Resp: 107.56  Degree: 0

          Cost_io: 105.00  Cost_cpu: 51720390

          Resp_io: 105.00  Resp_cpu: 51720390

      Best:: AccessPath: TableScan

             Cost: 107.56  Degree: 1  Resp: 107.56  Card: 50000.00  Bytes: 0

    Note that NewDensity calculated as 1/(2*num_distinct)= 1/4=0.25 for frequency histogram!.

    CBO used column groups statistic and estimated cardinality was 200000*0.25=50000.

    Remember that they are permanent statistics and RDBMS gathered they by analyzing actual table data(Even correlation columns data).

    But dynamic sampling can be good in your above situation, due to it is calculate selectivity in run time using sampling method together real predicate.

    For other situation you can see extends statistics is great help for estimation like where a=2 and b=1 because this is actual data and according information(stats/histograms) stored in dictionary.

    SQL>  select * from col_stats where a=2 and b=1;

    Execution Plan

    ----------------------------------------------------------

    Plan hash value: 1829175627

    -------------------------------------------------------------------------------

    | Id  | Operation         | Name      | Rows  | Bytes | Cost (%CPU)| Time     |

    -------------------------------------------------------------------------------

    |   0 | SELECT STATEMENT  |           |   100K|   585K|   108   (3)| 00:00:02 |

    |*  1 |  TABLE ACCESS FULL| COL_STATS |   100K|   585K|   108   (3)| 00:00:02 |

    -------------------------------------------------------------------------------

    Predicate Information (identified by operation id):

    ---------------------------------------------------

       1 - filter("A"=2 AND "B"=1)

    and from trace file

    Table Stats::

      Table: COL_STATS  Alias: COL_STATS

        #Rows: 200000  #Blks:  382  AvgRowLen:  18.00

    Access path analysis for COL_STATS

    ***************************************

    SINGLE TABLE ACCESS PATH

      Single Table Cardinality Estimation for COL_STATS[COL_STATS]

      Column (#1):

        NewDensity:0.250000, OldDensity:0.000003 BktCnt:200000, PopBktCnt:200000, PopValCnt:2, NDV:2

      Column (#2):

        NewDensity:0.250000, OldDensity:0.000003 BktCnt:200000, PopBktCnt:200000, PopValCnt:2, NDV:2

      Column (#3):

        NewDensity:0.250000, OldDensity:0.000003 BktCnt:200000, PopBktCnt:200000, PopValCnt:2, NDV:2

      ColGroup (#1, VC) SYS_STUNA$6DVXJXTP05EH56DTIR0X

        Col#: 1 2    CorStregth: 2.00

    ColGroup Usage:: PredCnt: 2  Matches Full: #1  Partial:  Sel: 0.5000

      Table: COL_STATS  Alias: COL_STATS

        Card: Original: 200000.000000  Rounded: 100000  Computed: 100000.00  Non Adjusted: 100000.00

      Access Path: TableScan

        Cost:  107.56  Resp: 107.56  Degree: 0

          Cost_io: 105.00  Cost_cpu: 51720365

          Resp_io: 105.00  Resp_cpu: 51720365

      Best:: AccessPath: TableScan

             Cost: 107.56  Degree: 1  Resp: 107.56  Card: 100000.00  Bytes: 0

    Lets calculate:

    MOD (sys_op_combined_hash (2, 1), 9999999999)=1977102303 and for it (e_endpoint-b_enpoint)/num_rows=(200000-100000)/200000=0.5

    and result card=sel*num_rows(or just e_endpoint-b_enpoint)=100000.

    Richa Agrawal
  • Jonathan Lewis
    Jonathan Lewis Member Posts: 9,607 Gold Crown

    Aliyev,

    Good catch on the mod(,9999999999), I'd forgotten about that - which won't have helped the OP very much if he was trying to generate values for a fake histogram.

    Aliyev Chinar wrote:
    
    
    Note that NewDensity calculated as 1/(2*num_distinct)= 1/4=0.25 for frequency histogram!.

    Which version of Oracle are using - that's not the generic calculation for NewDensity with frequency histograms in the versions I've been using.  The NewDensity is calculated to give half the selectivity of the least popular value captured in the histogram.  Here, for example is the NewDensity of a column group that holds 4 distinct values:

        NewDensity:0.022727, OldDensity:0.000023 BktCnt:22000, PopBktCnt:22000, PopValCnt:4, NDV:4

      Column (#3): SYS_STUBZH0IHA7K$KEBJVXO5LOHAS(

        AvgLen: 12 NDV: 4 Nulls: 0 Density: 0.022727 Min: 1598248696 Max: 7894566276

        Histogram: Freq  #Bkts: 4  UncompBkts: 22000  EndPtVals: 4

    There are 22,000 rows captured by the histogram, and the least frequently occurring value accounts for 1,000 of them, so the NewDensity is (1,000/22,000)/2 = 1/44

    0.0227272727272...

    Regards

    Jonathan Lewis

  • Jonathan Lewis wrote:
    
    Aliyev,
    
    Good catch on the mod(,9999999999), I'd forgotten about that - which won't have helped the OP very much if he was trying to generate values for a fake histogram.
    
    
    Aliyev Chinar wrote:  
    
    
    Note that NewDensity calculated as 1/(2*num_distinct)= 1/4=0.25 for frequency histogram!.
    Which version of Oracle are using - that's not the generic calculation for NewDensity with frequency histograms in the versions I've been using.  T

    Of course , it is not generic formula. I just mentioned that. for above trace file.

  • Jonathan Lewis
    Jonathan Lewis Member Posts: 9,607 Gold Crown
    Aliyev Chinar wrote:
    
    
    Jonathan Lewis wrote:
    
    Aliyev,
    
    Good catch on the mod(,9999999999), I'd forgotten about that - which won't have helped the OP very much if he was trying to generate values for a fake histogram.
    
    
    Aliyev Chinar wrote: 
    
    
    Note that NewDensity calculated as 1/(2*num_distinct)= 1/4=0.25 for frequency histogram!.
    Which version of Oracle are using - that's not the generic calculation for NewDensity with frequency histograms in the versions I've been using.  T Of course , it is not generic formula. I just mentioned that. for above trace file.

    Let me rephrase the question - beyond the fact that the num_distinct happens to be 2 in this particular trace file and you needed to generate the number 4, is there any reason why chose to insert the num_distinct at that point ?

    Regards

    Jonathan Lewis


  • Let me rephrase the question - beyond the fact that the num_distinct happens to be 2 in this particular trace file and you needed to generate the number 4, is there any reason why chose to insert the num_distinct at that point ? Regards Jonathan Lewis

    Even here is not histograms stats(in simple cases) we know that the term "density" calculated as 1/num_distict.  So it  give me idea about density and NDV relationship. I want to note that again. In this case my guess was only above trace file and it is not about general cases(HB histogram ,even FH`s special cases and oracle version related inf.). 

  • Jonathan Lewis
    Jonathan Lewis Member Posts: 9,607 Gold Crown

    Aliyev,

    Thanks.

    When I read your first comments I didn't realise it was an educated guess.

    Regards

    Jonathan Lewis

This discussion has been closed.