11 Replies Latest reply: Jan 10, 2014 8:31 AM by Jonathan Lewis RSS

    Extended statistics issue

    RobK

      Hello!

       

      I have a problem with extended statistics on 11.2.0.3

       

      Here is the script I run

       

      drop table col_stats;

      create table col_stats as

      select  1 a, 2 b,

      from dual

      connect by level<=100000;

       

      insert into col_stats (

      select  2, 1,

      from dual

      connect by level<=100000);

       

      -- check the a,b distribution

      A    BCOUNT(1)

       

      2    1  100000
      1    2  100000

       

       

      -- extended stats DEFINITION

      select dbms_stats.create_extended_stats('A','COL_STATS','(A,B)') name

      from dual;

       

      -- set estimate_percent to 100%

      EXEC dbms_stats.SET_TABLE_prefs ('A','COL_STATS','ESTIMATE_PERCENT',100);

       

       

      -- check the changes

      select dbms_stats.get_prefs ('ESTIMATE_PERCENT','A','COL_STATS')

      from dual;

       

       

      -- NOW GATHER COLUMN STATS

      BEGIN

        DBMS_STATS.GATHER_TABLE_STATS (

          OWNNAME    => 'A',

          TABNAME    => 'COL_STATS',

          METHOD_OPT => 'FOR ALL COLUMNS' );

      END;

      /

       

       

      set autotrace traceonly explain

      select * from col_stats where a=1 and b=1;

       

      SQL> select * from col_stats where a=1 and b=1;

       

       

      Execution Plan

      ----------------------------------------------------------

      Plan hash value: 1829175627

       

       

      -------------------------------------------------------------------------------

      | Id  | Operation         | Name      | Rows  | Bytes | Cost (%CPU)| Time     |

      -------------------------------------------------------------------------------

      |   0 | SELECT STATEMENT  |           | 50000 |   683K|   177 (2)| 00:00:03 |

      |*  1 |  TABLE ACCESS FULL| COL_STATS | 50000 |   683K|   177 (2)| 00:00:03 |

      -------------------------------------------------------------------------------

       

       

      Predicate Information (identified by operation id):

      ---------------------------------------------------

       

       

         1 - filter("A"=1 AND "B"=1)

       

       

      How come the optimizer expects 50000 rows?

       

      Thanks in advance.

      Rob

        • 1. Re: Extended statistics issue
          Martin Preiss

          I think that's a result of Oracle's strategy to calculate cardinalities with frequency histograms when there are only a few different known values: " If the value you supply does not appear in the histogram, but is inside the low/high range of the histogram then the cardinality will be half the cardinality of the least frequently occurring value that is in the histogram." Histogram change | Oracle Scratchpad.

           

          In your case there are two known values each with 100K rows - so they are also the least frequently occuring values and so the calculation brings 100K/2 = 50K.

          • 2. Re: Extended statistics issue
            Jonathan Lewis

            Martin,

             

            Absolutely spot on.

             

            There's not a lot you can do about this situation if you want to query for a value that probably doesn't exist when the only other values have very high frequencies - you're reduce to faking it by program.

             

            This example introduces an interesting problem - how do you know which combinations are inside the range (and give the "half the least common" selectivity) and which are outside the range ?  If the OP changes the initial (1,2) to (1,1) and then queries for (1,2) he'll find that the cardinality is one.  (Because the relationships of the hashed values of the three combinations.)

             

            The relevant function to generate values is sys_op_combined_hash()

             

            Regards

            Jonathan Lewis

            • 3. Re: Extended statistics issue
              RobK

              Thanks Martin and Jonathan!

               

              I am honored that you spent time on my question.

               

              Few related thoughts:

              - I created this testcase as a demo to show how useful extended statistics are. In this case having no stats at all is better as dynamic sampling does much better work

              - This example is a bit unusual because I changed the stats collection defaults by setting estimate percent to 100%. In this particular case Oracle could have known that values that do not appear in the histogram are simply not in the table

              - Could you please also show in more detail how sys_op_combined_hash() could be used. Shall I fake statistics with it? How?

               

              Thanks,

              Rob.

              • 4. Re: Extended statistics issue
                Jonathan Lewis

                select sys_op_combined_hash(1,2) from dual;

                compare with the contents of the histogram (user_tab_histograms) for the column and you'll see how to generate values.

                Then read my blog about faking frequency histograms for the type of code you need: Frequency Histograms | Oracle Scratchpad

                 

                Regards

                Jonathan Lewis

                • 5. Re: Extended statistics issue
                  Hemant K Chitale

                  >In this particular case Oracle could have known that values that do not appear in the histogram are simply not in the table

                  The values were not in the table when statistics were gathered.  However, Oracle can't confirm or deny whether values were inserted 1second after the gather_table_stats.  So it has to come up with an estimate for a value not present in the histogram.

                   

                   

                  Hemant K Chitale

                  • 6. Re: Extended statistics issue
                    Aliyev Chinar

                    RobK wrote:

                     

                    SQL> select * from col_stats where a=1 and b=1;

                     

                     

                    Execution Plan

                    ----------------------------------------------------------

                    Plan hash value: 1829175627

                     

                     

                    -------------------------------------------------------------------------------

                    | Id  | Operation         | Name      | Rows  | Bytes | Cost (%CPU)| Time     |

                    -------------------------------------------------------------------------------

                    |   0 | SELECT STATEMENT  |           | 50000 |   683K|   177 (2)| 00:00:03 |

                    |*  1 |  TABLE ACCESS FULL| COL_STATS | 50000 |   683K|   177 (2)| 00:00:03 |

                    -------------------------------------------------------------------------------

                     

                     

                    Predicate Information (identified by operation id):

                    ---------------------------------------------------

                     

                     

                       1 - filter("A"=1 AND "B"=1)

                     

                     

                    How come the optimizer expects 50000 rows?

                     

                    Thanks in advance.

                    Rob

                     

                    This is an expected behavior.

                    When you create extended statistics then it creates histogram for column groups(and it is virtual column).

                    The query predicate "where a=1 and b=1" is actually out-of-range predicate for that virtual column. In such cases optimizer should be estimate

                    selectivity as 0 (so cardinality 1) but in such cases optimizer uses density(in our case actually used Newdensity) of column(or your virtual columns).

                    Let see following information(this is exact your case)

                     

                     

                    NUM_ROWS

                    --------

                      200000

                     

                     

                     

                     

                    COLUMN_NAME                                           NUM_DISTINCT     DENSITY            HISTOGRAM

                    ------------------------------ ------------                          -----------                -----------

                    A                                                                  2                         .00000250            FREQUENCY

                    B                                                                  2                          .00000250           FREQUENCY

                    SYS_STUNA$6DVXJXTP05EH56DTIR0X          2                          .00000250           FREQUENCY

                     

                     

                     

                     

                    COLUMN_NAME                    ENDPOINT_NUMBER               ENDPOINT_VALUE

                    -----------------------------                ---------------                                 --------------

                    A                                          100000                                      1

                    A                                          200000                                      2

                    B                                          100000                                      1

                    B                                          200000                                      2

                    SYS_STUNA$6DVXJXTP05EH56DTIR0X          100000             1977102303

                    SYS_STUNA$6DVXJXTP05EH56DTIR0X          200000             7894566276

                     

                     

                    Your predicate is "where a=1 and b=1" and it is equivalent with "where SYS_STUNA$6DVXJXTP05EH56DTIR0X = sys_op_combined_hash (1, 1)"

                     

                    As you know with frequency histogram selectivity for equ(=) predicate is (E_endpoint-B_Endpoint)/num_rows. Here predicate value has located

                    between E_endpoint and B_Endpoint histogram buckets(endpoint numbers). But sys_op_combined_hash (1, 1) = 7026129190895635777. So then how can

                    i compare this value and according histogram endpoint values?. Answer is when creating histogram oracle do not use exact sys_op_combined_hash(x,y)

                    but it also apply MOD function, so you have to compare MOD (sys_op_combined_hash (1, 1), 9999999999)(which is equal 1598248696) with endpoint values

                    . So 1598248696 this is not locate between any endpoint number. Due to optimizer use NewDensity as density(in this case can not endpoint inf)  

                     

                    In below trace file you clearly can see that

                     

                    BASE STATISTICAL INFORMATION

                    ***********************

                    Table Stats::

                      Table: COL_STATS  Alias: COL_STATS

                        #Rows: 200000  #Blks:  382  AvgRowLen:  18.00

                    Access path analysis for COL_STATS

                    ***************************************

                    SINGLE TABLE ACCESS PATH

                      Single Table Cardinality Estimation for COL_STATS[COL_STATS]

                      Column (#1):

                        NewDensity:0.250000, OldDensity:0.000003 BktCnt:200000, PopBktCnt:200000, PopValCnt:2, NDV:2

                      Column (#2):

                        NewDensity:0.250000, OldDensity:0.000003 BktCnt:200000, PopBktCnt:200000, PopValCnt:2, NDV:2

                      Column (#3):

                        NewDensity:0.250000, OldDensity:0.000003 BktCnt:200000, PopBktCnt:200000, PopValCnt:2, NDV:2

                      ColGroup (#1, VC) SYS_STUNA$6DVXJXTP05EH56DTIR0X

                        Col#: 1 2    CorStregth: 2.00

                      ColGroup Usage:: PredCnt: 2  Matches Full: #1  Partial:  Sel: 0.2500

                      Table: COL_STATS  Alias: COL_STATS

                        Card: Original: 200000.000000  Rounded: 50000  Computed: 50000.00  Non Adjusted: 50000.00

                      Access Path: TableScan

                        Cost:  107.56  Resp: 107.56  Degree: 0

                          Cost_io: 105.00  Cost_cpu: 51720390

                          Resp_io: 105.00  Resp_cpu: 51720390

                      Best:: AccessPath: TableScan

                             Cost: 107.56  Degree: 1  Resp: 107.56  Card: 50000.00  Bytes: 0

                     

                     

                    Note that NewDensity calculated as 1/(2*num_distinct)= 1/4=0.25 for frequency histogram!.

                     

                    CBO used column groups statistic and estimated cardinality was 200000*0.25=50000.

                     

                    Remember that they are permanent statistics and RDBMS gathered they by analyzing actual table data(Even correlation columns data).

                    But dynamic sampling can be good in your above situation, due to it is calculate selectivity in run time using sampling method together real predicate.

                     

                    For other situation you can see extends statistics is great help for estimation like where a=2 and b=1 because this is actual data and according information(stats/histograms) stored in dictionary.

                     

                     

                    SQL>  select * from col_stats where a=2 and b=1;

                     

                    Execution Plan

                    ----------------------------------------------------------

                    Plan hash value: 1829175627

                     

                    -------------------------------------------------------------------------------

                    | Id  | Operation         | Name      | Rows  | Bytes | Cost (%CPU)| Time     |

                    -------------------------------------------------------------------------------

                    |   0 | SELECT STATEMENT  |           |   100K|   585K|   108   (3)| 00:00:02 |

                    |*  1 |  TABLE ACCESS FULL| COL_STATS |   100K|   585K|   108   (3)| 00:00:02 |

                    -------------------------------------------------------------------------------

                     

                    Predicate Information (identified by operation id):

                    ---------------------------------------------------

                     

                       1 - filter("A"=2 AND "B"=1)

                     

                     

                    and from trace file

                     

                    Table Stats::

                      Table: COL_STATS  Alias: COL_STATS

                        #Rows: 200000  #Blks:  382  AvgRowLen:  18.00

                    Access path analysis for COL_STATS

                    ***************************************

                    SINGLE TABLE ACCESS PATH

                      Single Table Cardinality Estimation for COL_STATS[COL_STATS]

                      Column (#1):

                        NewDensity:0.250000, OldDensity:0.000003 BktCnt:200000, PopBktCnt:200000, PopValCnt:2, NDV:2

                      Column (#2):

                        NewDensity:0.250000, OldDensity:0.000003 BktCnt:200000, PopBktCnt:200000, PopValCnt:2, NDV:2

                      Column (#3):

                        NewDensity:0.250000, OldDensity:0.000003 BktCnt:200000, PopBktCnt:200000, PopValCnt:2, NDV:2

                      ColGroup (#1, VC) SYS_STUNA$6DVXJXTP05EH56DTIR0X

                        Col#: 1 2    CorStregth: 2.00

                    ColGroup Usage:: PredCnt: 2  Matches Full: #1  Partial:  Sel: 0.5000

                      Table: COL_STATS  Alias: COL_STATS

                        Card: Original: 200000.000000  Rounded: 100000  Computed: 100000.00  Non Adjusted: 100000.00

                      Access Path: TableScan

                        Cost:  107.56  Resp: 107.56  Degree: 0

                          Cost_io: 105.00  Cost_cpu: 51720365

                          Resp_io: 105.00  Resp_cpu: 51720365

                      Best:: AccessPath: TableScan

                             Cost: 107.56  Degree: 1  Resp: 107.56  Card: 100000.00  Bytes: 0

                     

                    Lets calculate:

                     

                    MOD (sys_op_combined_hash (2, 1), 9999999999)=1977102303 and for it (e_endpoint-b_enpoint)/num_rows=(200000-100000)/200000=0.5

                    and result card=sel*num_rows(or just e_endpoint-b_enpoint)=100000.

                    • 7. Re: Extended statistics issue
                      Jonathan Lewis

                      Aliyev,

                       

                      Good catch on the mod(,9999999999), I'd forgotten about that - which won't have helped the OP very much if he was trying to generate values for a fake histogram.

                       

                      Aliyev Chinar wrote:

                       

                       

                      Note that NewDensity calculated as 1/(2*num_distinct)= 1/4=0.25 for frequency histogram!.

                       

                       

                       

                      Which version of Oracle are using - that's not the generic calculation for NewDensity with frequency histograms in the versions I've been using.  The NewDensity is calculated to give half the selectivity of the least popular value captured in the histogram.  Here, for example is the NewDensity of a column group that holds 4 distinct values:

                       

                          NewDensity:0.022727, OldDensity:0.000023 BktCnt:22000, PopBktCnt:22000, PopValCnt:4, NDV:4

                        Column (#3): SYS_STUBZH0IHA7K$KEBJVXO5LOHAS(

                          AvgLen: 12 NDV: 4 Nulls: 0 Density: 0.022727 Min: 1598248696 Max: 7894566276

                          Histogram: Freq  #Bkts: 4  UncompBkts: 22000  EndPtVals: 4

                       

                      There are 22,000 rows captured by the histogram, and the least frequently occurring value accounts for 1,000 of them, so the NewDensity is (1,000/22,000)/2 = 1/44

                      0.0227272727272...

                       

                      Regards

                      Jonathan Lewis

                      • 8. Re: Extended statistics issue
                        Aliyev Chinar

                        Jonathan Lewis wrote:

                         

                        Aliyev,

                         

                        Good catch on the mod(,9999999999), I'd forgotten about that - which won't have helped the OP very much if he was trying to generate values for a fake histogram.

                         

                        Aliyev Chinar wrote: 

                         

                         

                        Note that NewDensity calculated as 1/(2*num_distinct)= 1/4=0.25 for frequency histogram!.

                         

                         

                         

                        Which version of Oracle are using - that's not the generic calculation for NewDensity with frequency histograms in the versions I've been using.  T

                        Of course , it is not generic formula. I just mentioned that. for above trace file.

                        • 9. Re: Extended statistics issue
                          Jonathan Lewis

                          Aliyev Chinar wrote:

                           

                          Jonathan Lewis wrote:

                           

                          Aliyev,

                           

                          Good catch on the mod(,9999999999), I'd forgotten about that - which won't have helped the OP very much if he was trying to generate values for a fake histogram.

                           

                          Aliyev Chinar wrote:

                           

                           

                          Note that NewDensity calculated as 1/(2*num_distinct)= 1/4=0.25 for frequency histogram!.

                           

                           

                           

                          Which version of Oracle are using - that's not the generic calculation for NewDensity with frequency histograms in the versions I've been using.  T

                          Of course , it is not generic formula. I just mentioned that. for above trace file.

                          Let me rephrase the question - beyond the fact that the num_distinct happens to be 2 in this particular trace file and you needed to generate the number 4, is there any reason why chose to insert the num_distinct at that point ?

                           

                           

                           

                          Regards

                          Jonathan Lewis

                          • 10. Re: Extended statistics issue
                            Aliyev Chinar

                            Let me rephrase the question - beyond the fact that the num_distinct happens to be 2 in this particular trace file and you needed to generate the number 4, is there any reason why chose to insert the num_distinct at that point ?

                             

                             

                             

                            Regards

                            Jonathan Lewis

                            Even here is not histograms stats(in simple cases) we know that the term "density" calculated as 1/num_distict.  So it  give me idea about density and NDV relationship. I want to note that again. In this case my guess was only above trace file and it is not about general cases(HB histogram ,even FH`s special cases and oracle version related inf.). 

                            • 11. Re: Extended statistics issue
                              Jonathan Lewis

                              Aliyev,

                               

                              Thanks.

                              When I read your first comments I didn't realise it was an educated guess.

                               

                              Regards

                              Jonathan Lewis