This discussion is archived
9 Replies Latest reply: Jan 17, 2013 2:20 AM by rahulras RSS

Why estimated rows are not correct?

rahulras Explorer
Currently Being Moderated
Hi All,

Oracle 11.2.0.3, 64 bit on Linux.

We use the default (out-of-the-box) statistics collections by Oracle. I have a table with 130M records, with one unique key index. I am taking a simple count of records based on a date column (DO_DATE) which is the third column in that unique key index (hence all those INDEX FAST FULL SCAN in following plans).
There are histograms on all columns in the table (indexed and non-indexed both).
My question is, why Oracle is estimating/guessing wrong number of records in 2 out of the following 4 cases ??
In Case 2, there are 1071 records, but Oracle thought there are 35074. In Case 4, there are 171K records, but Oracle thought there are 35074.
I hope, I am interpreting the numbers in the plans correctly :)
SQL> set autotrace on explain
-- ==================== CASE 1 ===================================================
SQL> select count(*) from TABLE1 where DO_DATE=to_date('14012013','DDMMYYYY');

  COUNT(*)
----------
  19957936

1 row selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 1543387839

-----------------------------------------------------------------------------------------
| Id  | Operation             | Name            | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |                 |     1 |     8 |   253K  (2)| 00:50:43 |
|   1 |  SORT AGGREGATE       |                 |     1 |     8 |            |          |
|*  2 |   INDEX FAST FULL SCAN| TABLE1_UK       |    20M|   156M|   253K  (2)| 00:50:43 |
-----------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("DO_DATE"=TO_DATE(' 2013-01-14 00:00:00', 'syyyy-mm-dd
              hh24:mi:ss'))

-- ==================== CASE 2 ===================================================
SQL> select count(*) from TABLE1 where DO_DATE=to_date('03052011','DDMMYYYY');

  COUNT(*)
----------
      1071

1 row selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 1543387839

-----------------------------------------------------------------------------------------
| Id  | Operation             | Name            | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |                 |     1 |     8 |   253K  (2)| 00:50:43 |
|   1 |  SORT AGGREGATE       |                 |     1 |     8 |            |          |
|*  2 |   INDEX FAST FULL SCAN| TABLE1_UK       | 35074 |   274K|   253K  (2)| 00:50:43 |
-----------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("DO_DATE"=TO_DATE(' 2011-05-03 00:00:00', 'syyyy-mm-dd
              hh24:mi:ss'))

-- ==================== CASE 3 ===================================================
SQL> select count(*) from TABLE1 where DO_DATE=to_date('16102007','DDMMYYYY');

  COUNT(*)
----------
         4

1 row selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 1543387839

-----------------------------------------------------------------------------------------
| Id  | Operation             | Name            | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |                 |     1 |     8 |   253K  (2)| 00:50:43 |
|   1 |  SORT AGGREGATE       |                 |     1 |     8 |            |          |
|*  2 |   INDEX FAST FULL SCAN| TABLE1_UK       |     1 |     8 |   253K  (2)| 00:50:43 |
-----------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("DO_DATE"=TO_DATE(' 2007-10-16 00:00:00', 'syyyy-mm-dd
              hh24:mi:ss'))

-- ==================== CASE 4 ===================================================
SQL> select count(*) from TABLE1 where DO_DATE=to_date('10042012','DDMMYYYY');

  COUNT(*)
----------
    171897

1 row selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 1543387839

-----------------------------------------------------------------------------------------
| Id  | Operation             | Name            | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |                 |     1 |     8 |   253K  (2)| 00:50:43 |
|   1 |  SORT AGGREGATE       |                 |     1 |     8 |            |          |
|*  2 |   INDEX FAST FULL SCAN| TABLE1_UK       | 35074 |   274K|   253K  (2)| 00:50:43 |
-----------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("DO_DATE"=TO_DATE(' 2012-04-10 00:00:00', 'syyyy-mm-dd
              hh24:mi:ss'))
Thanks in advance
  • 1. Re: Why estimated rows are not correct?
    rp0428 Guru
    Currently Being Moderated
    >
    Why estimated rows are not correct?
    >
    Ummm - because Oracle 'estimates' the number of rows by taking a sample of the data. Estimates will seldom match the actual results. They are only useful in assisting Oracle in selecting the correct plan to use.

    See this article written by Maria Colgan, a member of Oracle's Optimizer developer team
    How do I know if the cardinality estimates in a plan are accurate?

    https://blogs.oracle.com/optimizer/entry/how_do_i_know_if
  • 2. Re: Why estimated rows are not correct?
    Justin Cave Oracle ACE
    Currently Being Moderated
    You say there is a histogram

    What kind of histogram? How many buckets?
    How many rows are in the table? What are the minimum and maximum values?
    It's clear from your queries that the data is highly skewed-- can you describe how the data is skewed (i.e. is recent data uniformly more common than older data)?
    When were the statistics for the histogram gathered? Has the data changed since then?

    Let's assume for the sake of argument that you have a height-balanced histogram with 255 buckets and data from Jan 1, 2007 (first of the year of your earliest example) to the present). That would mean that there are ~2200 distinct values of the DO_DATE column. If there are 255 buckets, that means that Oracle would try to find start and end points for 255 different segments such that each segment had roughly the same number of rows. So if older dates have fewer rows, the earlier buckets would be wider than the more recent buckets.

    If the data doesn't vary uniformly-- if Jan 14 has 20 million rows, Jan 13 has 200,000 rows, Jan 12 has 2,000 rows, and Jan 11 has 2 million rows-- but all 4 dates are in a single bucket (which in our hypothetical implies that the table has ~5.6 billion rows and each of the 255 histogram buckets covers ~22 million rows), then Oracle would estimate that any of the 4 dates would return 22 million/4 = 5.5 million rows. That would be a bit low for Jan 14, a bit high for Jan 11, and way, way too high for Jan 12 and 13.

    Justin
  • 3. Re: Why estimated rows are not correct?
    rahulras Explorer
    Currently Being Moderated
    I will add some more information, if at all that helps.
    Sample size Oracle used in case of this table is 100%, hence I was thinking that the estimates will be very close to correct.
    SQL> select num_rows, blocks, sample_size from dba_tab_statistics where table_name = 'TABLE1';
    
      NUM_ROWS     BLOCKS SAMPLE_SIZE
    ---------- ---------- -----------
     127986401    1165766   127986401
    
    1 row selected.
    For the column DO_DATE there are 1458 distinct values in the table and the histogram has 127 records in dba_histograms.

    The reason I am asking this question here is, do I need to do anything to make Oracle do better estimates? any parameters? any configuration?
  • 4. Re: Why estimated rows are not correct?
    Dom Brooks Guru
    Currently Being Moderated
    My question is, why Oracle is estimating/guessing wrong number of records in 2 out of the following 4 cases ??
    Depending on how many distinct values there are and the distribution of data that might occur across the buckets, histograms can be a fairly limited way of providing accurate information.


    If you want more accurate estimates then try using dynamic sampling instead.
  • 5. Re: Why estimated rows are not correct?
    rp0428 Guru
    Currently Being Moderated
    >
    The reason I am asking this question here is, do I need to do anything to make Oracle do better estimates? any parameters? any configuration?
    >
    You haven't provided any reason to think that Oracle needs a better estimate than what it is getting.

    As I said, the stats (estimates) are used to assist Oracle in selecting the lowest cost plan for getting your result set. You haven't provided ANY information that there is a better plan that Oracle could be using.

    And so far you have only provided the estimates, not the actual number of rows accessed.

    It doesn't matter what the estimates are, or if they match the actual rows. The only thing that matters is if Oracle is not picking the lowest cost plan for some reason.

    Run a 10053 trace to see what other executions plans Oracle considered and ignored.

    Or read the article at the link I provided and try what that Oracle CBO developer recommended:
    >
    If we add the GATHER_PLAN_STATISTICS hint to our simple SQL statement we should be able to see the actual cardinality of each operation at execution time alongside the Optimizer estimates for each cardinality in the plan.
    >
    Then post the relevant 10053 trace sections, the actual execution plan and the number of ACTUAL rows that are accessed.
  • 6. Re: Why estimated rows are not correct?
    rahulras Explorer
    Currently Being Moderated
    The reason I am thinking that Oracle needs to provide better estimates is, every now and then we come across situation that Oracle picks up a wrong index on the table and SQLs take very long time. Same SQL will not always take that long. There might be something wrong with our stats in general.

    I think, I was wrong in saying that we have histograms on all columns. When I looked in dba_tab_col_statistics, I can see following result. What does that mean? does it mean that there are histograms on only 3 columns and they too are done with very small sample size?
    SQL> select column_name, num_distinct, num_nulls, num_buckets, sample_size, histogram
      2  from dba_tab_col_statistics where table_name='TABLE1';
    
    COLUMN_NAME                    NUM_DISTINCT  NUM_NULLS NUM_BUCKETS SAMPLE_SIZE HISTOGRAM
    ------------------------------ ------------ ---------- ----------- ----------- ---------------
    COLUMN7                              378624          0           1   128248430 NONE
    COLUMN6                              363680          0           1   128248430 NONE
    COLUMN5                             2385664       4411           1   128244019 NONE
    COLUMN4                                   5          0           5        5429 FREQUENCY
    COLUMN3                                  19          0           1   128248430 NONE
    DO_DATE                                1458          0         255        5429 HEIGHT BALANCED
    COLUMN2                                  42          0          23        5430 FREQUENCY
    COLUMN1                            19998720          0           1   128248430 NONE
    
    8 rows selected.
  • 7. Re: Why estimated rows are not correct?
    Hemant K Chitale Oracle ACE
    Currently Being Moderated
    As Justin has pointed out, you likely have unequal distribution of DO_DATE values in the 255 buckets of the Histogram for this column.


    Hemant K Chitale
  • 8. Re: Why estimated rows are not correct?
    Jonathan Lewis Oracle ACE Director
    Currently Being Moderated
    rahulras wrote:
    My question is, why Oracle is estimating/guessing wrong number of records in 2 out of the following 4 cases ??
    In Case 2, there are 1071 records, but Oracle thought there are 35074. In Case 4, there are 171K records, but Oracle thought there are 35074.
    You have a height balanced histogram taken on a very small sample size.

    There is (at least) one very popular value that stands out in the histogram - the sample identified this popular value and held many buckets for it giving the optimizer a reasonable estimate of cardinality.

    The two values which produced an estimate of ca. 35,000 did so because they didn't appear a popular values in the histogram, so Oracle calculated (num_rows - popular_rows)/(num of distinct value - number of popular values).

    One of your values was probably a long way outside the RECORDED range of values seen in the sample, so after estimating the "non-popular" cardinality, Oracle then did its "linear decay" calculation to estimate the cardinality of a value that wasn't known to exist. I would guess that about half your data holds popular values, and that there is a long, finely scattered "tail" of extreme values for a very small fraction of the data. If you want to get a better model (i.e. histogram) of your data, then it is possible to create a hand-built one using dbms_stats.set_column_stats(): http://jonathanlewis.wordpress.com/2010/03/23/fake-histograms/

    For extreme cases in the past I have advised people to create a fake FREQUENCY histogram listing the 250 most popular values, plus the low and high values that you want Oracle to know about and (for 10.2.0.4 and above) to represent the values that are not in the histogram one entry with twice the number of rows that you want Oracle to think the "ordinary" values will have.


    Regards
    Jonathan Lewis
  • 9. Re: Why estimated rows are not correct?
    rahulras Explorer
    Currently Being Moderated
    Hi Jonathan,

    Thanks for the great explaination. You are absolutely correct in saying that, there must be one popular value. There is popular value for the DO_DATE which is normally "yesterday's" date and that will be the value for roughly 25M rows out of 130M.
    Apart from this table, there are tables in our application which has 300M-500M records (example shown below), where we have histograms based on sample size of 5700.

    I was thinking of setting higher estimate percent for some of the more used columns, something like 30% (no statistical base here for picking up the number 30).
    Can that help ?
    I am still not going ahead with manually created (or fake) histograms because if the nature/distribution of data change, somebody needs to re-create these histograms manually according to new data distribution.
    SQL> select column_name, num_distinct, num_nulls, num_buckets, sample_size, histogram
      2  from dba_tab_col_statistics where table_name='TABLE2';
    
    COLUMN_NAME                    NUM_DISTINCT  NUM_NULLS NUM_BUCKETS SAMPLE_SIZE HISTOGRAM
    ------------------------------ ------------ ---------- ----------- ----------- ---------------
    EX_DATE                                   0  506615378           0             NONE
    END_DATE                                  1          0           1        5640 FREQUENCY
    ABCD_STATUS                               1          0           1   506615378 NONE
    WXYZ_VALUE                         46878720          7         254        5640 HEIGHT BALANCED
    COLUMN01                                 11          0           4        5641 FREQUENCY
    XXX_DATE                            6821376          0         254        5640 HEIGHT BALANCED
    VIMP_NUM                                 43          0          25        5640 FREQUENCY
    KEY_NUMER                          73457664          0           1   506615378 NONE
    
    8 rows selected.
    Here VIMP_NUM is a column which is highly skewed and also used in many important batch processes as key decision maker.
    VIMP_NUM has 100-odd distinct values. Is it the case that, when Oracle find few populare values in random sample, it stops going further and set the stats on that small collected sample?

    Thanks again

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points