
1. Re: Extended statistics issue
Martin Preiss Jan 8, 2014 8:59 AM (in response to RobK)I think that's a result of Oracle's strategy to calculate cardinalities with frequency histograms when there are only a few different known values: " If the value you supply does not appear in the histogram, but is inside the low/high range of the histogram then the cardinality will be half the cardinality of the least frequently occurring value that is in the histogram." Histogram change  Oracle Scratchpad.
In your case there are two known values each with 100K rows  so they are also the least frequently occuring values and so the calculation brings 100K/2 = 50K.

2. Re: Extended statistics issue
Jonathan Lewis Jan 8, 2014 9:36 AM (in response to Martin Preiss)Martin,
Absolutely spot on.
There's not a lot you can do about this situation if you want to query for a value that probably doesn't exist when the only other values have very high frequencies  you're reduce to faking it by program.
This example introduces an interesting problem  how do you know which combinations are inside the range (and give the "half the least common" selectivity) and which are outside the range ? If the OP changes the initial (1,2) to (1,1) and then queries for (1,2) he'll find that the cardinality is one. (Because the relationships of the hashed values of the three combinations.)
The relevant function to generate values is sys_op_combined_hash()
Regards
Jonathan Lewis

3. Re: Extended statistics issue
RobK Jan 8, 2014 1:40 PM (in response to RobK)Thanks Martin and Jonathan!
I am honored that you spent time on my question.
Few related thoughts:
 I created this testcase as a demo to show how useful extended statistics are. In this case having no stats at all is better as dynamic sampling does much better work
 This example is a bit unusual because I changed the stats collection defaults by setting estimate percent to 100%. In this particular case Oracle could have known that values that do not appear in the histogram are simply not in the table
 Could you please also show in more detail how sys_op_combined_hash() could be used. Shall I fake statistics with it? How?
Thanks,
Rob.

4. Re: Extended statistics issue
Jonathan Lewis Jan 8, 2014 5:02 PM (in response to RobK)select sys_op_combined_hash(1,2) from dual;
compare with the contents of the histogram (user_tab_histograms) for the column and you'll see how to generate values.
Then read my blog about faking frequency histograms for the type of code you need: Frequency Histograms  Oracle Scratchpad
Regards
Jonathan Lewis

5. Re: Extended statistics issue
Hemant K Chitale Jan 8, 2014 9:10 PM (in response to RobK)>In this particular case Oracle could have known that values that do not appear in the histogram are simply not in the table
The values were not in the table when statistics were gathered. However, Oracle can't confirm or deny whether values were inserted 1second after the gather_table_stats. So it has to come up with an estimate for a value not present in the histogram.
Hemant K Chitale

6. Re: Extended statistics issue
Aliyev Chinar Jan 9, 2014 6:05 AM (in response to RobK)RobK wrote:
SQL> select * from col_stats where a=1 and b=1;
Execution Plan

Plan hash value: 1829175627

 Id  Operation  Name  Rows  Bytes  Cost (%CPU) Time 

 0  SELECT STATEMENT   50000  683K 177 (2) 00:00:03 
* 1  TABLE ACCESS FULL COL_STATS  50000  683K 177 (2) 00:00:03 

Predicate Information (identified by operation id):

1  filter("A"=1 AND "B"=1)
How come the optimizer expects 50000 rows?
Thanks in advance.
Rob
This is an expected behavior.
When you create extended statistics then it creates histogram for column groups(and it is virtual column).
The query predicate "where a=1 and b=1" is actually outofrange predicate for that virtual column. In such cases optimizer should be estimate
selectivity as 0 (so cardinality 1) but in such cases optimizer uses density(in our case actually used Newdensity) of column(or your virtual columns).
Let see following information(this is exact your case)
NUM_ROWS

200000
COLUMN_NAME NUM_DISTINCT DENSITY HISTOGRAM
   
A 2 .00000250 FREQUENCY
B 2 .00000250 FREQUENCY
SYS_STUNA$6DVXJXTP05EH56DTIR0X 2 .00000250 FREQUENCY
COLUMN_NAME ENDPOINT_NUMBER ENDPOINT_VALUE
  
A 100000 1
A 200000 2
B 100000 1
B 200000 2
SYS_STUNA$6DVXJXTP05EH56DTIR0X 100000 1977102303
SYS_STUNA$6DVXJXTP05EH56DTIR0X 200000 7894566276
Your predicate is "where a=1 and b=1" and it is equivalent with "where SYS_STUNA$6DVXJXTP05EH56DTIR0X = sys_op_combined_hash (1, 1)"
As you know with frequency histogram selectivity for equ(=) predicate is (E_endpointB_Endpoint)/num_rows. Here predicate value has located
between E_endpoint and B_Endpoint histogram buckets(endpoint numbers). But sys_op_combined_hash (1, 1) = 7026129190895635777. So then how can
i compare this value and according histogram endpoint values?. Answer is when creating histogram oracle do not use exact sys_op_combined_hash(x,y)
but it also apply MOD function, so you have to compare MOD (sys_op_combined_hash (1, 1), 9999999999)(which is equal 1598248696) with endpoint values
. So 1598248696 this is not locate between any endpoint number. Due to optimizer use NewDensity as density(in this case can not endpoint inf)
In below trace file you clearly can see that
BASE STATISTICAL INFORMATION
***********************
Table Stats::
Table: COL_STATS Alias: COL_STATS
#Rows: 200000 #Blks: 382 AvgRowLen: 18.00
Access path analysis for COL_STATS
***************************************
SINGLE TABLE ACCESS PATH
Single Table Cardinality Estimation for COL_STATS[COL_STATS]
Column (#1):
NewDensity:0.250000, OldDensity:0.000003 BktCnt:200000, PopBktCnt:200000, PopValCnt:2, NDV:2
Column (#2):
NewDensity:0.250000, OldDensity:0.000003 BktCnt:200000, PopBktCnt:200000, PopValCnt:2, NDV:2
Column (#3):
NewDensity:0.250000, OldDensity:0.000003 BktCnt:200000, PopBktCnt:200000, PopValCnt:2, NDV:2
ColGroup (#1, VC) SYS_STUNA$6DVXJXTP05EH56DTIR0X
Col#: 1 2 CorStregth: 2.00
ColGroup Usage:: PredCnt: 2 Matches Full: #1 Partial: Sel: 0.2500
Table: COL_STATS Alias: COL_STATS
Card: Original: 200000.000000 Rounded: 50000 Computed: 50000.00 Non Adjusted: 50000.00
Access Path: TableScan
Cost: 107.56 Resp: 107.56 Degree: 0
Cost_io: 105.00 Cost_cpu: 51720390
Resp_io: 105.00 Resp_cpu: 51720390
Best:: AccessPath: TableScan
Cost: 107.56 Degree: 1 Resp: 107.56 Card: 50000.00 Bytes: 0
Note that NewDensity calculated as 1/(2*num_distinct)= 1/4=0.25 for frequency histogram!.
CBO used column groups statistic and estimated cardinality was 200000*0.25=50000.
Remember that they are permanent statistics and RDBMS gathered they by analyzing actual table data(Even correlation columns data).
But dynamic sampling can be good in your above situation, due to it is calculate selectivity in run time using sampling method together real predicate.
For other situation you can see extends statistics is great help for estimation like where a=2 and b=1 because this is actual data and according information(stats/histograms) stored in dictionary.
SQL> select * from col_stats where a=2 and b=1;
Execution Plan

Plan hash value: 1829175627

 Id  Operation  Name  Rows  Bytes  Cost (%CPU) Time 

 0  SELECT STATEMENT   100K 585K 108 (3) 00:00:02 
* 1  TABLE ACCESS FULL COL_STATS  100K 585K 108 (3) 00:00:02 

Predicate Information (identified by operation id):

1  filter("A"=2 AND "B"=1)
and from trace file
Table Stats::
Table: COL_STATS Alias: COL_STATS
#Rows: 200000 #Blks: 382 AvgRowLen: 18.00
Access path analysis for COL_STATS
***************************************
SINGLE TABLE ACCESS PATH
Single Table Cardinality Estimation for COL_STATS[COL_STATS]
Column (#1):
NewDensity:0.250000, OldDensity:0.000003 BktCnt:200000, PopBktCnt:200000, PopValCnt:2, NDV:2
Column (#2):
NewDensity:0.250000, OldDensity:0.000003 BktCnt:200000, PopBktCnt:200000, PopValCnt:2, NDV:2
Column (#3):
NewDensity:0.250000, OldDensity:0.000003 BktCnt:200000, PopBktCnt:200000, PopValCnt:2, NDV:2
ColGroup (#1, VC) SYS_STUNA$6DVXJXTP05EH56DTIR0X
Col#: 1 2 CorStregth: 2.00
ColGroup Usage:: PredCnt: 2 Matches Full: #1 Partial: Sel: 0.5000
Table: COL_STATS Alias: COL_STATS
Card: Original: 200000.000000 Rounded: 100000 Computed: 100000.00 Non Adjusted: 100000.00
Access Path: TableScan
Cost: 107.56 Resp: 107.56 Degree: 0
Cost_io: 105.00 Cost_cpu: 51720365
Resp_io: 105.00 Resp_cpu: 51720365
Best:: AccessPath: TableScan
Cost: 107.56 Degree: 1 Resp: 107.56 Card: 100000.00 Bytes: 0
Lets calculate:
MOD (sys_op_combined_hash (2, 1), 9999999999)=1977102303 and for it (e_endpointb_enpoint)/num_rows=(200000100000)/200000=0.5
and result card=sel*num_rows(or just e_endpointb_enpoint)=100000.

7. Re: Extended statistics issue
Jonathan Lewis Jan 9, 2014 9:49 AM (in response to Aliyev Chinar)Aliyev,
Good catch on the mod(,9999999999), I'd forgotten about that  which won't have helped the OP very much if he was trying to generate values for a fake histogram.
Aliyev Chinar wrote:
Note that NewDensity calculated as 1/(2*num_distinct)= 1/4=0.25 for frequency histogram!.
Which version of Oracle are using  that's not the generic calculation for NewDensity with frequency histograms in the versions I've been using. The NewDensity is calculated to give half the selectivity of the least popular value captured in the histogram. Here, for example is the NewDensity of a column group that holds 4 distinct values:
NewDensity:0.022727, OldDensity:0.000023 BktCnt:22000, PopBktCnt:22000, PopValCnt:4, NDV:4
Column (#3): SYS_STUBZH0IHA7K$KEBJVXO5LOHAS(
AvgLen: 12 NDV: 4 Nulls: 0 Density: 0.022727 Min: 1598248696 Max: 7894566276
Histogram: Freq #Bkts: 4 UncompBkts: 22000 EndPtVals: 4
There are 22,000 rows captured by the histogram, and the least frequently occurring value accounts for 1,000 of them, so the NewDensity is (1,000/22,000)/2 = 1/44
0.0227272727272...
Regards
Jonathan Lewis

8. Re: Extended statistics issue
Aliyev Chinar Jan 9, 2014 11:28 PM (in response to Jonathan Lewis)Jonathan Lewis wrote:
Aliyev,
Good catch on the mod(,9999999999), I'd forgotten about that  which won't have helped the OP very much if he was trying to generate values for a fake histogram.
Aliyev Chinar wrote:
Note that NewDensity calculated as 1/(2*num_distinct)= 1/4=0.25 for frequency histogram!.
Which version of Oracle are using  that's not the generic calculation for NewDensity with frequency histograms in the versions I've been using. T
Of course , it is not generic formula. I just mentioned that. for above trace file.

9. Re: Extended statistics issue
Jonathan Lewis Jan 10, 2014 1:11 AM (in response to Aliyev Chinar)Aliyev Chinar wrote:
Jonathan Lewis wrote:
Aliyev,
Good catch on the mod(,9999999999), I'd forgotten about that  which won't have helped the OP very much if he was trying to generate values for a fake histogram.
Aliyev Chinar wrote:
Note that NewDensity calculated as 1/(2*num_distinct)= 1/4=0.25 for frequency histogram!.
Which version of Oracle are using  that's not the generic calculation for NewDensity with frequency histograms in the versions I've been using. T
Of course , it is not generic formula. I just mentioned that. for above trace file.
Let me rephrase the question  beyond the fact that the num_distinct happens to be 2 in this particular trace file and you needed to generate the number 4, is there any reason why chose to insert the num_distinct at that point ?
Regards
Jonathan Lewis

10. Re: Extended statistics issue
Aliyev Chinar Jan 10, 2014 6:45 AM (in response to Jonathan Lewis)
Let me rephrase the question  beyond the fact that the num_distinct happens to be 2 in this particular trace file and you needed to generate the number 4, is there any reason why chose to insert the num_distinct at that point ?
Regards
Jonathan Lewis
Even here is not histograms stats(in simple cases) we know that the term "density" calculated as 1/num_distict. So it give me idea about density and NDV relationship. I want to note that again. In this case my guess was only above trace file and it is not about general cases(HB histogram ,even FH`s special cases and oracle version related inf.).

11. Re: Extended statistics issue
Jonathan Lewis Jan 10, 2014 8:31 AM (in response to Aliyev Chinar)Aliyev,
Thanks.
When I read your first comments I didn't realise it was an educated guess.
Regards
Jonathan Lewis