 3,734,382 Users
 2,246,958 Discussions
 7,857,255 Comments
Forum Stats
Discussions
Howdy, Stranger!
Categories
 380.9K All Categories
 2.1K Data
 203 Big Data Appliance
 1.9K Data Science
 446.1K Databases
 220.4K General Database Discussions
 3.7K Java and JavaScript in the Database
 23 Multilingual Engine
 506 MySQL Community Space
 459 NoSQL Database
 7.7K Oracle Database Express Edition (XE)
 2.8K ORDS, SODA & JSON in the Database
 438 SQLcl
 3.9K SQL Developer Data Modeler
 185.4K SQL & PL/SQL
 20.8K SQL Developer
 291.3K Development
 6 Developer Projects
 117 Programming Languages
 288.1K Development Tools
 96 DevOps
 3K QA/Testing
 645.2K Java
 18 Java Learning Subscription
 36.9K Database Connectivity
 149 Java Community Process
 104 Java 25
 22.1K Java APIs
 137.7K Java Development Tools
 165.3K Java EE (Java Enterprise Edition)
 12 Java Essentials
 138 Java 8 Questions
 85.9K Java Programming
 79 Java Puzzle Ball
 65.1K New To Java
 1.7K Training / Learning / Certification
 13.8K Java HotSpot Virtual Machine
 94.2K Java SE
 13.8K Java Security
 195 Java User Groups
 22 JavaScript  Nashorn
 Programs
 181 LiveLabs
 34 Workshops
 10.2K Software
 6.7K Berkeley DB Family
 3.5K JHeadstart
 5.7K Other Languages
 2.3K Chinese
 165 Deutsche Oracle Community
 1.2K Español
 1.9K Japanese
 225 Portuguese
Extended statistics issue
Hello!
I have a problem with extended statistics on 11.2.0.3
Here is the script I run
drop table col_stats;
create table col_stats as
select 1 a, 2 b,
from dual
connect by level<=100000;
insert into col_stats (
select 2, 1,
from dual
connect by level<=100000);
 check the a,b distribution
A  B  COUNT(1) 
2  1  100000  
1  2  100000 
 extended stats DEFINITION
select dbms_stats.create_extended_stats('A','COL_STATS','(A,B)') name
from dual;
 set estimate_percent to 100%
EXEC dbms_stats.SET_TABLE_prefs ('A','COL_STATS','ESTIMATE_PERCENT',100);
 check the changes
select dbms_stats.get_prefs ('ESTIMATE_PERCENT','A','COL_STATS')
from dual;
 NOW GATHER COLUMN STATS
BEGIN
DBMS_STATS.GATHER_TABLE_STATS (
OWNNAME => 'A',
TABNAME => 'COL_STATS',
METHOD_OPT => 'FOR ALL COLUMNS' );
END;
/
set autotrace traceonly explain
select * from col_stats where a=1 and b=1;
SQL> select * from col_stats where a=1 and b=1;
Execution Plan

Plan hash value: 1829175627

 Id  Operation  Name  Rows  Bytes  Cost (%CPU) Time 

 0  SELECT STATEMENT   50000  683K 177 (2) 00:00:03 
* 1  TABLE ACCESS FULL COL_STATS  50000  683K 177 (2) 00:00:03 

Predicate Information (identified by operation id):

1  filter("A"=1 AND "B"=1)
How come the optimizer expects 50000 rows?
Thanks in advance.
Rob
Best Answer

I think that's a result of Oracle's strategy to calculate cardinalities with frequency histograms when there are only a few different known values: " If the value you supply does not appear in the histogram, but is inside the low/high range of the histogram then the cardinality will be half the cardinality of the least frequently occurring value that is in the histogram." Histogram change  Oracle Scratchpad.
In your case there are two known values each with 100K rows  so they are also the least frequently occuring values and so the calculation brings 100K/2 = 50K.
Answers

I think that's a result of Oracle's strategy to calculate cardinalities with frequency histograms when there are only a few different known values: " If the value you supply does not appear in the histogram, but is inside the low/high range of the histogram then the cardinality will be half the cardinality of the least frequently occurring value that is in the histogram." Histogram change  Oracle Scratchpad.
In your case there are two known values each with 100K rows  so they are also the least frequently occuring values and so the calculation brings 100K/2 = 50K.

Martin,
Absolutely spot on.
There's not a lot you can do about this situation if you want to query for a value that probably doesn't exist when the only other values have very high frequencies  you're reduce to faking it by program.
This example introduces an interesting problem  how do you know which combinations are inside the range (and give the "half the least common" selectivity) and which are outside the range ? If the OP changes the initial (1,2) to (1,1) and then queries for (1,2) he'll find that the cardinality is one. (Because the relationships of the hashed values of the three combinations.)
The relevant function to generate values is sys_op_combined_hash()
Regards
Jonathan Lewis

Thanks Martin and Jonathan!
I am honored that you spent time on my question.
Few related thoughts:
 I created this testcase as a demo to show how useful extended statistics are. In this case having no stats at all is better as dynamic sampling does much better work
 This example is a bit unusual because I changed the stats collection defaults by setting estimate percent to 100%. In this particular case Oracle could have known that values that do not appear in the histogram are simply not in the table
 Could you please also show in more detail how sys_op_combined_hash() could be used. Shall I fake statistics with it? How?
Thanks,
Rob.

select sys_op_combined_hash(1,2) from dual;
compare with the contents of the histogram (user_tab_histograms) for the column and you'll see how to generate values.
Then read my blog about faking frequency histograms for the type of code you need: Frequency Histograms  Oracle Scratchpad
Regards
Jonathan Lewis

>In this particular case Oracle could have known that values that do not appear in the histogram are simply not in the table
The values were not in the table when statistics were gathered. However, Oracle can't confirm or deny whether values were inserted 1second after the gather_table_stats. So it has to come up with an estimate for a value not present in the histogram.
Hemant K Chitale

RobK wrote: SQL> select * from col_stats where a=1 and b=1; Execution Plan  Plan hash value: 1829175627   Id  Operation  Name  Rows  Bytes  Cost (%CPU) Time    0  SELECT STATEMENT   50000  683K 177 (2) 00:00:03  * 1  TABLE ACCESS FULL COL_STATS  50000  683K 177 (2) 00:00:03   Predicate Information (identified by operation id):  1  filter("A"=1 AND "B"=1) How come the optimizer expects 50000 rows? Thanks in advance. Rob
This is an expected behavior.
When you create extended statistics then it creates histogram for column groups(and it is virtual column).
The query predicate "where a=1 and b=1" is actually outofrange predicate for that virtual column. In such cases optimizer should be estimate
selectivity as 0 (so cardinality 1) but in such cases optimizer uses density(in our case actually used Newdensity) of column(or your virtual columns).
Let see following information(this is exact your case)
NUM_ROWS

200000
COLUMN_NAME NUM_DISTINCT DENSITY HISTOGRAM
   
A 2 .00000250 FREQUENCY
B 2 .00000250 FREQUENCY
SYS_STUNA$6DVXJXTP05EH56DTIR0X 2 .00000250 FREQUENCY
COLUMN_NAME ENDPOINT_NUMBER ENDPOINT_VALUE
  
A 100000 1
A 200000 2
B 100000 1
B 200000 2
SYS_STUNA$6DVXJXTP05EH56DTIR0X 100000 1977102303
SYS_STUNA$6DVXJXTP05EH56DTIR0X 200000 7894566276
Your predicate is "where a=1 and b=1" and it is equivalent with "where SYS_STUNA$6DVXJXTP05EH56DTIR0X = sys_op_combined_hash (1, 1)"
As you know with frequency histogram selectivity for equ(=) predicate is (E_endpointB_Endpoint)/num_rows. Here predicate value has located
between E_endpoint and B_Endpoint histogram buckets(endpoint numbers). But sys_op_combined_hash (1, 1) = 7026129190895635777. So then how can
i compare this value and according histogram endpoint values?. Answer is when creating histogram oracle do not use exact sys_op_combined_hash(x,y)
but it also apply MOD function, so you have to compare MOD (sys_op_combined_hash (1, 1), 9999999999)(which is equal 1598248696) with endpoint values
. So 1598248696 this is not locate between any endpoint number. Due to optimizer use NewDensity as density(in this case can not endpoint inf)
In below trace file you clearly can see that
BASE STATISTICAL INFORMATION
***********************
Table Stats::
Table: COL_STATS Alias: COL_STATS
#Rows: 200000 #Blks: 382 AvgRowLen: 18.00
Access path analysis for COL_STATS
***************************************
SINGLE TABLE ACCESS PATH
Single Table Cardinality Estimation for COL_STATS[COL_STATS]
Column (#1):
NewDensity:0.250000, OldDensity:0.000003 BktCnt:200000, PopBktCnt:200000, PopValCnt:2, NDV:2
Column (#2):
NewDensity:0.250000, OldDensity:0.000003 BktCnt:200000, PopBktCnt:200000, PopValCnt:2, NDV:2
Column (#3):
NewDensity:0.250000, OldDensity:0.000003 BktCnt:200000, PopBktCnt:200000, PopValCnt:2, NDV:2
ColGroup (#1, VC) SYS_STUNA$6DVXJXTP05EH56DTIR0X
Col#: 1 2 CorStregth: 2.00
ColGroup Usage:: PredCnt: 2 Matches Full: #1 Partial: Sel: 0.2500
Table: COL_STATS Alias: COL_STATS
Card: Original: 200000.000000 Rounded: 50000 Computed: 50000.00 Non Adjusted: 50000.00
Access Path: TableScan
Cost: 107.56 Resp: 107.56 Degree: 0
Cost_io: 105.00 Cost_cpu: 51720390
Resp_io: 105.00 Resp_cpu: 51720390
Best:: AccessPath: TableScan
Cost: 107.56 Degree: 1 Resp: 107.56 Card: 50000.00 Bytes: 0
Note that NewDensity calculated as 1/(2*num_distinct)= 1/4=0.25 for frequency histogram!.
CBO used column groups statistic and estimated cardinality was 200000*0.25=50000.
Remember that they are permanent statistics and RDBMS gathered they by analyzing actual table data(Even correlation columns data).
But dynamic sampling can be good in your above situation, due to it is calculate selectivity in run time using sampling method together real predicate.
For other situation you can see extends statistics is great help for estimation like where a=2 and b=1 because this is actual data and according information(stats/histograms) stored in dictionary.
SQL> select * from col_stats where a=2 and b=1;
Execution Plan

Plan hash value: 1829175627

 Id  Operation  Name  Rows  Bytes  Cost (%CPU) Time 

 0  SELECT STATEMENT   100K 585K 108 (3) 00:00:02 
* 1  TABLE ACCESS FULL COL_STATS  100K 585K 108 (3) 00:00:02 

Predicate Information (identified by operation id):

1  filter("A"=2 AND "B"=1)
and from trace file
Table Stats::
Table: COL_STATS Alias: COL_STATS
#Rows: 200000 #Blks: 382 AvgRowLen: 18.00
Access path analysis for COL_STATS
***************************************
SINGLE TABLE ACCESS PATH
Single Table Cardinality Estimation for COL_STATS[COL_STATS]
Column (#1):
NewDensity:0.250000, OldDensity:0.000003 BktCnt:200000, PopBktCnt:200000, PopValCnt:2, NDV:2
Column (#2):
NewDensity:0.250000, OldDensity:0.000003 BktCnt:200000, PopBktCnt:200000, PopValCnt:2, NDV:2
Column (#3):
NewDensity:0.250000, OldDensity:0.000003 BktCnt:200000, PopBktCnt:200000, PopValCnt:2, NDV:2
ColGroup (#1, VC) SYS_STUNA$6DVXJXTP05EH56DTIR0X
Col#: 1 2 CorStregth: 2.00
ColGroup Usage:: PredCnt: 2 Matches Full: #1 Partial: Sel: 0.5000
Table: COL_STATS Alias: COL_STATS
Card: Original: 200000.000000 Rounded: 100000 Computed: 100000.00 Non Adjusted: 100000.00
Access Path: TableScan
Cost: 107.56 Resp: 107.56 Degree: 0
Cost_io: 105.00 Cost_cpu: 51720365
Resp_io: 105.00 Resp_cpu: 51720365
Best:: AccessPath: TableScan
Cost: 107.56 Degree: 1 Resp: 107.56 Card: 100000.00 Bytes: 0
Lets calculate:
MOD (sys_op_combined_hash (2, 1), 9999999999)=1977102303 and for it (e_endpointb_enpoint)/num_rows=(200000100000)/200000=0.5
and result card=sel*num_rows(or just e_endpointb_enpoint)=100000.

Aliyev,
Good catch on the mod(,9999999999), I'd forgotten about that  which won't have helped the OP very much if he was trying to generate values for a fake histogram.
Aliyev Chinar wrote: Note that NewDensity calculated as 1/(2*num_distinct)= 1/4=0.25 for frequency histogram!.
Which version of Oracle are using  that's not the generic calculation for NewDensity with frequency histograms in the versions I've been using. The NewDensity is calculated to give half the selectivity of the least popular value captured in the histogram. Here, for example is the NewDensity of a column group that holds 4 distinct values:
NewDensity:0.022727, OldDensity:0.000023 BktCnt:22000, PopBktCnt:22000, PopValCnt:4, NDV:4
Column (#3): SYS_STUBZH0IHA7K$KEBJVXO5LOHAS(
AvgLen: 12 NDV: 4 Nulls: 0 Density: 0.022727 Min: 1598248696 Max: 7894566276
Histogram: Freq #Bkts: 4 UncompBkts: 22000 EndPtVals: 4
There are 22,000 rows captured by the histogram, and the least frequently occurring value accounts for 1,000 of them, so the NewDensity is (1,000/22,000)/2 = 1/44
0.0227272727272...
Regards
Jonathan Lewis

Jonathan Lewis wrote: Aliyev, Good catch on the mod(,9999999999), I'd forgotten about that  which won't have helped the OP very much if he was trying to generate values for a fake histogram. Aliyev Chinar wrote: Note that NewDensity calculated as 1/(2*num_distinct)= 1/4=0.25 for frequency histogram!.
Which version of Oracle are using  that's not the generic calculation for NewDensity with frequency histograms in the versions I've been using. TOf course , it is not generic formula. I just mentioned that. for above trace file.

Aliyev Chinar wrote: Jonathan Lewis wrote: Aliyev, Good catch on the mod(,9999999999), I'd forgotten about that  which won't have helped the OP very much if he was trying to generate values for a fake histogram. Aliyev Chinar wrote: Note that NewDensity calculated as 1/(2*num_distinct)= 1/4=0.25 for frequency histogram!.
Which version of Oracle are using  that's not the generic calculation for NewDensity with frequency histograms in the versions I've been using. T Of course , it is not generic formula. I just mentioned that. for above trace file.Let me rephrase the question  beyond the fact that the num_distinct happens to be 2 in this particular trace file and you needed to generate the number 4, is there any reason why chose to insert the num_distinct at that point ?
Regards
Jonathan Lewis

Let me rephrase the question  beyond the fact that the num_distinct happens to be 2 in this particular trace file and you needed to generate the number 4, is there any reason why chose to insert the num_distinct at that point ? Regards Jonathan LewisEven here is not histograms stats(in simple cases) we know that the term "density" calculated as 1/num_distict. So it give me idea about density and NDV relationship. I want to note that again. In this case my guess was only above trace file and it is not about general cases(HB histogram ,even FH`s special cases and oracle version related inf.).

Aliyev,
Thanks.
When I read your first comments I didn't realise it was an educated guess.
Regards
Jonathan Lewis