I was reading up on extended statistics(although I have never actually used them) and I see that they should be helpful in estimating the cardinality when the query contains multiple columns in WHERE caluse. Is anyone using extended column statistics in their database? Now my main doubt is will extended column statistics help in deciding the join order and join type when there are multiple tables joined together and join type ?
Would it make sense to gather extended statistics on columns
table A - id,id1
table B - id,id1
table C - id,id2
Will this help in deciding the join order? Is there a way to gather statistics specifically for particular joins if we know a particular set columns will be joined frequently or should we go for a cluster in that case?
Extended statistics help you when columns are related to eachother, e.g. if you had 10 values of col_1 and 10 values of col_2, filtering on col_1 =:y and col_2 = :z could be a selectivity of 1/100 if they're unrelated or 1/10 if they're completely related.
You get the same sort of statistic when you index two columns together, as you now have distinct keys.
Try it and see?
Do you actually have a problem with what the optimizer has come up with?
Yes, I have used them, and they can be beneficial: they'll help the optimizer to estimate the correct cardinality where multiple related columns are involved as WHERE clause predicates (as Andrew has explained) -- check out these URLs for an overview:
Extended statistics are often of critical importance to get correct join order. So much so that Uncle Oracle has provided a way to detect the need for them automatically, see the dbms_stats.seed_col usage / dbms_stats.report_col_usage / dbms_stats.create_extended_stats routine.
Anywhere you see filter or join predicates across correlated columns, you need them. Predicates on columns to do with dates or addesses are common culprits: this type of information is often denormalized. Such as
...where cust_city='Passau' and cust_state_province='Bayern'
in the sh.customers demo table.
An alternative that can get you out of trouble quickly is to set optimizer_dynamic_sampling=4
1 person found this helpful
Others have already commented on the benefits of column groups for getting better cardinality estimates and plans (including join order) when multiple columns from the same table are used in the where clause - there are a few articles on limitations, surprises and defects (and benefits) on my blog.
For the example you've given you may find some benefit if you create a column group on both a (id, id1) and b (id, id1) ... though if one table has an index on just those two columns you need only create a column group on the other table to get the benefit. I am fairly sure that you won't see any change for that query if you create a column group on c(id, id2) because you're not comparing them with two columns from the SAME table, you're comparing with one column from a and one from b and the "product of predicates" will probably mean that the a and b single selectivities will override the c column group selectivity.
On the other hand - if you change the predicate "a.id = c.id" to "b.id = c.id", or change the predicate "b.id1 = c.id1" to "a.id1 = c.id1" [Updated to remove error: the b/c predicate is on id2, not id1] then (before you've created any column groups) you may find that the cardinality figures change and the join order changes. (Transitive closure that the human eye can see but the optimizer isn't coded to handle makes either modification legal.) If you do make the code change then a column group on c (id, id2) plus a column group on b(id, id2) [Updated to get correct colukmn groups relating to id2] could then have an effect.
Thanks for the reply. I went through the post in your blog .Posting it here for future readers
I have another question on how the optimizer would decide the join type based on extended stats.
Let us Assume that
join order is A ->B->C
a.id4=66 returns 9999 rows b,id7=44 returns 9999 rows
Total rows in A and B = 10000
After further filtering with join
a.id=b.id and a.id1=b.id1 satisfies 3800 rows
We have extended statistics on
Considering that there is a histogram on these extended stats like
How do these stats help in deciding the join type? We do not have the filter condition included in the extended stats i.e a.id4 and b.id7
The data distribution is in such a way that a.id,a.id1 has 9999 distinct values with 1000 row for each combination and similary for b.id,b.id1 .With this statistics optimizer may think that i have might have to return a large number of records but in fact if we include the condition a.id4=66 returns only 1 row from each distinct combination of id,id1 i.e in total 9999 rows and lets say the same case for table B. Wouldn't adding the condition in where clause to extended stats help ?