Forum Stats

  • 3,741,519 Users
  • 2,248,442 Discussions
  • 7,861,847 Comments

Discussions

Extended column statistics for Join order and Join type

Sekar_BLUE4EVER
Sekar_BLUE4EVER Member Posts: 324 Blue Ribbon
edited Mar 8, 2018 4:35AM in General Database Discussions

I was reading up on extended statistics(although I have never actually used them) and I see that they should be helpful in estimating the cardinality when the query contains multiple columns in WHERE caluse. Is anyone using extended column statistics in their database? Now my main doubt is will extended column statistics help in deciding the join order and join type when there are multiple tables joined together and join type ?

  select a.*,b.*,c.* from a,b,c where a.id=b.id and a.id1=b.id1 and a.id=c.id and b.id2=c.id2   and a.id4=66 and b.id7=44 and c.id88=88

Would it make sense to gather extended statistics on columns

  table A - id,id1  table B - id,id1   table C - id,id2

Will this help in deciding the join order? Is there a way to gather statistics specifically for particular joins if we know a particular set columns will be joined frequently or should we go for a cluster in that case?

Tagged:

Answers

  • AndrewSayer
    AndrewSayer Member Posts: 12,998 Gold Crown
    edited Mar 6, 2018 7:02AM
    Sekar_BLUE4EVER wrote:I was reading up on extended statistics(although I have never actually used them) and I see that they should be helpful in estimating the cardinality when the query contains multiple columns in WHERE caluse. Is anyone using extended column statistics in their database? Now my main doubt is will extended column statistics help in deciding the join order and join type when there are multiple tables joined together and join type ?
    1. selecta.*,b.*,c.*froma,b,cwherea.id=b.idanda.id1=b.id1anda.id=c.idandb.id2=c.id2
    2. anda.id4=66andb.id7=44andc.id88=88
     select a.*,b.*,c.* from a,b,c where a.id=b.id and a.id1=b.id1 and a.id=c.id and b.id2=c.id2   and a.id4=66 and b.id7=44 and c.id88=88
    Would it make sense to gather extended statistics on columnstable A - id,id1table B - id,id1table C - id,id2Will this help in deciding the join order? Is there a way to gather statistics specifically for particular joins if we know a particular set columns will be joined frequently or should we go for a cluster in that case?

    Maybe.

    Extended statistics help you when columns are related to eachother, e.g. if you had 10 values of col_1 and 10 values of col_2, filtering on col_1 =:y and col_2 = :z could be a selectivity of 1/100 if they're unrelated or 1/10 if they're completely related.

    You get the same sort of statistic when you index two columns together, as you now have distinct keys.

    Try it and see?

    Do you actually have a problem with what the optimizer has come up with?

  • SeánMacGC
    SeánMacGC Member Posts: 2,914 Gold Trophy
    edited Mar 6, 2018 7:28AM

    Yes, I have used them, and they can be beneficial: they'll help the optimizer to estimate the correct cardinality where multiple related columns are involved as WHERE clause predicates (as Andrew has explained) -- check out these URLs for an overview:

    https://blogs.oracle.com/optimizer/extended-statistics

    https://blogs.oracle.com/optimizer/how-do-i-know-what-extended-statistics-are-needed-for-a-given-workload

  • JohnWatson2
    JohnWatson2 Member Posts: 4,268 Silver Crown
    edited Mar 6, 2018 8:02AM

    Extended statistics are often of critical importance to get correct join order. So much so that Uncle Oracle has provided a way to detect the need for them automatically, see the dbms_stats.seed_col usage / dbms_stats.report_col_usage /  dbms_stats.create_extended_stats routine.

    Anywhere you see filter or join predicates across correlated columns, you need them. Predicates on columns to do with dates or addesses are common culprits: this type of information is often denormalized. Such as

    ...where cust_city='Passau' and cust_state_province='Bayern'

    in the sh.customers demo table.

    An alternative that can get you out of trouble quickly is to set optimizer_dynamic_sampling=4

  • Jonathan Lewis
    Jonathan Lewis Member Posts: 9,681 Gold Crown
    edited Mar 6, 2018 10:15AM

    Others have already commented on the benefits of column groups for getting better cardinality estimates and plans (including join order) when multiple columns from the same table are used in the where clause  - there are a few articles on limitations, surprises and defects (and benefits) on my blog.

    For the example you've given you may find some benefit if you create a column group on both a (id, id1) and b (id, id1) ... though if one table has an index on just those two columns you need only create a column group on the other table to get the benefit. I am fairly sure that you won't see any change for that query if you create a column group on c(id, id2) because you're not comparing them with two columns from the SAME table, you're comparing with one column from a and one from b and the "product of predicates" will probably mean that the a and b single selectivities will override the c column group selectivity.

    On the other hand - if you change the predicate "a.id = c.id" to "b.id = c.id", or change the predicate "b.id1 = c.id1" to "a.id1 = c.id1" [Updated to remove error: the b/c predicate is on id2, not id1] then (before you've created any column groups) you may find that the cardinality figures change and the join order changes.  (Transitive closure that the human eye can see but the optimizer isn't coded to handle makes either modification legal.)  If you do make the code change then a column group on c (id, id2) plus a column group on b(id, id2) [Updated to get correct colukmn groups relating to id2] could then have an effect.

    Regards

    Jonathan Lewis

    Sekar_BLUE4EVER
  • Sekar_BLUE4EVER
    Sekar_BLUE4EVER Member Posts: 324 Blue Ribbon
    edited Mar 8, 2018 4:35AM

    Hi Jonathan,

         Thanks for the reply. I went through the post in your blog .Posting it here for future readers

    https://jonathanlewis.wordpress.com/2018/03/08/column-groups-4/

    I have another question on how the optimizer would decide the join type based on extended stats.

    Let us Assume that

    join order is A ->B->C

    a.id4=66 returns 9999 rowsb,id7=44 returns 9999 rows 

    Total rows in A and B = 10000

    After further filtering with join

    a.id=b.id and a.id1=b.id1 satisfies  3800 rows 

    We have extended statistics on

    (a.id,a.id1) (b.id,b.id1)

    Considering that  there is a histogram on these extended stats like

    How do these stats help in deciding the join type? We do not have the filter condition included in the extended stats i.e a.id4 and b.id7

    The data distribution is in such a way that a.id,a.id1 has 9999 distinct values with 1000  row for each  combination and similary for b.id,b.id1 .With this statistics optimizer may think that i have might have to return a large number of records but in fact if we include the condition a.id4=66 returns only 1 row from each distinct combination of id,id1 i.e in total  9999 rows and lets say the same case for table B. Wouldn't adding the condition in where clause to extended stats help ?

This discussion has been closed.