Forum Stats

  • 3,769,482 Users
  • 2,252,969 Discussions
  • 7,875,047 Comments

Discussions

Hash join - hashed table in memory

Sekar_BLUE4EVER
Sekar_BLUE4EVER Member Posts: 324 Blue Ribbon
edited Jan 11, 2018 1:22PM in General Database Discussions

Hi I have a complex query for which I am trying to figure out the optimal tuning order . I am proceeding with the following assumption

start with the table and condition which has the least number  of rows and keep the large tables for join at the end.

So I have decided an order for my first four tables as follows .

select /*+gather_plan_statistics leading( A D C E) */ A.bsid,C.trqty AS S_QTY,C.CR AS EX_CM from tbl1 A INNER JOIN tbl2 D ON D.EID = A.EID AND D.INID = A.bsid AND D.NUMBEME = 'QUCD' INNER JOIN tbl3 C ON C.SID = A.bsid AND C.EID = A.EID INNER JOIN tbl4 E ON  E.EID = A.EID AND E.ACNO = (CASE WHEN C.SECMOVTYPE = 'DELI' THEN C.SACC ELSE C.CTRSECACC END) AND      E.ACERID = (CASE WHEN C.STYPE = 'DELI' THEN C.ENTTRD ELSE C.ENTCTID END)

Plan

--------------------------------------------------------------------------------------------------------------------------| Id  | Operation               | Name                | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem | Used-Mem |--------------------------------------------------------------------------------------------------------------------------|   0 | SELECT STATEMENT        |                     |      1 |        |    483K|00:00:01.35 |     520K|  25480 |       ||   1 |  NESTED LOOPS           |                     |      1 |     18 |    483K|00:00:01.35 |     520K|  25480 |       ||*  2 |   HASH JOIN             |                     |      1 |    241K|    617K|00:00:00.53 |     216K|  25449 |  1035K|  1035K| 1651K (0)||   3 |    NESTED LOOPS         |                     |      1 |   5843 |   6166 |00:00:00.02 |    5844 |    108 |       ||   4 |     INDEX FAST FULL SCAN| IDX_tbl1_02         |      1 |   9393 |   9432 |00:00:00.01 |      95 |     87 |       ||*  5 |     INDEX UNIQUE SCAN   | IDX_tbl2_01         |   9432 |      1 |   6166 |00:00:00.01 |    5749 |     21 |       ||   6 |    TABLE ACCESS FULL    | tbl3                |      1 |    120K|    121K|00:00:00.33 |     210K|  25341 |       ||*  7 |   INDEX UNIQUE SCAN     | IDX_tbl4            |    617K|      1 |    483K|00:00:00.73 |     304K|     31 |       |--------------------------------------------------------------------------------------------------------------------------

Then I tried adding a fifth table G

select /*+gather_plan_statistics leading( A D C E) */ A.bsid,C.trqty AS S_QTY,C.CR AS EX_CM from tbl1 A INNER JOIN tbl2 D ON D.EID = A.EID AND D.INID = A.bsid AND D.NUMBEME = 'QUCD' INNER JOIN tbl3 C ON C.SID = A.bsid AND C.EID = A.EID INNER JOIN tbl4 E ON  E.EID = A.EID AND E.ACNO = (CASE WHEN C.SECMOVTYPE = 'DELI' THEN C.SACC ELSE C.CTRSECACC END) AND E.ACERID = (CASE WHEN C.STYPE = 'DELI' THEN C.ENTTRD ELSE C.ENTCTID END) INNER JOIN  tbl5 G ON G.LANG_ID = 1 and G.NAME = 'd_TrdStatus' AND G.VALUE = C.CURUS and G.CODE_ID = 2000000110

Line number 07 is the fifth table G which has been newly added.

Plan

---------------------------| Id  | Operation                    | Name                   | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem | Used-Mem |-----------------------------------------------------------------------------------------------------------------------------------------------------|   0 | SELECT STATEMENT             |                        |      1 |        |    483K|00:00:01.35 |     474K|  25471 ||*  1 |  HASH JOIN                   |                        |      1 |     12 |    483K|00:00:01.35 |     474K|  25471 |    24M|  5228K|   39M (0)||   2 |   NESTED LOOPS               |                        |      1 |     18 |    483K|00:00:01.13 |     474K|  25471 ||*  3 |    HASH JOIN                 |                        |      1 |    241K|    617K|00:00:00.43 |     211K|  25448 |  1035K|  1035K| 1632K (0)||   4 |     NESTED LOOPS             |                        |      1 |   5843 |   6166 |00:00:00.02 |    5844 |    107 ||   5 |      INDEX FAST FULL SCAN    | IDX_tbl1_02            |      1 |   9393 |   9432 |00:00:00.01 |      95 |     87 ||*  6 |      INDEX UNIQUE SCAN       | IDX_tbl2_01            |   9432 |      1 |   6166 |00:00:00.01 |    5749 |     20 ||   7 |     TABLE ACCESS FULL        | tbl3                   |      1 |    120K|    121K|00:00:00.26 |     205K|  25341 ||*  8 |    INDEX UNIQUE SCAN         | IDX_tbl4               |    617K|      1 |    483K|00:00:00.60 |     263K|     23 ||*  9 |   TABLE ACCESS BY INDEX ROWID| tbl5                   |      1 |      1 |     94 |00:00:00.01 |      56 |      0 ||* 10 |    INDEX RANGE SCAN          | IDX_tbl5_01            |      1 |    111 |     94 |00:00:00.01 |      15 |      0 |--------------------------------------------------------------------------------------------------------------------------

AS you can see the plan is initially the same but a new hash join is added to join the table tbl5 with the output of four tables . The table tbl5 returns only 94 rows and so I believe hashing the table tbl5 in memory would be beneficial for the hash join .

According to oracle documentation

Because the orders table is small relative to the order_items table, which is 6 times larger, the database hashes orders. In a hash join, the data set for the build table always appears first in the list of operations (Step 2). In Step 3, the database performs a full scan of the larger order_items later, probing the hash table for each row.

https://docs.oracle.com/database/121/TGSQL/tgsql_join.htm#TGSQL245

I also tried writing the query as a subquery  in the below form

select /*+ leading (G adce) */ adce.bsid,adce.col2,adce.col3 from tbl5 G,(query joinin the tables adce) adce where G.LANG_ID = 1 and G.NAME = 'd_TrdStatus' AND G.VALUE = C.CURUS and G.CODE_ID = 2000000110

but after rewriting in the format the  varies completely and it starts joining tbl5 and tbl1 at the start.

Why is the join order not picking tbl5 to hash in memory ? How can I rewrite the query to make sute it joins tbl5 with the result of other 4 tables.

Thanks

Tagged:
John Thortonmtefft

Best Answer

  • Jonathan Lewis
    Jonathan Lewis Member Posts: 9,786 Gold Crown
    edited Jan 11, 2018 4:55AM Accepted Answer

    Like everyone else who has answered so far, I encourage people to avoid hinting. Mainly because they usually don't do it well.  In your case, for example, you haven't specified any join methods so Oracle could follow your hints and still change the plan dramatically from the one you expect.  If you think you've hinted the best plan then capture an SQL Plan Baseline for it so that you can see how many more hints you need to make it a stable plan.

    To answer the detail of your question, though:  you still want G to be the last table in the join order, but you want Oracle to use it as the probe build table in a hash join, so you need to add G at the end of your leading() hint, specifiy a hash join, and specity swap_join_inputs() the last hint will put G at the top of the plan while still joining to it as the last table.

    /*+ leading(A D C E G) use_hash(G) swap_join_inputs(G) */

    Regards

    Jonathan Lewis

    Update: I wrote "probe" when I should have said "build": the thing that appears first as you read down the plan from the "hash join " operator is the build table, the second is the probe table.

Answers

  • John Brady - UK
    John Brady - UK Member Posts: 434 Bronze Badge
    edited Jan 9, 2018 9:24AM

    First, why do you think you know more about the "best join order" between your tables than Oracle does?  That is precisely what the Optimizer does automatically for you - it considers all possible joins between the tables in the query (up to a limit) and uses the execution plan with the lowest total cost.  Thus the Optimizer will automatically consider different join orders and join methods, and use those with the lowest cost.

    The only time I'm really aware of when this doesn't work as expected is when the statistics on your tables don't agree with the actual data in the tables (out of date statistics), or you have heavily skewed data distributions and the same execution plan is used with different bind variable values.

    Second, the "leading" hint tells Oracle to access those tables in that order within the execution plan.  So any execution plan produced will always start with A, then do B, then C, then D, then the other tables.  So your "leading" hint is stopping the Optimizer from considering joining table G until after the other 4 tables have been accessed in the execution plan.

    The simplest solution is to remove the "leading" hint (and any others you might be considering) and let the Optimizer consider all the possible joins between the tables in the query.  Then it will naturally pick the table access sequence and join methods that have the lowest cost.  If joining A to G early on produces a lower cost execution plan, then the Optimizer should find it and use it.

    I'd strongly recommend avoiding all such hints that force specific execution plans, unless you have very strong and overriding reasons for doing so.  And even then I'd still try and avoid hints, because you have to know more than the Optimizer does to produce the "right" plan and to make sure that your hints ALWAYS produce the plan you want in all possible circumstances.

    John Thortonmtefft
  • JohnWatson2
    JohnWatson2 Member Posts: 4,328 Silver Crown
    edited Jan 9, 2018 10:08AM

    You initial assumption is not usually correct,

    start with the table and condition which has the least number  of rows and keep the large tables for join at the end.

    When calculating a join order you need to start not with the smallest table, and not with the table that has the least number of rows after applying filters. You should start with the table that has the lowest proportion of rows remaining after applying filters. This is known as "the filtered rows percentage method", brilliantly documented here,

    https://www.amazon.com/Oracle-Performance-Tuning-Optimization-Cardinalities/dp/1501022695

    The way to look at it is, for example, that if you have a dimension table of only four rows and a constant filter which cuts it down to 1 (25%), you will still (on average) have to look at one quarter of a huge fact table. So even if the filter on the huge fact table returns a thousand rows, it will be quicker to start with that if that one thousand is a smaller percentage of the total fact table.

  • JohnWatson2
    JohnWatson2 Member Posts: 4,328 Silver Crown
    edited Jan 9, 2018 10:13AM

    Here's an example:

    orclx>orclx> set autot trace exporclx> select * from emp natural join dept where dname='SALES' and job='MANAGER';Execution Plan----------------------------------------------------------Plan hash value: 3625962092----------------------------------------------------------------------------------------| Id  | Operation                    | Name    | Rows  | Bytes | Cost (%CPU)| Time    |----------------------------------------------------------------------------------------|  0 | SELECT STATEMENT            |        |    1 |    58 |    6  (0)| 00:00:01 ||  1 |  NESTED LOOPS                |        |    1 |    58 |    6  (0)| 00:00:01 ||  2 |  NESTED LOOPS              |        |    3 |    58 |    6  (0)| 00:00:01 ||*  3 |    TABLE ACCESS FULL        | EMP    |    3 |  114 |    3  (0)| 00:00:01 ||*  4 |    INDEX UNIQUE SCAN        | PK_DEPT |    1 |      |    0  (0)| 00:00:01 ||*  5 |  TABLE ACCESS BY INDEX ROWID| DEPT    |    1 |    20 |    1  (0)| 00:00:01 |----------------------------------------------------------------------------------------Predicate Information (identified by operation id):---------------------------------------------------  3 - filter("EMP"."JOB"='MANAGER')  4 - access("EMP"."DEPTNO"="DEPT"."DEPTNO")  5 - filter("DEPT"."DNAME"='SALES')Note-----  - this is an adaptive planorclx>

    Uncle Oracle is starting with emp, even though emp id the bigger table and it expects to get three rows after applying the filter. This is because 3 as a proportion of 14 is lower then the 1 out of 4 proportion that it expects to get from dept.

  • mtefft
    mtefft Member Posts: 836 Gold Badge
    edited Jan 9, 2018 11:33AM

    > I'd strongly recommend avoiding all such hints that force specific execution plans, unless you have very strong and overriding reasons for doing so.  And even then I'd still try and avoid hints, because you have to know more than the Optimizer does to produce the "right" plan and to make sure that your hints ALWAYS produce the plan you want in all possible circumstances.

    Absolutely agree.

    I discourage the developers I work with from trying to hint their way to 'a better plan'.

    If you try to do this, you need to be able to imagine a better plan than Oracle is providing: this takes work. And then get Oracle to follow your ideas. And then verify that the result was really worth the effort and fragility that your hints introduced. Much better is to find an Oracle-produced plan (and ASH can help you verify its efficiency), and use a baseline or one of the other plan-stability methods to enforce it.

    Too often, a developer thinks they can just throw a single hint into the mix and goodness happens.

  • Jonathan Lewis
    Jonathan Lewis Member Posts: 9,786 Gold Crown
    edited Jan 9, 2018 1:22PM

    The problem with that argument is that it's implicitly assuming the join is a nested loop.

    In your case a nested loop from 1,000 rows of fact table to 4 rows of dimension may well be better than a nested loop from one row of the dimension table to 1/4 of the fact table with filter; but that means you can identify the 1,000 efficiently anyway, so a hash join that builds with the dimension and probes with the fact may be more efficient than either nested loop.

    When you consider hash joins you have to be careful with the expression "starts with" - it becomes ambiguous.

    Regards

    Jonathan Lewis

  • Jonathan Lewis
    Jonathan Lewis Member Posts: 9,786 Gold Crown
    edited Jan 9, 2018 1:31PM

    I'm going to ignore the choice of order for A D C E in this question because it's aimed at the problem of adding one more table to an existing query - and there is clearly a very good reason why the OP should want to make Oracle do something other than its chosen strategy for that last step.

    The 4 table join produces 483,000 rows; the single table access to G produces 94 rows.  Although volume of data is more important than number of rows when considering a hash join I think it's safe to assume that the 94 rows acquired from 56 buffers will have a lower volume than 483,000 rows (which - according to the stats required 39MB to build in memory).

    If the plan for the first 4 tables is good, then it makes sense for the OP to ask the question about how to change Oracle's choice on the fifth.

    Regards

    Jonathan Lewis

  • Jonathan Lewis
    Jonathan Lewis Member Posts: 9,786 Gold Crown
    edited Jan 11, 2018 4:55AM Accepted Answer

    Like everyone else who has answered so far, I encourage people to avoid hinting. Mainly because they usually don't do it well.  In your case, for example, you haven't specified any join methods so Oracle could follow your hints and still change the plan dramatically from the one you expect.  If you think you've hinted the best plan then capture an SQL Plan Baseline for it so that you can see how many more hints you need to make it a stable plan.

    To answer the detail of your question, though:  you still want G to be the last table in the join order, but you want Oracle to use it as the probe build table in a hash join, so you need to add G at the end of your leading() hint, specifiy a hash join, and specity swap_join_inputs() the last hint will put G at the top of the plan while still joining to it as the last table.

    /*+ leading(A D C E G) use_hash(G) swap_join_inputs(G) */

    Regards

    Jonathan Lewis

    Update: I wrote "probe" when I should have said "build": the thing that appears first as you read down the plan from the "hash join " operator is the build table, the second is the probe table.

  • Sekar_BLUE4EVER
    Sekar_BLUE4EVER Member Posts: 324 Blue Ribbon
    edited Jan 11, 2018 4:43AM

    Thanks for all the response. I get that it is better to leave the Join order and join type for the Oracle optimizer to select based on the statistics. The idea behind adding the hint to manually force the join order was that if we know the business behind the table data and what kind of data will  be there in each table then isn't it better to manually force the join order and type ? Because I have ran into issues where sometimes oracle picks up the right plan (the one that runs faster ) but sometimes it takes too long . For example I know that a table will return a lot of rows after the filter and it was used at the start of a complex join query in Nested loops join. So in that case doesn't it make sense to manually "delay" the join operation as long as possible by moving it to the end of joins?

  • Unknown
    edited Jan 11, 2018 1:22PM
    I get that it is better to leave the Join order and join type for the Oracle optimizer to select based on the statistics.

    Correct - but there is more to the process than that

    1. initially you should let Oracle do its thing and not try to solve problems that don't exist - that is, if it ain't broke don't fix it

    2. do your testing

    3. if the results don't meet the required SLA (service level agreement) then start your troubleshooting.

    For example I know that a table will return a lot of rows after the filter and it was used at the start of a complex join query in Nested loops join. So in that case doesn't it make sense to manually "delay" the join operation as long as possible by moving it to the end of joins? 

    And the 'issue' is that you do NOT know what Oracle is going to do until you let it do it and then examine the execution plan.

    The standard troubleshooting steps are:

    1. identify a problem/issue to be addressed

    2. confirm that the problem/issue REALLY exists

    3. identify potential solutions that will eliminate/mitigate the problem/issue

    4. select a small number (2 or 3) of those 'solutions' for prototyping and testing

    5. select the BEST solution from those tested.

    Having an SLA is an important part of step #1. Almost any query/process can be made 'faster' - so 'make it faster' is NOT an appropriate goal. You need to have a SPECIFIC performance target.

    That target will be based on things like: 1) number of rows expected, 2) response time needed for the FIRST rows (or maybe ALL rows) to be returned, 3) the number of times the query/process is executed (once a day? or every five minutes?), 4) the effect that 'tuning' this process might have on other queries/processes (adding an index might help this query but could cause problems with other queries that used to work ok).

    So what everyone meant is that generally you should let Oracle do its thing and only get involved if there is a CONFIRMED problem with what Oracle did. Even then the problem/issue can often be resolved without using hints.

This discussion has been closed.