Forum Stats

  • 3,815,978 Users
  • 2,259,122 Discussions
  • 7,893,350 Comments

Discussions

Parallel Hash Join always swapping to TEMP

664822
664822 Member Posts: 6
edited Oct 16, 2008 10:44AM in General Database Discussions
Hi,

I've experienced some strange behaviour on Oracle 9.2.0.5 recently: simple query hash joining two tables - smaller with 16k records/1 mb of size and bigger with 2.5m records/1.5 gb of size is swapping to TEMP when launched in parallel mode (4 set of PQ slaves). What is strange serial execution is running as expected - in-memory Hash Join occurs. It's worth to add that both parallel and serial execution properly selects smaller table as inner one but parallel query always decides to buffer the source data (no matter how big is it).

To be more precise - all table stats are gathered, I have enough PGA memory assigned to queries (WORKAREA_POLICY_SIZE=AUTO, PGA_AGGREGATE_TARGET=6GB) and I properly analyze the results. Even hidden parameter SMMPX_MAX_SIZE is properly set to about 2GB, the issue is that parallel execution still decides to swap (even if the inner data size for each slave is about 220kb!).

I dig into the traces (10104 event) and found some substantial difference between serial and parallel execution. It looks like some internal flag orders PQ slaves to always buffer the data, here is what I found in PQ slave trace:

HASH JOIN STATISTICS (INITIALIZATION)
Original memory: 4428800
Memory after all overhead: 4283220
Memory for slots: 3809280
Calculated overhead for partitions and row/slot managers: 473940
Hash-join fanout: 8
Number of partitions: 9
Number of slots: 15
Multiblock IO: 31
Block size(KB): 8
Cluster (slot) size(KB): 248
Hash-join fanout (manual): 8
Cluster/slot size(KB) (manual): 280
Minimum number of bytes per block: 8160
Bit vector memory allocation(KB): 128
Per partition bit vector length(KB): 16
Maximum possible row length: 1455
Estimated build size (KB): 645
Estimated Row Length (includes overhead): 167
Immutable Flags:
BUFFER the output of the join for Parallel Query
kxhfSetPhase: phase=BUILD
kxhfAddChunk: add chunk 0 (sz=32) to slot table
kxhfAddChunk: chunk 0 (lbs=800003ff640ebb50, slotTab=800003ff640ebce8) successfuly added
kxhfSetPhase: phase=PROBE_1

Bolded is the part that is not present in serial mode. Unfortunatelly I cannot find anything that could help identifying the reason or setting that drives this behaviour :(

Best regards
Bazyli

Edited by: user10419027 on Oct 13, 2008 3:53 AM

Best Answer

  • Jonathan Lewis
    Jonathan Lewis Member Posts: 9,953 Blue Diamond
    edited Oct 15, 2008 6:06PM Answer ✓
    Buzzylee wrote:
    Jonathan,
    >
    After today's tests my understanding of the problem didn't significantly change - I still don't get why Oracle swaps probe table to disk.
    The only new is that I see it's not typical "on-disk" hash join, because inner table is not written to TEMP. Plus you confirmed that immutable flag isn't forcing such behaviour (BTW thanks for that!).

    So maybe it's bug? In meantime I've checked it against never version of DB (9.2.0.8) - still the same behaviour.
    I have emulated your example - the behaviour also appears in 10g and 11g.
    This is probably not a bug, but it may be a case where a generic strategy is not appropriate.

    The extra partition is NOT holding the probe table, it is holding the result of the hash join. The entire result is built before it is forwarded to the next "slave set" (which happens to be the query coordinator in this case). Your memory allocation allowed for about 18 slots (multiblock IO batches) of 31 blocks each. You used 8 of them for the hash table, the rest are available to hold the result.

    Somewhere in your trace, around the point where you switch from writes to reads, you should see a summary about partition 8, and the number of "in-memory slots" which will tell you about the size of the result set.

    If the difference between the clusters and the in-memory slots is small, you may find that by setting the "_hash_multiblock_io_count" to a value less than the 31 that the optimizer selected you free up enough memory from the hash table to allow the result set to build in memory.

    Another option - to bypass this spill - is to switch to a (broadcast, none) distribution.

    Regards
    Jonathan Lewis
    http://jonathanlewis.wordpress.com
    http://www.jlcomp.demon.co.uk

Answers

  • Pavan Kumar
    Pavan Kumar Member Posts: 11,904 Gold Crown
    Hi,

    I've experienced some strange behaviour on Oracle 9.2.0.5 recently: simple query hash joining two tables - smaller with 16k records/1 mb of size and bigger with 2.5m records/1.5 gb of size is swapping to TEMP when launched in parallel mode (4 set of PQ slaves). What is strange serial execution is running as expected - in-memory Hash Join occurs. It's worth to add that both parallel and serial execution properly selects smaller table as inner one but parallel query always decides to buffer the source data (no matter how big is it).

    The hash join is more memory intensive than a nested loop join. To be faster than a nested loop join, we must set the hash_area_size large enough to hold the entire hash table in memory (about 1.6 times the sum of the rows in the table). If the hash join overflows the hash_area_size memory, the hash join will page into the TEMP tablespace.

    +To be more precise - all table stats are gathered, I have enough PGA memory assigned to queries (WORKAREA_POLICY_SIZE=AUTO, PGA_AGGREGATE_TARGET=6GB) and I properly analyze the results. Even hidden parameter SMMPX_MAX_SIZE is properly set to about 2GB, the issue is that parallel execution still decides to swap (even if the inner data size for each slave is about 220kb!).+

    http://dba-blog.blogspot.com/2005/08/pgamaxsize-hidden-parameter.html


    - Pavan Kumar N
  • 664822
    664822 Member Posts: 6
    Pavan,

    As I mentionned in my post - I have enough PGA memory assigned for Hash/Sort operations - PGA_AGGREGATE_TARGET is set to 6GB (combined with AUTO memory management, what makes the HASH_AREA_SIZE parameter irrelevant. Also, PGAMAX_SIZE parameter is set to 200MB, so again - it's not the matter of PGA size limits.

    Moreover - serial execution (which operates on 4 times bigger data to hash) is able to do it in memory and I'm not suprised here - it's only 1MB of data.

    Cheers
    Bazyli
  • Pavan Kumar
    Pavan Kumar Member Posts: 11,904 Gold Crown
    Hi,

    Can you monitor v$sql_workarea_active .
    since as per the setting (limit for a single workarea operation in a single process) is 2GB
    what is the size of this smmpx_max_size

    - Pavan Kumar N
  • 664822
    664822 Member Posts: 6
    Pavan,

    Yes I can monitor workarea usage however I don't expect to find there anything useful - it will only confirm Hash Join is swapping to disk (I saw it in V$SORT_USAGE as well). Parameter smmpx_max_size is set to about 2gb, so it's not the case. Again - my queries refuses to hash 250kb of data :(

    I guess there must be some internal setting that drives that that "BUFFER the output of the join for Parallel Query" flag. Unfortunatelly I don't see any hidden parameter responsible for that...
    What I have seen in some sample 10104 traces on the internet - there is an opposite flag which appears as "Not BUFFER(execution) output of the join for PQ" (still under immutable flags section of HJ output). The question is - what drives this setting?

    Cheers
    Bazyli
  • Jonathan Lewis
    Jonathan Lewis Member Posts: 9,953 Blue Diamond
    What is the distribution method of the join ?
    Can you generate an execution plan from dbms_xplan and post it. (Use the "code" tags to get a readable output in fixed font).
    Can you print the section of the plan that shows the row distribution across hash partitions, and an example of the section where Oracle starts phase 2 and reads a matching pair of partitions.

    Given the information that Oracle has show in this part of the trace, it seems very surprising that it manages to spill to disc.

    Regards
    Jonathan Lewis
    http://jonathanlewis.wordpress.com
    http://www.jlcomp.demon.co.uk
  • 664822
    664822 Member Posts: 6
    Jonathan,

    Distribution seems to be as expected (HASH/HASH), please have a look on the query plan:
    PLAN_TABLE_OUTPUT
    ----------------------------------------------------------------------------------------------------------------------------------------------------------------
    
    -------------------------------------------------------------------------------------------------------------------
    | Id  | Operation            |  Name                         | Rows  | Bytes | Cost  |  TQ    |IN-OUT| PQ Distrib |
    -------------------------------------------------------------------------------------------------------------------
    |   0 | SELECT STATEMENT     |                               |   456K|    95M|   876 |        |      |            |
    |*  1 |  HASH JOIN           |                               |   456K|    95M|   876 | 43,02  | P->S | QC (RAND)  |
    |   2 |   TABLE ACCESS FULL  | SH30_8700195_9032_0_TMP_TEST  | 16555 |   468K|    16 | 43,00  | P->P | HASH       |
    |   3 |   TABLE ACCESS FULL  | SH30_8700195_9031_0_TMP_TEST  |  2778K|   503M|   860 | 43,01  | P->P | HASH       |
    -------------------------------------------------------------------------------------------------------------------
    
    Predicate Information (identified by operation id):
    ---------------------------------------------------
       1 - access(NVL("A"."PROD_ID",'NULL!')=NVL("B"."PROD_ID",'NULL!') AND
                  NVL("A"."PROD_UNIT_OF_MEASR_ID",'NULL!')=NVL("B"."PROD_UNIT_OF_MEASR_ID",'NULL!'))
    Let me also share with you trace files from parallel and serial execution.

    First, serial execution (only 10104 event details):
    Dump file /opt/oracle/admin/cdwep4/udump/cdwep401_ora_18729.trc
    Oracle9i Enterprise Edition Release 9.2.0.5.0 - 64bit Production
    With the Partitioning, OLAP and Oracle Data Mining options
    JServer Release 9.2.0.5.0 - Production
    ORACLE_HOME = /opt/oracle/product/9.2.0.5
    System name:	HP-UX
    Node name:	ethp1018
    Release:	B.11.11
    Version:	U
    Machine:	9000/800
    Instance name: cdwep401
    Redo thread mounted by this instance: 1
    Oracle process number: 100
    Unix process pid: 18729, image: [email protected] (TNS V1-V3)
    
    kxhfInit(): enter
    kxhfInit(): exit
    *** HASH JOIN STATISTICS (INITIALIZATION) ***
    Original memory: 4341760
    Memory after all overhead: 4163446
    Memory for slots: 3301376
    Calculated overhead for partitions and row/slot managers: 862070
    Hash-join fanout: 8
    Number of partitions: 8
    Number of slots: 13
    Multiblock IO: 31
    Block size(KB): 8
    Cluster (slot) size(KB): 248
    Hash-join fanout (manual): 8
    Cluster/slot size(KB) (manual): 240
    Minimum number of bytes per block: 8160
    Bit vector memory allocation(KB): 128
    Per partition bit vector length(KB): 16
    Maximum possible row length: 1455
    Estimated build size (KB): 1083
    Estimated Row Length (includes overhead): 67
    # Immutable Flags:
    kxhfSetPhase: phase=BUILD
    kxhfAddChunk: add chunk 0 (sz=32) to slot table
    kxhfAddChunk: chunk 0 (lbs=800003ff6c063b20, slotTab=800003ff6c063cb8) successfuly added
    kxhfSetPhase: phase=PROBE_1
    qerhjFetch: max build row length (mbl=110)
    *** END OF HASH JOIN BUILD (PHASE 1) ***
      Revised row length: 68
      Revised row count: 16555
      Revised build size: 1089KB
    kxhfResize(enter): resize to 12 slots (numAlloc=8, max=13)
    kxhfResize(exit): resized to 12 slots (numAlloc=8, max=12)
      Slot table resized: old=13 wanted=12 got=12 unload=0
    *** HASH JOIN BUILD HASH TABLE (PHASE 1) ***
    Total number of partitions: 8
    Number of partitions which could fit in memory: 8
    Number of partitions left in memory: 8
    Total number of slots in in-memory partitions: 8
    Total number of rows in in-memory partitions: 16555
       (used as preliminary number of buckets in hash table)
    Estimated max # of build rows that can fit in avail memory: 55800
    ### Partition Distribution ###
    Partition:0    rows:2131       clusters:1      slots:1      kept=1
    Partition:1    rows:1975       clusters:1      slots:1      kept=1
    Partition:2    rows:1969       clusters:1      slots:1      kept=1
    Partition:3    rows:2174       clusters:1      slots:1      kept=1
    Partition:4    rows:2041       clusters:1      slots:1      kept=1
    Partition:5    rows:2092       clusters:1      slots:1      kept=1
    Partition:6    rows:2048       clusters:1      slots:1      kept=1
    Partition:7    rows:2125       clusters:1      slots:1      kept=1
    *** (continued) HASH JOIN BUILD HASH TABLE (PHASE 1) ***
    Revised number of hash buckets (after flushing): 16555
    Allocating new hash table.
    *** (continued) HASH JOIN BUILD HASH TABLE (PHASE 1) ***
    Requested size of hash table: 4096
    Actual size of hash table: 4096
    Number of buckets: 32768
    kxhfResize(enter): resize to 14 slots (numAlloc=8, max=12)
    kxhfResize(exit): resized to 14 slots (numAlloc=8, max=14)
      freeze work area size to: 4357K (14 slots)
    *** (continued) HASH JOIN BUILD HASH TABLE (PHASE 1) ***
    Total number of rows (may have changed): 16555
    Number of in-memory partitions (may have changed): 8
    Final number of hash buckets: 32768
    Size (in bytes) of hash table: 262144
    kxhfIterate(end_iterate): numAlloc=8, maxSlots=14
    *** (continued) HASH JOIN BUILD HASH TABLE (PHASE 1) ***
    ### Hash table ###
    # NOTE: The calculated number of rows in non-empty buckets may be smaller
    #       than the true number.
    Number of buckets with   0 rows:      21129
    Number of buckets with   1 rows:       8755
    Number of buckets with   2 rows:       2024
    Number of buckets with   3 rows:        433
    Number of buckets with   4 rows:        160
    Number of buckets with   5 rows:         85
    Number of buckets with   6 rows:         69
    Number of buckets with   7 rows:         41
    Number of buckets with   8 rows:         32
    Number of buckets with   9 rows:         18
    Number of buckets with between  10 and  19 rows:         21
    Number of buckets with between  20 and  29 rows:          1
    Number of buckets with between  30 and  39 rows:          0
    Number of buckets with between  40 and  49 rows:          0
    Number of buckets with between  50 and  59 rows:          0
    Number of buckets with between  60 and  69 rows:          0
    Number of buckets with between  70 and  79 rows:          0
    Number of buckets with between  80 and  89 rows:          0
    Number of buckets with between  90 and  99 rows:          0
    Number of buckets with 100 or more rows:          0
    ### Hash table overall statistics ###
    Total buckets: 32768 Empty buckets: 21129 Non-empty buckets: 11639
    Total number of rows: 16555
    Maximum number of rows in a bucket: 24
    Average number of rows in non-empty buckets: 1.422373
    =====================
    
    .... (lots of fetching) ....
    
    qerhjFetch: max probe row length (mpl=0)
    qerhjFreeSpace(): free hash-join memory
    kxhfRemoveChunk: remove chunk 0 from slot table
    And finally, PQ slave output (only one trace, please note Immutable Flag that I believe orders Oracle to buffer to TEMP):
    Dump file /opt/oracle/admin/cdwep4/bdump/cdwep401_p002_4640.trc
    Oracle9i Enterprise Edition Release 9.2.0.5.0 - 64bit Production
    With the Partitioning, OLAP and Oracle Data Mining options
    JServer Release 9.2.0.5.0 - Production
    ORACLE_HOME = /opt/oracle/product/9.2.0.5
    System name:	HP-UX
    Node name:	ethp1018
    Release:	B.11.11
    Version:	U
    Machine:	9000/800
    Instance name: cdwep401
    Redo thread mounted by this instance: 1
    Oracle process number: 86
    Unix process pid: 4640, image: [email protected] (P002)
    
    kxhfInit(): enter
    kxhfInit(): exit
    *** HASH JOIN STATISTICS (INITIALIZATION) ***
    Original memory: 4428800
    Memory after all overhead: 4283220
    Memory for slots: 3809280
    Calculated overhead for partitions and row/slot managers: 473940
    Hash-join fanout: 8
    Number of partitions: 9
    Number of slots: 15
    Multiblock IO: 31
    Block size(KB): 8
    Cluster (slot) size(KB): 248
    Hash-join fanout (manual): 8
    Cluster/slot size(KB) (manual): 280
    Minimum number of bytes per block: 8160
    Bit vector memory allocation(KB): 128
    Per partition bit vector length(KB): 16
    Maximum possible row length: 1455
    Estimated build size (KB): 645
    Estimated Row Length (includes overhead): 167
    # Immutable Flags:
      BUFFER the output of the join for Parallel Query
    kxhfSetPhase: phase=BUILD
    kxhfAddChunk: add chunk 0 (sz=32) to slot table
    kxhfAddChunk: chunk 0 (lbs=800003ff640ebb50, slotTab=800003ff640ebce8) successfuly added
    kxhfSetPhase: phase=PROBE_1
    qerhjFetch: max build row length (mbl=96)
    *** END OF HASH JOIN BUILD (PHASE 1) ***
      Revised row length: 54
      Revised row count: 4203
      Revised build size: 221KB
    kxhfResize(enter): resize to 16 slots (numAlloc=8, max=15)
    kxhfResize(exit): resized to 16 slots (numAlloc=8, max=16)
      Slot table resized: old=15 wanted=16 got=16 unload=0
    *** HASH JOIN BUILD HASH TABLE (PHASE 1) ***
    Total number of partitions: 8
    Number of partitions which could fit in memory: 8
    Number of partitions left in memory: 8
    Total number of slots in in-memory partitions: 8
    Total number of rows in in-memory partitions: 4203
       (used as preliminary number of buckets in hash table)
    Estimated max # of build rows that can fit in avail memory: 85312
    ### Partition Distribution ###
    Partition:0    rows:537        clusters:1      slots:1      kept=1
    Partition:1    rows:554        clusters:1      slots:1      kept=1
    Partition:2    rows:497        clusters:1      slots:1      kept=1
    Partition:3    rows:513        clusters:1      slots:1      kept=1
    Partition:4    rows:498        clusters:1      slots:1      kept=1
    Partition:5    rows:543        clusters:1      slots:1      kept=1
    Partition:6    rows:547        clusters:1      slots:1      kept=1
    Partition:7    rows:514        clusters:1      slots:1      kept=1
    *** (continued) HASH JOIN BUILD HASH TABLE (PHASE 1) ***
    Revised number of hash buckets (after flushing): 4203
    Allocating new hash table.
    *** (continued) HASH JOIN BUILD HASH TABLE (PHASE 1) ***
    Requested size of hash table: 1024
    Actual size of hash table: 1024
    Number of buckets: 8192
    kxhfResize(enter): resize to 18 slots (numAlloc=8, max=16)
    kxhfResize(exit): resized to 18 slots (numAlloc=8, max=18)
      freeze work area size to: 5812K (18 slots)
    *** (continued) HASH JOIN BUILD HASH TABLE (PHASE 1) ***
    Total number of rows (may have changed): 4203
    Number of in-memory partitions (may have changed): 8
    Final number of hash buckets: 8192
    Size (in bytes) of hash table: 65536
    kxhfIterate(end_iterate): numAlloc=8, maxSlots=18
    *** (continued) HASH JOIN BUILD HASH TABLE (PHASE 1) ***
    ### Hash table ###
    # NOTE: The calculated number of rows in non-empty buckets may be smaller
    #       than the true number.
    Number of buckets with   0 rows:       5284
    Number of buckets with   1 rows:       2177
    Number of buckets with   2 rows:        510
    Number of buckets with   3 rows:        104
    Number of buckets with   4 rows:         51
    Number of buckets with   5 rows:         14
    Number of buckets with   6 rows:         14
    Number of buckets with   7 rows:         13
    Number of buckets with   8 rows:         12
    Number of buckets with   9 rows:          4
    Number of buckets with between  10 and  19 rows:          9
    Number of buckets with between  20 and  29 rows:          0
    Number of buckets with between  30 and  39 rows:          0
    Number of buckets with between  40 and  49 rows:          0
    Number of buckets with between  50 and  59 rows:          0
    Number of buckets with between  60 and  69 rows:          0
    Number of buckets with between  70 and  79 rows:          0
    Number of buckets with between  80 and  89 rows:          0
    Number of buckets with between  90 and  99 rows:          0
    Number of buckets with 100 or more rows:          0
    ### Hash table overall statistics ###
    Total buckets: 8192 Empty buckets: 5284 Non-empty buckets: 2908
    Total number of rows: 4203
    Maximum number of rows in a bucket: 16
    Average number of rows in non-empty buckets: 1.445323
    kxhfWrite: hash-join is spilling to disk
    kxhfWrite: Writing dba=950281 slot=8 part=8
    kxhfWrite: Writing dba=950312 slot=9 part=8
    kxhfWrite: Writing dba=950343 slot=10 part=8
    kxhfWrite: Writing dba=950374 slot=11 part=8
    
    .... (lots of writing) ....
    
    kxhfRead(): Reading dba=950281 into slot=15
    kxhfIsDone: waiting slot=15 lbs=800003ff640ebb50
    kxhfRead(): Reading dba=950312 into slot=16
    kxhfIsDone: waiting slot=16 lbs=800003ff640ebb50
    kxhfRead(): Reading dba=950343 into slot=17
    kxhfFreeSlots(800003ff7c068918): all=0 alloc=18 max=18
      EmptySlots:15 8 9 10 11 12 13 
      PendingSlots:
    kxhfIsDone: waiting slot=17 lbs=800003ff640ebb50
    kxhfRead(): Reading dba=950374 into slot=15
    kxhfFreeSlots(800003ff7c068918): all=0 alloc=18 max=18
      EmptySlots:16 8 9 10 11 12 13 
      PendingSlots:
    
    .... (lots of reading) ....
    
    qerhjFetchPhase2(): building a hash table
    kxhfFreeSlots(800003ff7c068980): all=1 alloc=18 max=18
      EmptySlots:2 4 6 1 0 7 5 3 14 17 16 15 8 9 10 11 12 13 
      PendingSlots:
    qerhjFreeSpace(): free hash-join memory
    kxhfRemoveChunk: remove chunk 0 from slot table
    Why do you think it's surprising that Oracle utilizes TEMP? Basing on traces Oracle seems to be very sure it should spill to disk. I believe the key to answer is this immutable flag printing "BUFFER the output of the join for Parallel Query" - as I mentioned in one of previous posts it's opposite to "Not BUFFER(execution) output of the join for PQ" which appears in some traces found on internet.

    Best regards
    Bazyli
  • Jonathan Lewis
    Jonathan Lewis Member Posts: 9,953 Blue Diamond
    edited Oct 15, 2008 2:16AM
    Bazyli,

    The first part of the trace file shows that Oracle has plenty of memory to hold the hash table, hence my surprise.
    The later parts of the trace show that the hash does indeed fit in memory (all partitions marked as KEPT´=1)

    There is an interesting detail in the NUMBER of partitons declared in the opening part of the trace. This is 9, but Oracle builds the hash table for a hash join using a number of partitions that is a power of 2 - and we can see this in the lists partitions that are holding the data (The eight partitions 0 - 7 account for all rows recorded). Note also that you do not have a report of "reloading pair of flushed partions" which would appear if you had really spillled the hash table to disc.

    Your writes to disc are for partition 8 - and there is no data in partition , and the size of your clusters(slots) is big (248K) so if you were writing the hash table to disc, it would take very few writes.

    At present I can only guess that the writes are eithe a bug, or the results of the hash join being written to disc because the query coordinator cannot keep up with the slaves - so they dump to temp so that they can continue running in the background.

    What is the query - does it return very few rows, or a number similar to the size of the build table, or of the probe table ?
    Can you run the query with 10046 enabled at level 8 so that you can see if the writes alternative with reads of the probe table ?

    I have a query with the same plan - and it shows the same immutable flags setting, but does not write to disc. So the "feature" may be enabled by the flag, but is not forced by the flag.

    Regards
    Jonathan Lewis
    http://jonathanlewis.wordpress.om
    http://www.jlcomp.demon.co.uk
    Jonathan Lewis
  • 664822
    664822 Member Posts: 6
    Jonathan,

    Thanks for noticing two important facts that I've missed:
    -Oracle spills ninth partition (#8), which according to hash data distribution shouldn't be filled with any data
    -It's not real on-disk Hash Join - the build (smaller) table is not written to disk, only probe (bigger) table goes there (normally it should be building hash table in several iterations, each time informing us how many blocks remains in TEMP)

    Answering your questions: query is a typical Data Warehousing look-up join - I try to assign each fact record some additional look-up data from small table. Fact table is about 800 mb / 2.8 m records, look-up table occupies about 1mb / 16k records (unique join key values). Query produces exactly the same number of rows as in fact table (2.8m).

    According to your suggestion I've repeated the test - this time with 10046 event turned on (level 8 - waits). Traces have been properly split between two layers of PQ slaves - lower level reading data blocks and broadcasting it on Hash manner to upper level which performs HJ, spills probe data to TEMP, then reads it and returns to QC. Here is short summary of wait events on both levels (executed with DOP = 4):

    *1) lower PQ level*
    -executes following queries:

    CURSOR #1: look-up (small) table read
    SELECT /*+ Q2836777000 NO_EXPAND ROWID(A1) */ NVL(A1."PROD_ID",'NULL!') C0,NVL(A1."PROD_UNIT_OF_MEASR_ID",'NULL!') C1,A1."PROD_ID" C2,A1."PROD_UNIT_OF_MEASR_ID" C3,
    A1."PUOM_RELTV_QTY" C4,A1."PUOM_STAT_VAL" C5,A1."PUOM_UNIT_OF_MEASR_ID" C6,A1."PB_DNORM_COMP_ID" C7,A1."PB_DNORM_VAL_FAC" C8,A1."PB_DNORM_COMP_QTY" C9,
    A1."PB_DNORM_SU_FAC" C10,A1."PB_DNORM_COMP_NUMBER" C11,A1."PB_DNORM_VOL_CREDT_CODE" C12 FROM "IDWE4"."SH30_8700195_9032_0_TMP_TEST" PX_GRANULE(0, BLOCK_RANGE, DYNAMIC)  A1
    CURSOR #2: fact (big) table read
    SELECT /*+ Q2836777001 NO_EXPAND ROWID(A1) */ NVL(A1."PROD_ID",'NULL!') C0,NVL(A1."PROD_UNIT_OF_MEASR_ID",'NULL!') C1,A1."CUST_ACCT_GRP_CODE" C2,A1."AFFLT_SHIP_FLAG" C3,
    A1."GEO_ID" C4,A1."GEO_REPLC_STTUS_FLAG" C5,A1."GEO_SUB_CNTRY_ID" C6,A1."SUB_CNTRY_REPLC_STTUS_FLAG" C7,A1."TRADE_CHANL_ID" C8,A1."TRADE_CHANL_REPLC_STTUS_FLAG" C9,
    A1."CUST_ID" C10,A1."FACT_AMT_1" C11,A1."FACT_AMT_2" C12,A1."FACT_AMT_3" C13,A1."FACT_ID" C14,A1."CUST_REPLC_STTUS_FLAG" C15,A1."LEGAL_ENT_ID" C16,A1."PROD_ID" C17,
    A1."PROD_REPLC_STTUS_FLAG" C18,A1."SITE_ID" C19,A1."TIME_PERD_ID" C20,A1."FACT_TYPE_CODE" C21,A1."ISO_CRNCY_CODE_CHAR" C22,A1."SRCE_SYS_ID" C23,A1."ENGIN_LYOUT_ID" C24,
    A1."DUE_PERD" C25,A1."PRTTN_CODE" C26,A1."PRTTN_BY_FUNC_DATE" C27,A1."TIME_PERD_TYPE_CODE" C28,A1."TIME_PERD_END_DATE" C29,A1."CORP_OSB_REVNU_SHIP_FLAG" C30,
    A1."CORP_NON_OSB_REVNU_SHIP_FLAG" C31,A1."EXPTN_OSB_REVNU_SHIP_FLAG" C32,A1."FREE_SHIP_FLAG" C33,A1."NON_PG_MATL_SHIP_FLAG" C34,A1."STOCK_MVMNT_SHTTL_FLAG" C35,
    A1."STOCK_MVMNT_WHSE_FLAG" C36,A1."RET_SHIP_FLAG" C37,A1."REFS_SHIP_FLAG" C38,A1."CARRY_SALES_FLAG" C39,A1."POST_DIVST_SHIP_FLAG" C40,A1."CBD_POST_DIVST_SHIP_FLAG" C41,
    A1."PRE_ACQ_SHIP_FLAG" C42,A1."FA_DIVST_SHIP_FLAG" C43,A1."FA_ACQ_SHIP_FLAG" C44,A1."JV_SHIP_FLAG" C45,A1."PROD_UNIT_OF_MEASR_ID" C46,A1."PROFT_CTR_BUS_AREA_ID" C47,
    A1."BILLG_TYPE_ID" C48,A1."SOLD_TO_CUST_ID" C49,A1."ORIG_TRANX_ID" C50,A1."BUS_INTFC_VAL" C51,A1."ATTR_VAL_1" C52,A1."ATTR_VAL_2" C53,A1."ATTR_VAL_3" C54,A1."ATTR_VAL_4" C55,
    A1."ORD_ENTRY_DATE" C56,A1."ORD_STTUS" C57,A1."ATTR_VAL_75" C58,A1."ATTR_VAL_13" C59,A1."ATTR_VAL_16" C60,A1."ATTR_VAL_21" C61,A1."LEGAL_ENT_DFLT" C62,
    A1."LE_DERV_PROD_STRCT_CODE" C63,A1."TRADE_CHANL_DFLT" C64 FROM "IDWE4"."SH30_8700195_9031_0_TMP_TEST" PX_GRANULE(0, BLOCK_RANGE, DYNAMIC)  A1
    -reports following waits (original order, skipping PQ messaging and library cache events):
    EXEC #1
    WAIT #1: nam='db file sequential read' - reported about 20 times which is moreless adequate to processed size of inner table (about 150kb = 1/4 of the table)
    EXEC #2
    WAIT #2: nam='direct path read' - reported 400 times, each by 64 blocks, adequate to size of processed data (about 200mb = 1/4 of the table)

    *2) upper PQ level*
    -executes following query:

    CURSOR #1: hash join
    SELECT /*+ ORDERED NO_EXPAND USE_HASH(A2) */ A1.C0,A2.C0,A1.C1,A2.C1,A1.C2,A1.C3,A1.C4,A1.C5,A1.C6,A1.C7,A1.C8,A1.C9,A1.C10,A1.C11,A1.C12,A2.C2,A2.C3,A2.C4,
    A2.C5,A2.C6,A2.C7,A2.C8,A2.C9,A2.C10,A2.C11,A2.C12,A2.C13,A2.C14,A2.C15,A2.C16,A2.C17,A2.C18,A2.C19,A2.C20,A2.C21,A2.C22,A2.C23,A2.C24,A2.C25,A2.C26,A2.C27,
    A2.C28,A2.C29,A2.C30,A2.C31,A2.C32,A2.C33,A2.C34,A2.C35,A2.C36,A2.C37,A2.C38,A2.C39,A2.C40,A2.C41,A2.C42,A2.C43,A2.C44,A2.C45,A2.C46,A2.C47,A2.C48,A2.C49,
    A2.C50,A2.C51,A2.C52,A2.C53,A2.C54,A2.C55,A2.C56,A2.C57,A2.C58,A2.C59,A2.C60,A2.C61,A2.C62,A2.C63,A2.C64 FROM :Q2836777000 A1,:Q2836777001 A2 WHERE A2.C1=A1.C1 AND A2.C0=A1.C0
    -reports following waits (original order, skipping PQ messaging and library cache events):
    EXEC #1
    WAIT #1: nam='direct path write' - reported about 850 times, adequate to size of probe table (about 200mb), every write occurs for partition number 8!
    WAIT #1: nam='direct path read' - reported about 850 times, processed all the block number that has been previously spilled to TEMP

    I have skipped QC trace (it's only fetching data).

    After today's tests my understanding of the problem didn't significantly change - I still don't get why Oracle swaps probe table to disk.
    The only new is that I see it's not typical "on-disk" hash join, because inner table is not written to TEMP. Plus you confirmed that immutable flag isn't forcing such behaviour (BTW thanks for that!).

    So maybe it's bug? In meantime I've checked it against never version of DB (9.2.0.8) - still the same behaviour.

    Warm regards
    Bazyli
  • Jonathan Lewis
    Jonathan Lewis Member Posts: 9,953 Blue Diamond
    edited Oct 15, 2008 6:06PM Answer ✓
    Buzzylee wrote:
    Jonathan,
    >
    After today's tests my understanding of the problem didn't significantly change - I still don't get why Oracle swaps probe table to disk.
    The only new is that I see it's not typical "on-disk" hash join, because inner table is not written to TEMP. Plus you confirmed that immutable flag isn't forcing such behaviour (BTW thanks for that!).

    So maybe it's bug? In meantime I've checked it against never version of DB (9.2.0.8) - still the same behaviour.
    I have emulated your example - the behaviour also appears in 10g and 11g.
    This is probably not a bug, but it may be a case where a generic strategy is not appropriate.

    The extra partition is NOT holding the probe table, it is holding the result of the hash join. The entire result is built before it is forwarded to the next "slave set" (which happens to be the query coordinator in this case). Your memory allocation allowed for about 18 slots (multiblock IO batches) of 31 blocks each. You used 8 of them for the hash table, the rest are available to hold the result.

    Somewhere in your trace, around the point where you switch from writes to reads, you should see a summary about partition 8, and the number of "in-memory slots" which will tell you about the size of the result set.

    If the difference between the clusters and the in-memory slots is small, you may find that by setting the "_hash_multiblock_io_count" to a value less than the 31 that the optimizer selected you free up enough memory from the hash table to allow the result set to build in memory.

    Another option - to bypass this spill - is to switch to a (broadcast, none) distribution.

    Regards
    Jonathan Lewis
    http://jonathanlewis.wordpress.com
    http://www.jlcomp.demon.co.uk
  • 664822
    664822 Member Posts: 6
    edited Oct 16, 2008 10:44AM
    Jonathan,

    You are absolutely right - it's the result set, not PROBE table that is spilled to TEMP.

    One more thing that I didn't mentioned earlier. I used to launch this query as CTAS with parallel DDL turned on. It spilled to disk as well but this time it wasn't QC that was intercepting results from the PQ slaves layer performing Hash Join but another layer of slaves (the ones writting to disk). Anyway, it doesn't explicitly explain Oracle's behaviour - it should normally pass the results of HJ to QC or next PQ level without trying buffering the results but doing it in chunks (it's how in my understanding communication between PQs/QC looks like). I wish I knew what stands behind this strategy...

    Thanks for workaround tips - I've chosen the data redistribution option. It's not the perfect one (small table goes to each PQ, so we have some rework here) but should be ok for my purposes - our build table should always be small enought to fit in memory.

    Regarding second recommendation - could you please explain why you think it could help? Isn't the result set too big to be stored in memory (it's not smaller than hundreds of mb, usually bigger than 2-3 gb)? Moreover - when I changed HJ distribution method into Broadcast/None the "Multiblock IO" setting in trace changed to 15. So I checked "_hash_multiblock_io_count" parameter on my DB - it's set to 0, what probably means it's dynamicaly chosen during execution. If the Oracle can set it dynamically then why it's having a problems with it?

    I also found the line of trace informing us about result set size:
    qerhjFetch: PQ Partition rows:696534     clusters:907    in-memory slots      1
    Why is there only 1 in-memory slot free? Free slot tracing shows something different:
    kxhfFreeSlots(800003ff7c068918): all=0 alloc=18 max=18
      EmptySlots:15 8 9 10 11 12 13 
      PendingSlots:
    However I noticed Oracle writes down and reads up data in round robin manner - slot after slot, thus it looks like even 1 slot is enough to handle this.

    BTW I've did some additional tests with wrapping problematic HJ with some other PGA-related operation (like sorting). And the results were great - HJ stopped spilling to TEMP. Unfortunatelly I don't need any sort/group by operation so I skipped this idea :)

    Thanks again for tracking down the issue, I really appreciate your help Jonathan.

    Warm regards
    Bazyli

    Edited by: Buzzylee on Oct 16, 2008 7:33 AM
This discussion has been closed.