Oracle Database Discussions

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

Parallel Hash Join always swapping to TEMP

664822Oct 13 2008 — edited Oct 16 2008

Hi,

I've experienced some strange behaviour on Oracle 9.2.0.5 recently: simple query hash joining two tables - smaller with 16k records/1 mb of size and bigger with 2.5m records/1.5 gb of size is swapping to TEMP when launched in parallel mode (4 set of PQ slaves). What is strange serial execution is running as expected - in-memory Hash Join occurs. It's worth to add that both parallel and serial execution properly selects smaller table as inner one but parallel query always decides to buffer the source data (no matter how big is it).

To be more precise - all table stats are gathered, I have enough PGA memory assigned to queries (WORKAREA_POLICY_SIZE=AUTO, PGA_AGGREGATE_TARGET=6GB) and I properly analyze the results. Even hidden parameter SMMPX_MAX_SIZE is properly set to about 2GB, the issue is that parallel execution still decides to swap (even if the inner data size for each slave is about 220kb!).

I dig into the traces (10104 event) and found some substantial difference between serial and parallel execution. It looks like some internal flag orders PQ slaves to always buffer the data, here is what I found in PQ slave trace:

HASH JOIN STATISTICS (INITIALIZATION)
Original memory: 4428800
Memory after all overhead: 4283220
Memory for slots: 3809280
Calculated overhead for partitions and row/slot managers: 473940
Hash-join fanout: 8
Number of partitions: 9
Number of slots: 15
Multiblock IO: 31
Block size(KB): 8
Cluster (slot) size(KB): 248
Hash-join fanout (manual): 8
Cluster/slot size(KB) (manual): 280
Minimum number of bytes per block: 8160
Bit vector memory allocation(KB): 128
Per partition bit vector length(KB): 16
Maximum possible row length: 1455
Estimated build size (KB): 645
Estimated Row Length (includes overhead): 167
Immutable Flags:
BUFFER the output of the join for Parallel Query
kxhfSetPhase: phase=BUILD
kxhfAddChunk: add chunk 0 (sz=32) to slot table
kxhfAddChunk: chunk 0 (lbs=800003ff640ebb50, slotTab=800003ff640ebce8) successfuly added
kxhfSetPhase: phase=PROBE_1

Bolded is the part that is not present in serial mode. Unfortunatelly I cannot find anything that could help identifying the reason or setting that drives this behaviour :(

Best regards
Bazyli

Edited by: user10419027 on Oct 13, 2008 3:53 AM

This post has been answered by Jonathan Lewis on Oct 15 2008

Jump to Answer

Pavan Kumar

664822

Pavan,

As I mentionned in my post - I have enough PGA memory assigned for Hash/Sort operations - PGA_AGGREGATE_TARGET is set to 6GB (combined with AUTO memory management, what makes the HASH_AREA_SIZE parameter irrelevant. Also, PGAMAX_SIZE parameter is set to 200MB, so again - it's not the matter of PGA size limits.

Moreover - serial execution (which operates on 4 times bigger data to hash) is able to do it in memory and I'm not suprised here - it's only 1MB of data.

Cheers
Bazyli

Pavan Kumar

Hi,

Can you monitor v$sql_workarea_active .
since as per the setting (limit for a single workarea operation in a single process) is 2GB
what is the size of this smmpx_max_size

- Pavan Kumar N

664822

Pavan,

Yes I can monitor workarea usage however I don't expect to find there anything useful - it will only confirm Hash Join is swapping to disk (I saw it in V$SORT_USAGE as well). Parameter smmpx_max_size is set to about 2gb, so it's not the case. Again - my queries refuses to hash 250kb of data :(

I guess there must be some internal setting that drives that that "BUFFER the output of the join for Parallel Query" flag. Unfortunatelly I don't see any hidden parameter responsible for that...
What I have seen in some sample 10104 traces on the internet - there is an opposite flag which appears as "Not BUFFER(execution) output of the join for PQ" (still under immutable flags section of HJ output). The question is - what drives this setting?

Cheers
Bazyli

Jonathan Lewis

What is the distribution method of the join ?
Can you generate an execution plan from dbms_xplan and post it. (Use the "code" tags to get a readable output in fixed font).
Can you print the section of the plan that shows the row distribution across hash partitions, and an example of the section where Oracle starts phase 2 and reads a matching pair of partitions.

Given the information that Oracle has show in this part of the trace, it seems very surprising that it manages to spill to disc.

Regards
Jonathan Lewis
http://jonathanlewis.wordpress.com
http://www.jlcomp.demon.co.uk

664822

Jonathan,

Distribution seems to be as expected (HASH/HASH), please have a look on the query plan:

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------------------------------------------------------------------

-------------------------------------------------------------------------------------------------------------------
| Id  | Operation            |  Name                         | Rows  | Bytes | Cost  |  TQ    |IN-OUT| PQ Distrib |
-------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |                               |   456K|    95M|   876 |        |      |            |
|*  1 |  HASH JOIN           |                               |   456K|    95M|   876 | 43,02  | P->S | QC (RAND)  |
|   2 |   TABLE ACCESS FULL  | SH30_8700195_9032_0_TMP_TEST  | 16555 |   468K|    16 | 43,00  | P->P | HASH       |
|   3 |   TABLE ACCESS FULL  | SH30_8700195_9031_0_TMP_TEST  |  2778K|   503M|   860 | 43,01  | P->P | HASH       |
-------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - access(NVL("A"."PROD_ID",'NULL!')=NVL("B"."PROD_ID",'NULL!') AND
              NVL("A"."PROD_UNIT_OF_MEASR_ID",'NULL!')=NVL("B"."PROD_UNIT_OF_MEASR_ID",'NULL!'))

Let me also share with you trace files from parallel and serial execution.

First, serial execution (only 10104 event details):

Dump file /opt/oracle/admin/cdwep4/udump/cdwep401_ora_18729.trc
Oracle9i Enterprise Edition Release 9.2.0.5.0 - 64bit Production
With the Partitioning, OLAP and Oracle Data Mining options
JServer Release 9.2.0.5.0 - Production
ORACLE_HOME = /opt/oracle/product/9.2.0.5
System name:	HP-UX
Node name:	ethp1018
Release:	B.11.11
Version:	U
Machine:	9000/800
Instance name: cdwep401
Redo thread mounted by this instance: 1
Oracle process number: 100
Unix process pid: 18729, image: oracle@ethp1018 (TNS V1-V3)

kxhfInit(): enter
kxhfInit(): exit
*** HASH JOIN STATISTICS (INITIALIZATION) ***
Original memory: 4341760
Memory after all overhead: 4163446
Memory for slots: 3301376
Calculated overhead for partitions and row/slot managers: 862070
Hash-join fanout: 8
Number of partitions: 8
Number of slots: 13
Multiblock IO: 31
Block size(KB): 8
Cluster (slot) size(KB): 248
Hash-join fanout (manual): 8
Cluster/slot size(KB) (manual): 240
Minimum number of bytes per block: 8160
Bit vector memory allocation(KB): 128
Per partition bit vector length(KB): 16
Maximum possible row length: 1455
Estimated build size (KB): 1083
Estimated Row Length (includes overhead): 67
# Immutable Flags:
kxhfSetPhase: phase=BUILD
kxhfAddChunk: add chunk 0 (sz=32) to slot table
kxhfAddChunk: chunk 0 (lbs=800003ff6c063b20, slotTab=800003ff6c063cb8) successfuly added
kxhfSetPhase: phase=PROBE_1
qerhjFetch: max build row length (mbl=110)
*** END OF HASH JOIN BUILD (PHASE 1) ***
  Revised row length: 68
  Revised row count: 16555
  Revised build size: 1089KB
kxhfResize(enter): resize to 12 slots (numAlloc=8, max=13)
kxhfResize(exit): resized to 12 slots (numAlloc=8, max=12)
  Slot table resized: old=13 wanted=12 got=12 unload=0
*** HASH JOIN BUILD HASH TABLE (PHASE 1) ***
Total number of partitions: 8
Number of partitions which could fit in memory: 8
Number of partitions left in memory: 8
Total number of slots in in-memory partitions: 8
Total number of rows in in-memory partitions: 16555
   (used as preliminary number of buckets in hash table)
Estimated max # of build rows that can fit in avail memory: 55800
### Partition Distribution ###
Partition:0    rows:2131       clusters:1      slots:1      kept=1
Partition:1    rows:1975       clusters:1      slots:1      kept=1
Partition:2    rows:1969       clusters:1      slots:1      kept=1
Partition:3    rows:2174       clusters:1      slots:1      kept=1
Partition:4    rows:2041       clusters:1      slots:1      kept=1
Partition:5    rows:2092       clusters:1      slots:1      kept=1
Partition:6    rows:2048       clusters:1      slots:1      kept=1
Partition:7    rows:2125       clusters:1      slots:1      kept=1
*** (continued) HASH JOIN BUILD HASH TABLE (PHASE 1) ***
Revised number of hash buckets (after flushing): 16555
Allocating new hash table.
*** (continued) HASH JOIN BUILD HASH TABLE (PHASE 1) ***
Requested size of hash table: 4096
Actual size of hash table: 4096
Number of buckets: 32768
kxhfResize(enter): resize to 14 slots (numAlloc=8, max=12)
kxhfResize(exit): resized to 14 slots (numAlloc=8, max=14)
  freeze work area size to: 4357K (14 slots)
*** (continued) HASH JOIN BUILD HASH TABLE (PHASE 1) ***
Total number of rows (may have changed): 16555
Number of in-memory partitions (may have changed): 8
Final number of hash buckets: 32768
Size (in bytes) of hash table: 262144
kxhfIterate(end_iterate): numAlloc=8, maxSlots=14
*** (continued) HASH JOIN BUILD HASH TABLE (PHASE 1) ***
### Hash table ###
# NOTE: The calculated number of rows in non-empty buckets may be smaller
#       than the true number.
Number of buckets with   0 rows:      21129
Number of buckets with   1 rows:       8755
Number of buckets with   2 rows:       2024
Number of buckets with   3 rows:        433
Number of buckets with   4 rows:        160
Number of buckets with   5 rows:         85
Number of buckets with   6 rows:         69
Number of buckets with   7 rows:         41
Number of buckets with   8 rows:         32
Number of buckets with   9 rows:         18
Number of buckets with between  10 and  19 rows:         21
Number of buckets with between  20 and  29 rows:          1
Number of buckets with between  30 and  39 rows:          0
Number of buckets with between  40 and  49 rows:          0
Number of buckets with between  50 and  59 rows:          0
Number of buckets with between  60 and  69 rows:          0
Number of buckets with between  70 and  79 rows:          0
Number of buckets with between  80 and  89 rows:          0
Number of buckets with between  90 and  99 rows:          0
Number of buckets with 100 or more rows:          0
### Hash table overall statistics ###
Total buckets: 32768 Empty buckets: 21129 Non-empty buckets: 11639
Total number of rows: 16555
Maximum number of rows in a bucket: 24
Average number of rows in non-empty buckets: 1.422373
=====================

.... (lots of fetching) ....

qerhjFetch: max probe row length (mpl=0)
qerhjFreeSpace(): free hash-join memory
kxhfRemoveChunk: remove chunk 0 from slot table

And finally, PQ slave output (only one trace, please note Immutable Flag that I believe orders Oracle to buffer to TEMP):

Dump file /opt/oracle/admin/cdwep4/bdump/cdwep401_p002_4640.trc
Oracle9i Enterprise Edition Release 9.2.0.5.0 - 64bit Production
With the Partitioning, OLAP and Oracle Data Mining options
JServer Release 9.2.0.5.0 - Production
ORACLE_HOME = /opt/oracle/product/9.2.0.5
System name:	HP-UX
Node name:	ethp1018
Release:	B.11.11
Version:	U
Machine:	9000/800
Instance name: cdwep401
Redo thread mounted by this instance: 1
Oracle process number: 86
Unix process pid: 4640, image: oracle@ethp1018 (P002)

kxhfInit(): enter
kxhfInit(): exit
*** HASH JOIN STATISTICS (INITIALIZATION) ***
Original memory: 4428800
Memory after all overhead: 4283220
Memory for slots: 3809280
Calculated overhead for partitions and row/slot managers: 473940
Hash-join fanout: 8
Number of partitions: 9
Number of slots: 15
Multiblock IO: 31
Block size(KB): 8
Cluster (slot) size(KB): 248
Hash-join fanout (manual): 8
Cluster/slot size(KB) (manual): 280
Minimum number of bytes per block: 8160
Bit vector memory allocation(KB): 128
Per partition bit vector length(KB): 16
Maximum possible row length: 1455
Estimated build size (KB): 645
Estimated Row Length (includes overhead): 167
# Immutable Flags:
  BUFFER the output of the join for Parallel Query
kxhfSetPhase: phase=BUILD
kxhfAddChunk: add chunk 0 (sz=32) to slot table
kxhfAddChunk: chunk 0 (lbs=800003ff640ebb50, slotTab=800003ff640ebce8) successfuly added
kxhfSetPhase: phase=PROBE_1
qerhjFetch: max build row length (mbl=96)
*** END OF HASH JOIN BUILD (PHASE 1) ***
  Revised row length: 54
  Revised row count: 4203
  Revised build size: 221KB
kxhfResize(enter): resize to 16 slots (numAlloc=8, max=15)
kxhfResize(exit): resized to 16 slots (numAlloc=8, max=16)
  Slot table resized: old=15 wanted=16 got=16 unload=0
*** HASH JOIN BUILD HASH TABLE (PHASE 1) ***
Total number of partitions: 8
Number of partitions which could fit in memory: 8
Number of partitions left in memory: 8
Total number of slots in in-memory partitions: 8
Total number of rows in in-memory partitions: 4203
   (used as preliminary number of buckets in hash table)
Estimated max # of build rows that can fit in avail memory: 85312
### Partition Distribution ###
Partition:0    rows:537        clusters:1      slots:1      kept=1
Partition:1    rows:554        clusters:1      slots:1      kept=1
Partition:2    rows:497        clusters:1      slots:1      kept=1
Partition:3    rows:513        clusters:1      slots:1      kept=1
Partition:4    rows:498        clusters:1      slots:1      kept=1
Partition:5    rows:543        clusters:1      slots:1      kept=1
Partition:6    rows:547        clusters:1      slots:1      kept=1
Partition:7    rows:514        clusters:1      slots:1      kept=1
*** (continued) HASH JOIN BUILD HASH TABLE (PHASE 1) ***
Revised number of hash buckets (after flushing): 4203
Allocating new hash table.
*** (continued) HASH JOIN BUILD HASH TABLE (PHASE 1) ***
Requested size of hash table: 1024
Actual size of hash table: 1024
Number of buckets: 8192
kxhfResize(enter): resize to 18 slots (numAlloc=8, max=16)
kxhfResize(exit): resized to 18 slots (numAlloc=8, max=18)
  freeze work area size to: 5812K (18 slots)
*** (continued) HASH JOIN BUILD HASH TABLE (PHASE 1) ***
Total number of rows (may have changed): 4203
Number of in-memory partitions (may have changed): 8
Final number of hash buckets: 8192
Size (in bytes) of hash table: 65536
kxhfIterate(end_iterate): numAlloc=8, maxSlots=18
*** (continued) HASH JOIN BUILD HASH TABLE (PHASE 1) ***
### Hash table ###
# NOTE: The calculated number of rows in non-empty buckets may be smaller
#       than the true number.
Number of buckets with   0 rows:       5284
Number of buckets with   1 rows:       2177
Number of buckets with   2 rows:        510
Number of buckets with   3 rows:        104
Number of buckets with   4 rows:         51
Number of buckets with   5 rows:         14
Number of buckets with   6 rows:         14
Number of buckets with   7 rows:         13
Number of buckets with   8 rows:         12
Number of buckets with   9 rows:          4
Number of buckets with between  10 and  19 rows:          9
Number of buckets with between  20 and  29 rows:          0
Number of buckets with between  30 and  39 rows:          0
Number of buckets with between  40 and  49 rows:          0
Number of buckets with between  50 and  59 rows:          0
Number of buckets with between  60 and  69 rows:          0
Number of buckets with between  70 and  79 rows:          0
Number of buckets with between  80 and  89 rows:          0
Number of buckets with between  90 and  99 rows:          0
Number of buckets with 100 or more rows:          0
### Hash table overall statistics ###
Total buckets: 8192 Empty buckets: 5284 Non-empty buckets: 2908
Total number of rows: 4203
Maximum number of rows in a bucket: 16
Average number of rows in non-empty buckets: 1.445323
kxhfWrite: hash-join is spilling to disk
kxhfWrite: Writing dba=950281 slot=8 part=8
kxhfWrite: Writing dba=950312 slot=9 part=8
kxhfWrite: Writing dba=950343 slot=10 part=8
kxhfWrite: Writing dba=950374 slot=11 part=8

.... (lots of writing) ....

kxhfRead(): Reading dba=950281 into slot=15
kxhfIsDone: waiting slot=15 lbs=800003ff640ebb50
kxhfRead(): Reading dba=950312 into slot=16
kxhfIsDone: waiting slot=16 lbs=800003ff640ebb50
kxhfRead(): Reading dba=950343 into slot=17
kxhfFreeSlots(800003ff7c068918): all=0 alloc=18 max=18
  EmptySlots:15 8 9 10 11 12 13 
  PendingSlots:
kxhfIsDone: waiting slot=17 lbs=800003ff640ebb50
kxhfRead(): Reading dba=950374 into slot=15
kxhfFreeSlots(800003ff7c068918): all=0 alloc=18 max=18
  EmptySlots:16 8 9 10 11 12 13 
  PendingSlots:

.... (lots of reading) ....

qerhjFetchPhase2(): building a hash table
kxhfFreeSlots(800003ff7c068980): all=1 alloc=18 max=18
  EmptySlots:2 4 6 1 0 7 5 3 14 17 16 15 8 9 10 11 12 13 
  PendingSlots:
qerhjFreeSpace(): free hash-join memory
kxhfRemoveChunk: remove chunk 0 from slot table

Why do you think it's surprising that Oracle utilizes TEMP? Basing on traces Oracle seems to be very sure it should spill to disk. I believe the key to answer is this immutable flag printing "BUFFER the output of the join for Parallel Query" - as I mentioned in one of previous posts it's opposite to "Not BUFFER(execution) output of the join for PQ" which appears in some traces found on internet.

Best regards
Bazyli

Jonathan Lewis

Bazyli,

The first part of the trace file shows that Oracle has plenty of memory to hold the hash table, hence my surprise.
The later parts of the trace show that the hash does indeed fit in memory (all partitions marked as KEPT´=1)

There is an interesting detail in the NUMBER of partitons declared in the opening part of the trace. This is 9, but Oracle builds the hash table for a hash join using a number of partitions that is a power of 2 - and we can see this in the lists partitions that are holding the data (The eight partitions 0 - 7 account for all rows recorded). Note also that you do not have a report of "reloading pair of flushed partions" which would appear if you had really spillled the hash table to disc.

Your writes to disc are for partition 8 - and there is no data in partition , and the size of your clusters(slots) is big (248K) so if you were writing the hash table to disc, it would take very few writes.

At present I can only guess that the writes are eithe a bug, or the results of the hash join being written to disc because the query coordinator cannot keep up with the slaves - so they dump to temp so that they can continue running in the background.

What is the query - does it return very few rows, or a number similar to the size of the build table, or of the probe table ?
Can you run the query with 10046 enabled at level 8 so that you can see if the writes alternative with reads of the probe table ?

I have a query with the same plan - and it shows the same immutable flags setting, but does not write to disc. So the "feature" may be enabled by the flag, but is not forced by the flag.

Regards
Jonathan Lewis
http://jonathanlewis.wordpress.om
http://www.jlcomp.demon.co.uk

1 person found this helpful

664822

Jonathan,

Thanks for noticing two important facts that I've missed:
-Oracle spills ninth partition (#8), which according to hash data distribution shouldn't be filled with any data
-It's not real on-disk Hash Join - the build (smaller) table is not written to disk, only probe (bigger) table goes there (normally it should be building hash table in several iterations, each time informing us how many blocks remains in TEMP)

Answering your questions: query is a typical Data Warehousing look-up join - I try to assign each fact record some additional look-up data from small table. Fact table is about 800 mb / 2.8 m records, look-up table occupies about 1mb / 16k records (unique join key values). Query produces exactly the same number of rows as in fact table (2.8m).

According to your suggestion I've repeated the test - this time with 10046 event turned on (level 8 - waits). Traces have been properly split between two layers of PQ slaves - lower level reading data blocks and broadcasting it on Hash manner to upper level which performs HJ, spills probe data to TEMP, then reads it and returns to QC. Here is short summary of wait events on both levels (executed with DOP = 4):

*1) lower PQ level*
-executes following queries:

CURSOR #1: look-up (small) table read

SELECT /*+ Q2836777000 NO_EXPAND ROWID(A1) */ NVL(A1."PROD_ID",'NULL!') C0,NVL(A1."PROD_UNIT_OF_MEASR_ID",'NULL!') C1,A1."PROD_ID" C2,A1."PROD_UNIT_OF_MEASR_ID" C3,
A1."PUOM_RELTV_QTY" C4,A1."PUOM_STAT_VAL" C5,A1."PUOM_UNIT_OF_MEASR_ID" C6,A1."PB_DNORM_COMP_ID" C7,A1."PB_DNORM_VAL_FAC" C8,A1."PB_DNORM_COMP_QTY" C9,
A1."PB_DNORM_SU_FAC" C10,A1."PB_DNORM_COMP_NUMBER" C11,A1."PB_DNORM_VOL_CREDT_CODE" C12 FROM "IDWE4"."SH30_8700195_9032_0_TMP_TEST" PX_GRANULE(0, BLOCK_RANGE, DYNAMIC)  A1

CURSOR #2: fact (big) table read

SELECT /*+ Q2836777001 NO_EXPAND ROWID(A1) */ NVL(A1."PROD_ID",'NULL!') C0,NVL(A1."PROD_UNIT_OF_MEASR_ID",'NULL!') C1,A1."CUST_ACCT_GRP_CODE" C2,A1."AFFLT_SHIP_FLAG" C3,
A1."GEO_ID" C4,A1."GEO_REPLC_STTUS_FLAG" C5,A1."GEO_SUB_CNTRY_ID" C6,A1."SUB_CNTRY_REPLC_STTUS_FLAG" C7,A1."TRADE_CHANL_ID" C8,A1."TRADE_CHANL_REPLC_STTUS_FLAG" C9,
A1."CUST_ID" C10,A1."FACT_AMT_1" C11,A1."FACT_AMT_2" C12,A1."FACT_AMT_3" C13,A1."FACT_ID" C14,A1."CUST_REPLC_STTUS_FLAG" C15,A1."LEGAL_ENT_ID" C16,A1."PROD_ID" C17,
A1."PROD_REPLC_STTUS_FLAG" C18,A1."SITE_ID" C19,A1."TIME_PERD_ID" C20,A1."FACT_TYPE_CODE" C21,A1."ISO_CRNCY_CODE_CHAR" C22,A1."SRCE_SYS_ID" C23,A1."ENGIN_LYOUT_ID" C24,
A1."DUE_PERD" C25,A1."PRTTN_CODE" C26,A1."PRTTN_BY_FUNC_DATE" C27,A1."TIME_PERD_TYPE_CODE" C28,A1."TIME_PERD_END_DATE" C29,A1."CORP_OSB_REVNU_SHIP_FLAG" C30,
A1."CORP_NON_OSB_REVNU_SHIP_FLAG" C31,A1."EXPTN_OSB_REVNU_SHIP_FLAG" C32,A1."FREE_SHIP_FLAG" C33,A1."NON_PG_MATL_SHIP_FLAG" C34,A1."STOCK_MVMNT_SHTTL_FLAG" C35,
A1."STOCK_MVMNT_WHSE_FLAG" C36,A1."RET_SHIP_FLAG" C37,A1."REFS_SHIP_FLAG" C38,A1."CARRY_SALES_FLAG" C39,A1."POST_DIVST_SHIP_FLAG" C40,A1."CBD_POST_DIVST_SHIP_FLAG" C41,
A1."PRE_ACQ_SHIP_FLAG" C42,A1."FA_DIVST_SHIP_FLAG" C43,A1."FA_ACQ_SHIP_FLAG" C44,A1."JV_SHIP_FLAG" C45,A1."PROD_UNIT_OF_MEASR_ID" C46,A1."PROFT_CTR_BUS_AREA_ID" C47,
A1."BILLG_TYPE_ID" C48,A1."SOLD_TO_CUST_ID" C49,A1."ORIG_TRANX_ID" C50,A1."BUS_INTFC_VAL" C51,A1."ATTR_VAL_1" C52,A1."ATTR_VAL_2" C53,A1."ATTR_VAL_3" C54,A1."ATTR_VAL_4" C55,
A1."ORD_ENTRY_DATE" C56,A1."ORD_STTUS" C57,A1."ATTR_VAL_75" C58,A1."ATTR_VAL_13" C59,A1."ATTR_VAL_16" C60,A1."ATTR_VAL_21" C61,A1."LEGAL_ENT_DFLT" C62,
A1."LE_DERV_PROD_STRCT_CODE" C63,A1."TRADE_CHANL_DFLT" C64 FROM "IDWE4"."SH30_8700195_9031_0_TMP_TEST" PX_GRANULE(0, BLOCK_RANGE, DYNAMIC)  A1

-reports following waits (original order, skipping PQ messaging and library cache events):
EXEC #1
WAIT #1: nam='db file sequential read' - reported about 20 times which is moreless adequate to processed size of inner table (about 150kb = 1/4 of the table)
EXEC #2
WAIT #2: nam='direct path read' - reported 400 times, each by 64 blocks, adequate to size of processed data (about 200mb = 1/4 of the table)

*2) upper PQ level*
-executes following query:

CURSOR #1: hash join

SELECT /*+ ORDERED NO_EXPAND USE_HASH(A2) */ A1.C0,A2.C0,A1.C1,A2.C1,A1.C2,A1.C3,A1.C4,A1.C5,A1.C6,A1.C7,A1.C8,A1.C9,A1.C10,A1.C11,A1.C12,A2.C2,A2.C3,A2.C4,
A2.C5,A2.C6,A2.C7,A2.C8,A2.C9,A2.C10,A2.C11,A2.C12,A2.C13,A2.C14,A2.C15,A2.C16,A2.C17,A2.C18,A2.C19,A2.C20,A2.C21,A2.C22,A2.C23,A2.C24,A2.C25,A2.C26,A2.C27,
A2.C28,A2.C29,A2.C30,A2.C31,A2.C32,A2.C33,A2.C34,A2.C35,A2.C36,A2.C37,A2.C38,A2.C39,A2.C40,A2.C41,A2.C42,A2.C43,A2.C44,A2.C45,A2.C46,A2.C47,A2.C48,A2.C49,
A2.C50,A2.C51,A2.C52,A2.C53,A2.C54,A2.C55,A2.C56,A2.C57,A2.C58,A2.C59,A2.C60,A2.C61,A2.C62,A2.C63,A2.C64 FROM :Q2836777000 A1,:Q2836777001 A2 WHERE A2.C1=A1.C1 AND A2.C0=A1.C0

-reports following waits (original order, skipping PQ messaging and library cache events):
EXEC #1
WAIT #1: nam='direct path write' - reported about 850 times, adequate to size of probe table (about 200mb), every write occurs for partition number 8!
WAIT #1: nam='direct path read' - reported about 850 times, processed all the block number that has been previously spilled to TEMP

I have skipped QC trace (it's only fetching data).

After today's tests my understanding of the problem didn't significantly change - I still don't get why Oracle swaps probe table to disk.
The only new is that I see it's not typical "on-disk" hash join, because inner table is not written to TEMP. Plus you confirmed that immutable flag isn't forcing such behaviour (BTW thanks for that!).

So maybe it's bug? In meantime I've checked it against never version of DB (9.2.0.8) - still the same behaviour.

Warm regards
Bazyli

Jonathan Lewis

Answer

Buzzylee wrote:
Jonathan,
>
After today's tests my understanding of the problem didn't significantly change - I still don't get why Oracle swaps probe table to disk.
The only new is that I see it's not typical "on-disk" hash join, because inner table is not written to TEMP. Plus you confirmed that immutable flag isn't forcing such behaviour (BTW thanks for that!).

So maybe it's bug? In meantime I've checked it against never version of DB (9.2.0.8) - still the same behaviour.

I have emulated your example - the behaviour also appears in 10g and 11g.
This is probably not a bug, but it may be a case where a generic strategy is not appropriate.

The extra partition is NOT holding the probe table, it is holding the result of the hash join. The entire result is built before it is forwarded to the next "slave set" (which happens to be the query coordinator in this case). Your memory allocation allowed for about 18 slots (multiblock IO batches) of 31 blocks each. You used 8 of them for the hash table, the rest are available to hold the result.

Somewhere in your trace, around the point where you switch from writes to reads, you should see a summary about partition 8, and the number of "in-memory slots" which will tell you about the size of the result set.

If the difference between the clusters and the in-memory slots is small, you may find that by setting the "_hash_multiblock_io_count" to a value less than the 31 that the optimizer selected you free up enough memory from the hash table to allow the result set to build in memory.

Another option - to bypass this spill - is to switch to a (broadcast, none) distribution.

Regards
Jonathan Lewis
http://jonathanlewis.wordpress.com
http://www.jlcomp.demon.co.uk

Marked as Answer by 664822 · Sep 27 2020

664822

Jonathan,

You are absolutely right - it's the result set, not PROBE table that is spilled to TEMP.

One more thing that I didn't mentioned earlier. I used to launch this query as CTAS with parallel DDL turned on. It spilled to disk as well but this time it wasn't QC that was intercepting results from the PQ slaves layer performing Hash Join but another layer of slaves (the ones writting to disk). Anyway, it doesn't explicitly explain Oracle's behaviour - it should normally pass the results of HJ to QC or next PQ level without trying buffering the results but doing it in chunks (it's how in my understanding communication between PQs/QC looks like). I wish I knew what stands behind this strategy...

Thanks for workaround tips - I've chosen the data redistribution option. It's not the perfect one (small table goes to each PQ, so we have some rework here) but should be ok for my purposes - our build table should always be small enought to fit in memory.

Regarding second recommendation - could you please explain why you think it could help? Isn't the result set too big to be stored in memory (it's not smaller than hundreds of mb, usually bigger than 2-3 gb)? Moreover - when I changed HJ distribution method into Broadcast/None the "Multiblock IO" setting in trace changed to 15. So I checked "_hash_multiblock_io_count" parameter on my DB - it's set to 0, what probably means it's dynamicaly chosen during execution. If the Oracle can set it dynamically then why it's having a problems with it?

I also found the line of trace informing us about result set size:

qerhjFetch: PQ Partition rows:696534     clusters:907    in-memory slots      1

Why is there only 1 in-memory slot free? Free slot tracing shows something different:

kxhfFreeSlots(800003ff7c068918): all=0 alloc=18 max=18
  EmptySlots:15 8 9 10 11 12 13 
  PendingSlots:

However I noticed Oracle writes down and reads up data in round robin manner - slot after slot, thus it looks like even 1 slot is enough to handle this.

BTW I've did some additional tests with wrapping problematic HJ with some other PGA-related operation (like sorting). And the results were great - HJ stopped spilling to TEMP. Unfortunatelly I don't need any sort/group by operation so I skipped this idea :)

Thanks again for tracking down the issue, I really appreciate your help Jonathan.

Warm regards
Bazyli

Edited by: Buzzylee on Oct 16, 2008 7:33 AM

1 - 10

Locked Post

New comments cannot be posted to this locked post.

Locked on Nov 13 2008

Added on Oct 13 2008

#general-database-discussions

10 comments

1,514 views

Oracle Database Discussions

Parallel Hash Join always swapping to TEMP

Comments

Post Details