Discussions
Categories
- 196.7K All Categories
- 2.2K Data
- 235 Big Data Appliance
- 1.9K Data Science
- 449.8K Databases
- 221.6K General Database Discussions
- 3.8K Java and JavaScript in the Database
- 31 Multilingual Engine
- 549 MySQL Community Space
- 477 NoSQL Database
- 7.9K Oracle Database Express Edition (XE)
- 3K ORDS, SODA & JSON in the Database
- 532 SQLcl
- 4K SQL Developer Data Modeler
- 186.9K SQL & PL/SQL
- 21.3K SQL Developer
- 295.4K Development
- 17 Developer Projects
- 138 Programming Languages
- 292.1K Development Tools
- 104 DevOps
- 3.1K QA/Testing
- 645.9K Java
- 28 Java Learning Subscription
- 37K Database Connectivity
- 153 Java Community Process
- 105 Java 25
- 22.1K Java APIs
- 138.1K Java Development Tools
- 165.3K Java EE (Java Enterprise Edition)
- 17 Java Essentials
- 158 Java 8 Questions
- 85.9K Java Programming
- 79 Java Puzzle Ball
- 65.1K New To Java
- 1.7K Training / Learning / Certification
- 13.8K Java HotSpot Virtual Machine
- 94.2K Java SE
- 13.8K Java Security
- 203 Java User Groups
- 24 JavaScript - Nashorn
- Programs
- 395 LiveLabs
- 37 Workshops
- 10.2K Software
- 6.7K Berkeley DB Family
- 3.5K JHeadstart
- 5.6K Other Languages
- 2.3K Chinese
- 170 Deutsche Oracle Community
- 1.1K Español
- 1.9K Japanese
- 230 Portuguese
Parallel Hash Join always swapping to TEMP

664822
Member Posts: 6
Hi,
I've experienced some strange behaviour on Oracle 9.2.0.5 recently: simple query hash joining two tables - smaller with 16k records/1 mb of size and bigger with 2.5m records/1.5 gb of size is swapping to TEMP when launched in parallel mode (4 set of PQ slaves). What is strange serial execution is running as expected - in-memory Hash Join occurs. It's worth to add that both parallel and serial execution properly selects smaller table as inner one but parallel query always decides to buffer the source data (no matter how big is it).
To be more precise - all table stats are gathered, I have enough PGA memory assigned to queries (WORKAREA_POLICY_SIZE=AUTO, PGA_AGGREGATE_TARGET=6GB) and I properly analyze the results. Even hidden parameter SMMPX_MAX_SIZE is properly set to about 2GB, the issue is that parallel execution still decides to swap (even if the inner data size for each slave is about 220kb!).
I dig into the traces (10104 event) and found some substantial difference between serial and parallel execution. It looks like some internal flag orders PQ slaves to always buffer the data, here is what I found in PQ slave trace:
HASH JOIN STATISTICS (INITIALIZATION)
Original memory: 4428800
Memory after all overhead: 4283220
Memory for slots: 3809280
Calculated overhead for partitions and row/slot managers: 473940
Hash-join fanout: 8
Number of partitions: 9
Number of slots: 15
Multiblock IO: 31
Block size(KB): 8
Cluster (slot) size(KB): 248
Hash-join fanout (manual): 8
Cluster/slot size(KB) (manual): 280
Minimum number of bytes per block: 8160
Bit vector memory allocation(KB): 128
Per partition bit vector length(KB): 16
Maximum possible row length: 1455
Estimated build size (KB): 645
Estimated Row Length (includes overhead): 167
Immutable Flags:
BUFFER the output of the join for Parallel Query
kxhfSetPhase: phase=BUILD
kxhfAddChunk: add chunk 0 (sz=32) to slot table
kxhfAddChunk: chunk 0 (lbs=800003ff640ebb50, slotTab=800003ff640ebce8) successfuly added
kxhfSetPhase: phase=PROBE_1
Bolded is the part that is not present in serial mode. Unfortunatelly I cannot find anything that could help identifying the reason or setting that drives this behaviour
Best regards
Bazyli
Edited by: user10419027 on Oct 13, 2008 3:53 AM
I've experienced some strange behaviour on Oracle 9.2.0.5 recently: simple query hash joining two tables - smaller with 16k records/1 mb of size and bigger with 2.5m records/1.5 gb of size is swapping to TEMP when launched in parallel mode (4 set of PQ slaves). What is strange serial execution is running as expected - in-memory Hash Join occurs. It's worth to add that both parallel and serial execution properly selects smaller table as inner one but parallel query always decides to buffer the source data (no matter how big is it).
To be more precise - all table stats are gathered, I have enough PGA memory assigned to queries (WORKAREA_POLICY_SIZE=AUTO, PGA_AGGREGATE_TARGET=6GB) and I properly analyze the results. Even hidden parameter SMMPX_MAX_SIZE is properly set to about 2GB, the issue is that parallel execution still decides to swap (even if the inner data size for each slave is about 220kb!).
I dig into the traces (10104 event) and found some substantial difference between serial and parallel execution. It looks like some internal flag orders PQ slaves to always buffer the data, here is what I found in PQ slave trace:
HASH JOIN STATISTICS (INITIALIZATION)
Original memory: 4428800
Memory after all overhead: 4283220
Memory for slots: 3809280
Calculated overhead for partitions and row/slot managers: 473940
Hash-join fanout: 8
Number of partitions: 9
Number of slots: 15
Multiblock IO: 31
Block size(KB): 8
Cluster (slot) size(KB): 248
Hash-join fanout (manual): 8
Cluster/slot size(KB) (manual): 280
Minimum number of bytes per block: 8160
Bit vector memory allocation(KB): 128
Per partition bit vector length(KB): 16
Maximum possible row length: 1455
Estimated build size (KB): 645
Estimated Row Length (includes overhead): 167
Immutable Flags:
BUFFER the output of the join for Parallel Query
kxhfSetPhase: phase=BUILD
kxhfAddChunk: add chunk 0 (sz=32) to slot table
kxhfAddChunk: chunk 0 (lbs=800003ff640ebb50, slotTab=800003ff640ebce8) successfuly added
kxhfSetPhase: phase=PROBE_1
Bolded is the part that is not present in serial mode. Unfortunatelly I cannot find anything that could help identifying the reason or setting that drives this behaviour

Best regards
Bazyli
Edited by: user10419027 on Oct 13, 2008 3:53 AM
Best Answer
-
Buzzylee wrote:I have emulated your example - the behaviour also appears in 10g and 11g.
Jonathan,
>
After today's tests my understanding of the problem didn't significantly change - I still don't get why Oracle swaps probe table to disk.
The only new is that I see it's not typical "on-disk" hash join, because inner table is not written to TEMP. Plus you confirmed that immutable flag isn't forcing such behaviour (BTW thanks for that!).
So maybe it's bug? In meantime I've checked it against never version of DB (9.2.0.8) - still the same behaviour.
This is probably not a bug, but it may be a case where a generic strategy is not appropriate.
The extra partition is NOT holding the probe table, it is holding the result of the hash join. The entire result is built before it is forwarded to the next "slave set" (which happens to be the query coordinator in this case). Your memory allocation allowed for about 18 slots (multiblock IO batches) of 31 blocks each. You used 8 of them for the hash table, the rest are available to hold the result.
Somewhere in your trace, around the point where you switch from writes to reads, you should see a summary about partition 8, and the number of "in-memory slots" which will tell you about the size of the result set.
If the difference between the clusters and the in-memory slots is small, you may find that by setting the "_hash_multiblock_io_count" to a value less than the 31 that the optimizer selected you free up enough memory from the hash table to allow the result set to build in memory.
Another option - to bypass this spill - is to switch to a (broadcast, none) distribution.
Regards
Jonathan Lewis
http://jonathanlewis.wordpress.com
http://www.jlcomp.demon.co.uk
Answers
-
Hi,
I've experienced some strange behaviour on Oracle 9.2.0.5 recently: simple query hash joining two tables - smaller with 16k records/1 mb of size and bigger with 2.5m records/1.5 gb of size is swapping to TEMP when launched in parallel mode (4 set of PQ slaves). What is strange serial execution is running as expected - in-memory Hash Join occurs. It's worth to add that both parallel and serial execution properly selects smaller table as inner one but parallel query always decides to buffer the source data (no matter how big is it).
The hash join is more memory intensive than a nested loop join. To be faster than a nested loop join, we must set the hash_area_size large enough to hold the entire hash table in memory (about 1.6 times the sum of the rows in the table). If the hash join overflows the hash_area_size memory, the hash join will page into the TEMP tablespace.
+To be more precise - all table stats are gathered, I have enough PGA memory assigned to queries (WORKAREA_POLICY_SIZE=AUTO, PGA_AGGREGATE_TARGET=6GB) and I properly analyze the results. Even hidden parameter SMMPX_MAX_SIZE is properly set to about 2GB, the issue is that parallel execution still decides to swap (even if the inner data size for each slave is about 220kb!).+
http://dba-blog.blogspot.com/2005/08/pgamaxsize-hidden-parameter.html
- Pavan Kumar N -
Pavan,
As I mentionned in my post - I have enough PGA memory assigned for Hash/Sort operations - PGA_AGGREGATE_TARGET is set to 6GB (combined with AUTO memory management, what makes the HASH_AREA_SIZE parameter irrelevant. Also, PGAMAX_SIZE parameter is set to 200MB, so again - it's not the matter of PGA size limits.
Moreover - serial execution (which operates on 4 times bigger data to hash) is able to do it in memory and I'm not suprised here - it's only 1MB of data.
Cheers
Bazyli -
Hi,
Can you monitor v$sql_workarea_active .
since as per the setting (limit for a single workarea operation in a single process) is 2GB
what is the size of this smmpx_max_size
- Pavan Kumar N -
Pavan,
Yes I can monitor workarea usage however I don't expect to find there anything useful - it will only confirm Hash Join is swapping to disk (I saw it in V$SORT_USAGE as well). Parameter smmpx_max_size is set to about 2gb, so it's not the case. Again - my queries refuses to hash 250kb of data
I guess there must be some internal setting that drives that that "BUFFER the output of the join for Parallel Query" flag. Unfortunatelly I don't see any hidden parameter responsible for that...
What I have seen in some sample 10104 traces on the internet - there is an opposite flag which appears as "Not BUFFER(execution) output of the join for PQ" (still under immutable flags section of HJ output). The question is - what drives this setting?
Cheers
Bazyli -
What is the distribution method of the join ?
Can you generate an execution plan from dbms_xplan and post it. (Use the "code" tags to get a readable output in fixed font).
Can you print the section of the plan that shows the row distribution across hash partitions, and an example of the section where Oracle starts phase 2 and reads a matching pair of partitions.
Given the information that Oracle has show in this part of the trace, it seems very surprising that it manages to spill to disc.
Regards
Jonathan Lewis
http://jonathanlewis.wordpress.com
http://www.jlcomp.demon.co.uk -
Jonathan,
Distribution seems to be as expected (HASH/HASH), please have a look on the query plan:PLAN_TABLE_OUTPUT ---------------------------------------------------------------------------------------------------------------------------------------------------------------- ------------------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost | TQ |IN-OUT| PQ Distrib | ------------------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 456K| 95M| 876 | | | | |* 1 | HASH JOIN | | 456K| 95M| 876 | 43,02 | P->S | QC (RAND) | | 2 | TABLE ACCESS FULL | SH30_8700195_9032_0_TMP_TEST | 16555 | 468K| 16 | 43,00 | P->P | HASH | | 3 | TABLE ACCESS FULL | SH30_8700195_9031_0_TMP_TEST | 2778K| 503M| 860 | 43,01 | P->P | HASH | ------------------------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - access(NVL("A"."PROD_ID",'NULL!')=NVL("B"."PROD_ID",'NULL!') AND NVL("A"."PROD_UNIT_OF_MEASR_ID",'NULL!')=NVL("B"."PROD_UNIT_OF_MEASR_ID",'NULL!'))
Let me also share with you trace files from parallel and serial execution.
First, serial execution (only 10104 event details):Dump file /opt/oracle/admin/cdwep4/udump/cdwep401_ora_18729.trc Oracle9i Enterprise Edition Release 9.2.0.5.0 - 64bit Production With the Partitioning, OLAP and Oracle Data Mining options JServer Release 9.2.0.5.0 - Production ORACLE_HOME = /opt/oracle/product/9.2.0.5 System name: HP-UX Node name: ethp1018 Release: B.11.11 Version: U Machine: 9000/800 Instance name: cdwep401 Redo thread mounted by this instance: 1 Oracle process number: 100 Unix process pid: 18729, image: [email protected] (TNS V1-V3) kxhfInit(): enter kxhfInit(): exit *** HASH JOIN STATISTICS (INITIALIZATION) *** Original memory: 4341760 Memory after all overhead: 4163446 Memory for slots: 3301376 Calculated overhead for partitions and row/slot managers: 862070 Hash-join fanout: 8 Number of partitions: 8 Number of slots: 13 Multiblock IO: 31 Block size(KB): 8 Cluster (slot) size(KB): 248 Hash-join fanout (manual): 8 Cluster/slot size(KB) (manual): 240 Minimum number of bytes per block: 8160 Bit vector memory allocation(KB): 128 Per partition bit vector length(KB): 16 Maximum possible row length: 1455 Estimated build size (KB): 1083 Estimated Row Length (includes overhead): 67 # Immutable Flags: kxhfSetPhase: phase=BUILD kxhfAddChunk: add chunk 0 (sz=32) to slot table kxhfAddChunk: chunk 0 (lbs=800003ff6c063b20, slotTab=800003ff6c063cb8) successfuly added kxhfSetPhase: phase=PROBE_1 qerhjFetch: max build row length (mbl=110) *** END OF HASH JOIN BUILD (PHASE 1) *** Revised row length: 68 Revised row count: 16555 Revised build size: 1089KB kxhfResize(enter): resize to 12 slots (numAlloc=8, max=13) kxhfResize(exit): resized to 12 slots (numAlloc=8, max=12) Slot table resized: old=13 wanted=12 got=12 unload=0 *** HASH JOIN BUILD HASH TABLE (PHASE 1) *** Total number of partitions: 8 Number of partitions which could fit in memory: 8 Number of partitions left in memory: 8 Total number of slots in in-memory partitions: 8 Total number of rows in in-memory partitions: 16555 (used as preliminary number of buckets in hash table) Estimated max # of build rows that can fit in avail memory: 55800 ### Partition Distribution ### Partition:0 rows:2131 clusters:1 slots:1 kept=1 Partition:1 rows:1975 clusters:1 slots:1 kept=1 Partition:2 rows:1969 clusters:1 slots:1 kept=1 Partition:3 rows:2174 clusters:1 slots:1 kept=1 Partition:4 rows:2041 clusters:1 slots:1 kept=1 Partition:5 rows:2092 clusters:1 slots:1 kept=1 Partition:6 rows:2048 clusters:1 slots:1 kept=1 Partition:7 rows:2125 clusters:1 slots:1 kept=1 *** (continued) HASH JOIN BUILD HASH TABLE (PHASE 1) *** Revised number of hash buckets (after flushing): 16555 Allocating new hash table. *** (continued) HASH JOIN BUILD HASH TABLE (PHASE 1) *** Requested size of hash table: 4096 Actual size of hash table: 4096 Number of buckets: 32768 kxhfResize(enter): resize to 14 slots (numAlloc=8, max=12) kxhfResize(exit): resized to 14 slots (numAlloc=8, max=14) freeze work area size to: 4357K (14 slots) *** (continued) HASH JOIN BUILD HASH TABLE (PHASE 1) *** Total number of rows (may have changed): 16555 Number of in-memory partitions (may have changed): 8 Final number of hash buckets: 32768 Size (in bytes) of hash table: 262144 kxhfIterate(end_iterate): numAlloc=8, maxSlots=14 *** (continued) HASH JOIN BUILD HASH TABLE (PHASE 1) *** ### Hash table ### # NOTE: The calculated number of rows in non-empty buckets may be smaller # than the true number. Number of buckets with 0 rows: 21129 Number of buckets with 1 rows: 8755 Number of buckets with 2 rows: 2024 Number of buckets with 3 rows: 433 Number of buckets with 4 rows: 160 Number of buckets with 5 rows: 85 Number of buckets with 6 rows: 69 Number of buckets with 7 rows: 41 Number of buckets with 8 rows: 32 Number of buckets with 9 rows: 18 Number of buckets with between 10 and 19 rows: 21 Number of buckets with between 20 and 29 rows: 1 Number of buckets with between 30 and 39 rows: 0 Number of buckets with between 40 and 49 rows: 0 Number of buckets with between 50 and 59 rows: 0 Number of buckets with between 60 and 69 rows: 0 Number of buckets with between 70 and 79 rows: 0 Number of buckets with between 80 and 89 rows: 0 Number of buckets with between 90 and 99 rows: 0 Number of buckets with 100 or more rows: 0 ### Hash table overall statistics ### Total buckets: 32768 Empty buckets: 21129 Non-empty buckets: 11639 Total number of rows: 16555 Maximum number of rows in a bucket: 24 Average number of rows in non-empty buckets: 1.422373 ===================== .... (lots of fetching) .... qerhjFetch: max probe row length (mpl=0) qerhjFreeSpace(): free hash-join memory kxhfRemoveChunk: remove chunk 0 from slot table
And finally, PQ slave output (only one trace, please note Immutable Flag that I believe orders Oracle to buffer to TEMP):Dump file /opt/oracle/admin/cdwep4/bdump/cdwep401_p002_4640.trc Oracle9i Enterprise Edition Release 9.2.0.5.0 - 64bit Production With the Partitioning, OLAP and Oracle Data Mining options JServer Release 9.2.0.5.0 - Production ORACLE_HOME = /opt/oracle/product/9.2.0.5 System name: HP-UX Node name: ethp1018 Release: B.11.11 Version: U Machine: 9000/800 Instance name: cdwep401 Redo thread mounted by this instance: 1 Oracle process number: 86 Unix process pid: 4640, image: [email protected] (P002) kxhfInit(): enter kxhfInit(): exit *** HASH JOIN STATISTICS (INITIALIZATION) *** Original memory: 4428800 Memory after all overhead: 4283220 Memory for slots: 3809280 Calculated overhead for partitions and row/slot managers: 473940 Hash-join fanout: 8 Number of partitions: 9 Number of slots: 15 Multiblock IO: 31 Block size(KB): 8 Cluster (slot) size(KB): 248 Hash-join fanout (manual): 8 Cluster/slot size(KB) (manual): 280 Minimum number of bytes per block: 8160 Bit vector memory allocation(KB): 128 Per partition bit vector length(KB): 16 Maximum possible row length: 1455 Estimated build size (KB): 645 Estimated Row Length (includes overhead): 167 # Immutable Flags: BUFFER the output of the join for Parallel Query kxhfSetPhase: phase=BUILD kxhfAddChunk: add chunk 0 (sz=32) to slot table kxhfAddChunk: chunk 0 (lbs=800003ff640ebb50, slotTab=800003ff640ebce8) successfuly added kxhfSetPhase: phase=PROBE_1 qerhjFetch: max build row length (mbl=96) *** END OF HASH JOIN BUILD (PHASE 1) *** Revised row length: 54 Revised row count: 4203 Revised build size: 221KB kxhfResize(enter): resize to 16 slots (numAlloc=8, max=15) kxhfResize(exit): resized to 16 slots (numAlloc=8, max=16) Slot table resized: old=15 wanted=16 got=16 unload=0 *** HASH JOIN BUILD HASH TABLE (PHASE 1) *** Total number of partitions: 8 Number of partitions which could fit in memory: 8 Number of partitions left in memory: 8 Total number of slots in in-memory partitions: 8 Total number of rows in in-memory partitions: 4203 (used as preliminary number of buckets in hash table) Estimated max # of build rows that can fit in avail memory: 85312 ### Partition Distribution ### Partition:0 rows:537 clusters:1 slots:1 kept=1 Partition:1 rows:554 clusters:1 slots:1 kept=1 Partition:2 rows:497 clusters:1 slots:1 kept=1 Partition:3 rows:513 clusters:1 slots:1 kept=1 Partition:4 rows:498 clusters:1 slots:1 kept=1 Partition:5 rows:543 clusters:1 slots:1 kept=1 Partition:6 rows:547 clusters:1 slots:1 kept=1 Partition:7 rows:514 clusters:1 slots:1 kept=1 *** (continued) HASH JOIN BUILD HASH TABLE (PHASE 1) *** Revised number of hash buckets (after flushing): 4203 Allocating new hash table. *** (continued) HASH JOIN BUILD HASH TABLE (PHASE 1) *** Requested size of hash table: 1024 Actual size of hash table: 1024 Number of buckets: 8192 kxhfResize(enter): resize to 18 slots (numAlloc=8, max=16) kxhfResize(exit): resized to 18 slots (numAlloc=8, max=18) freeze work area size to: 5812K (18 slots) *** (continued) HASH JOIN BUILD HASH TABLE (PHASE 1) *** Total number of rows (may have changed): 4203 Number of in-memory partitions (may have changed): 8 Final number of hash buckets: 8192 Size (in bytes) of hash table: 65536 kxhfIterate(end_iterate): numAlloc=8, maxSlots=18 *** (continued) HASH JOIN BUILD HASH TABLE (PHASE 1) *** ### Hash table ### # NOTE: The calculated number of rows in non-empty buckets may be smaller # than the true number. Number of buckets with 0 rows: 5284 Number of buckets with 1 rows: 2177 Number of buckets with 2 rows: 510 Number of buckets with 3 rows: 104 Number of buckets with 4 rows: 51 Number of buckets with 5 rows: 14 Number of buckets with 6 rows: 14 Number of buckets with 7 rows: 13 Number of buckets with 8 rows: 12 Number of buckets with 9 rows: 4 Number of buckets with between 10 and 19 rows: 9 Number of buckets with between 20 and 29 rows: 0 Number of buckets with between 30 and 39 rows: 0 Number of buckets with between 40 and 49 rows: 0 Number of buckets with between 50 and 59 rows: 0 Number of buckets with between 60 and 69 rows: 0 Number of buckets with between 70 and 79 rows: 0 Number of buckets with between 80 and 89 rows: 0 Number of buckets with between 90 and 99 rows: 0 Number of buckets with 100 or more rows: 0 ### Hash table overall statistics ### Total buckets: 8192 Empty buckets: 5284 Non-empty buckets: 2908 Total number of rows: 4203 Maximum number of rows in a bucket: 16 Average number of rows in non-empty buckets: 1.445323 kxhfWrite: hash-join is spilling to disk kxhfWrite: Writing dba=950281 slot=8 part=8 kxhfWrite: Writing dba=950312 slot=9 part=8 kxhfWrite: Writing dba=950343 slot=10 part=8 kxhfWrite: Writing dba=950374 slot=11 part=8 .... (lots of writing) .... kxhfRead(): Reading dba=950281 into slot=15 kxhfIsDone: waiting slot=15 lbs=800003ff640ebb50 kxhfRead(): Reading dba=950312 into slot=16 kxhfIsDone: waiting slot=16 lbs=800003ff640ebb50 kxhfRead(): Reading dba=950343 into slot=17 kxhfFreeSlots(800003ff7c068918): all=0 alloc=18 max=18 EmptySlots:15 8 9 10 11 12 13 PendingSlots: kxhfIsDone: waiting slot=17 lbs=800003ff640ebb50 kxhfRead(): Reading dba=950374 into slot=15 kxhfFreeSlots(800003ff7c068918): all=0 alloc=18 max=18 EmptySlots:16 8 9 10 11 12 13 PendingSlots: .... (lots of reading) .... qerhjFetchPhase2(): building a hash table kxhfFreeSlots(800003ff7c068980): all=1 alloc=18 max=18 EmptySlots:2 4 6 1 0 7 5 3 14 17 16 15 8 9 10 11 12 13 PendingSlots: qerhjFreeSpace(): free hash-join memory kxhfRemoveChunk: remove chunk 0 from slot table
Why do you think it's surprising that Oracle utilizes TEMP? Basing on traces Oracle seems to be very sure it should spill to disk. I believe the key to answer is this immutable flag printing "BUFFER the output of the join for Parallel Query" - as I mentioned in one of previous posts it's opposite to "Not BUFFER(execution) output of the join for PQ" which appears in some traces found on internet.
Best regards
Bazyli -
Bazyli,
The first part of the trace file shows that Oracle has plenty of memory to hold the hash table, hence my surprise.
The later parts of the trace show that the hash does indeed fit in memory (all partitions marked as KEPT´=1)
There is an interesting detail in the NUMBER of partitons declared in the opening part of the trace. This is 9, but Oracle builds the hash table for a hash join using a number of partitions that is a power of 2 - and we can see this in the lists partitions that are holding the data (The eight partitions 0 - 7 account for all rows recorded). Note also that you do not have a report of "reloading pair of flushed partions" which would appear if you had really spillled the hash table to disc.
Your writes to disc are for partition 8 - and there is no data in partition , and the size of your clusters(slots) is big (248K) so if you were writing the hash table to disc, it would take very few writes.
At present I can only guess that the writes are eithe a bug, or the results of the hash join being written to disc because the query coordinator cannot keep up with the slaves - so they dump to temp so that they can continue running in the background.
What is the query - does it return very few rows, or a number similar to the size of the build table, or of the probe table ?
Can you run the query with 10046 enabled at level 8 so that you can see if the writes alternative with reads of the probe table ?
I have a query with the same plan - and it shows the same immutable flags setting, but does not write to disc. So the "feature" may be enabled by the flag, but is not forced by the flag.
Regards
Jonathan Lewis
http://jonathanlewis.wordpress.om
http://www.jlcomp.demon.co.uk -
Jonathan,
Thanks for noticing two important facts that I've missed:
-Oracle spills ninth partition (#8), which according to hash data distribution shouldn't be filled with any data
-It's not real on-disk Hash Join - the build (smaller) table is not written to disk, only probe (bigger) table goes there (normally it should be building hash table in several iterations, each time informing us how many blocks remains in TEMP)
Answering your questions: query is a typical Data Warehousing look-up join - I try to assign each fact record some additional look-up data from small table. Fact table is about 800 mb / 2.8 m records, look-up table occupies about 1mb / 16k records (unique join key values). Query produces exactly the same number of rows as in fact table (2.8m).
According to your suggestion I've repeated the test - this time with 10046 event turned on (level 8 - waits). Traces have been properly split between two layers of PQ slaves - lower level reading data blocks and broadcasting it on Hash manner to upper level which performs HJ, spills probe data to TEMP, then reads it and returns to QC. Here is short summary of wait events on both levels (executed with DOP = 4):
*1) lower PQ level*
-executes following queries:
CURSOR #1: look-up (small) table readSELECT /*+ Q2836777000 NO_EXPAND ROWID(A1) */ NVL(A1."PROD_ID",'NULL!') C0,NVL(A1."PROD_UNIT_OF_MEASR_ID",'NULL!') C1,A1."PROD_ID" C2,A1."PROD_UNIT_OF_MEASR_ID" C3, A1."PUOM_RELTV_QTY" C4,A1."PUOM_STAT_VAL" C5,A1."PUOM_UNIT_OF_MEASR_ID" C6,A1."PB_DNORM_COMP_ID" C7,A1."PB_DNORM_VAL_FAC" C8,A1."PB_DNORM_COMP_QTY" C9, A1."PB_DNORM_SU_FAC" C10,A1."PB_DNORM_COMP_NUMBER" C11,A1."PB_DNORM_VOL_CREDT_CODE" C12 FROM "IDWE4"."SH30_8700195_9032_0_TMP_TEST" PX_GRANULE(0, BLOCK_RANGE, DYNAMIC) A1
CURSOR #2: fact (big) table readSELECT /*+ Q2836777001 NO_EXPAND ROWID(A1) */ NVL(A1."PROD_ID",'NULL!') C0,NVL(A1."PROD_UNIT_OF_MEASR_ID",'NULL!') C1,A1."CUST_ACCT_GRP_CODE" C2,A1."AFFLT_SHIP_FLAG" C3, A1."GEO_ID" C4,A1."GEO_REPLC_STTUS_FLAG" C5,A1."GEO_SUB_CNTRY_ID" C6,A1."SUB_CNTRY_REPLC_STTUS_FLAG" C7,A1."TRADE_CHANL_ID" C8,A1."TRADE_CHANL_REPLC_STTUS_FLAG" C9, A1."CUST_ID" C10,A1."FACT_AMT_1" C11,A1."FACT_AMT_2" C12,A1."FACT_AMT_3" C13,A1."FACT_ID" C14,A1."CUST_REPLC_STTUS_FLAG" C15,A1."LEGAL_ENT_ID" C16,A1."PROD_ID" C17, A1."PROD_REPLC_STTUS_FLAG" C18,A1."SITE_ID" C19,A1."TIME_PERD_ID" C20,A1."FACT_TYPE_CODE" C21,A1."ISO_CRNCY_CODE_CHAR" C22,A1."SRCE_SYS_ID" C23,A1."ENGIN_LYOUT_ID" C24, A1."DUE_PERD" C25,A1."PRTTN_CODE" C26,A1."PRTTN_BY_FUNC_DATE" C27,A1."TIME_PERD_TYPE_CODE" C28,A1."TIME_PERD_END_DATE" C29,A1."CORP_OSB_REVNU_SHIP_FLAG" C30, A1."CORP_NON_OSB_REVNU_SHIP_FLAG" C31,A1."EXPTN_OSB_REVNU_SHIP_FLAG" C32,A1."FREE_SHIP_FLAG" C33,A1."NON_PG_MATL_SHIP_FLAG" C34,A1."STOCK_MVMNT_SHTTL_FLAG" C35, A1."STOCK_MVMNT_WHSE_FLAG" C36,A1."RET_SHIP_FLAG" C37,A1."REFS_SHIP_FLAG" C38,A1."CARRY_SALES_FLAG" C39,A1."POST_DIVST_SHIP_FLAG" C40,A1."CBD_POST_DIVST_SHIP_FLAG" C41, A1."PRE_ACQ_SHIP_FLAG" C42,A1."FA_DIVST_SHIP_FLAG" C43,A1."FA_ACQ_SHIP_FLAG" C44,A1."JV_SHIP_FLAG" C45,A1."PROD_UNIT_OF_MEASR_ID" C46,A1."PROFT_CTR_BUS_AREA_ID" C47, A1."BILLG_TYPE_ID" C48,A1."SOLD_TO_CUST_ID" C49,A1."ORIG_TRANX_ID" C50,A1."BUS_INTFC_VAL" C51,A1."ATTR_VAL_1" C52,A1."ATTR_VAL_2" C53,A1."ATTR_VAL_3" C54,A1."ATTR_VAL_4" C55, A1."ORD_ENTRY_DATE" C56,A1."ORD_STTUS" C57,A1."ATTR_VAL_75" C58,A1."ATTR_VAL_13" C59,A1."ATTR_VAL_16" C60,A1."ATTR_VAL_21" C61,A1."LEGAL_ENT_DFLT" C62, A1."LE_DERV_PROD_STRCT_CODE" C63,A1."TRADE_CHANL_DFLT" C64 FROM "IDWE4"."SH30_8700195_9031_0_TMP_TEST" PX_GRANULE(0, BLOCK_RANGE, DYNAMIC) A1
-reports following waits (original order, skipping PQ messaging and library cache events):
EXEC #1
WAIT #1: nam='db file sequential read' - reported about 20 times which is moreless adequate to processed size of inner table (about 150kb = 1/4 of the table)
EXEC #2
WAIT #2: nam='direct path read' - reported 400 times, each by 64 blocks, adequate to size of processed data (about 200mb = 1/4 of the table)
*2) upper PQ level*
-executes following query:
CURSOR #1: hash joinSELECT /*+ ORDERED NO_EXPAND USE_HASH(A2) */ A1.C0,A2.C0,A1.C1,A2.C1,A1.C2,A1.C3,A1.C4,A1.C5,A1.C6,A1.C7,A1.C8,A1.C9,A1.C10,A1.C11,A1.C12,A2.C2,A2.C3,A2.C4, A2.C5,A2.C6,A2.C7,A2.C8,A2.C9,A2.C10,A2.C11,A2.C12,A2.C13,A2.C14,A2.C15,A2.C16,A2.C17,A2.C18,A2.C19,A2.C20,A2.C21,A2.C22,A2.C23,A2.C24,A2.C25,A2.C26,A2.C27, A2.C28,A2.C29,A2.C30,A2.C31,A2.C32,A2.C33,A2.C34,A2.C35,A2.C36,A2.C37,A2.C38,A2.C39,A2.C40,A2.C41,A2.C42,A2.C43,A2.C44,A2.C45,A2.C46,A2.C47,A2.C48,A2.C49, A2.C50,A2.C51,A2.C52,A2.C53,A2.C54,A2.C55,A2.C56,A2.C57,A2.C58,A2.C59,A2.C60,A2.C61,A2.C62,A2.C63,A2.C64 FROM :Q2836777000 A1,:Q2836777001 A2 WHERE A2.C1=A1.C1 AND A2.C0=A1.C0
-reports following waits (original order, skipping PQ messaging and library cache events):
EXEC #1
WAIT #1: nam='direct path write' - reported about 850 times, adequate to size of probe table (about 200mb), every write occurs for partition number 8!
WAIT #1: nam='direct path read' - reported about 850 times, processed all the block number that has been previously spilled to TEMP
I have skipped QC trace (it's only fetching data).
After today's tests my understanding of the problem didn't significantly change - I still don't get why Oracle swaps probe table to disk.
The only new is that I see it's not typical "on-disk" hash join, because inner table is not written to TEMP. Plus you confirmed that immutable flag isn't forcing such behaviour (BTW thanks for that!).
So maybe it's bug? In meantime I've checked it against never version of DB (9.2.0.8) - still the same behaviour.
Warm regards
Bazyli -
Buzzylee wrote:I have emulated your example - the behaviour also appears in 10g and 11g.
Jonathan,
>
After today's tests my understanding of the problem didn't significantly change - I still don't get why Oracle swaps probe table to disk.
The only new is that I see it's not typical "on-disk" hash join, because inner table is not written to TEMP. Plus you confirmed that immutable flag isn't forcing such behaviour (BTW thanks for that!).
So maybe it's bug? In meantime I've checked it against never version of DB (9.2.0.8) - still the same behaviour.
This is probably not a bug, but it may be a case where a generic strategy is not appropriate.
The extra partition is NOT holding the probe table, it is holding the result of the hash join. The entire result is built before it is forwarded to the next "slave set" (which happens to be the query coordinator in this case). Your memory allocation allowed for about 18 slots (multiblock IO batches) of 31 blocks each. You used 8 of them for the hash table, the rest are available to hold the result.
Somewhere in your trace, around the point where you switch from writes to reads, you should see a summary about partition 8, and the number of "in-memory slots" which will tell you about the size of the result set.
If the difference between the clusters and the in-memory slots is small, you may find that by setting the "_hash_multiblock_io_count" to a value less than the 31 that the optimizer selected you free up enough memory from the hash table to allow the result set to build in memory.
Another option - to bypass this spill - is to switch to a (broadcast, none) distribution.
Regards
Jonathan Lewis
http://jonathanlewis.wordpress.com
http://www.jlcomp.demon.co.uk -
Jonathan,
You are absolutely right - it's the result set, not PROBE table that is spilled to TEMP.
One more thing that I didn't mentioned earlier. I used to launch this query as CTAS with parallel DDL turned on. It spilled to disk as well but this time it wasn't QC that was intercepting results from the PQ slaves layer performing Hash Join but another layer of slaves (the ones writting to disk). Anyway, it doesn't explicitly explain Oracle's behaviour - it should normally pass the results of HJ to QC or next PQ level without trying buffering the results but doing it in chunks (it's how in my understanding communication between PQs/QC looks like). I wish I knew what stands behind this strategy...
Thanks for workaround tips - I've chosen the data redistribution option. It's not the perfect one (small table goes to each PQ, so we have some rework here) but should be ok for my purposes - our build table should always be small enought to fit in memory.
Regarding second recommendation - could you please explain why you think it could help? Isn't the result set too big to be stored in memory (it's not smaller than hundreds of mb, usually bigger than 2-3 gb)? Moreover - when I changed HJ distribution method into Broadcast/None the "Multiblock IO" setting in trace changed to 15. So I checked "_hash_multiblock_io_count" parameter on my DB - it's set to 0, what probably means it's dynamicaly chosen during execution. If the Oracle can set it dynamically then why it's having a problems with it?
I also found the line of trace informing us about result set size:qerhjFetch: PQ Partition rows:696534 clusters:907 in-memory slots 1
Why is there only 1 in-memory slot free? Free slot tracing shows something different:kxhfFreeSlots(800003ff7c068918): all=0 alloc=18 max=18 EmptySlots:15 8 9 10 11 12 13 PendingSlots:
However I noticed Oracle writes down and reads up data in round robin manner - slot after slot, thus it looks like even 1 slot is enough to handle this.
BTW I've did some additional tests with wrapping problematic HJ with some other PGA-related operation (like sorting). And the results were great - HJ stopped spilling to TEMP. Unfortunatelly I don't need any sort/group by operation so I skipped this idea
Thanks again for tracking down the issue, I really appreciate your help Jonathan.
Warm regards
Bazyli
Edited by: Buzzylee on Oct 16, 2008 7:33 AM
This discussion has been closed.