Discussions
Categories
- 196.7K All Categories
- 2.2K Data
- 235 Big Data Appliance
- 1.9K Data Science
- 449.8K Databases
- 221.5K General Database Discussions
- 3.8K Java and JavaScript in the Database
- 31 Multilingual Engine
- 549 MySQL Community Space
- 477 NoSQL Database
- 7.9K Oracle Database Express Edition (XE)
- 3K ORDS, SODA & JSON in the Database
- 532 SQLcl
- 4K SQL Developer Data Modeler
- 186.8K SQL & PL/SQL
- 21.2K SQL Developer
- 295.4K Development
- 17 Developer Projects
- 138 Programming Languages
- 292.1K Development Tools
- 104 DevOps
- 3.1K QA/Testing
- 645.9K Java
- 27 Java Learning Subscription
- 37K Database Connectivity
- 153 Java Community Process
- 105 Java 25
- 22.1K Java APIs
- 138.1K Java Development Tools
- 165.3K Java EE (Java Enterprise Edition)
- 17 Java Essentials
- 158 Java 8 Questions
- 85.9K Java Programming
- 79 Java Puzzle Ball
- 65.1K New To Java
- 1.7K Training / Learning / Certification
- 13.8K Java HotSpot Virtual Machine
- 94.2K Java SE
- 13.8K Java Security
- 203 Java User Groups
- 24 JavaScript - Nashorn
- Programs
- 390 LiveLabs
- 37 Workshops
- 10.2K Software
- 6.7K Berkeley DB Family
- 3.5K JHeadstart
- 5.6K Other Languages
- 2.3K Chinese
- 170 Deutsche Oracle Community
- 1K Español
- 1.9K Japanese
- 230 Portuguese
help with sql monitor report

Dear community,
We literally get stuck in the below merge sql in our ETL code chain.
The curious thing is the report is obtained from a stored proc
That merge statement comes back instantly with zero merge count when we re-play the sql outside of the stored proc with sql developer.
And the inner sql in the merge statement using (select ...) return NULL
And similar run with compared data also comes back instantly and at some point the perf just degrades and now it is literally hangs.
SQL Monitoring ReportSQL Text------------------------------MERGE INTO tmp_repair_status_note e USING ( SELECT rsn.repair_note_id AS repair_note_id, rsn.repair_status_id AS repair_status_id, rn.repair_id, rn.type_cd, rn.created_dt, rs.status_cd, rs.status_dt, 'N' new_status, 'N' new_noteFROM repair_status_note rsn, repair_note rn, repair_status rs, apl_repair_publish arpWHERE rsn.repair_note_id=rn.repair_note_id AND rsn.repair_status_id=rs.repair_staTUS_ID AND ARP.REPAIR_ID = RS.REPAIR_ID AND ARP.BATCH_ID = :B3 AND ARP.ID >= :B2 AND ARP.ID <= :B1) H ON ( E.REPAIR_ID = H.REPAIR_ID AND E.STATUS_CD = H.STATUS_CD AND CAST(E.CREATED_DT AS DATE) = CAST(H.CREATED_DT AS DATE) AND E.TYPE_CD = H.TYPE_CD ) WHEN MATCHED THEN UPDATE SET E.REPAIR_STATUS_ID = H.REPAIR_STATUS_ID, E.REPAIR_NOTE_ID = H.REPAIR_NOTE_ID, E.NEW_STATUS = H.NEW_STATUS, E.NEW_NOTE = h.new_note log errors into repair_status_note_errlog(TO_CHAR(:b4))reject limit unlimitedError: ORA-28------------------------------ORA-00028: your session has been killedGlobal Information------------------------------ Status : DONE (ERROR) Instance ID : 1 Session : GCRM_MIG_USER (8080:59224) SQL ID : 1nv5xg2n2z06r SQL Execution ID : 17090972 Execution Started : 07/03/2019 01:12:03 First Refresh Time : 07/03/2019 01:12:13 Last Refresh Time : 07/03/2019 01:21:17 Duration : 554s Module/Action : SQL Developer/- Service : SYS$USERS Program : SQL Developer PLSQL Entry Ids (Object/Subprogram) : 4752687,1 PLSQL Current Ids (Object/Subprogram) : 4752690,26Binds========================================================================================================================| Name | Position | Type | Value |========================================================================================================================| :B3 | 1 | NUMBER | -99926 || :B2 | 2 | NUMBER | 1 || :B1 | 3 | NUMBER | 998 |========================================================================================================================Global Stats=========================================| Elapsed | Cpu | Other | Buffer || Time(s) | Time(s) | Waits(s) | Gets |=========================================| 553 | 326 | 227 | 6M |=========================================SQL Plan Monitoring Details (Plan Hash Value=4290347794)==================================================================================================================================================================================| Id | Operation | Name | Rows | Cost | Time | Start | Execs | Rows | Mem | Activity | Activity Detail || | | | (Estim) | | Active(s) | Active | | (Actual) | (Max) | (%) | (# samples) |==================================================================================================================================================================================| 0 | MERGE STATEMENT | | | | | | 1 | | | | || 1 | MERGE | TMP_REPAIR_STATUS_NOTE | | | | | 1 | | | | || 2 | VIEW | | | | | | 1 | | | | || 3 | FILTER | | | | | | 1 | | | | || 4 | NESTED LOOPS | | 1 | 128M | | | 1 | | | | || 5 | NESTED LOOPS | | 25765 | 128M | 545 | +10 | 1 | 0 | | 1.28 | Cpu (7) || 6 | NESTED LOOPS | | 1M | 125M | 545 | +10 | 1 | 217M | | 1.47 | Cpu (8) || 7 | MERGE JOIN CARTESIAN | | 31M | 265K | 549 | +6 | 1 | 27M | | 0.18 | Cpu (1) || 8 | TABLE ACCESS BY INDEX ROWID | APL_REPAIR_PUBLISH_TB | 606 | 218 | 57 | +10 | 1 | 542 | | | || 9 | INDEX RANGE SCAN | APL_PUB_UK1 | 606 | 5 | 545 | +10 | 1 | 542 | | | || 10 | BUFFER SORT | | 51385 | 265K | 545 | +10 | 542 | 27M | 17M | 2.56 | Cpu (14) || 11 | TABLE ACCESS FULL | TMP_REPAIR_STATUS_NOTE | 51385 | 437 | 1 | +10 | 1 | 49745 | | | || 12 | TABLE ACCESS BY GLOBAL INDEX ROWID BATCHED | REPAIR_STATUS | 1 | 4 | 550 | +5 | 27M | 217M | | 38.46 | Cpu (210) || 13 | INDEX RANGE SCAN | RST_STSCD_RPR_IDX | 1 | 3 | 553 | +2 | 27M | 217M | | 15.20 | Cpu (83) || 14 | INDEX RANGE SCAN | RSN_PK_IDX | 1 | 3 | 488 | +66 | 217M | 0 | | 40.66 | Cpu (222) || 15 | TABLE ACCESS BY GLOBAL INDEX ROWID | REPAIR_NOTE | 1 | 3 | | | | | | | || 16 | INDEX UNIQUE SCAN | RNO_PK_IDX | 1 | 2 | | | | | | | |==================================================================================================================================================================================
Best Answer
-
The cast need only be applied to the timestamp, not to both sides. This MIGHT make some difference.
The plan you've just labelled the GOOD plan is the plan you first showed us, with the implication that it was a bad plan that was taking unexpected minutes to run or even "hang". It's also noticeable that the new version of the plan has completely different statistics from the previous verions by a few orders of magnitude.
The difference between the two plans you've shown us is simply in the order in which the two tables in the merge cartesian are joined, and the difference in total cost is very small - this makes me wonder whether you've made some significant adjustments to the optimizer_index_cost_adj and optimizer_index_caching parameters and thereby managed to make it hard for the optimizer to see the potential difference between two plans.
It's interesting to see that the BYTES column of your plan suggests the average row length (or sum of column lengths that you want) from the tmp_repair_status_note table is about 25KB. This is very large; can you tell us if there's anything special about the table or it's columns that might make Oracle think this, and might make the query less efficient. It's worth remembering if the rows are very large it's a good idea to use an inline query (for both the INTO and the USING) which lists only the relevent columns - see https://jonathanlewis.wordpress.com/2016/06/06/merge-precision/
Please note, though, in your position I would at least have tested the rewrite. If the problem is not important enough to test a suggestion that's backed by a reasonable explanation then the problem isn't important enough for anyone here to spend time trying to help you. Your comment about "it's old data but there's a mission critical OLTP application" seems to be rather self-contradictory - the statement seems to be badly written, it looks like it should be changed, wanting a "trick" because it's "old data" while using the code "with a trick" in a mission critical system sounds like a dangerous strategy.
If you want a "trick", though - you think you've got a good plan - create a baseline (or SQL Patch) or faked SQL Profile to force Oracle to use that plan.
Regards
Jonathan Lewis
Answers
-
ALWAYS post Oracle version to 4 decimal places!
-
Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
PL/SQL Release 12.1.0.2.0 - Production
"CORE 12.1.0.2.0 Production"
TNS for IBM/AIX RISC System/6000: Version 12.1.0.2.0 - Production
NLSRTL Version 12.1.0.2.0 - Production
-
I can spot a table with the prefix TMP, is that a temporary table that is only populated inside your stored procedure? I find it most likely that when you are trying to replicate the problem you are not populating this table with the same data.
It looks like the Cartesian join to that table will have the most impact on your performance.
A brief read (hard to be 100% when I’m reading on my phone), it looks like you’ve coded a cross join in the using subquery. Is that intentional? The optimiser has decided that it would be better to go straight from arl_repair_publish_tb to the table you’re trying to merge into (as a cross join) before it can join to repair_note with some filter (presumably repair_id).
Did you intentionally use a cross join in your using subquery? If so you’d probably be best off making sure your statistics on your tmp_ table are representative of the data to help the optimizer come up with a different strategy
-
The surprising thing about this query is the the optimizer KNOWS that it's going to be very expensive, and KNOWS that the cartesian join is going to produce a HUGE amount of data (prediction 31M rows vs. 27M reached before the query was cancelled, and the estimates for the two input table are in the right ballpark.
The thing to note is that when you get to 27M rows the optimizer's estimate for the next nested loop is to drop to 1M rows while the actual climbs to 217M rows, and the next nested loop drops the estimate to 26,000 and the next to 1. You spend a huge amount of time probing repair_status 27M times, and even more probing rsn_pk_idx 217M times.
You've shown us the code (which looks like a wierd code generator simply cross-joins all the tables then puts things that a person would write in the ON clause into the WHERE clause - and that may be why the optimizer has made some bad estimates, of course)), [update: ignore the previous comment, I was looking at a completely different piece of code] but we need to see how those predicates have been applied if we are to give you any more detailed suggestions. Initial line, though - examine the predicate section (dbms_xplan.display_cursor while the query is running) and find out what it is about the nested loop joins that makes Oracle think the joins will reduce the result set dramatically at each stage.
It's also worth checking that there isn't an SQL Plan Baseline (or SQL Profile) attached to this query that is forcing the plan to appear (perhaps because it was a good plan when the query was first written and the data sets were smaller). This would appear in the Note section from dbms_xplan.display_cursor()
Regards
Jonathan Lewis
UPDATE 2: I see Andrew has also said something about a cross join in the subquery. Have you edited the original thread ? If so have you re-run the query with the new SQL to see what happens ?
-
Jonathan Lewis wrote:The surprising thing about this query is the the optimizer KNOWS that it's going to be very expensive, and KNOWS that the cartesian join is going to produce a HUGE amount of data (prediction 31M rows vs. 27M reached before the query was cancelled, and the estimates for the two input table are in the right ballpark.The thing to note is that when you get to 27M rows the optimizer's estimate for the next nested loop is to drop to 1M rows while the actual climbs to 217M rows, and the next nested loop drops the estimate to 26,000 and the next to 1. You spend a huge amount of time probing repair_status 27M times, and even more probing rsn_pk_idx 217M times.You've shown us the code (which looks like a wierd code generator simply cross-joins all the tables then puts things that a person would write in the ON clause into the WHERE clause - and that may be why the optimizer has made some bad estimates, of course)), [update: ignore the previous comment, I was looking at a completely different piece of code] but we need to see how those predicates have been applied if we are to give you any more detailed suggestions. Initial line, though - examine the predicate section (dbms_xplan.display_cursor while the query is running) and find out what it is about the nested loop joins that makes Oracle think the joins will reduce the result set dramatically at each stage.It's also worth checking that there isn't an SQL Plan Baseline (or SQL Profile) attached to this query that is forcing the plan to appear (perhaps because it was a good plan when the query was first written and the data sets were smaller). This would appear in the Note section from dbms_xplan.display_cursor()RegardsJonathan LewisUPDATE 2: I see Andrew has also said something about a cross join in the subquery. Have you edited the original thread ? If so have you re-run the query with the new SQL to see what happens ?
I don't think the post was edited (there's usually a note at the bottom), it's likely I was just misreading the SQL earlier.
Your idea about something forcing the plan sounds most probable. The SQL doesn't need to be executing for display_cursor to do the trick, the sql_id is known and plan hash value is there just in case multiple child cursors exist.
This is quite similar to the classic bind peeking problem where the query is originally hard parsed with out of range peeked bind variables so the plan favours eliminating using that filter first. Later, the query is executed without further hard parsing but with bind variables that are in range, we still use the same execution plan because it's still in memory but the plan is suboptimal when that predicate doesn't return 0 rows. In that scenario you would generally force a new plan to be obtained by flushing the cursor from the shared pool. In OP's scenario, the plan seems to be persisted by something else and will even survive the hard parse, if it's a baseline then it can be managed with dbms_spm.
This is one to be aware of as plan baselines become more common - I believe they are default on in the autonomous offerings (although there'll be no DBAs to scratch their heads )
-
I don't think the post was edited (there's usually a note at the bottom), it's likely I was just misreading the SQL earlier.
Andrew,
I've rediscovered the SQL with the "cross join subquery" ( ) - so I think I must have been looking at my copy of that statement while writing the response to this one. Perhaps you were thinking of the same statement since you've also responded on the other thread.
Regards
Jpnathan Lewis
-
A couple of thoughts about why the optimizer may have found it hard to find a good plan.
a) Your ON clause includes: "E.REPAIR_ID = H.REPAIR_ID", and the repair_id exposed in the USING subquery is rn.repair_id, but the WHERE clause on the subquery is arp.repair_id = rs.repair_id. Is it possible that rn.repair_id will match arp.repair_id and rs.repair_id in this query - if so then exposing rs.repair_id in the view (or including a predicate between rs.repair_id and rn.repair_id) might give the optimizer the option to change the join order.
b) you have an odd predicate: CAST(E.CREATED_DT AS DATE) = CAST(H.CREATED_DT AS DATE); why is that, the column names suggest that they are already date types, so it looks a little odd and may be doing something that changes the cardinality estimates in a very unpleasant way.
Regards
Jonathan Lewis
-
Thanks for posting your version as per the guidelines in the link that @John Thorton posted.
5) Database Version and IDE Version
Ensure you provide your database version number e.g. 11.2.0.3 so that we know what features we can use when answering.If you're not sure what it is you can do the following:select * from v$version;
in an SQL*Plus session and paste the results.
Also read #2 from the link (which I posted below) and please update the subject title of your thread
2) Thread Subject line
Give your thread a meaningful subject, not just "help please", "Query help" or "SQL". This is the SQL and PL/SQL forum. We know your question is going to be about those things, make it meaningful to the type of question so that people with the right sort of knowledge can pick it up and those without can ignore it. Never, EVER, mark your subject as "URGENT" or "ASAP". This forum is manned by volunteers giving their own time to help and your question is never urgent or more important than their own work or than other people's questions. It may be urgent to you, but that's not forum members issue. By marking your question as urgent you are actually less likely to get a good response, as members will choose to ignore such questions or respond unfavorably. -
There should not be any cross join,
As the inner sql is returning NULL
apl_repair_publish is our driveing table which is less than 2000 rows in any given instance
and the repair_id and batch_id is the primary key for that table
and for the hanging use case, it will always return NULL for that inner query
Actually our DBA team sees the plan change and we the dev team asks DBA team uses plan stable feature of Oracle but the DBA team says after pin change it still not working
We cannot figure it out why it was working before and then it breaks after a point in time...
SQL_ID 1nv5xg2n2z06rMERGE INTO TMP_REPAIR_STATUS_NOTE E USING ( SELECT RSN.REPAIR_NOTE_IDAS REPAIR_NOTE_ID, RSN.REPAIR_STATUS_ID AS REPAIR_STATUS_ID,RN.REPAIR_ID, RN.TYPE_CD, RN.CREATED_DT, RS.STATUS_CD, RS.STATUS_DT,'N' NEW_STATUS, 'N' NEW_NOTE FROM REPAIR_STATUS_NOTE RSN, REPAIR_NOTERN, REPAIR_STATUS RS, APL_REPAIR_PUBLISH ARP WHERE RSN.REPAIR_NOTE_ID =RN.REPAIR_NOTE_ID AND RSN.REPAIR_STATUS_ID = RS.REPAIR_STATUS_ID ANDARP.REPAIR_ID = RS.REPAIR_ID AND ARP.BATCH_ID = :B3 AND ARP.ID >= :B2AND ARP.ID <= :B1 ) H ON ( E.REPAIR_ID = H.REPAIR_ID AND E.STATUS_CD =H.STATUS_CD AND CAST(E.CREATED_DT AS DATE) = CAST(H.CREATED_DT AS DATE)AND E.TYPE_CD = H.TYPE_CD ) WHEN MATCHED THEN UPDATE SETE.REPAIR_STATUS_ID = H.REPAIR_STATUS_ID, E.REPAIR_NOTE_ID =H.REPAIR_NOTE_ID, E.NEW_STATUS = H.NEW_STATUS, E.NEW_NOTE = H.NEW_NOTELOG ERRORS INTO REPAIR_STATUS_NOTE_ERRLOG ( TO_CHAR(:B4 ) ) REJECTLIMIT UNLIMITEDPlan hash value: 3847621052(bad one --current plan)-------------------------------------------------------------------------------------------------------------------------------------------| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop |-------------------------------------------------------------------------------------------------------------------------------------------| 0 | MERGE STATEMENT | | | | 15 (100)| | | || 1 | MERGE | TMP_REPAIR_STATUS_NOTE | | | | | | || 2 | VIEW | | | | | | | ||* 3 | FILTER | | | | | | | || 4 | NESTED LOOPS | | 1 | 26293 | 15 (0)| 00:00:01 | | || 5 | NESTED LOOPS | | 1 | 26232 | 12 (0)| 00:00:01 | | || 6 | NESTED LOOPS | | 1 | 26164 | 9 (0)| 00:00:01 | | || 7 | MERGE JOIN CARTESIAN | | 1 | 26103 | 5 (0)| 00:00:01 | | || 8 | TABLE ACCESS FULL | TMP_REPAIR_STATUS_NOTE | 1 | 26081 | 2 (0)| 00:00:01 | | || 9 | BUFFER SORT | | 1 | 22 | 3 (0)| 00:00:01 | | || 10 | TABLE ACCESS BY INDEX ROWID BATCHED | APL_REPAIR_PUBLISH_TB | 1 | 22 | 3 (0)| 00:00:01 | | ||* 11 | INDEX RANGE SCAN | APL_PUB_UK1 | 1 | | 2 (0)| 00:00:01 | | || 12 | TABLE ACCESS BY GLOBAL INDEX ROWID BATCHED| REPAIR_STATUS | 1 | 61 | 4 (0)| 00:00:01 | ROWID | ROWID ||* 13 | INDEX RANGE SCAN | RST_STSCD_RPR_IDX | 1 | | 3 (0)| 00:00:01 | | ||* 14 | INDEX RANGE SCAN | RSN_PK_IDX | 1 | 68 | 3 (0)| 00:00:01 | | ||* 15 | TABLE ACCESS BY GLOBAL INDEX ROWID | REPAIR_NOTE | 1 | 61 | 3 (0)| 00:00:01 | ROWID | ROWID ||* 16 | INDEX UNIQUE SCAN | RNO_PK_IDX | 1 | | 2 (0)| 00:00:01 | | |-------------------------------------------------------------------------------------------------------------------------------------------Plan hash value: 4290347794 (Good one)-------------------------------------------------------------------------------------------------------------------------------------------| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop |-------------------------------------------------------------------------------------------------------------------------------------------| 0 | MERGE STATEMENT | | | | 16 (100)| | | || 1 | MERGE | TMP_REPAIR_STATUS_NOTE | | | | | | || 2 | VIEW | | | | | | | ||* 3 | FILTER | | | | | | | || 4 | NESTED LOOPS | | 1 | 26293 | 16 (0)| 00:00:01 | | || 5 | NESTED LOOPS | | 1 | 26232 | 13 (0)| 00:00:01 | | || 6 | NESTED LOOPS | | 1 | 26164 | 10 (0)| 00:00:01 | | || 7 | MERGE JOIN CARTESIAN | | 1 | 26103 | 6 (0)| 00:00:01 | | || 8 | TABLE ACCESS BY INDEX ROWID | APL_REPAIR_PUBLISH_TB | 1 | 22 | 4 (0)| 00:00:01 | | ||* 9 | INDEX RANGE SCAN | APL_PUB_UK1 | 1 | | 3 (0)| 00:00:01 | | || 10 | BUFFER SORT | | 29 | 738K| 2 (0)| 00:00:01 | | || 11 | TABLE ACCESS FULL | TMP_REPAIR_STATUS_NOTE | 29 | 738K| 2 (0)| 00:00:01 | | || 12 | TABLE ACCESS BY GLOBAL INDEX ROWID BATCHED| REPAIR_STATUS | 1 | 61 | 4 (0)| 00:00:01 | ROWID | ROWID ||* 13 | INDEX RANGE SCAN | RST_STSCD_RPR_IDX | 1 | | 3 (0)| 00:00:01 | | ||* 14 | INDEX RANGE SCAN | RSN_PK_IDX | 1 | 68 | 3 (0)| 00:00:01 | | ||* 15 | TABLE ACCESS BY GLOBAL INDEX ROWID | REPAIR_NOTE | 1 | 61 | 3 (0)| 00:00:01 | ROWID | ROWID ||* 16 | INDEX UNIQUE SCAN | RNO_PK_IDX | 1 | | 2 (0)| 00:00:01 | | |-------------------------------------------------------------------------------------------------------------------------------------------
-
Is it possible that rn.repair_id will match arp.repair_id and rs.repair_id
Yes.
CAST(E.CREATED_DT AS DATE) = CAST(H.CREATED_DT AS DATE)
because one is timestamp datatype so we need to upcast it
OK so for the first suggestion we need to re-write the query?? And this will not be a good idea. as this exercise is for out-dated data but the business is some how still want it for testing purpose.
and we have a mission critical OLTP replying on the same code chain, so it is better we figure out some trick to make Oracle to do the same thing it used to before..
As we did 287 runs and the last 12 got hang like that..
how do we know if we change the query it will NOT affect the mission critical query plan...