Forum Stats

  • 3,815,712 Users
  • 2,259,070 Discussions
  • 7,893,213 Comments

Discussions

help with sql monitor report

2

Answers

  • Jonathan Lewis
    Jonathan Lewis Member Posts: 9,949 Blue Diamond
    edited Jul 3, 2019 3:33PM Answer ✓

    The cast need only be applied to the timestamp, not to both sides. This MIGHT make some difference.

    The plan you've just labelled the GOOD plan is the plan you first showed us, with the implication that it was a bad plan that was taking unexpected minutes to run or even "hang". It's also noticeable that the new version of the plan has completely different statistics from the previous verions by a few orders of magnitude.

    The difference between the two plans you've shown us is simply in the order in which the two tables in the merge cartesian are joined, and the difference in total cost is very small - this makes me wonder whether you've made some significant adjustments to the optimizer_index_cost_adj and optimizer_index_caching parameters and thereby managed to make it hard for the optimizer to see the potential difference between two plans.

    It's interesting to see that the BYTES column of your plan suggests the average row length (or sum of column lengths that you want) from the tmp_repair_status_note table is about 25KB. This is very large; can you tell us if there's anything special about the table or it's columns that might make Oracle think this, and might make the query less efficient. It's worth remembering if the rows are very large it's a good idea to use an inline query (for both the INTO and the USING) which lists only the relevent columns - see https://jonathanlewis.wordpress.com/2016/06/06/merge-precision/

    Please note, though, in your position I would at least have tested the rewrite. If the problem is not important enough to test a suggestion that's backed by a reasonable explanation then the problem isn't important enough for anyone here to spend time trying to help you. Your comment about "it's old data but there's a mission critical OLTP application" seems to be rather self-contradictory - the statement seems  to be badly written, it looks like it should be changed, wanting a "trick" because it's "old data" while using the code "with a trick" in a mission critical system sounds like a dangerous strategy.

    If you want a "trick", though - you think you've got a good plan - create a baseline (or SQL Patch) or faked SQL Profile to force Oracle to use that plan.

    Regards

    Jonathan Lewis

  • kaericn
    kaericn Member Posts: 307 Blue Ribbon
    edited Jul 3, 2019 6:04PM

      CREATE GLOBAL TEMPORARY TABLE "GCRM_MIG_OWNER"."TMP_REPAIR_STATUS_NOTE"

       ( "REPAIR_STATUS_ID" VARCHAR2(4000 BYTE),

    "REPAIR_NOTE_ID" VARCHAR2(4000 BYTE),

    "REPAIR_ID" VARCHAR2(4000 BYTE),

    "STATUS_CD" VARCHAR2(4000 BYTE),

    "STATUS_DT" TIMESTAMP (6),

    "CHANNEL_CD" VARCHAR2(4000 BYTE),

    "STATUS_NOTES" CLOB,

    "REASON_CD" VARCHAR2(4000 BYTE),

    "CREATED_BY" VARCHAR2(4000 BYTE),

    "CREATED_DT" TIMESTAMP (6),

    "LAST_MODIFIED_BY" VARCHAR2(4000 BYTE),

    "LAST_MODIFIED_DT" TIMESTAMP (6),

    "NEW_STATUS" VARCHAR2(1 BYTE),

    "NEW_NOTE" VARCHAR2(1 BYTE),

    "TYPE_CD" VARCHAR2(4000 BYTE),

    "TITLE" VARCHAR2(4000 BYTE),

    "TEXT" CLOB,

    "CATEGORY_CD" VARCHAR2(4000 BYTE)

       ) ON COMMIT PRESERVE ROWS ;

  • kaericn
    kaericn Member Posts: 307 Blue Ribbon
    edited Jul 3, 2019 6:10PM

    It is important as how do we know that this issue is not going to replicate to our mission critical OLTP part ??

    It is the same code.

    The difference is

    1)the data is different (old data vs new data)

    2)the volumne: in the mission critical we do 100 repairs at a time and this exercise we do 999 at a time

    We will do the re-write and test it out of course, however, given the business just want to close out the out-dated data testing exercise, we just try to resolve this as quickly as possible before the holiday.

    Honestly we the dev team do not 100% sure which one is good vs which one is bad, we just take what the DBA team says as it does not make too much difference to us (the dev team)

    And look at the bytes, it seems the "good" one gets more bytes and we dont get why ???

    Our DBA team did not suggest "create a baseline (or SQL Patch) or faked SQL Profile to force Oracle to use that plan"

    We will pass this to them and will see.

  • kaericn
    kaericn Member Posts: 307 Blue Ribbon
    edited Jul 3, 2019 6:54PM

    according to our DBA team the "good" plan which has low elapse time is being pinned already.

    optimizer_index_cost_adj and optimizer_index_caching, the dev team wont touch these as far as we know..

    Thanks for all the info.

    So even the sql plan get pinned, Oracle is not necessary using it if the sql is being cached ??? until it gets flushed out..

    It seems after DBA pin the plan, the hang still happens last nite but lucky thx GOD the OLTP mission critical is good with that sql so far..

  • AndrewSayer
    AndrewSayer Member Posts: 13,007 Gold Crown
    edited Jul 4, 2019 3:21AM
    kaericn wrote: CREATE GLOBAL TEMPORARY TABLE "GCRM_MIG_OWNER"."TMP_REPAIR_STATUS_NOTE"  ( "REPAIR_STATUS_ID" VARCHAR2(4000 BYTE), "REPAIR_NOTE_ID" VARCHAR2(4000 BYTE), "REPAIR_ID" VARCHAR2(4000 BYTE), "STATUS_CD" VARCHAR2(4000 BYTE), "STATUS_DT" TIMESTAMP (6), "CHANNEL_CD" VARCHAR2(4000 BYTE), "STATUS_NOTES" CLOB, "REASON_CD" VARCHAR2(4000 BYTE), "CREATED_BY" VARCHAR2(4000 BYTE), "CREATED_DT" TIMESTAMP (6), "LAST_MODIFIED_BY" VARCHAR2(4000 BYTE), "LAST_MODIFIED_DT" TIMESTAMP (6), "NEW_STATUS" VARCHAR2(1 BYTE), "NEW_NOTE" VARCHAR2(1 BYTE), "TYPE_CD" VARCHAR2(4000 BYTE), "TITLE" VARCHAR2(4000 BYTE), "TEXT" CLOB, "CATEGORY_CD" VARCHAR2(4000 BYTE) ) ON COMMIT PRESERVE ROWS ;

    I spot two clobs there, you should have a link at the link Jonathan shared about column projections in a merge statement.

    -edit- actually the clobs are likely to just get projected as lob locators. Your bigger problem will be the many varchar2(4000) columns that look like they are there “just in case” https://jonathanlewis.wordpress.com/2014/12/23/just-in-case/

    You and your DBA team need to communicate better, it seems this is a task for you to work on when someone else holds all the information. which one is the good plan? It’s the one that performs fine with all data, right? The original plan you shared with the sql live monitor looked terrible but as I mentioned in my first reply, would perform fine if the data in your temporary table was small. Did you DBA team pin the cursor or did they create a SQL plan baseline? They are two different concepts. As mentioned, it very much looks like a bad plan was used to create the baseline, maybe this plan was seen as food mistakenly?

    I suggest:

    Drop the current baseline (assuming the DBAs didn’t really just pin a cursor)

    Flush the cursor from the shared pool (use dbms_shared_pool)

    Run your workload in a way that results in the actually good plan being used when it gets parsed (likely involving populating the temporary table with sufficient data first)

    Confirm the plan being used is the good one - check the join order, what predicates are being applied where.

    Create the baseline using this known good plan.

    You mentioned a lot about your subquery returning null, that doesn’t really mean anything. Do you mean one of the columns will always be populated with null so the join to the TMP table would return no rows (if the statement was executed in that order)? Or do you mean there would be no rows in the subquery. The plan you’ve marked as good but shared the monitor report for does not operate in that order - it cross joins to your TMP table from your “leading” table - as no predicates can be applied on this join I wouldn’t count it as being a driving table.

  • Jonathan Lewis
    Jonathan Lewis Member Posts: 9,949 Blue Diamond
    edited Jul 4, 2019 4:19AM

    If I've understood you correctly, you've got a process which is being used to set some flags on historic data  but the same process is used to set some flags in current data in near-real time; for cleaning up old data you process 999 rows at a time but for new data you process only 100 rows at a time.

    Does this mean you believe you insert 999 rows into the GTT and then update them (then do something else with the results), then insert the next 999 rows into the GTT and so on? If so then shouldn't you be moving the rows out of the GTT after each batch is complete?

    If that's a reasonably accurate description of what you believe to be happening then I'd want to know why the first plan you showed us reported an estimate of 51,000 rows in the GTT and had got through an actual 49,745 before it was terminated.

    Maybe the progressive slow down and eventual "hang" (which is probably not a hang, just a very long processing time) is a result of data accumulaiting in the GTT. Since it's declared with ON COMMIT PRESERVE ROWS you need to end the session or issue a truncate to empty it efficiently.

    Perhaps your "old data" problem exists because you keep a single session looping through the data and growing the content of the GTT until it reaches a catastrophe, while the "new data" OLTP system has a process that starts up, handles a few rows, and terminates itself because there's nothing left to process, so never manages to get a lot of data into the table.

    Regards

    Jonathan Lewis

    AndrewSayer
  • kaericn
    kaericn Member Posts: 307 Blue Ribbon
    edited Jul 10, 2019 2:10AM

    The below return NULL always in the hang case

    SELECT    rsn.repair_note_id    AS repair_note_id,    rsn.repair_status_id  AS repair_status_id,    rn.repair_id,    rn.type_cd,    rn.created_dt,    rs.status_cd,    rs.status_dt,    'N' new_status,    'N' new_noteFROM    repair_status_note  rsn,    repair_note          rn,    repair_status        rs,    apl_repair_publish  arpWHERE    rsn.repair_note_id=rn.repair_note_id    AND rsn.repair_status_id=rs.repair_sta  TUS_ID AND ARP.REPAIR_ID = RS.REPAIR_ID AND ARP.BATCH_ID = :B3 AND ARP.ID >= :B2 AND ARP.ID <= :B1 

    And we always do the below, before loading the GTT tmp_repair_status_note

    execute immediate 'truncate tmp_repair_status_note';

    And in our hang case.

    tmp_repair_status_note has 43k+ records

    So our question is

    1)why it was working before and now its NOT working as of now 2019-07-09 11:08 PM PDT ???

    2)if the above sql is NULL, and it means the merge will also be ZERO as we have observed before the hang...

    why oracle is spending so much time (HANG!!??) on something that is soooo simple ???

  • AndrewSayer
    AndrewSayer Member Posts: 13,007 Gold Crown
    edited Jul 10, 2019 2:51AM
    kaericn wrote:The below return NULL always in the hang case
    1. SELECT
    2. rsn.repair_note_idASrepair_note_id,
    3. rsn.repair_status_idASrepair_status_id,
    4. rn.repair_id,
    5. rn.type_cd,
    6. rn.created_dt,
    7. rs.status_cd,
    8. rs.status_dt,
    9. 'N'new_status,
    10. 'N'new_note
    11. FROM
    12. repair_status_notersn,
    13. repair_notern,
    14. repair_statusrs,
    15. apl_repair_publisharp
    16. WHERE
    17. rsn.repair_note_id=rn.repair_note_id
    18. ANDrsn.repair_status_id=rs.repair_staTUS_IDANDARP.REPAIR_ID=RS.REPAIR_IDANDARP.BATCH_ID=:B3ANDARP.ID>=:B2ANDARP.ID<=:B1
    SELECT  rsn.repair_note_id AS repair_note_id,  rsn.repair_status_id AS repair_status_id,  rn.repair_id,  rn.type_cd,  rn.created_dt,  rs.status_cd,  rs.status_dt,  'N' new_status,  'N' new_note FROM  repair_status_note rsn,  repair_note rn,  repair_status rs,  apl_repair_publish arp WHERE  rsn.repair_note_id=rn.repair_note_id  AND rsn.repair_status_id=rs.repair_sta TUS_ID AND ARP.REPAIR_ID = RS.REPAIR_ID AND ARP.BATCH_ID = :B3 AND ARP.ID >= :B2 AND ARP.ID <= :B1 
    And we always do the below, before loading the GTT tmp_repair_status_noteexecute immediate 'truncate tmp_repair_status_note';And in our hang case.tmp_repair_status_note has 43k+ recordsSo our question is1)why it was working before and now its NOT working as of now 2019-07-09 11:08 PM PDT ???2)if the above sql is NULL, and it means the merge will also be ZERO as we have observed before the hang...why oracle is spending so much time (HANG!!??) on something that is soooo simple ???

    I’ve asked before but what does return NULL mean? Your query returns many columns, are they all going to be null? Or do you mean it returns no rows.

    Its taking so much time because it’s being forced to execute using a terrible plan. The plan would work fine if the tmp% Table was empty because it would find there are no rows that can be updated much sooner - this is probably why it was mistakenly thought of as a good plan. As I explained in the first reply, the performance of that plan is very highly dependant on how many rows are in that table. I recommend you read over the plan yourself to see how bad it is.

    You need to drop the baseline, or whatever else you’ve done to force the current bad plan. Then you need to give the optimizer a fighting chance to get a reasonable plan by making sure your statistics on these tables represent the data. You probably want stats on your TMP table that say there are rows - even if there are no rows, it should execute a plan that thinks there’s rows reasonably well.

  • Jonathan Lewis
    Jonathan Lewis Member Posts: 9,949 Blue Diamond
    edited Jul 10, 2019 4:08AM
    kaericn wrote:The below return NULL always in the hang caseASELECT  rsn.repair_note_id AS repair_note_id,  rsn.repair_status_id AS repair_status_id,  rn.repair_id,  rn.type_cd,  rn.created_dt,  rs.status_cd,  rs.status_dt,  'N' new_status,  'N' new_note FROM  repair_status_note rsn,  repair_note rn,  repair_status rs,  apl_repair_publish arp WHERE  rsn.repair_note_id=rn.repair_note_id  AND rsn.repair_status_id=rs.repair_sta TUS_ID AND ARP.REPAIR_ID = RS.REPAIR_ID AND ARP.BATCH_ID = :B3 AND ARP.ID >= :B2 AND ARP.ID <= :B1 
    And we always do the below, before loading the GTT tmp_repair_status_noteexecute immediate 'truncate tmp_repair_status_note'; And in our hang case.tmp_repair_status_note has 43k+ recordsSo our question is1)why it was working before and now its NOT working as of now 2019-07-09 11:08 PM PDT ???2)if the above sql is NULL, and it means the merge will also be ZERO as we have observed before the hang...why oracle is spending so much time (HANG!!??) on something that is soooo simple ???

    a) It is irrelevant that joining A, B, C returns no rows if you have forced Oracle to join A,D first before joining B, C.

    b) You have shown us a plan that "hangs" - though the report indicates that it wasn't hanging, just following a hugely inefficient strategy that the optimizer had been forced to pick - and then shown us a "GOOD" plan that is exactly the same plan.  You have said that the DBA's have "pinned" the plan, but you can't pin plans you either pin cursors (which won't help when you truncate a dependent table) or create a baseline/outline to force the optimizer to follow a specific plan. Please get the DBAs to explain what they mean by "pinning" a plan.

    c) You have said "in the mission critical we do 100 repairs at a time and this exercise we do 999 at a time" what does this actually mean - you didn't answer my question from a previous post where I asked you to correct my understanding.

    d) If you think the number of rows in the driving query you have supplied should be the thing that dictates the speed of execution then add the hint /*+ no_merge */ to the select statement  "(SELECT   /*+ no_merge */   rsn.repair_note_id    AS repair_note_id,  ...)" so that the optimizer is forced to optimize and use that query block as a unit and the run-time engine will discover that the join returns no rows.

    Bottom line:

    Until you can demonstrate otherwise I have to assume from the information supplied that your DBAs have attached a stored outline, SQL Plann Baseline, SQL Patch, or SQL Profile (using the CoE hack  that means it's not a proper profile) to the SQL statement.

    Regards

    Jonathan Lewis

  • kaericn
    kaericn Member Posts: 307 Blue Ribbon
    edited Jul 10, 2019 3:59PM

    Thanks for your input!!

    d) sounds like a good idea!  we probably wont put it in for now the perf is good for that sql in our mission critical ETL chain.

    c)And in the mission critical system we load apl_repair_publish 100 records at a time.

    and every time, it finishes it delete the rows in apl_repair_publish  so the below data set is always very small.

    and everytime if we join that apl_repair_publish, the result set should be small and in this case it is ZERO rows as we are loading new data.

    And on this HANG or whatever you call it exercise, we are doing data-fix and we are loading < 999 records at a time.  We have finished 254 such batches all with very good perf.

    and then all of a sudden this issue HAPPENS!!?? 

    So why Oracle all of a sudden changes its mind when everything was running good before ??

    And more important to us is how do we know Oracle will not change its plan all of a sudden like this case in our mission critical ETL chain ??

    As like you state it wont help as you stated in b)

    Our boss is conservative in any code change while it is running good.

    However, if this happens in our mission critical app, we would get call in 3am in the morning California time...

    SELECTrsn.repair_note_idASrepair_note_id,rsn.repair_status_idASrepair_status_id,rn.repair_id,rn.type_cd,rn.created_dt,rs.status_cd,rs.status_dt,'N'new_status,'N'new_noteFROMrepair_status_notersn,repair_notern,repair_statusrs,apl_repair_publisharpWHERErsn.repair_note_id=rn.repair_note_idANDrsn.repair_status_id=rs.repair_staTUS_ID AND ARP.REPAIR_ID=RS.REPAIR_IDANDARP.BATCH_ID=:B3ANDARP.ID>=:B2ANDARP.ID<=:B1