1 2 Previous Next 20 Replies Latest reply on Dec 5, 2019 5:13 AM by 933257

    Idle Waitevent Issue

    933257

      Hi we are using version 11.2.0.4 of Oracle. We see one query consuming ~4GB of tempspace and holding this tempspace for ~45minutes till its execution finishes up. We want to get the run time reduced and also if possible the temp space consumption/duration.

       

      But what i got to see from the sql  monitor the elapsed time i.e the real Db time is ~7minutes but the total duration is ~45minutes which means it must be all on idle wait i.e. "SQL*Net message from client". Then i checked with Dev team if this part of cursor query and we do some operation for each chunk of rows fetched so DB is just waiting for response from client side? But that is not the case here, we are fetching all the rows in one shot here. So then i checked the sql monitor and found the "fetch calls" showing as  -2086 and total number of rows fetched by the query is ~104million which means the fetch size set at client side is ~50000 which looks bit higher to me though.

       

      Then after doing the 10046 trace, the sql i see below waits sections. Which does show highest numbers for "SQL*Net more data to client" in "times waited" column and highest "SQL*Net message from client" in "Total waited".

       

      I have posted only specific section of the sql monitor to keep it simple. It does shows we are performing distinct operation on ~98million rows which spilling to temp ~4GB and i think there is no way we can reduce it further. The only way i can think of is setting the hash_area_aggregate_size to 2GB explicitly but i think that may cause memory issue(blowing out pga memory) considering this query executes many number of times in a day and sometimes from multiple sessions at same time. We have pga_aggregate_size set as 12GB.

       

      And i have also shown the starting portion of the sql monitor plan which responsible for ~90% of the activity in the query and big difference in run_duration and elapsed_time is noticed in same too.

       

      My question is

       

      1)What the symptom shows here? Is it going to help the query finish in quick time if we reduce the fetchsize? Or we have any issue with SDU size or network setup etc which needs to be fixed?

      2)If we can further reduce the tempspace consumption anyway?

       

       

      Its big UNION ALL query. I have just posting the part which consumes the ~95% of the time and the sql monitor section of that part only to keep it simple.

       

      SELECT DISTINCT TAB1.c1,  TAB2.c7,TAB2.c7,  TAB2.c8,  TAB2.c9, TAB1.c2,'X'

           FROM  TAB2, TAB1 ,  TAB3

          WHERE     TAB2.c2 = TAB1.c2

                AND TAB2.C6 = TAB1.C6

               AND TAB2.C5 = TAB3.C5

                AND TAB3.C4 = 'XXX'

                AND TAB3.C4 = TAB2.C4;

       

      *****Sections of sql monitor*************
      
      Global Information
      ------------------------------
       Status              :  DONE (ALL ROWS)                  
       Instance ID         :  1                                
        SQL Execution ID    :  16777217                         
       Execution Started   :  11/15/2019 04:00:02              
       First Refresh Time  :  11/15/2019 04:00:08              
       Last Refresh Time   :  11/15/2019 04:44:06              
       Duration            :  2644s                            
       Fetch Calls         :  2086                             
      
      Global Stats
      ==========================================================================================================
      | Elapsed |   Cpu   |    IO    | Concurrency |  Other   | Fetch | Buffer | Read  | Read  | Write | Write |
      | Time(s) | Time(s) | Waits(s) |  Waits(s)   | Waits(s) | Calls |  Gets  | Reqs  | Bytes | Reqs  | Bytes |
      ==========================================================================================================
      |     389 |     279 |       64 |        0.00 |       46 |  2086 |   155K | 15927 |   4GB | 15915 |   4GB |
      ==========================================================================================================
      
      ======================================================================================================================================================================================================================================
      | Id |               Operation               |             Name              |  Rows   | Cost  |   Time    | Start  | Execs |   Rows   | Read  | Read  | Write | Write |  Mem  | Temp  | Activity |         Activity Detail          |
      |    |                                       |                               | (Estim) |       | Active(s) | Active |       | (Actual) | Reqs  | Bytes | Reqs  | Bytes | (Max) | (Max) |   (%)    |           (# samples)            |
      ======================================================================================================================================================================================================================================
      |  0 | SELECT STATEMENT                      |                               |         |       |      2487 |   +158 |     1 |     104M |       |       |       |       |       |       |    27.03 | Cpu (58)                         |
      |    |                                       |                               |         |       |           |        |       |          |       |       |       |       |       |       |          | SQL*Net more data to client (45) |
      |  1 |   VIEW                                | A_VIEW                        |    141M |    2M |      2487 |   +158 |     1 |     104M |       |       |       |       |       |       |     1.31 | Cpu (5)                          |
      |  2 |    UNION-ALL                          |                               |         |       |      2487 |   +158 |     1 |     104M |       |       |       |       |       |       |     4.46 | Cpu (17)                         |
      |  3 |     HASH UNIQUE                       |                               |    134M |    2M |      2494 |     +1 |     1 |      98M | 15915 |   4GB | 15915 |   4GB |    1G |    4G |    60.63 | Cpu (177)                        |
      |    |                                       |                               |         |       |           |        |       |          |       |       |       |       |       |       |          | direct path read temp (30)       |
      |    |                                       |                               |         |       |           |        |       |          |       |       |       |       |       |       |          | direct path write temp (24)      |
      |  4 |      HASH JOIN                        |                               |    134M |  6649 |       153 |     +6 |     1 |      98M |       |       |       |       |    1M |       |     3.94 | Cpu (15)                         |
      |  5 |       HASH JOIN                       |                               |    3169 |   396 |         1 |     +6 |     1 |     3276 |       |       |       |       |    1M |       |          |                                  |
      |  6 |        TABLE ACCESS FULL              | TAB2                          |    3169 |    12 |         1 |     +6 |     1 |     3279 |       |       |       |       |       |       |          |                                  |
      |  7 |        TABLE ACCESS FULL              | TAB3                          |    133K |   383 |         1 |     +6 |     1 |     134K |       |       |       |       |       |       |          |                                  |
      |  8 |       TABLE ACCESS FULL               | TAB1                          |    425K |  5212 |       153 |     +6 |     1 |     427K |       |       |       |       |       |       |          |                                  |
      |  9 |     HASH UNIQUE                       |                               |    349K | 15665 |        84 |  +2494 |     1 |       3M |       |       |       |       |  272M |       |     1.84 | Cpu (7)                          |
      | 10 |      HASH JOIN                        |                               |    349K |  8576 |         5 |  +2494
      
      
      

       

       

      **** Sections Of 10046 trace********************
      
      call     count       cpu    elapsed       disk      query    current        rows
      ------- ------  -------- ---------- ---------- ---------- ----------  ----------
      Parse        0      0.00       0.00          0          0          0           0
      Execute      1      0.00       0.00          0          0          0           0
      Fetch     2086    365.60     493.38     493365     155270          0   104297649
      ------- ------  -------- ---------- ---------- ---------- ----------  ----------
      total     2087    365.60     493.38     493365     155270          0   104297649
      
      
      OVERALL TOTALS FOR ALL NON-RECURSIVE STATEMENTS
      
      call     count       cpu    elapsed       disk      query    current        rows
      ------- ------  -------- ---------- ---------- ---------- ----------  ----------
      Parse        0      0.00       0.00          0          0          0           0
      Execute      1      0.00       0.00          0          0          0           0
      Fetch     2086    365.60     493.38     493365     155270          0   104297649
      ------- ------  -------- ---------- ---------- ---------- ----------  ----------
      total     2087    365.60     493.38     493365     155270          0   104297649
      
      Misses in library cache during parse: 0
      
      Elapsed times include waiting on following events:
        Event waited on                             Times   Max. Wait  Total Waited
        ----------------------------------------   Waited  ----------  ------------
        Disk file operations I/O                        2        0.00          0.00
        latch: row cache objects                        1        0.00          0.00
        SQL*Net message to client                    2086        0.00          0.00
        asynch descriptor resize                        1        0.00          0.00
        direct path write temp                      15915        0.04         25.61
        latch free                                      2        0.00          0.00
        SQL*Net more data to client                320615        0.09         70.85
        SQL*Net message from client                  2085        1.50       2332.84
        direct path read temp                       15915        0.04         30.90
      
      
      OVERALL TOTALS FOR ALL RECURSIVE STATEMENTS
      
      call     count       cpu    elapsed       disk      query    current        rows
      ------- ------  -------- ---------- ---------- ---------- ----------  ----------
      Parse        1      0.00       0.00          0          0          0           0
      Execute      1      0.00       0.00          0          0          0           0
      Fetch        1      0.00       0.00          1          2          0           1
      ------- ------  -------- ---------- ---------- ---------- ----------  ----------
      total        3      0.00       0.00          1          2          0           1
      
      Misses in library cache during parse: 1
      Misses in library cache during execute: 1
      
      
        • 1. Re: Idle Waitevent Issue
          AndrewSayer

          If you have a query that sends 100million+ rows to a client multiple times a day, sometimes at the same time, you are doing something silly.

           

          What is the client doing with all those rows? Why do they need a fresh set of results throughout the day?

           

          Yes, the client side is spending a lot of time processing the rows. Perhaps it is doing some aggregation (SQL will do that better!). There is zero possibility that it needs to display all these rows to an end user. You can either fix the application so it gets pre-aggregated (/preprocessed etc) results and not have to fetch 100+ million rows. You could fix the application so that it only requests differences from the last time it did the calculations. You could tune the application so that it spends less time holding the cursor open using all the temp.

           

          4GB of temp probably isn’t really a lot in the grand scheme of things, is it?

          -edit

          Definitely not an SDU problem. Purely an application problem (which might extend to the driver being used as some do process huge result sets very inefficiently)

          1 person found this helpful
          • 2. Re: Idle Waitevent Issue
            JohnWatson2

            Do you actually need the DISTINCT? The optimizer doesn't think it will remove any rows. Or if there are only a few duplicates, perhaps you could manage them some other way?

            • 3. Re: Idle Waitevent Issue
              933257

              Thank You Andrew.

              That is the question i have raised with the team and we are trying to look into the exact business logic for which this kind of implementation has been done. And if we can reduce the number of execution and/or minimize the rows which transferred to the client.

               

              Btw is it still justified to take ~35minutes to collect ~100million rows by the client here? This was not the case previously , this query used to take exact same amount of time i.e. ~7minutes but after some upgrade in client side this issue is coming to picture. Want to understand what exact setup we need to verified on client(Got to know its C++) side.

               

              Also i checked if any of the column fetched is CLOB or anything, but that is not the case here. We are fetching ~6 different columns out of which  four columns are varchar2(3 bytes), one is vachar2(15bytes) and other one si char(6 bytes). So why its taking so  much time to move those 100million to the client?

               

              Also the trace shows below statistics in which i see the number of "SQL*Net more data to client" is higher but the "total waited" is higher for "SQL*Net message from client" . Does this signify any issue with network setup?

               

              Elapsed times include waiting on following events:
                Event waited on                             Times   Max. Wait  Total Waited
                ----------------------------------------   Waited  ----------  ------------
                Disk file operations I/O                        2        0.00          0.00
                latch: row cache objects                        1        0.00          0.00
                SQL*Net message to client                    2086        0.00          0.00
                asynch descriptor resize                        1        0.00          0.00
                direct path write temp                      15915        0.04         25.61
                latch free                                      2        0.00          0.00
                SQL*Net more data to client                320615        0.09         70.85
                SQL*Net message from client                  2085        1.50       2332.84
                direct path read temp                       15915        0.04         30.90
              
              
              • 4. Re: Idle Waitevent Issue
                933257

                This is the one i verified with Dev team as it looks from sql  monitor the distinct is actually not removing any rows. As because this distinct is causing those ~4GB temp consumption and some resources to read/write from temp. But team says they need it as per business logic , there are chances of duplicates at times.

                • 5. Re: Idle Waitevent Issue
                  Jonathan Lewis

                  933257 wrote:

                   

                   

                  Also the trace shows below statistics in which i see the number of "SQL*Net more data to client" is higher but the "total waited" is higher for "SQL*Net message from client" . Does this signify any issue with network setup?

                   

                  No.

                   

                  You have your arraysize set to 50,000 (which is too big to be of any real benefit), so a client FETCH call has to populate an array of 50,000 rows; doing the arithmetic there are roughly 160 "more data to client" calls for every "message from client" call, so the following is (roughly) what happened in your case:

                   

                  1) The client issues a fetch call to the server

                  2) the server packs about 310 rows into a packet and sends it to the client with a "there's more available" tag, starts a "SQL*Net message to to client" wait

                   

                  3) the client sql*net layer copies the data into the array in a few tens of micro seconds and sends a "give me more" call to the server

                  4) the server packs about 310 rows into a packet and sends it to the client with a "there's more available" tag, starts a "SQL*Net more data to client" wait

                   

                  5) repeat 3/4 roughly 160 times

                   

                  6) the client sql*net layer copies the data into the array in a few tens of micro seconds and sends a "give me more" call to the server for the last time.

                  7) the server packs up to about 310 rows into a packet and sends it to the client with a "no more rows in fetch" tag, starts a "SQL*Net message from client" wait

                   

                  8) the sql*net layer copies the data into the array and returns control to the client program which starts to process the array.

                      this takes about 1.1 seconds

                   

                  Repeat from (1) until the packet returned by the server includes an end-of-fetch flag.

                   

                   

                  There are a couple of minor details where this description isn't quite telling the whole truth, but the big picture describes what's going on and why you see the results you do.

                   

                  Regards

                  Jonathan Lewis

                  1 person found this helpful
                  • 6. Re: Idle Waitevent Issue
                    Jonathan Lewis

                    On a side note - the "Start Active" in the UNION ALL (operation 2) that reads +158 seconds is a fairly strong clue that the hash joins and hash unique completed in roughly 158 seconds and the first rows became available for return to the client at that moment.

                     

                    Regards

                    Jonathan Lewis

                    1 person found this helpful
                    • 7. Re: Idle Waitevent Issue
                      AndrewSayer

                      933257 wrote:

                       

                      Thank You Andrew.

                      That is the question i have raised with the team and we are trying to look into the exact business logic for which this kind of implementation has been done. And if we can reduce the number of execution and/or minimize the rows which transferred to the client.

                       

                      Btw is it still justified to take ~35minutes to collect ~100million rows by the client here? This was not the case previously , this query used to take exact same amount of time i.e. ~7minutes but after some upgrade in client side this issue is coming to picture. Want to understand what exact setup we need to verified on client(Got to know its C++) side.

                       

                      Also i checked if any of the column fetched is CLOB or anything, but that is not the case here. We are fetching ~6 different columns out of which four columns are varchar2(3 bytes), one is vachar2(15bytes) and other one si char(6 bytes). So why its taking so much time to move those 100million to the client?

                       

                      Also the trace shows below statistics in which i see the number of "SQL*Net more data to client" is higher but the "total waited" is higher for "SQL*Net message from client" . Does this signify any issue with network setup?

                       

                      1. Elapsedtimesincludewaitingonfollowingevents:
                      2. EventwaitedonTimesMax.WaitTotalWaited
                      3. ----------------------------------------Waited----------------------
                      4. DiskfileoperationsI/O20.000.00
                      5. latch:rowcacheobjects10.000.00
                      6. SQL*Netmessagetoclient20860.000.00
                      7. asynchdescriptorresize10.000.00
                      8. directpathwritetemp159150.0425.61
                      9. latchfree20.000.00
                      10. SQL*Netmoredatatoclient3206150.0970.85
                      11. SQL*Netmessagefromclient20851.502332.84
                      12. directpathreadtemp159150.0430.90

                      What driver is being used? What was upgraded?

                      • 8. Re: Idle Waitevent Issue
                        933257

                        Thank you. Actually I was trying to see if I can test it someway from sqlplus client, the behaviour of arraysize 50000 Vs 200. But sqlplus has max limit of 5000 so unable to test it.

                         

                        To dig down this issue more, is there any other way to move with? Like other type of tracing Or evaluating any parameters of client sides or database etc. to get to the bottom of this issue?

                        • 9. Re: Idle Waitevent Issue
                          Jonathan Lewis

                          Which issue are you trying to get to the bottom of ?

                           

                          You can test the "SQL*Net message from client" vs. "SQL*Net more data to client" difference from SQL*Plus very easily, even though the array limit is 5000 rather than 50,000.  You need only see significant differences rather than an exact match for the difference.

                           

                          At arraysize 5,000 you'll probably get roughly 16 "more data" for each "message from client", and the difference in timing won't be quite so extreme. Just make sure you're allowing the output to scroll up the screen as this will use up a lot of extra client time.  I'd guess that with an arraysize less than about 310 you won't see any "more data".

                           

                          If you want to refine the testing you can do things like "set pagesize 10" vs. "set pagesize 5000" to vary the amount of extra client work that takes place (and time that gets lost) per array fetch.

                           

                          Regards

                          Jonathan Lewis

                          • 10. Re: Idle Waitevent Issue
                            933257

                            "Which issue are you trying to get to the bottom of ?"

                             

                            Actually as I mentioned we want the query to finish in closely same amount of time as close as to the elapsed time. And it used to finish earlier like that. But now being the query taking 45 minutes it's holding up the tempspace for whole duration causing failure of other query when multiple queries need tempspace during those moments.

                             

                            This issue was unnoticeable earlier as because this query was finishing in 10minutes and thus releasing the tempspace quickly. So want to understand what has caused these additional idle wait times to the query execution. And if we can fix this.

                            • 11. Re: Idle Waitevent Issue
                              Jonathan Lewis

                              The problem seems to appear at the client end of the system, so you need to look at what the client does and what might have changed.

                               

                              First, of course, has the query ALWAYS produced about 100M rows ?

                              Second, what does the client application code do with those 100M rows - and has the code to do it changed at all in the recent past.

                              Third - does the client do anything that might perform differently if the data appeared in a different order ?

                              • I mention this because if the hash joins changed order the hash unique might end up ordering the data differently (I can't tell from your anonymised SQL) and possibly the application has some code which says "if this value is the same as the corresponding value in the last row use a short cut (e.g. can do a running total instead of a search and increment)". It would be worth checking how much CPU the client process uses while this process is running.

                              Fourth - has any configuration changed between the client application code and the final layer that makes the call into SQL*Net

                              • e.g. Maybe someone has enabled some tracing or logging to check the traffic

                               

                              I am just a little curious about your SDU sizing. I don't think changing it is going to make much difference, but I'm wondering if your client has changed its character set and that might result in CPU disappearing at the client end as SQL*Net translates from the server character set to the client character set. Can you check the SDU_SIZE from both ends of the system (the dialogue will use the smaller of the two) and let us know what they are.

                               

                              Regards

                              Jonathan Lewis

                              1 person found this helpful
                              • 12. Re: Idle Waitevent Issue
                                933257

                                Thank you so much for the detailed suggestion.

                                 

                                 

                                I will try to get these information and revert back.

                                • 13. Re: Idle Waitevent Issue
                                  933257

                                  Below are the sections of tnamames.ora and listener.ora showing SDU side for DBserver side.

                                   

                                  tnsnames.ora:-

                                   

                                      (SDU = 32767)

                                      (SEND_BUF_SIZE = 262144)

                                      (RECV_BUF_SIZE = 262144)

                                   

                                  listener.ora:-

                                   

                                        (SEND_BUF_SIZE=131072)

                                        (RECV_BUF_SIZE=131072)

                                        (ADDRESS_LIST =

                                          (ADDRESS = (PROTOCOL = TCP)(SDU=32767)(HOST = XXXXX)(PORT = XXXX))

                                   

                                         

                                  And for client we have dynamic registration setup,i see below in sqlnet.Ora. And tnsnames.ora doesn't have any SDU setting mentioned as i verified. Which means it must be using the same as its there in server side. Please correct me if wrong. And again that means some other tracing/logging etc may be happening in client side causing the slow data transfer? I am yet to check on that part.

                                         

                                  NAMES.DIRECTORY_PATH= (TNSNAMES, EZCONNECT)

                                  ADR_BASE = /apps/oracle/NA

                                  • 14. Re: Idle Waitevent Issue
                                    Jonathan Lewis

                                    SQL*Net will negotiate an SDU size that is the smaller of the client and server settings.

                                    If you haven't got an explicit SDU_SIZE somewhere on the client side then I think the 11g default will be 8192 (or possibly a few bytes smaller).

                                     

                                    Your server's tnsnames.ora (SDU=) is irrelevant in this context, it would only be relevant for a db_link connection out to another server.

                                    Your server's listener.ora entry with SDU=32767 is the value that would be used (and currenlty rejected) in the negotiation with the client.

                                     

                                    If you can set a default_sdu_size=32767 in the client sqlnet.ora (or an SDU= in the client tnsnames.ora, or connect string if you use that technology) then you could get a larger SDU agreement between client and server.  It might make a little difference.

                                     

                                    If you set event 10079 from an SQL*Plus session at the client, then do something like:

                                     

                                    set arraysize 5000 linesize 2048 pagesize 0 pause off termout off trimspool on

                                    select * from all_objects;

                                     

                                    The resulting trace file should hold lines (after the first few exchanges) something like:

                                     

                                    11 bytes received via SQL*Net

                                    IFCDSEND: (bp=0x11d049da, bl=1, isdblink=0)

                                    IFCDSEND: (bp=0x7ffa678c1fca, bl=28, isdblink=0)

                                    (overflow) 8088 bytes sent via fast-path

                                    IFCDSEND: (bp=0x7ffc1bc43ddb, bl=1, isdblink=0)

                                    (overflow) 8085 bytes sent via fast-path

                                    IFCDSEND: (bp=0x7ffa678c213f, bl=23, isdblink=0)

                                    (overflow) 8092 bytes sent via fast-path

                                     

                                     

                                    In this case I had a  DEFAULT_SDU_SIZE at the server of 32767, and left the client with no specified SDU - as you can see the exchange was operating at close to 8192 bytes.

                                     

                                     

                                    To set the event I wanted to use the following from the current session:

                                     

                                    alter session set events '10079 trace name context forever, level 1';

                                     

                                    but found that gave me ORA-01031 insufficient privilege, I had to use

                                     

                                    alter system set events '10079 trace name context forever, level 1';

                                     

                                    Followed promptly after the test by

                                     

                                    alter system set events '10079 trace name context off';

                                     

                                    It's been a long time since I last used this event, so someone else may suggest a better was of getting the necessary information dumped.

                                     

                                     

                                    Regards

                                    Jonathan Lewis

                                    1 person found this helpful
                                    1 2 Previous Next