1 2 3 Previous Next 37 Replies Latest reply: May 4, 2012 4:39 AM by Uwehesse-Oracle RSS

    Pulling my hair out - sudden performance issue

    EdStevens
      Oracle 11.2 SEone on Oracle linux 5, with ASM

      Feeling like a beginner again ...

      Beginning monday morning, everything on this db is taking orders of magnitude longer. Queries/packages are the same as they have been for months. Fresh stats collected on key tables. Even a very simple ad-hoc query takes c. 15 seconds - even on a second execution when everything should be the buffer.
      TKPROF: Release 11.2.0.1.0 - Development on Wed May 2 14:02:23 2012
      
      Copyright (c) 1982, 2009, Oracle and/or its affiliates.  All rights reserved.
      
      Trace file: dwprd_ora_20811_STEVENS.trc
      Sort options: default
      
      ********************************************************************************
      count    = number of times OCI procedure was executed
      cpu      = cpu time in seconds executing
      elapsed  = elapsed time in seconds executing
      disk     = number of physical reads of buffers from disk
      query    = number of buffers gotten for consistent read
      current  = number of buffers gotten in current mode (usually for update)
      rows     = number of rows processed by the fetch or execute call
      ********************************************************************************
      
      sELECT * FROM dw.program_log a
      WHERE a.program_name LIKE '%CBA%' AND a.date_time > to_date('01-Apr-2012','dd-Mon-yyyy')
      
      call     count       cpu    elapsed       disk      query    current        rows
      ------- ------  -------- ---------- ---------- ---------- ----------  ----------
      Parse        1      0.00       0.01          0          0          0           0
      Execute      1      0.00       0.00          0          0          0           0
      Fetch        3      3.55      53.21     374587     374589          0          29
      ------- ------  -------- ---------- ---------- ---------- ----------  ----------
      total        5      3.56      53.22     374587     374589          0          29
      
      Misses in library cache during parse: 1
      Optimizer mode: ALL_ROWS
      Parsing user id: 5
      
      Rows     Row Source Operation
      -------  ---------------------------------------------------
           29  TABLE ACCESS BY INDEX ROWID PROGRAM_LOG (cr=374589 pr=374587 pw=0 time=0 us cost=419831 size=118145736 card=1093942)
           29   INDEX RANGE SCAN SYS_C0014081 (cr=374573 pr=374571 pw=0 time=0 us cost=380860 size=0 card=1093942)(object id 77238)
      In the above, index SYS_C0014081 is a 3-column composite key with DATE_TIME at the high-order end (first column in th index def). It is the PK of the table.

      The above isn't the only one - and in fact no one cares about that query itself, but it is very simple and hopefully symptomatic of the underlying issue. Anything else I could show is far more complex. The datafile is on ASM on a SAN.

      As for the basic question, "what has changed" the answer is - nothing from the database side. Last week the SA removed some SAN disks that were no longer being used. That happened on Thursday, but it was Monday before some heavy jobs were scheduled that started causing issues.
      It's been a loong time since I've had to analyze anything like this and am definately out of practice anaylizing tkprof reports .. especially when judgement born of experience is involved. to make matters more difficult, statspack is broken and dbcontrol not even configured. I need to get those working as soon as I can get this put to bed.


      Ideas? thoughts? Any additional info I can provide.
      Ideas? Thoughts?
        • 1. Re: Pulling my hair out - sudden performance issue
          Nikolay Savvinov
          Hi,

          if re-running a statement doesn't make it faster, I'd suspect that something is wrong with data caching -- either on database level, or on OS (or hardware) level.

          Is it possible to run some AWR reports?

          Best regards,
          Nikolay
          • 2. Re: Pulling my hair out - sudden performance issue
            LKBrwn_DBA
            Fresh stats collected on key tables.
            Perhaps you forgot to collect the Histograms?
            Last week the SA removed some SAN disks
            When you "dismounted" those disks groups, the data in them was redistributed.

            Try executing ASH report(s):
            SQL>@?/rdbms/admin/ashrpt.sql
            ;)
            • 3. Re: Pulling my hair out - sudden performance issue
              Jonathan Lewis
              EdStevens wrote:
              Oracle 11.2 SEone on Oracle linux 5, with ASM

              Feeling like a beginner again ...
              call     count       cpu    elapsed       disk      query    current        rows
              ------- ------  -------- ---------- ---------- ---------- ----------  ----------
              Parse        1      0.00       0.01          0          0          0           0
              Execute      1      0.00       0.00          0          0          0           0
              Fetch        3      3.55      53.21     374587     374589          0          29
              ------- ------  -------- ---------- ---------- ---------- ----------  ----------
              total        5      3.56      53.22     374587     374589          0          29
              
              
              Rows     Row Source Operation
              -------  ---------------------------------------------------
              29  TABLE ACCESS BY INDEX ROWID PROGRAM_LOG (cr=374589 pr=374587 pw=0 time=0 us cost=419831 size=118145736 card=1093942)
              29   INDEX RANGE SCAN SYS_C0014081 (cr=374573 pr=374571 pw=0 time=0 us cost=380860 size=0 card=1093942)(object id 77238)
              The optimizer seems to be predicting the workload pretty well - cost = 380860 for an index range scan would basically be indicating the expectation of that many single block disc reads and that's consistent with the disk reads recorded. And the response time was pretty quick given it did that in 52 seconds (possibly getting a lot of the SAN cache, and using db file parallel reads - I don't suppose you traced this with waits, did you?)
              In the above, index SYS_C0014081 is a 3-column composite key with DATE_TIME at the high-order end (first column in th index def). It is the PK of the table.

              The above isn't the only one - and in fact no one cares about that query itself, but it is very simple and hopefully symptomatic of the underlying issue. Anything else I could show is far more complex. The datafile is on ASM on a SAN.
              How many rows are there with date > 1st April, what other columns are in that index, and how much data is in the table as a whole If the whole system is slow it could be that a couple of extreme queries like this are having an impact on everything else, so chasing a simple query down in detail might be a reasonable initial strategy.
              As for the basic question, "what has changed" the answer is - nothing from the database side. Last week the SA removed some SAN disks that were no longer being used. That happened on Thursday, but it was Monday before some heavy jobs were scheduled that started causing issues.
              For Oracle to believe that this index range scan is a good idea, it would have to believe that the tablescan is enormous, and (relatively) very slow. Although "nothing has changed" from the database side, I think I'd still want to check whether the system statistics (sys.aux_stats) have changed - odd values there could produce extreme execution plans that look as mad as this simple index range scan.

              Regards
              Jonathan Lewis
              • 4. Re: Pulling my hair out - sudden performance issue
                EdStevens
                Nikolay Savvinov wrote:
                Hi,

                if re-running a statement doesn't make it faster, I'd suspect that something is wrong with data caching -- either on database level, or on OS (or hardware) level.

                Is it possible to run some AWR reports?

                Best regards,
                Nikolay
                Unfortunately we are not licensed for the performance packs.
                • 5. Re: Pulling my hair out - sudden performance issue
                  user503699
                  EdStevens wrote:
                  Oracle 11.2 SEone on Oracle linux 5, with ASM

                  Feeling like a beginner again ...

                  Beginning monday morning, everything on this db is taking orders of magnitude longer. Queries/packages are the same as they have been for months. Fresh stats collected on key tables. Even a very simple ad-hoc query takes c. 15 seconds - even on a second execution when everything should be the buffer.

                  As for the basic question, "what has changed" the answer is - nothing from the database side. Last week the SA removed some SAN disks that were no longer being used. That happened on Thursday, but it was Monday before some heavy jobs were scheduled that started causing issues.
                  It's been a loong time since I've had to analyze anything like this and am definately out of practice anaylizing tkprof reports .. especially when judgement born of experience is involved. to make matters more difficult, statspack is broken and dbcontrol not even configured. I need to get those working as soon as I can get this put to bed.


                  Ideas? thoughts? Any additional info I can provide.
                  Ideas? Thoughts?
                  If you are sure nothing has changed "in database" and this has started "all of a sudden", you may want to consider looking at it from OS.
                  For e.g. when you run the above query, you may want to check at OS level if there is significantly more I/O wait. Also, a look at TOP command
                  output for e.g. may help to find out if anything else is "choking" up the server.
                  One of the advantages of this approach is you would be able to involve your SAs (they will be more interested in using OS tools rather than DB)
                  and they would be able to spot any abnormal behaviours (from OS command outputs) quickly.
                  • 6. Re: Pulling my hair out - sudden performance issue
                    EdStevens
                    >
                    <snip>
                    The optimizer seems to be predicting the workload pretty well - cost = 380860 for an index range scan would basically be indicating the expectation of that many single block disc reads and that's consistent with the disk reads recorded. And the response time was pretty quick given it did that in 52 seconds (possibly getting a lot of the SAN cache, and using db file parallel reads - I don't suppose you traced this with waits, did you?)
                    <snip>
                    How many rows are there with date > 1st April,
                    20,799,816 rows
                    what other columns are in that index,
                    a sequence number (NUMBER)
                    a 'program name' (varchar2(100)
                    and how much data is in the table as a whole
                    75,014,276 rows
                    If the whole system is slow it could be that a couple of extreme queries like this are having an impact on everything else, so chasing a simple query down in detail might be a reasonable initial strategy.
                    Well,l dont' think this query is itself causing a problem. It's just an ad-hoc we happened to use to get some information on the real problem, but since the issue is performance and this query itself exhibited some unbelieveably slow performance, we thought it would point to the same root cause and be a lot simpler to look at.

                    <snip>
                    For Oracle to believe that this index range scan is a good idea, it would have to believe that the tablescan is enormous, and (relatively) very slow. Although "nothing has changed" from the database side, I think I'd still want to check whether the system statistics (sys.aux_stats) have changed - odd values there could produce extreme execution plans that look as mad as this simple index range scan.
                    hmm ....
                    SQL> show user
                    USER is "SYS"
                    SQL> desc sys.aux_stats
                    ERROR:
                    ORA-04043: object sys.aux_stats does not exist
                    And to your earlier question ... the tkprof with waits:
                    ********************************************************************************
                    
                    sELECT * FROM dw.program_log a
                    WHERE a.program_name LIKE '%CBA%' AND a.date_time > to_date('01-Apr-2012','dd-Mon-yyyy')
                    
                    call     count       cpu    elapsed       disk      query    current        rows
                    ------- ------  -------- ---------- ---------- ---------- ----------  ----------
                    Parse        1      0.00       0.00          0          0          0           0
                    Execute      2      0.00       0.00          0          0          0           0
                    Fetch        3      3.22      43.77     374587     374590          0          29
                    ------- ------  -------- ---------- ---------- ---------- ----------  ----------
                    total        6      3.22      43.78     374587     374590          0          29
                    
                    Misses in library cache during parse: 1
                    Optimizer mode: ALL_ROWS
                    Parsing user id: 5  (SYSTEM)
                    
                    Rows     Row Source Operation
                    -------  ---------------------------------------------------
                         29  TABLE ACCESS BY INDEX ROWID PROGRAM_LOG (cr=374590 pr=374587 pw=0 time=0 us cost=419831 size=118145736 card=1093942)
                         29   INDEX RANGE SCAN SYS_C0014081 (cr=374574 pr=374571 pw=0 time=0 us cost=380860 size=0 card=1093942)(object id 77238)
                    
                    
                    Rows     Execution Plan
                    -------  ---------------------------------------------------
                          0  SELECT STATEMENT   MODE: ALL_ROWS
                         29   TABLE ACCESS   MODE: ANALYZED (BY INDEX ROWID) OF 'PROGRAM_LOG' 
                                  (TABLE)
                         29    INDEX   MODE: ANALYZED (RANGE SCAN) OF 'SYS_C0014081' (INDEX 
                                   (UNIQUE))
                    
                    
                    Elapsed times include waiting on following events:
                      Event waited on                             Times   Max. Wait  Total Waited
                      ----------------------------------------   Waited  ----------  ------------
                      SQL*Net message to client                       3        0.00          0.00
                      Disk file operations I/O                        8        0.00          0.00
                      db file sequential read                    374574        0.05         25.71
                      SQL*Net message from client                     3        0.00          0.00
                      db file parallel read                           2        0.01          0.02
                    ********************************************************************************
                    Regards
                    Jonathan Lewis
                    • 7. Re: Pulling my hair out - sudden performance issue
                      jgarry
                      Between a quarter and a third of the table, and it picks the index? How fast if you hint a full table scan? Something seriously off with your statistics, how and when did you gather them, and would you consider deleting and regathering? (Or does dbconsole for your setup allow you to revert stats? I think it would possibly be easier to get that working than deal with the equivalent command line statistics while under pressure.)

                      Do you see any huge objects in sysaux? I know little about plan baselines, so I wonder if you are getting burnt by some strangeness there (OPTIMIZER_USE_SQL_PLAN_BASELINES is set to true by default).
                      • 8. Re: Pulling my hair out - sudden performance issue
                        Jonathan Lewis
                        EdStevens wrote:
                        SQL> show user
                        USER is "SYS"
                        SQL> desc sys.aux_stats
                        ERROR:
                        ORA-04043: object sys.aux_stats does not exist
                        Sorry, I lost a $ -- sys.aux_stats$
                        SQL> SELECT pname, pval1
                          2  FROM sys.aux_stats$
                          3  WHERE sname = 'SYSSTATS_MAIN';
                        Regards
                        Jonathan Lewis
                        • 9. Re: Pulling my hair out - sudden performance issue
                          Jonathan Lewis
                          jgarry wrote:
                          Between a quarter and a third of the table, and it picks the index? How fast if you hint a full table scan? Something seriously off with your statistics, how and when did you gather them, and would you consider deleting and regathering?
                          If you check the Rowsource Execution details -

                          pr = 380,000 which matches the cost very closely - so Oracle's estimate of single block reads needed is very good

                          card = 1,000,000, and we have 20,000,000 rows for the date range (according to Ed), and the predicate program like '%XYZ%' which will have a 5% selectivity - so the card = 1M is also very good.

                          So the object stats look good, and the optimizer seems to have chosen a mad plan - hence my hypothesis that the system stats (mreadtim / sreadtim etc. are (extremely) odd). I suppose it's possible that the table has a HWM that is vastly higher than that implied by the total 75M rows, though.

                          Regards
                          Jonathan Lewis
                          • 10. Re: Pulling my hair out - sudden performance issue
                            jgarry
                            Jonathan Lewis wrote:
                            jgarry wrote:
                            Between a quarter and a third of the table, and it picks the index? How fast if you hint a full table scan? Something seriously off with your statistics, how and when did you gather them, and would you consider deleting and regathering?
                            If you check the Rowsource Execution details -

                            pr = 380,000 which matches the cost very closely - so Oracle's estimate of single block reads needed is very good

                            card = 1,000,000, and we have 20,000,000 rows for the date range (according to Ed), and the predicate program like '%XYZ%' which will have a 5% selectivity - so the card = 1M is also very good.

                            So the object stats look good, and the optimizer seems to have chosen a mad plan - hence my hypothesis that the system stats (mreadtim / sreadtim etc. are (extremely) odd). I suppose it's possible that the table has a HWM that is vastly higher than that implied by the total 75M rows, though.
                            Well, the hardware changed Thursday, and the heavy stuff started Monday, and the problems started Monday... yes, system stats, very suspicious here. Ed, check your patch level! See http://jonathanlewis.wordpress.com/category/oracle/statistics/system-stats/

                            Naturally, on my screen, the 1M cardinality was truncated to 10939 unless I scrolled right (duhhhhh). I deluded myself into focusing on the > date anyways.
                            • 11. Re: Pulling my hair out - sudden performance issue
                              EdStevens
                              A bit of follow up after some more reading.

                              The bit about the AUX_STATS$ table proved a bit interesting. Here's what I've got
                              SQL> select * from aux_stats$;
                              
                              SNAME                    PNAMEP        VAL1         PVAL2
                              ---------------          ----------    ----------   -----------------
                              SYSSTATS_INFO            STATUS                     COMPLETED
                              SYSSTATS_INFO            DSTART                     08-15-2009 00:49
                              SYSSTATS_INFO            DSTOP                      08-15-2009 00:49
                              SYSSTATS_INFO            FLAGS                  1
                              SYSSTATS_MAIN            CPUSPEEDNW     2657.0122
                              SYSSTATS_MAIN            IOSEEKTIM             10
                              SYSSTATS_MAIN            IOTFRSPEED          4096
                              SYSSTATS_MAIN            SREADTIM
                              SYSSTATS_MAIN            MREADTIM
                              SYSSTATS_MAIN            CPUSPEED
                              SYSSTATS_MAIN            MBRC
                              SYSSTATS_MAIN            MAXTHR
                              SYSSTATS_MAIN            SLAVETHR
                              
                              13 rows selected.
                              So it looks like these stats have essentially never been updated.

                              As for the thing about the disks being unmounted last week. We have been going through a migration from one SAN to another. It is my understanding that the actual migration of data happened a few weeks ago ... possibly before I came on board and definitely without my involvement. I don't know if these had belonged to the ASM or not, but my involvement in this latest change was to simply confirm that the disks in question were not in use either by ASM or directly by the db. My thought there though is that if they had been involved, we'd have known right away. In any event, any data redistribution would have been completed well before Monday morning. I simply can't imagine how that change could have effected this issue but I'm certainly open for further "eddication". On the other hand, given the suddenness of this degradation, everyone on my end keeps pointing back to that. I'll talk to the SA in the morning and get some more info on the history of that change and what she was able to see this morning when we were running a heavy read-only job. That job is not well written, but until Monday it had always completed in about a minute, whereas now it is suddenly taking over an hour, so I don't think we can blame the code itself.

                              And keep in mind that this is Standard Edition One, and we are NOT licensed for the performance packs.
                              Connected to:
                              Oracle Database 11g Release 11.2.0.1.0 - 64bit Production
                              With the Automatic Storage Management option
                              • 12. Re: Pulling my hair out - sudden performance issue
                                Karan
                                Did you see old plans ?

                                Regards
                                Karan
                                • 13. Re: Pulling my hair out - sudden performance issue
                                  Karan
                                  Its not a new thing when you gather fresh statistics and performance degrades, did you try reverting back the same ?

                                  Regards
                                  Karan
                                  • 14. Re: Pulling my hair out - sudden performance issue
                                    Dom Brooks
                                    In summary:

                                    SE - so no AWR etc
                                    Statspack - broken
                                    Historical plans - not available

                                    Symptoms
                                    Global performance problems

                                    Changes
                                    SAN changes
                                    Database reboot presumably
                                    Stats gathered

                                    Evidence
                                    High physical IO (or was that just the first execution)?
                                    Possible strange execution plans but not known whether they've changed or not.



                                    What about optimizer related parameters?
                                    Any strange values?

                                    Presumably there was a reboot?
                                    If so, might there be parameter/settngs that were changed in memory and not persisted to spfile and that have since been reverted? Or restart done via pfile or something weird like that?
                                    1 2 3 Previous Next