Forum Stats

  • 3,815,713 Users
  • 2,259,070 Discussions
  • 7,893,213 Comments

Discussions

finding a performance bottleneck

2

Answers

  • AndrewSayer
    AndrewSayer Member Posts: 13,007 Gold Crown
    edited Feb 7, 2017 3:28PM

    Is caching like you suggest possible without access to application code? Such caching would need to be away from the db server to have any effect on chattiness-with-db.

    ( Im not suggesting this is the route that should be taken)

    GlenStrom
  • GlenStrom
    GlenStrom Member Posts: 217 Bronze Badge
    edited Feb 7, 2017 3:50PM

    Just went and talked to the application admins again to get a better idea of it's workload. In our case it is handling 300+ different locations, and reports are run pretty well on a 24X7 basis because the application cues them up and runs them consecutively. They said at one point today the reports were "5 hours behind" but as of when I talked to them they were only 10 minutes behind. I think this is just due to the sheer volume of reports being run and how the application cues them up to run one after the other.

    The other & main issue is come locations are reporting very slow responses for day to day circulation activities like checking books in, checking books out, cataloging etc.

    Between 2 and 3 today is when a lot of locations report slow response time, so I'll run another ADDM report for that time period later today for comparison.

    Given that there was over 3 million round trips to the database during the one hour this morning, I'm curious to see what the count will be for the "busy" time.

    It also appears that some of the cataloging they do also connects to another external database connected to the vendor itself, so the potential for a lot of network related latency is pretty high I would think. 

  • GlenStrom
    GlenStrom Member Posts: 217 Bronze Badge
    edited Feb 7, 2017 5:17PM
    Andrew Sayer wrote:Your report is telling you that those top SQLs are being executed millions of times, with an average execution of next-to-instant.Do you have a chatty application? Does your application do the joins rather than your SQL? Take a read of this blog post and watch the video https://blogs.oracle.com/plsql-and-ebr/entry/noplsql_versus_thickdb does it seem familiar?Why does your SQL rely on index hints? Do you distrust the optimizer that much?If you want to see where the time is going then check your instrumentation on the things your users are complaining about, find out where the time is taken. If it's in the DB then use extended sql tracing to properly see what's going on. You use module and action and client identifier to tell the DB what's being done right?

    It looks like the application does everything - there are no PLSQL procedures, packages, functions, triggers or even views at all in the database. Just tables, indexes and sequences.

  • John Thorton
    John Thorton Member Posts: 14,493 Silver Crown
    edited Feb 7, 2017 5:22PM
    user571263 wrote:Andrew Sayer wrote:Your report is telling you that those top SQLs are being executed millions of times, with an average execution of next-to-instant.Do you have a chatty application? Does your application do the joins rather than your SQL? Take a read of this blog post and watch the video https://blogs.oracle.com/plsql-and-ebr/entry/noplsql_versus_thickdb does it seem familiar?Why does your SQL rely on index hints? Do you distrust the optimizer that much?If you want to see where the time is going then check your instrumentation on the things your users are complaining about, find out where the time is taken. If it's in the DB then use extended sql tracing to properly see what's going on. You use module and action and client identifier to tell the DB what's being done right?It looks like the application does everything - there are no PLSQL procedures, packages, functions, triggers or even views at all in the database. Just tables, indexes and sequences. 

    is application 3 tier?

    does application utilize connection pooling?

    Is application written in Java?

  • GlenStrom
    GlenStrom Member Posts: 217 Bronze Badge
    edited Feb 7, 2017 5:44PM
    John Thorton wrote:user571263 wrote:Andrew Sayer wrote:Your report is telling you that those top SQLs are being executed millions of times, with an average execution of next-to-instant.Do you have a chatty application? Does your application do the joins rather than your SQL? Take a read of this blog post and watch the video https://blogs.oracle.com/plsql-and-ebr/entry/noplsql_versus_thickdb does it seem familiar?Why does your SQL rely on index hints? Do you distrust the optimizer that much?If you want to see where the time is going then check your instrumentation on the things your users are complaining about, find out where the time is taken. If it's in the DB then use extended sql tracing to properly see what's going on. You use module and action and client identifier to tell the DB what's being done right?It looks like the application does everything - there are no PLSQL procedures, packages, functions, triggers or even views at all in the database. Just tables, indexes and sequences. is application 3 tier?does application utilize connection pooling?Is application written in Java?

    3 tier? - I am not sure at this point but will check. I think it possibly is

    connection pooling - I'm thinking probably, given the number of connections compared to users/locations

    written in java? - yes

  • AndrewSayer
    AndrewSayer Member Posts: 13,007 Gold Crown
    edited Feb 7, 2017 5:47PM
    user571263 wrote:Andrew Sayer wrote:Your report is telling you that those top SQLs are being executed millions of times, with an average execution of next-to-instant.Do you have a chatty application? Does your application do the joins rather than your SQL? Take a read of this blog post and watch the video https://blogs.oracle.com/plsql-and-ebr/entry/noplsql_versus_thickdb does it seem familiar?Why does your SQL rely on index hints? Do you distrust the optimizer that much?If you want to see where the time is going then check your instrumentation on the things your users are complaining about, find out where the time is taken. If it's in the DB then use extended sql tracing to properly see what's going on. You use module and action and client identifier to tell the DB what's being done right?It looks like the application does everything - there are no PLSQL procedures, packages, functions, triggers or even views at all in the database. Just tables, indexes and sequences. 

    Sounds like the application could do with some work then. I recommend you take your requirements and findings to the vendor. You need to work with them to figure out exactly what sort of performance you should be able to expect, if they can't provide that then they need to address this (by seeing where the time is going and fixing this or managing your expectations).

    How many user actions happened (how many clicks/how many orders were processed/how many accounts were filed.. you know how best to measure this) in that hour of millions of SQL calls?

    GlenStrom
  • John Brady - UK
    John Brady - UK Member Posts: 434 Bronze Badge
    edited Feb 8, 2017 3:52AM
    user571263 wrote:Database is 11.2.0.3 running on Solaris SVR40-be-64bit 8.1.0, 32 CPU's. The database is used by many connected libraries, to check books in and out, query for overdues etc. On busy days, performance seems to be all over the map, some areas reporting 40 seconds for a form to update, some areas < 2 seconds. The transactions that seem to cause the bottlenecks seem to be queries checking if a patron has overdues, fines etc. I've been tasked to try and find where the bottlenecks are, and find out if the long delays are caused by server contention, application code, network & internet issues, or within the database itself. The server is a Solaris M5000 with about 16 other Oracle databases, 132GB RAM I am wondering if it is application code that is queuing up reports and queries that are causing the delay, internet issues, database waits or a combination of all - but I would like to be able to point out where the problem "most likely" is coming from. What and where else should I be looking for? Next steps? thanks in advance. 

    Okay, you have a performance problem SOMEWHERE between a user's keyboard / screen and where the data they want is stored on disk.  The bottleneck could be ANYWHERE between that user's client device and the disks on the database server where the data is stored.  You CANNOT just assume that the bottleneck is on the database server.  You MUST gather evidence on what is happening on EVERYTHING involved in how the application works.  Only then will you be able to say with any level of confidence where the bottleneck MIGHT be.

    To put it another way, you are looking for a needle in a haystack.  You might get lucky if you only look at the database server, if and only if it is where the bottleneck is.  But if it is not the bottleneck then you are just wasting your time.  "Fixing" things on anything else that is not the bottleneck will have NO impact at all on the overall performance.  You MUST identify the bottleneck first before making any changes anywhere.

    How do you identify the bottleneck?  Trace a transaction through from the user screen to the backend database server, through any intermediate systems and software it goes through.  With timestamps on each action that takes place you can see where most of the elapsed time is being spent.  That is the preferred way of doing it.  An instrumented application will be able to tell you where it spent its time within those transactions i.e. it will output timestamped messages for each action it does.  Unfortunately very few, if any, applications are instrumented as good as this.  Oracle Database is an example of something with good instrumentation.  Otherwise get activity data on each system involved and other things such as the network, and graph these over time together, and look for peaks in activity in one system during the busy periods.  That will be the bottleneck, and you should investigate further what it is doing.

    Be aware that anything you miss out and don't bother checking will probably turn out to be the cause of the bottleneck.  That is why you need a COMPLETE picture of what happens when a transaction executes, including all other servers, software, network and anything else.

    Others have mentioned that the workload on your database instance looks trivial (0.54 average active sessions) but it could be other database instances on that server being very active having an impact.  You also need to look at the system level, using Solaris tools such as vmstat and sar to check things like CPU utilization, free memory and memory paging, and overall disk activity.

    My best guess is that either one of the networks is slow or the application server is the bottleneck where the client requests get queued before being submitted to the database server.  If the database server is not busy, which it appears not to be, then the bottleneck is somewhere before the database server.

    GlenStromandrewmy
  • jgarry
    jgarry Member Posts: 13,844 Gold Crown
    edited Feb 8, 2017 5:02PM

    User571263, John Brady makes some excellent points.  On the subject of bottlenecks, I'd like to add:

    As you've seen, the bottlenecks change, and there may be more than one.

    For any change you make for a given bottleneck, there is likely going to be a shift in that bottleneck.  This could mean everything gets faster, or it could mean the issue shifts to another bottleneck, which may be worse, depending on what depends on that.  Sometimes it's like squeezing different parts of a balloon animal.

    Query performance tuning - indeed, all performance tuning - usually means getting rid of unnecessary work.  If the computer doesn't need to do as much stuff, everyone wins.  But shifting bottlenecks around can show anomalies.  For example, solving an app server issue may load the db server.  Solving an i/o issue may lead to a worse cpu issue, and so forth.  Think of onramp metering if you are in an area of highways that has it - slowing the access of additional cars to the highway can prevent an asymptotic worsening of traffic.  But the onramp traffic may back up onto the surface street...

    GlenStrom
  • Mark D Powell
    Mark D Powell Member Posts: 5,914 Blue Diamond
    edited Feb 8, 2017 2:26PM

    user571263, I may have missed the response which covered this but if the server has a paging problem that is probably where you want to start.  Have the System Administrator check the paging numbers.  Sometimes if the system has a lot of memory the Administrators do not allocate much page because it should not be used.  I have encountered significant problems because of this on a Windows box so make sure the Oracle recommendations how much page is defined were followed.  The page file needs to be defined with an adequate amount of space even if physical memory should never be exhausted.  If adequate page file space was defined and the page rate is high then your server needs more memory.

    - -

    If the page rate is low then that just leaves the SQL most of which appeared to have very short run times and from looking at the information provided is likely indexed.  If the preceding is true you may have limited tuning options as what you really need is more memory and/or CPU.  in which case you want to look closely at your buffer cache allocation usage since shifting some memory from the shared pool to the buffer cache might be an option

    - -

    HTH -- Mark D Powell --

    GlenStrom
  • GlenStrom
    GlenStrom Member Posts: 217 Bronze Badge
    edited Feb 8, 2017 5:54PM

    Thanks for the replies everyone.

    I initially asked the sys admin about the excessive paging at the OS level that ADDM reports, he said there was no issue with the paging on the server.

    Not knowing my way around unix OS that well, did a couple of quick searches &  ran the following:

    ================================================================================================================================

    zeus:/export/home/oracle % vmstat -s

            0 swap ins

            0 swap outs

            0 pages swapped in

            0 pages swapped out

    28148065343 total address trans. faults taken

    5362267153 page ins

    11667003 page outs

    41353933665 pages paged in

    25822256 pages paged out

    47193155966 total reclaims

    47188709357 reclaims from free list

            0 micro (hat) faults

    28148065343 minor (as) faults

    3485228240 major faults

    27638841633 copy-on-write faults

    14671122667 zero fill page faults

       455676 pages examined by the clock daemon

            0 revolutions of the clock hand

    21321932 pages freed by the clock daemon

    111683205 forks

       838797 vforks

    94623222 execs

    263883768807 cpu context switches

    247120603346 device interrupts

    36772340597 traps

    390983314603 system calls

    74348713190 total name lookups (cache hits 99%)

    1607015572 user   cpu

    1720900664 system cpu

    23015020664 idle   cpu

            0 wait   cpu

    zeus:/export/home/oracle % vmstat 1 2

    kthr      memory            page            disk          faults      cpu

    r b w   swap  free  re  mf pi po fr de sr m2 m4 m6 m1   in   sy   cs us sy id

    1 3 0 189190840 33547016 5733 3419 40190 25 21 0 0 1 2 1 -0 30019 47495 32056 6 7 87

    0 8 0 169760632 23074424 7903 2188 80272 0 0 0 0 2 0 0 0 40146 116780 48027 10 9 80

    zeus:/export/home/oracle % swap -l

    swapfile             dev  swaplo blocks   free

    /dev/md/dsk/d107    85,107     16 382437472 382437472

    zeus:/export/home/oracle % swap -s

    total: 101569144k bytes allocated + 5284080k reserved = 106853224k used, 169292400k available

    ========================================================================================================================================

    I also ran an AWR comparison report, one for a "not so busy" time (1st) and one for a "busy" (2nd) time.  Here is the OSS section of that report

    pastedImage_3.png

    So I'm not sure if this tells me swap is excessive or even being used at all..

    In any event, since any of the SQL I see the application running runs almost instantaneously when I run the same statement in sqlplus, I am leaning away from this being primarily a database issue.

    I've asked the supervisor in charge of the application to present the vendor a list of questions I have, such as what are performance recommendations, is there built in instrumentation, are joins done inside the application, why are they using index hints etc..  and I will wait to see what the vendor comes back with.

This discussion has been closed.