How Error Messages Found in Network Packets Revealed the Root Cause of Slow Screen Refreshes

Version 1

    by Mike Cuppett


    Digging into  performance and stability problems across complex infrastructures supporting a multitiered  application architecture never fails to surprise.


    Published October 2014


    Several teams were recently collaborating in the hunt for opportunities  to bring stability and performance enhancements to a client's primary  application, which supports nearly US$20 billion worth of transactions  annually. The user base had grown 50 percent over the past eight months, straining  the infrastructure resources, including the newly built private cloud database platform.  The primary issue: application users reporting screen refresh times exceeding  20 seconds.

    Growth always provides improvement opportunities by  revealing bottlenecks as transaction counts increase, data volume expands, and  network traffic explodes. Quickly, the teams built the "expected" list of tasks:  tune a handful of poor-performing queries; expand database memory to reduce  physical reads; shift batch job start times to decrease resource competition  during peak load, and so on.

    A Clue in the Network Packets

    The same tired story pressed onto a different day—except a  colleague of mine found an interesting error buried within network packets: ifError:11, which is shown in Figure 1.


    Figure 1. ifError:11 message

    Notice that the ifError:11  message resides within an HTTP/1.1 200  OK message, indicating a successful request-response interchange. Note  to self: "Success" information may yield problem clues. Figure 2 shows the  network frames for the request/response set.


    Figure 2. Network  frames for the request/response set

    Oracle's support site provides the  following information:


                The server is busy, re-try the request in  <n> milliseconds. This type of message will not be displayed to the user,  the VU will automatically re-try this request after <n> milliseconds.


    The ifError:11 messages mined from the network traffic showed a familiar Oracle Database design trait: spinning. Remember the database init parameter "spin_count" from years  ago? Data from current versions of the Automatic Workload Repository feature of  Oracle Database still shows spin data for several elements, but I digress. The ifError:11 messages showed incremental  time values of 100, 200, 400, 800, 1600, or more milliseconds. We determined that once the value hits 8000, it holds and repeats 8000.

    My colleague summarized one user's experience as follows:


    In  the case of this user, this happens multiple times.
                The  sequence of events is as follows:
                User sends POST /forms/lservlet.
                   Oracle Forms Server takes exactly 1 sec and then sends back ifError:11/100
                Client waits 100 ms and then attempts to send POST /forms/lservlet to the server again.
                   Oracle Forms Server takes exactly 1 sec and then sends back ifError:11/200
                Client waits 200 ms and then attempts to send POST /forms/lservlet to the server again.
                   Oracle Forms Server takes exactly 1 sec and then sends back ifError:11/400
                Client waits 400 ms and then attempts to send POST /forms/lservlet to the server  again.
                   Oracle Forms Server takes exactly 1 sec and then sends back ifError:11/800
                Client waits 800 ms and then attempts to send POST /forms/lservlet to the server  again.
                   Oracle Forms Server takes exactly 1 sec and then sends back ifError:11/1600
                Client waits 1600 ms and then attempts to send POST /forms/lservlet to the server  again.
                   Oracle Forms Server processes requests and responds in 31 ms.     
    So the user was waiting ~8.46 seconds from a pure delay standpoint. Besides this 1 POST /form/lservlet there are a lot of other POST's that occur to fully render the screen. This 8.46 sec was pure delay for this user out of a 10–11 sec screen load.


    Look again at Figure 2; the left column shows the response  time deltas from the network perspective, showing nearly one second between the  POST request and the HTTP 200 response from the Oracle Forms server. The client  browser will automatically continue to send repeated POST requests until the  requested data is provided. The increasing number of POST requests closely  correlates to user frustration, as shown in Figure 3.


    Figure 3. Correlation  between response time and user frustration

    Let's level-set with a brief review of a typical Oracle  Forms server application architecture (shown in Figure 4).


    Figure 4. Oracle Forms server architecture

    Now, knowing that the client makes repeated POST requests until the data is served, the teams needed to determine what the ifError:11 message meant:

    • Is the Oracle Forms server too busy to handle the request?


    • Is the Oracle Forms server unable to fulfill the request  because the database has not yet delivered the needed data?

    With the help of a few power users, the teams decided to  profile key transactions using network data captures between the client and the  Oracle Forms server. Figure 5 shows the two tests of one key transaction that included  six queries to complete the screen refresh.


    Figure 5. Oracle Forms session network traces

    The above load times (9 seconds and 3 seconds) are for the  same transaction showing dramatically different results. The data has also begun to reveal the  bottleneck location.

    Simultaneously with the network packet captures, a database  administrator performed traces on the user sessions. Unfortunately, another  factor contributing to the general application slowness was discovered: long-running  recursive calls (Figure 6).


    Figure 6. Data for recursive calls

    Recursive calls should be almost invisible when tracing a  user session; unfortunately, this trace shows over 15 seconds of wall-clock  time. After further analysis and additional user tracing, high recursive times  were found to be common, leading the team down a second path to check the  database settings.

    Resolving the Recursive Calls Issue

    Since database slowness impacts all of the user sessions, the  teams decided to focus on improving overall database response time first. The  teams discovered that during the hardware replatforming, the database  configuration file (aka SPFILE, PFILE, and init.ora) was missed. Instead, a default configuration  file from the Oracle software installation was being used, with session_cached_cursors set to 20. The database  administrator also confirmed that the open_cursors  parameter was set to the default.

    After checking the configuration file still residing on the  old hardware, both cursor settings were updated, and like magic, the recursive  calls numbers began looking "normal." Figure 7 shows much improvement, with  opportunity remaining.


    Figure 7. Improved numbers  for recursive calls

    As a follow-up task, the DBA team was asked to take a second  look at the two cursor settings, and to compare all the current configuration  settings to the premigration settings. Additional tuning of the cursor  parameters to adjust for the load increase has now nearly eliminated recursive  work time.

    Mitigating the ifError Issue

    Knowing that the number of users had increased 50 percent  and that the database had been replatformed for expanded capacity, the teams  decided to expand the Oracle Forms capacity by increasing the number of servers  in the load balanced pool, intending to decrease the number of user connections  per server and the server CPU workload. Before the change, data was collected to  capture an average global response time delay for users over five-minute  increments. Table 1 shows the data for three recordings.

    Table 1. Data from the first three network traffic collections


    Total # of Users (Max)Average Delay/User/5 Minutes# of Oracle Forms ServersAverage # of Users/Server
    3,3776.08 seconds11298
    3,2255.94 seconds11293
    3,5905.254 seconds11326


    Average response times started falling, as shown in Table 2  through Table 4, when the additional capacity became available.

    Table 2. Improved response times after adding one server


    Total # of Users (Max)Average Delay/User/5 Minutes# of Oracle Forms ServersAverage # of Users/Server
    3,5134.472 seconds12293

    Table 3. Improved response times after adding four servers


    Total # of Users (Max)Average Delay/User/5 Minutes# of Oracle Forms ServersAverage # of Users/Server
    3,3764.059 seconds15225

    Table 4. Improved response times after adding seven servers


    Total # of Users (Max)Average Delay/User/5 Minutes# of Oracle Forms ServersAverage # of Users/Server
    3,4523.254 seconds18192


    Application users confirmed  the improved application responsiveness, with a caveat that the screen refreshes  were still not optimal.

    By now the teams had  concluded that the ifError:11 message was being sent  when the Oracle Forms server was not able to provide the client-requested data  because the database had not yet returned the result set. If you recall from  the "3 second load" results, each query returned in less than 1 second, meaning  that the database had already fulfilled the request, so the Oracle Forms server  could push the data to the client. Still a question remained concerning the ifError:11 events: "Why do the Oracle Forms servers wait almost  one second before responding to the client when the data is unavailable?"  Further digging led us to the parameter maxBlockTime, an Oracle  Forms listener servlet parameter. When the client checks for a response from  the Oracle Forms server, the listener forwards the request to the Oracle Forms  runtime process and waits for the amount of time specified by maxBlockTime for the result data before responding to the client  request.


    To recap, database cursor  improvements reduced SQL parsing time, additional Oracle Forms server capacity  reduced the ifError:11 total wait times, and application improvements made  several heavily hit application functions perform better.

    Understanding why something  occurs is vital to understanding where and how response times can be improved.  Changing maxBlockTime is out of scope for this article, since extensive  testing is required. However, knowing that each unfulfilled client request  costs a minimum of one second in response time motivated the team to focus on  returning data requests quickly and efficiently. Also, knowing how Oracle Forms  operates gave the teams the ability to discern between the portion of the  response time that could be impacted and portions where response time was  outside of their control.

    As you can see from the  troubleshooting process described in this article, DBAs and application  programmers can use network packets to get clues about why an application is  performing poorly. Also, during any troubleshooting process, collecting data  from many aspects of the application and infrastructure will paint a much  clearer picture of the problem.

    About the Author

    Mike  Cuppett works as a solutions engineer for a Fortune 20 healthcare  organization and has saved millions of dollars for clients as a business systems  performance consultant.


    Revision 1.0, 10/21/2014


    Follow OTN Garage On:
    Blog | Facebook | Twitter | YouTube