by Mike Cuppett
Digging into performance and stability problems across complex infrastructures supporting a multitiered application architecture never fails to surprise.
Published October 2014
Several teams were recently collaborating in the hunt for opportunities to bring stability and performance enhancements to a client's primary application, which supports nearly US$20 billion worth of transactions annually. The user base had grown 50 percent over the past eight months, straining the infrastructure resources, including the newly built private cloud database platform. The primary issue: application users reporting screen refresh times exceeding 20 seconds.
Growth always provides improvement opportunities by revealing bottlenecks as transaction counts increase, data volume expands, and network traffic explodes. Quickly, the teams built the "expected" list of tasks: tune a handful of poor-performing queries; expand database memory to reduce physical reads; shift batch job start times to decrease resource competition during peak load, and so on.
A Clue in the Network Packets
The same tired story pressed onto a different day—except a colleague of mine found an interesting error buried within network packets: ifError:11
, which is shown in Figure 1.
Figure 1. ifError:11
message
Notice that the ifError:11
message resides within an HTTP/1.1 200 OK
message, indicating a successful request-response interchange. Note to self: "Success" information may yield problem clues. Figure 2 shows the network frames for the request/response set.
Figure 2. Network frames for the request/response set
Oracle's support site provides the following information:
ifError:11/<n> The server is busy, re-try the request in <n> milliseconds. This type of message will not be displayed to the user, the VU will automatically re-try this request after <n> milliseconds. |
The ifError:11
messages mined from the network traffic showed a familiar Oracle Database design trait: spinning. Remember the database init
parameter "spin_count" from years ago? Data from current versions of the Automatic Workload Repository feature of Oracle Database still shows spin data for several elements, but I digress. The ifError:11
messages showed incremental time values of 100, 200, 400, 800, 1600, or more milliseconds. We determined that once the value hits 8000, it holds and repeats 8000.
My colleague summarized one user's experience as follows:
In the case of this user, this happens multiple times. The sequence of events is as follows: User sends POST /forms/lservlet. Oracle Forms Server takes exactly 1 sec and then sends back ifError:11/100 Client waits 100 ms and then attempts to send POST /forms/lservlet to the server again. Oracle Forms Server takes exactly 1 sec and then sends back ifError:11/200 Client waits 200 ms and then attempts to send POST /forms/lservlet to the server again. Oracle Forms Server takes exactly 1 sec and then sends back ifError:11/400 Client waits 400 ms and then attempts to send POST /forms/lservlet to the server again. Oracle Forms Server takes exactly 1 sec and then sends back ifError:11/800 Client waits 800 ms and then attempts to send POST /forms/lservlet to the server again. Oracle Forms Server takes exactly 1 sec and then sends back ifError:11/1600 Client waits 1600 ms and then attempts to send POST /forms/lservlet to the server again. Oracle Forms Server processes requests and responds in 31 ms. So the user was waiting ~8.46 seconds from a pure delay standpoint. Besides this 1 POST /form/lservlet there are a lot of other POST's that occur to fully render the screen. This 8.46 sec was pure delay for this user out of a 10–11 sec screen load. |
Look again at Figure 2; the left column shows the response time deltas from the network perspective, showing nearly one second between the POST request and the HTTP 200 response from the Oracle Forms server. The client browser will automatically continue to send repeated POST requests until the requested data is provided. The increasing number of POST requests closely correlates to user frustration, as shown in Figure 3.
Figure 3. Correlation between response time and user frustration
Let's level-set with a brief review of a typical Oracle Forms server application architecture (shown in Figure 4).
Figure 4. Oracle Forms server architecture
Now, knowing that the client makes repeated POST requests until the data is served, the teams needed to determine what the ifError:11
message meant:
- Is the Oracle Forms server too busy to handle the request?
OR
- Is the Oracle Forms server unable to fulfill the request because the database has not yet delivered the needed data?
With the help of a few power users, the teams decided to profile key transactions using network data captures between the client and the Oracle Forms server. Figure 5 shows the two tests of one key transaction that included six queries to complete the screen refresh.
Figure 5. Oracle Forms session network traces
The above load times (9 seconds and 3 seconds) are for the same transaction showing dramatically different results. The data has also begun to reveal the bottleneck location.
Simultaneously with the network packet captures, a database administrator performed traces on the user sessions. Unfortunately, another factor contributing to the general application slowness was discovered: long-running recursive calls (Figure 6).
Figure 6. Data for recursive calls
Recursive calls should be almost invisible when tracing a user session; unfortunately, this trace shows over 15 seconds of wall-clock time. After further analysis and additional user tracing, high recursive times were found to be common, leading the team down a second path to check the database settings.
Resolving the Recursive Calls Issue
Since database slowness impacts all of the user sessions, the teams decided to focus on improving overall database response time first. The teams discovered that during the hardware replatforming, the database configuration file (aka SPFILE
, PFILE
, and init.ora
) was missed. Instead, a default configuration file from the Oracle software installation was being used, with session_cached_cursors
set to 20. The database administrator also confirmed that the open_cursors
parameter was set to the default.
After checking the configuration file still residing on the old hardware, both cursor settings were updated, and like magic, the recursive calls numbers began looking "normal." Figure 7 shows much improvement, with opportunity remaining.
Figure 7. Improved numbers for recursive calls
As a follow-up task, the DBA team was asked to take a second look at the two cursor settings, and to compare all the current configuration settings to the premigration settings. Additional tuning of the cursor parameters to adjust for the load increase has now nearly eliminated recursive work time.
Mitigating the ifError
Issue
Knowing that the number of users had increased 50 percent and that the database had been replatformed for expanded capacity, the teams decided to expand the Oracle Forms capacity by increasing the number of servers in the load balanced pool, intending to decrease the number of user connections per server and the server CPU workload. Before the change, data was collected to capture an average global response time delay for users over five-minute increments. Table 1 shows the data for three recordings.
Table 1. Data from the first three network traffic collections
Total # of Users (Max) | Average Delay/User/5 Minutes | # of Oracle Forms Servers | Average # of Users/Server |
---|---|---|---|
3,377 | 6.08 seconds | 11 | 298 |
3,225 | 5.94 seconds | 11 | 293 |
3,590 | 5.254 seconds | 11 | 326 |
Average response times started falling, as shown in Table 2 through Table 4, when the additional capacity became available.
Table 2. Improved response times after adding one server
Total # of Users (Max) | Average Delay/User/5 Minutes | # of Oracle Forms Servers | Average # of Users/Server |
---|---|---|---|
3,513 | 4.472 seconds | 12 | 293 |
Table 3. Improved response times after adding four servers
Total # of Users (Max) | Average Delay/User/5 Minutes | # of Oracle Forms Servers | Average # of Users/Server |
---|---|---|---|
3,376 | 4.059 seconds | 15 | 225 |
Table 4. Improved response times after adding seven servers
Total # of Users (Max) | Average Delay/User/5 Minutes | # of Oracle Forms Servers | Average # of Users/Server |
---|---|---|---|
3,452 | 3.254 seconds | 18 | 192 |
Application users confirmed the improved application responsiveness, with a caveat that the screen refreshes were still not optimal.
By now the teams had concluded that the ifError:11
message was being sent when the Oracle Forms server was not able to provide the client-requested data because the database had not yet returned the result set. If you recall from the "3 second load" results, each query returned in less than 1 second, meaning that the database had already fulfilled the request, so the Oracle Forms server could push the data to the client. Still a question remained concerning the ifError:11
events: "Why do the Oracle Forms servers wait almost one second before responding to the client when the data is unavailable?" Further digging led us to the parameter maxBlockTime
, an Oracle Forms listener servlet parameter. When the client checks for a response from the Oracle Forms server, the listener forwards the request to the Oracle Forms runtime process and waits for the amount of time specified by maxBlockTime
for the result data before responding to the client request.
Conclusion
To recap, database cursor improvements reduced SQL parsing time, additional Oracle Forms server capacity reduced the ifError:11
total wait times, and application improvements made several heavily hit application functions perform better.
Understanding why something occurs is vital to understanding where and how response times can be improved. Changing maxBlockTime
is out of scope for this article, since extensive testing is required. However, knowing that each unfulfilled client request costs a minimum of one second in response time motivated the team to focus on returning data requests quickly and efficiently. Also, knowing how Oracle Forms operates gave the teams the ability to discern between the portion of the response time that could be impacted and portions where response time was outside of their control.
As you can see from the troubleshooting process described in this article, DBAs and application programmers can use network packets to get clues about why an application is performing poorly. Also, during any troubleshooting process, collecting data from many aspects of the application and infrastructure will paint a much clearer picture of the problem.
About the Author
Mike Cuppett works as a solutions engineer for a Fortune 20 healthcare organization and has saved millions of dollars for clients as a business systems performance consultant.
Revision 1.0, 10/21/2014 |
Comments