We are running scalability tests of a web application, ramping up user load over a period of time, and hitting some type of bottleneck before maxing out the CPU on the box. We are using WebLogic 10.3.6 with JRockit 1.6.0_29.
As we increase load we notice a sudden drop in CPU usage (from roughly 60% to 20%) with corresponding increase in response times. It appears some resource has been exhausted so we setup JRMC and FiightRecorder to try to identify it. We have not seen the cause but several effects are apparent:
1. The HTTPClientWorkManager is reporting that it is hitting the max limit on a regular basis, and when the CPU drops off we get a message that the max queue length of 8192 has been exceeded.
2, The queue grows from ~50 to 8192 in just 3-4 minutes, even without adding additional load.
3. The Idle thread count goes to zero and the pending requests count goes up. This in turn is what is driving up our response times.
Aside from once seeing an StreamDemultiplexor thread blocking for 20s, we have not seen a reason for the queue growing so rapidly. We did not see any contention or latency issues in our code.
Any suggestions on what else we can do to track this down?
Edited by: user722303 on Feb 27, 2013 8:33 PM
Edited by: user722303 on Feb 27, 2013 8:53 PM
On the Overload, configuration tab the Shared Capacity For Work Managers: attribute can be set (default is 65536).
It could be you are running against an OS limit such as maximum number of open file descriptors (/etc/security/limits.conf).
On some systems the default value for the time wait interval is too high and needs to be adjusted. To determine the number
of sockets in TIME_WAIT we can use netstat -a | grep TIME_WAIT | wc -l. When the number approaches the maximum number
of file descriptors per process, the application’s throughput will degrade, i.e., new connections have to wait for a free space in the
application’s file descriptor table. To check the current value we can use cat /proc/sys/net/ipv4/tcp_keepalive_time and
More on OS tuning - http://middlewaremagic.com/weblogic/?p=8133
When you do a load test, you can see what individual threads are doing by using: top -H -p <PID>
based on the output you can retrieve a thread id (and convert it to a hex value), take a thread dump
a see (based on the hex value of the thread id) what threads are doing (which you can see in the stack trace
that is dumped with the specific thread).
Thanks for the reply. We are using Windows, not Linux.
I should have added that our scenario is ramping up users over a 1-2 hour period and hitting this issue with less than 200 users (typically 150-175). This is a load test, but a realistic load test where users have think times up to 30-60 seconds. This is not a stress test. Our server is very powerful with 12 physical CPUs (24 virtual) and 144 GB RAM. It is a physical box and not a VM image. No one else is on this machine during our tests.
I can say that we confirmed the TCP/IP tuning is correct that the the TCP starting port is 1025 with number of ports set to 64510. I did not check the number of connections in the TIME_WAIT state during the tests, but we do limit the TIME_WAIT to 30 seconds via the TcpTimedWaitDelay registry setting and with the load we were running, I would not expect to have too many connections in that state.
We also confirmed that the Java heap size and garbage collection are not a bottleneck, the disk utilization and disk queue length is very low and context switching is not a problem.
One question I have is why would the queue length grow from ~50 to 8192 in just a couple of minutes with less than 200 users on the system?
I can provide a jfr file if that will help. I don't see a way to do that here though.
Edited by: user722303 on Feb 28, 2013 10:54 AM
"One question I have is why would the queue length grow from ~50 to 8192 in just a couple of minutes with less than 200 users on the system?"
Can you check if there is any recursion in the application. Can you check how certain requests are handled, for example, are EJB's (or something similar)
calling other EJB's, which in turn calls the caller.
Also are you using JMS in your application? Note that JMS (and also the transaction manager) are so-called higher priority requests, for which overload
management is provided by the JMS and transaction manager. For JMS you can set thresholds and quotas, for JTA you can set a maximum number of transations.
When using JMS, also check what is happening once a message has been received and is being processed by the message listener.
When the queue length is reached you can make a thread dump and check what the application is doing (by checking the stack traces of the threads that are runnable)?
We are not using either EJBs or JMS directly. Our is an ADF application using bean data controls which talk to our back end services via web services. Test scenario is exactly same for test cycle. We have about 8 test cases and the users are distributed between each test case. Users are repeating their test over and over again.
Can you check your application if it does spawn the extra threads when a request comes in?
If you are able to do so simplify your test, such that you know which flow it takes in your application, such that you can pinpoint where the problem lies.
A couple of things I think you should check:
1) Regarding using 200 users and having a queue of thousands of requests... When you say that you have 200 users, this means that you will have maximum 200 threads simultaneously on the client side, right?
2) What is the read timeout on the client side? That could explain the number of threads in use in weblogic. Threads might be left running, and if the client stops waiting for the response and initiates a new connection, it could lead to the scenario you are describing.
3) As you said that the web application starts consuming less CPU when the issue happens and at the same time the queue increases, it seems to me that you might have a contention on the webservices. Are you also monitoring the webservices? Do they run in a different server? Is it deployed on the same physical server but in different weblogic server, or it's deployed in the same weblogic server?
4) Did you run performance test in the webservices on their own to make sure that they scale?
5) In this situation, I would run a separate performance test against the webservices. Then I would stub the webservices and run a performance test in the web application. By stubbing the webservices, I'm able to create any kind of behaviour I want. E.g: Simulate slow response times, respond under a consistent response time, etc.
6) Is there a read timeout between the web application and the web service?
Edited by: Fabio Douek on Mar 13, 2013 3:25 PM
I'm curious for your question:
"A couple of things I think you should check:
1) Regarding using 200 users and having a queue of thousands of requests... When you say that you have 200 users, this means that you will have maximum 200 threads simultaneously on the client side, right?"
We are having similar issue, and I guess this is because there are thread pool limits (work managers). We have similar issues, and given higher values for self tuning threads, but sincerely, it is still obscure how this really works. With execute queues it was much clear.
Is there any doc. that you are aware explaining how to monitor threads and workmanagers?