This content has been marked as final. Show 7 replies
On the Overload, configuration tab the Shared Capacity For Work Managers: attribute can be set (default is 65536).
It could be you are running against an OS limit such as maximum number of open file descriptors (/etc/security/limits.conf).
On some systems the default value for the time wait interval is too high and needs to be adjusted. To determine the number
of sockets in TIME_WAIT we can use netstat -a | grep TIME_WAIT | wc -l. When the number approaches the maximum number
of file descriptors per process, the application’s throughput will degrade, i.e., new connections have to wait for a free space in the
application’s file descriptor table. To check the current value we can use cat /proc/sys/net/ipv4/tcp_keepalive_time and
More on OS tuning - http://middlewaremagic.com/weblogic/?p=8133
When you do a load test, you can see what individual threads are doing by using: top -H -p <PID>
based on the output you can retrieve a thread id (and convert it to a hex value), take a thread dump
a see (based on the hex value of the thread id) what threads are doing (which you can see in the stack trace
that is dumped with the specific thread).
Thanks for the reply. We are using Windows, not Linux.
I should have added that our scenario is ramping up users over a 1-2 hour period and hitting this issue with less than 200 users (typically 150-175). This is a load test, but a realistic load test where users have think times up to 30-60 seconds. This is not a stress test. Our server is very powerful with 12 physical CPUs (24 virtual) and 144 GB RAM. It is a physical box and not a VM image. No one else is on this machine during our tests.
I can say that we confirmed the TCP/IP tuning is correct that the the TCP starting port is 1025 with number of ports set to 64510. I did not check the number of connections in the TIME_WAIT state during the tests, but we do limit the TIME_WAIT to 30 seconds via the TcpTimedWaitDelay registry setting and with the load we were running, I would not expect to have too many connections in that state.
We also confirmed that the Java heap size and garbage collection are not a bottleneck, the disk utilization and disk queue length is very low and context switching is not a problem.
One question I have is why would the queue length grow from ~50 to 8192 in just a couple of minutes with less than 200 users on the system?
I can provide a jfr file if that will help. I don't see a way to do that here though.
Edited by: user722303 on Feb 28, 2013 10:54 AM
"One question I have is why would the queue length grow from ~50 to 8192 in just a couple of minutes with less than 200 users on the system?"
Can you check if there is any recursion in the application. Can you check how certain requests are handled, for example, are EJB's (or something similar)
calling other EJB's, which in turn calls the caller.
Also are you using JMS in your application? Note that JMS (and also the transaction manager) are so-called higher priority requests, for which overload
management is provided by the JMS and transaction manager. For JMS you can set thresholds and quotas, for JTA you can set a maximum number of transations.
When using JMS, also check what is happening once a message has been received and is being processed by the message listener.
When the queue length is reached you can make a thread dump and check what the application is doing (by checking the stack traces of the threads that are runnable)?
We are not using either EJBs or JMS directly. Our is an ADF application using bean data controls which talk to our back end services via web services. Test scenario is exactly same for test cycle. We have about 8 test cases and the users are distributed between each test case. Users are repeating their test over and over again.
A couple of things I think you should check:
1) Regarding using 200 users and having a queue of thousands of requests... When you say that you have 200 users, this means that you will have maximum 200 threads simultaneously on the client side, right?
2) What is the read timeout on the client side? That could explain the number of threads in use in weblogic. Threads might be left running, and if the client stops waiting for the response and initiates a new connection, it could lead to the scenario you are describing.
3) As you said that the web application starts consuming less CPU when the issue happens and at the same time the queue increases, it seems to me that you might have a contention on the webservices. Are you also monitoring the webservices? Do they run in a different server? Is it deployed on the same physical server but in different weblogic server, or it's deployed in the same weblogic server?
4) Did you run performance test in the webservices on their own to make sure that they scale?
5) In this situation, I would run a separate performance test against the webservices. Then I would stub the webservices and run a performance test in the web application. By stubbing the webservices, I'm able to create any kind of behaviour I want. E.g: Simulate slow response times, respond under a consistent response time, etc.
6) Is there a read timeout between the web application and the web service?
Edited by: Fabio Douek on Mar 13, 2013 3:25 PM
I'm curious for your question:
"A couple of things I think you should check:
1) Regarding using 200 users and having a queue of thousands of requests... When you say that you have 200 users, this means that you will have maximum 200 threads simultaneously on the client side, right?"
We are having similar issue, and I guess this is because there are thread pool limits (work managers). We have similar issues, and given higher values for self tuning threads, but sincerely, it is still obscure how this really works. With execute queues it was much clear.
Is there any doc. that you are aware explaining how to monitor threads and workmanagers?