WL : 10.0 MP1
Config: Clustered with 9 ManagedServers on 3 hosts- load balanced in a F5
App Profile: WebServices only - no web ui
We have an app running pretty stable with high availability until the vendor and customer ran into a descrepancy on the performance of the application.
The flow is somewhat like this
(Customer App) ----LAN --- (Customer F5) ---- Dedicated T1line----(Vendor F5) ----LAN ----(Weblogic App)
Initially we were measuring the response at the Weblogic app logs which were not in line with what the customer was measuring at their F5 layer.
To get to the bottom of the problem we started measuring the response times at 3 layers
1. Application logs
2. Weblogic WebServer access logs
What we are noticing is sporadic latencies betwen 1 & 2 and also 2 & 3.
This is not happenning for any specific API, manserver or host or a particular time of the day. Yes it does happen more during the peak hours but that goes with any app.
When I sample calls for a random time, I see a latency of anywhere between 300ms to 2seconds between the app and webserver and also between webserver and F5 not at the same time and sometimes at the same time.
Since it is happenning within the weblogic (i.e between the app container and the webserver ) and outside the weblogic ( i.e between weblogic and F5) thought I would clean my house first before I go at the network team.
Are there any tools to debug issues in the webserver layer?
Any known apache bugs that is bundled with WL10.0 MP1?
Any diagnostics that I can collect that may lead to the bottom of this?
Any other ideas you may have.
I can see that you have the Weblogic App server but are you saying that you have any web server like an Apache or SunOne Iplanet etc sitting in front of WLS doing a reverse proxying?
Either ways.the approach that I would ideally want to suggest to identify if the delay that you are observing is after the request has reached the Weblogic server.In order to check that you could directly hit the application or invoke the Web service deployed on the WLS after bye-passing the LB and other web server components.
2)Once you gather the reading of the response time by directly hitting the WLS server,you may want to move back and try to identify response time that the request takes by accessing the application through the Web server.
3)After that you may want to try and access the application through the LB and take a look at the time response time.
By breaking your tests in this way you could identify where the possible delay is.
Now,coming to diagnose the slowness by using tools.
1)You could use Network tools like Wireshark to capture the TCP packets on the webserver or the App server and inspect the delay based on the time the request reaches the server in the TCP stream
2)If you are using a browser,you can use the iehttpheader utility which tracks all the header information to and from the browser.
3)If you have a web server with proxy plugin configured,you may want to enable "DEBUG=ALL" for additional logging and also show you the time stamps of the requests.
4)If you want to profile application calls,then you may want to make use of one of the numerous profiler tools available in the market such as Jprofiler etc.
There is no standalone webserver as such, I was referring to the webserver layer within weblogic.
We are not trying to solve the latency of the call upto arriving in the application queue so to speak. As I mentioned the application logs suggest most of the calls are being serviced in < 100ms.
When you look at the webserverlogs ( i.e http access.log within weblogic) we notice a latency of anywhere between 100ms to 2 seconds on top of what the application took to service sporadically.
And when you look at the F5logs we notice further addition of 100ms to 2 seconds.
Remember this is sporadic, I cannot recreate this scenario at will.
We tried to run a controlled load test using soapui aiming at one managed server and then aiming at the F5 LoadBalancer but we could not recreate the issue.
I was looking for ideas more in the line of passing any JVM parms, so it capture more data in the weblogic logs or http access logs that would help me debug this problem.
The problem with troubleshooting at the networking layer is we notice this only at production volumes and capturing packets to solve this problem is like looking for needle in a haystack.