Recently I've been having issues in OLT with some VUs 'hanging' while others run OK. This tends to happen when running multiple scripts.
e.g. Each morning I run a smoke test of all our applications (10 iterations each from 2 load agents of multiple scripts). I can virtually guarantee that at least one of the 17 scripts will 'hang', yet the mojority will run to completion OK. I work around this by simply re-running the whole test again (and a different VU will 'hang' but at least I've seen all the scripts complete at least once).
It looks like the scripts are actually running (though I have seen at least one occurrance of a VU not even complete the 1st step in it's script) but they get stuck on a step and never recover. Shouldn't the VU timeout at some point? I know there are options for Socket Timeout / Request Timeout / Connection Idle Timeout (set to 120s / 120s / 60s) but these don't seem to apply to this scenario.
Has anyone seen this type of behaviour before?
Any suggestions are appreciated.
ADDENDUM: I suspect this is related to an issue with our applications where no response is received, but why doesn't the VU timeout?
Yes you are right the requests should timeout. Have you been able to inspect the agent log files? You say you are running 17 virtual users (17 scripts) in parallel, or just 2 virtual users running 17 scripts one after each other? I will assume you are running 17 virtual users and you mentioned 2 agent load machines, so that is only 8 vurs on one agent machine and 9 on the other; hardly high stress. Still these are some options when running high number of virtual users:
1. Change the script profile (or default scenario profiles settings) to increase the java agent heap size. e.g. =4096
2. And/or change the profile for "Maximum users per process". Set to e.g. half the number of virtual users you are running. This will kick off 2 agent processes on each agent load machine.
3. Though probably not necessary in your case, this is a common Oracle Support provided solution for running lots of virtual users from the same agent machine.
Tune TCP/IP parameters on Windows as shown below.
Launch Windows registry
Navigate to HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\TcpIP\parameters
Configure the following two parameters. If not found, create those parameters by selecting Edit -> New -> DWORD Value from the menu bar. Select "Decimal" under Base.
TcpTimedWaitDelay : 30 [seconds]
MaxUserPort : 65534
Thanks Glenn. I leaped straight on the registry changes only to find that I'd already made them! I now recall that they are mentioned somewhere in the release/installation notes (at the time I think I was looking into "ephemeral port exhaustion").
Unfortunately I've still got a long-standing Support Request about the inability to see the VU Log "Content" (since this was changed in v12.1) so I can't check easily what's actually happening to the ones which 'hang'. However, I suspect this is due to a known issue with the test environment at the moment.
I've already got the default options set to 20 users per process (a historical issue with a particular script) but will try reducing this along with your other suggestion of the heap size.
To clarify something though. In this particular scenario there are actually a total of 13x2 + 4x1 = 30VUs. This isn't actually a performance test, purely a quick way of ensuring everything is working.
Your clarification of only 30 virtual users does confirm that this is hardly high stress on the agent machines, including your note in the original post about only 10 iterations. Perhaps if your scripts are very big, the increased heap size and splitting the number of processes will help. Do you have at least a 1 second iteration delay? Are you downloading images? ~How many pages per script, and ~how many navigations per page? You have the virtual user slowed down with at least a 1 second think time between those pages?
Why do you need to agent machines for only 30 virtual users? If you run virtual users from only one agent machine, or from the controller itself is the problem worse?
Oh, are you using http.beginConcurrent() in your scripts to asynchronously request a number of navigations. If so have you increased the maxConnections in the profile settings?
This is actually a modified "peak production volume" test, but with a minimal amount of VUs/iterations, purely used as a "smoke test" to ensure everything is working (typically only ran first thing in the morning in case there have been any changes overnight). It is also run after any major changes, before starting any "real" performance tests.
The only reason there are 2xVUs/Agents is so that we can generate load from 2 IP addresses (we don't really need 2 agent machines other than for this reason). Plus it can highlight if I've forgotten to upgrade the agent on the other server <cough> ;-) A similar issue occurs if I only use one agent.
These won't be downloading images but a few of the scripts involve 20-40 steps, so are quite large.
I'm almost certain this is actually an issue due to problems with our test environment at the moment. Normally this scenario runs fine (and in fact the peak production scenario it is based on can run quite happily for hours). However, it looks like if the scripts encounter an issue (no response at key steps?) they do not time-out. The odd thing is that I would expect OLT to time-out when this occurs, but I haven't been seeing this.
Annoyingly, the environment is "behaving itself" at the moment, so I can't try your suggestions of heap size / iteration delay (I originally had 0 delay but have also tried with 1s or 10s in the past). However, I'm sure the enviromental issue will resurface in the next couple of days.