This certainly looks like an I/O issue on the log filesystem. Did you happen to capture any I/O metrics for that disk device around the time of the issue? Also, you have the TT checkpoint and logs co-located which is not recommended. What kind of virtualisation are you using and is TimesTen running inside a VM? Many types of virtualisation have a negative impact on I/O. There is not much else to be said based on the available info. I would suggest that if this happens again you:
1. Capture snapshots from sys.monitor table every 10 seconds throughout the stress test.
2. Capture checkpoint history (call ttCkptHistiry) every few minutes throughout the stress test.
3. Capture O/S level I/O metrics for all disks during the stress test
4. Not kill the 'blocked' process after 20 seconds byt wait and see if it recovers; if this is due to an 'overload; then it should eventually become unblocked.
Feel free to post this info here but better still log an SR with Oracle support.
Thanks a lot for offering the further explanation and suggestions.
On Netra-T2000 box, we run Sun logical Domains / Solaris10. TT and application instances run in each VM respectively. As we run 8 instances on one bare host, it seems expensive to offer separate disks for different log files and for different VM respectively. As the issue happened in weekend mid-night, we had no I/O matrix data of that time when doing the post analysis of the issue.
According to the information you provided and our logs & performance counts, we tend to believe this issue was caused by disk access, though we don't understand why TimesTen function got stuck for more than 20 seconds while other processes could still write logs in the same duration. In real situations, we would have much smaller traffic in mid-night, so we now believe the issue may be not as severe as what it seems in lab. This is the first time (and the only time so far) we ran into the issue. We will collect the information as you suggested next time when the issue happen again and then contact you or open a ticket.
Separate disks is the optimal setup but of course ultimately it all depends on the overall I/O load from both TimesTen and other sources. While I have a string suspicion that this was due to I/O 'overload' I am not discounting other possibilities hence my suggestion that if it occurs again an SR would be the best route to get an in depth investigation to definitively establish root cause. I agree that a blockage of 20 seconds seem very excessive but there could have been multiple factors involved (e.g. a blocked transaction may have been holding locks which then caused things to get into lock waits etc.).