We are encountering a JVM process that dies with little explanation other than an exit code of 141. No hotspot error file (hs_err_*) or crash dump. To date, the process runs anywhere from 30 minutes to 8 days before the problem occurs. The last application log entry is always the report of a lost SSL connection, the result of an thrown SSLException. (The exception itself is unavailable at this time – the JVM dies before it is logged -- working on that.)
How can a JVM produce an exit code of 141, and nothing else? Can anyone suggest ideas for capturing additional diagnostic information? Any help would be greatly appreciated! Environment and efforts to date are described below.
Host machine: 8x Xeon server with 256GB memory, RHEL 6 (or RHEL 5.5) 64-bit
Java: Oracle Java SE 7u21 (or 6u26)
java version "1.7.0_21"
Java(TM) SE Runtime Environment (build 1.7.0_21-b11)
Java HotSpot(TM) 64-Bit Server VM (build 23.21-b01, mixed mode)
Diagnostics attempted to date:
- LD_PRELOAD=libjsig.so. A modified version of libjsig.so was created to report all signal handler registrations and to report SIGPIPE signals received. (Exit code 141 could be interpreted as 128+SIGPIPE(13).) No JNI libraries are registering any signal handlers, and no SIGPIPE signal is reported by the library for the duration of the JVM run. Calls to ::exit() are also intercepted and reported. No call to exit() is reported.
- Inspect /var/log/messages for any indication that the OS killed the process, e.g. via the Out Of Memory (OOM) Killer. Nothing found.
- Set ‘ulimit –c unlimited’, in case the default limit of 0 (zero) was preventing a core file from being written. Still no core dump.
- ‘top’ reports the VIRT size of the process can grow to 20GB or more in a matter of hours, which is unusual compared to other JVM processes. The RES (resident set size) does not grow beyond about 375MB, however, which is an considered normal.
This JVM process creates many short-lived Thread objects by way of a thread pool, averaging 1 thread every 2 seconds, and these objects end up referenced only by a Weak reference. The CMS collector seems lazy about collecting these, and upwards of 2000 Thread objects have been seen (in heap dumps) held only by Weak references. (The Java heap averages about 100MB, so the collector is not under any pressure.) However, a forced collection (via jconsole) cleans out the Thread objects as expected. Any relationship of this to the VIRT size or the JVM disappearance, however, cannot be established.
The process also uses NIO and direct buffers, and maintains a DirectByteBuffer cache. There is some DirectByteBuffer churn. MBeans report stats like:
Direct buffer pool: allocated=669 (20,824,064 bytes), released=665 (20,725,760), active=4 (98,304) [note: equals 2x 32K buffers and 2x 16K buffers]
java.nio.BufferPool > direct: Count=18, MemoryUsed=1343568, TotalCapacity=1343568
These numbers appear normal and also do not seem to correlate with the VIRT size or the JVM disappearance.