This content has been marked as final. Show 5 replies
...We're really at a dead end here in diagnosing this problem, as it clearly seems to be outside our app. Any suggestion?This is the first I am hearing of a problem of such a bizarre nature.
What did truss reveal as to the identifty of the thread(s) forking off the new process(es)
and the circumstances under which this was happening? What did ptree reveal as to
the ancestry of the descendant processes? For a sampling of "parent-child" pairs
in this tree/chain, could you post the complete pstack of the two processes?
Do you have a test case or a reproducible set-up, which you may be able to take to
your Sun support representative and open a ticket. Please make sure to quote the
exact version of Solaris 10 you were using when this phenomenon was observed.
We're also seeing this on a Solaris 10 machine. Our Java process sits for a long time, doing the same thing (which is sending UDP broadcasts and listening for a response), and at some point many hours later another process is forked from the JVM. This new process has the original JVM as its parent PID, and has not consumed any CPU (ps reports CPU usage of 0:00). The command line of the second process appears to be identical, indicating this is just a plain fork, and not an exec.
Actually, we've discovered that that's not really what was going on. I still believe there's a bug in the JVM, but the fork was happening because our Java code tries to exec a command line tool once a minute. After hours of this, we get a rogue child process with this stack (which is where we are forking this command line tool once a minute):
There are also several dozen other threads all with the same stack:
JVM version is 1.5.0_08-b03 Thread t@38: (state = IN_NATIVE) - java.lang.UNIXProcess.forkAndExec(byte, byte, int, byte, int, byte, boolean, java.io.FileDescriptor, java.io.FileDescriptor, java.io.FileDescriptor) @bci=168980456 (Interpreted frame) - java.lang.UNIXProcess.forkAndExec(byte, byte, int, byte, int, byte, boolean, java.io.FileDescriptor, java.io.FileDescriptor, java.io.FileDescriptor) @bci=0 (Interpreted frame) - java.lang.UNIXProcess.<init>(byte, byte, int, byte, int, byte, boolean) @bci=62, line=53 (Interpreted frame) - java.lang.ProcessImpl.start(java.lang.String, java.util.Map, java.lang.String, boolean) @bci=182, line=65 (Interpreted frame) - java.lang.ProcessBuilder.start() @bci=112, line=451 (Interpreted frame) - java.lang.Runtime.exec(java.lang.String, java.lang.String, java.io.File) @bci=16, line=591 (Interpreted frame) - java.lang.Runtime.exec(java.lang.String, java.lang.String, java.io.File) @bci=69, line=429 (Interpreted frame) - java.lang.Runtime.exec(java.lang.String) @bci=4, line=326 (Interpreted frame) ... - java.lang.Thread.run() @bci=11, line=595 (Interpreted frame)
I'm pretty sure this is because the fork part of the UnixProcess.forkAndExec is using the Solaris fork1 system call, and thus all the Java context thinks all those threads exist, whereas the actual threads don't exist in that process.
Thread t@32: (state = BLOCKED) Error occurred during stack walking: sun.jvm.hotspot.debugger.DebuggerException: can't map thread id to thread handle! at sun.jvm.hotspot.debugger.proc.ProcDebuggerLocal.getThreadIntegerRegisterSet0(Native Method) at sun.jvm.hotspot.debugger.proc.ProcDebuggerLocal.getThreadIntegerRegisterSet(ProcDebuggerLocal.java:364) at sun.jvm.hotspot.debugger.proc.sparc.ProcSPARCThread.getContext(ProcSPARCThread.java:35) at sun.jvm.hotspot.runtime.solaris_sparc.SolarisSPARCJavaThreadPDAccess.getCurrentFrameGuess(SolarisSPARCJavaThreadPDAccess.java:108) at sun.jvm.hotspot.runtime.JavaThread.getCurrentFrameGuess(JavaThread.java:252) at sun.jvm.hotspot.runtime.JavaThread.getLastJavaVFrameDbg(JavaThread.java:211) at sun.jvm.hotspot.tools.StackTrace.run(StackTrace.java:50) at sun.jvm.hotspot.tools.JStack.run(JStack.java:41) at sun.jvm.hotspot.tools.Tool.start(Tool.java:204) at sun.jvm.hotspot.tools.JStack.main(JStack.java:58)
It seems to me that something is broken in UnixProcess.forkAndExec in native code; it did the fork, but not the exec, and this exec thread just sits there forever. And of course, it's still holding all the file descriptors of the original process, which means that if we decide to restart our process, we can't reopen our sockets for listening or whatever else we want to do.
There is another possibility, which I can't completely rule out: this child process just happened to be the one that was fork'd when the parent process called Runtime.halt(), which is how the Java process exits. We decided to exit halfway through a Runtime.exec(), and got this child process stuck. But I don't think that's what happens... from what I understand that we collected, we see this same child process created at some point in time, and it doesn't go away.
Yes, I realize that my JVM is very old, but I cannot find any bug fixes in the release notes that claim to fix something like this. And since this only happens once every day or two, I'm reluctant to just throw a new JVM at this--although I'm sure I will shortly.
Has anyone else seen anything like this?