Today I've digged deeper into one of the nastiest problems I ever encounted since I became a Java programmer.
The problem was this — I have the Hudson slave agent program, which blocks on stdin for read almost all the time. This process uses other threads to perform other activities (what this thread does is actually driven by what it reads from stdin, but that's irrelevant to this bug.)
On Windows, I've seen on several occasions that this process hangs. The interesting thing about this hang is that it doesn't react to jstack, and so I couldn't see the stack trace (whereas if this was caused by a usual Java level deadlock, say, incorrect use of synchronized, jstack would show you how the deadlock is occurring.)
If I bring in a Windows native debugger, I can see a number of threads all suspending at Win32 system calls (KiFastSystemCallRet, to be exact), but I couldn't really make any sense out of it. You see, when you look at a Java process from a Windows debugger, mostly you just see what appears to be a JIT-generated code in the call stack, so it really doesn't tell you what's going on.
In an attempt to better understand the problem, I start a JVM with a debug option, which magically makes the problem disappear. So for a longest time, my "workaround" for this problem was to run the JVM with a debugger support.
A break-through happened, when I started seeing this in my unit tests, and so decided to revisit this issue. Instead of usual Visual Studio debugger, this time I was using WinDbgbecause I didn't wait to install Visual Studio on this virtual machine that I run Windows in.
After attaching to this JVM, WinDbg failed to break the execution. But unlike Visual Studio, it reported this helpful message:
Break-in sent, waiting 30 seconds... WARNING: Break-in timed out, suspending. This is usually caused by another thread holding the loader lock
What's the "loader lock"? Does anyone hold it? With this info, I started looking at the stack trace more carefully, and indeed I found the thread that appears to be holding this lock:
12 Id: 814.8b8 Suspend: 1 Teb: 7ffa8000 Unfrozen ChildEBP RetAddr 02f1e750 7c90d99a ntdll!KiFastSystemCallRet 02f1e754 7c810f63 ntdll!NtQueryVolumeInformationFile+0xc 02f1e784 77c2c9f9 kernel32!GetFileType+0x7e 02f1e7e8 77c1f01d msvcrt!_ioinit+0x19f 02f1e88c 7c90118a msvcrt!__CRTDLL_INIT+0xac 02f1e8ac 7c91c4fa ntdll!LdrpCallInitRoutine+0x14 02f1e9b4 7c916371 ntdll!LdrpRunInitializeRoutines+0x344 02f1ec60 7c9164d3 ntdll!LdrpLoadDll+0x3e5 02f1ef08 7c801bbd ntdll!LdrLoadDll+0x230 02f1ef70 7c801d72 kernel32!LoadLibraryExW+0x18e 02f1ef84 7c801da8 kernel32!LoadLibraryExA+0x1f 02f1efa0 77de8830 kernel32!LoadLibraryA+0x94 02f1f05c 6d3eb1be ADVAPI32!CryptAcquireContextA+0x512 WARNING: Stack unwind information not available. Following frames may be wrong. 02f1f13c 6d99c844 java_6d3e0000!Java_sun_security_provider_NativeSeedGenerator_nativeGenerateSeed+0x6e 02f1f154 6d99cf9c jvm!AsyncGetCallTrace+0x3f374 02f1f1cc 6da241e1 jvm!AsyncGetCallTrace+0x3facc 02f1f210 6d99d01d jvm!JVM_FindSignal+0x53111 02f1f22c 6d98d64c jvm!AsyncGetCallTrace+0x3fb4d 02f1f2e8 6d98d88b jvm!AsyncGetCallTrace+0x3017c 02f1f300 6d98db52 jvm!AsyncGetCallTrace+0x303bb
Someone is callingsun.security.provider.NativeSeedGenerator.nativeGenerateSeed(...), which results in the system loading a DLL — presumably some cryptography related DLL. The initialization of this includes_ioinit from Visual C runtime library (MSVCRT), which results in the GetFileType system call, and that's hanging.
For those of you lucky enough who have never done Win32 programming, GetFileType is a Windows system call that tells you the kind of a file descriptor, given a file descriptor (why C runtime needs to know, I have no idea.) MSDN article on GetFileType doesn't say anything, but the source code of _ioinit has a comment that saysGetFileType will hang if there are other pending read operations on the same file handle.
Of course, I know of one such thread — the one that's reading stdin! I can see its stack trace through WinDbg:
10 Id: 814.880 Suspend: 1 Teb: 7ffaa000 Unfrozen ChildEBP RetAddr 02e7d7e4 7c90d9da ntdll!KiFastSystemCallRet 02e7d7e8 7c801879 ntdll!NtReadFile+0xc 02e7d850 6d3e4a78 kernel32!ReadFile+0x16c WARNING: Stack unwind information not available. Following frames may be wrong. 02e7d870 6d3e422d java_6d3e0000!handleRead+0x25 02e7f898 6d3e1dae java_6d3e0000!VerifyClassCodesForMajorVersion+0x16a 02e7f8b8 00979e27 java_6d3e0000!Java_java_io_FileInputStream_readBytes+0x1d 02e7f904 00972ec5 0x979e27 02e7f944 00972ec5 0x972ec5 02e7f97c 00972da1 0x972ec5 02e7f9bc 00972ec5 0x972da1 02e7f9ec 00972ec5 0x972ec5 02e7fb14 6d99c844 0x972ec5 02e7fb2c 6d99cf9c jvm!AsyncGetCallTrace+0x3f374 02e7fba4 6da241e1 jvm!AsyncGetCallTrace+0x3facc 02e7fbe8 6d99d167 jvm!JVM_FindSignal+0x53111 02e7fc24 6d99d1dd jvm!AsyncGetCallTrace+0x3fc97 02e7fcd0 6d9c6290 jvm!AsyncGetCallTrace+0x3fd0d 02e7fd24 6da80424 jvm!jio_fprintf+0x140 02e7fd50 6da23e5c jvm!JVM_FindSignal+0xaf354 02e7ff80 7c349565 jvm!JVM_FindSignal+0x52d8c
And this makes it easy to understand why the use of the debug option eliminates the issue — it presumably changes the order of the DLL loading, in such a way that this race condition doesn't occur, at least with the crypto DLL.
Now I understand the problem — if you have a thread that's doing a blocking read on stdin, DLL load will block. The reason this problem doesn't manifest more often is that as soon as you send something to its stdin, the deadlock will resolve itself. Also, you only tend to actually load more DLLs early in the life of a process (later LoadLibrary calls more likely just find what you've already loaded, rather than loading a brand-new DLL.) This explains why I only see this during the start up of a Hudson slave.
So the challenge now is how to fix it, and this is the part that I still don't have a good answer, and this is a part of the reason why I'm writing this. I thought of 2 approaches, but neither of them work:
- Duplicating the handle into a higher value
- Upon reading _ioinit source code, I noticed thatGetFileType call only happens for stdin,stdout, and stderr. So if I could DuplicateHandle stdin to a higher file descriptor number, I can work around this problem rather nicely. Unfortunately, Java doesn't appear to have any API to duplicate a file handle. So in short of using JNI or JNA, I can't duplicate a handle, and that's a dependency I couldn't afford in this part of Hudson.
- Read stdin with timeout
- If I can read from stdin with timeout and instead do so in a loop, I can let other threads proceed with GetFileType, thus breaking the deadlock. Unfortunately, FileChannel is not SelectableChannel, so even with NIO, I really can't seem to issue a read with timeout.
In the end, I resorted to a very poor approach; I callSecureRandom before I start reading stdin to eagerly load whatever crypto DLL that it needs to load. This apparently "fixes" the problem in this case, but clearly, it's not a real solution as it can happen with any DLL.
If anyone has a better idea about how to fix this, please let me know. And hopefully Google indexes this blog so that others in the same boat won't have to waste as much time as I did.