A few Fridays back, in eager anticipation of the weekend, woke up to bad news on one of our production servers not accepting requests. Cursing the timing of the issue, logged into the weblogic console and found one of the servers in "warning" state. Drilling down into the server that was showing up "warning" state, found that there were stuck threads. And there started my tryst with unraveling the mystery of stuck threads.I take you along with me on this exploration on an analysis of a production issue of ours!
What is a stuck thread?
Java thread lifecycle involves the following stages: NEW, RUNNABLE,TIMED_WAITING, WAITING, BLOCKED, TERMINATED. Refer to this article for more information on thread state lifecycle.
Out of the many threads that the weblogic server's JVM is processing, the decision as to whether a thread is stuck is decided by this setting:
So based on the 'Max Stuck Thread Time', any thread that exceeds this time is marked as STUCK! Any RUNNABLE,TIMED_WAITING, WAITING, BLOCKED thread that exceeds this 'Max Stuck Thread Time' is marked as STUCK.
It would be interesting to note that a long running process could also be marked as STUCK when it really is not. An example is that a RUNNABLE thread running for more than10 minutes could be marked by the weblogic server as STUCK even though it may eventually finish in another few minutes.This is one reason why it is advisable to take multiple thread dumps over a period of time to zero down on the exact thread that is the cause of the issue.
Where do I start - Thread Dumps!
Thread dumps are a snapshot of the JVM at particular point of time listing out all the threads that are alive at that moment.
Discussion in this blog is restricted to a thread dump from our servers when a stuck thread was hogging the server.
"Benign" Threads - Ignore them!!
To start with on the analysis, the thread dump has a lot of content and it is important to know what to ignore. Usually there are many "benign" threads which are harmless and we should ignore these to narrow down to the actual thread that is causing an issue.
For ex. in the attached thread dump, ignore weblogic specific internal housekeeping threads.
- Ignore threads having Muxer on their trace.
"ExecuteThread: '0' for queue: 'weblogic.socket.Muxer'" id=73 idx=0x16c tid=8993 prio=5 alive, blocked, native_blocked, daemon -- Blocked trying to get lock: java/lang/String@0x1c7dcdcc8[fat lock] at jrockit/vm/Threads.waitForUnblockSignal()V(Native Method) at jrockit/vm/Locks.fatLockBlockOrSpin(Locks.java:1411)[optimized] at jrockit/vm/Locks.lockFat(Locks.java:1512)[optimized] at jrockit/vm/Locks.monitorEnterSecondStageHard(Locks.java:1054)[optimized] at jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1005)[optimized] at jrockit/vm/Locks.monitorEnter(Locks.java:2179)[optimized] at weblogic/socket/EPollSocketMuxer.processSockets(EPollSocketMuxer.java:153) at weblogic/socket/SocketReaderRequest.run(SocketReaderRequest.java:29) at weblogic/socket/SocketReaderRequest.execute(SocketReaderRequest.java:42) at weblogic/kernel/ExecuteThread.execute(ExecuteThread.java:145) at weblogic/kernel/ExecuteThread.run(ExecuteThread.java:117) at jrockit/vm/RNI.c2java(JJJJJ)V(Native Method) -- end of trace
Ignore threads having 'ExecuteThread.waitForRequest'
"[ACTIVE] ExecuteThread: '8' for queue: 'weblogic.kernel.Default (self-tuning)'" id=6407 idx=0x180 tid=36008 prio=5 alive, waiting, native_blocked, daemon -- Waiting for notification on: weblogic/work/ExecuteThread@0x1f9daaec0[fat lock] at jrockit/vm/Threads.waitForNotifySignal(JLjava/lang/Object;)Z(Native Method) at jrockit/vm/Locks.wait(Locks.java:1973)[inlined] at java/lang/Object.wait(Object.java:485)[inlined] at weblogic/work/ExecuteThread.waitForRequest(ExecuteThread.java:205)[inlined] at weblogic/work/ExecuteThread.run(ExecuteThread.java:233)[optimized] ^-- Lock released while waiting: weblogic/work/ExecuteThread@0x1f9daaec0[fat lock] at jrockit/vm/RNI.c2java(JJJJJ)V(Native Method) -- end of trace
Analyzing the remaining threads
Start from the "STUCK" thread and follow the chain of locks. In the thread dump, you here observe that the lock is held by JDBCConnection which was initiated by the application code 'BoxAppModuleImpl.saveCertification'. This code is the culprit and on drilling down further, this was firing a bad performing SQL which had a filter with a function that did not hit the index :
( (UPPER(PK_COL) = UPPER(:pVal) ) )
"[STUCK] ExecuteThread: '1' for queue: 'weblogic.kernel.Default (self-tuning)'" id=68 idx=0x160 tid=8984 prio=1 alive, in native, daemon at jrockit/net/SocketNativeIO.readBytesPinned(Ljava/io/FileDescriptor;[BIII)I(Native Method) at jrockit/net/SocketNativeIO.socketRead(SocketNativeIO.java:32)[inlined] at java/net/SocketInputStream.socketRead0(Ljava/io/FileDescriptor;[BIII)I(SocketInputStream.java)[inlined] at java/net/SocketInputStream.read(SocketInputStream.java:129)[optimized] at java/net/ManagedSocketInputStreamHighPerformanceNew.read(ManagedSocketInputStreamHighPerformanceNew.java:274)[optimized] at oracle/net/ns/Packet.receive(Packet.java:300) at oracle/net/ns/DataPacket.receive(DataPacket.java:106) at oracle/net/ns/NetInputStream.getNextPacket(NetInputStream.java:315)[optimized] at oracle/net/ns/NetInputStream.read(NetInputStream.java:260)[optimized] at oracle/net/ns/NetInputStream.read(NetInputStream.java:185)[inlined] at oracle/net/ns/NetInputStream.read(NetInputStream.java:102)[optimized] at oracle/jdbc/driver/T4CSocketInputStreamWrapper.readNextPacket(T4CSocketInputStreamWrapper.java:124)[inlined] at oracle/jdbc/driver/T4CSocketInputStreamWrapper.read(T4CSocketInputStreamWrapper.java:80)[optimized] at oracle/jdbc/driver/T4CMAREngine.unmarshalUB1(T4CMAREngine.java:1137)[inlined] at oracle/jdbc/driver/T4CTTIfun.receive(T4CTTIfun.java:290)[optimized] at oracle/jdbc/driver/T4CTTIfun.doRPC(T4CTTIfun.java:192)[optimized] at oracle/jdbc/driver/T4C8Oall.doOALL(T4C8Oall.java:531)[inlined] at oracle/jdbc/driver/T4CPreparedStatement.doOall8(T4CPreparedStatement.java:207)[optimized] at oracle/jdbc/driver/T4CPreparedStatement.fetch(T4CPreparedStatement.java:1119) at oracle/jdbc/driver/OracleResultSetImpl.close_or_fetch_from_next(OracleResultSetImpl.java:375)[inlined] at oracle/jdbc/driver/OracleResultSetImpl.next(OracleResultSetImpl.java:279)[optimized] ^-- Holding lock: oracle/jdbc/driver/T4CConnection@0x10f233a18[recursive] ^-- Holding lock: oracle/jdbc/driver/T4CConnection@0x10f233a18[thin lock] at weblogic/jdbc/wrapper/ResultSet_oracle_jdbc_driver_OracleResultSetImpl.next()Z(Unknown Source)[optimized] at oracle/jbo/server/QueryCollection.hasNextInResultSet(QueryCollection.java:4925)[optimized] at oracle/jbo/server/ViewObjectImpl.hasNextForCollection(ViewObjectImpl.java:7365) at oracle/jbo/server/QueryCollection.hasNext(QueryCollection.java:4893)[inlined] at oracle/jbo/server/QueryCollection.populateRow(QueryCollection.java:3793)[optimized] at oracle/jbo/server/QueryCollection.fetch(QueryCollection.java:3623)[optimized] ^-- Holding lock: oracle/jbo/JboSyncLock@0x1fdb66220[recursive] at oracle/jbo/server/QueryCollection.get(QueryCollection.java:2336)[optimized] at oracle/jbo/server/ViewRowSetImpl.getRow(ViewRowSetImpl.java:5180)[inlined] at oracle/jbo/server/ViewRowSetIteratorImpl.doFetch(ViewRowSetIteratorImpl.java:2971)[inlined] at oracle/jbo/server/ViewRowSetIteratorImpl.ensureRefreshed(ViewRowSetIteratorImpl.java:2827)[inlined] at oracle/jbo/server/ViewRowSetIteratorImpl.refresh(ViewRowSetIteratorImpl.java:3068)[inlined] at oracle/jbo/server/ViewRowSetImpl.notifyRefresh(ViewRowSetImpl.java:2841)[optimized] ^-- Holding lock: oracle/jbo/JboSyncLock@0x1fdb66220[recursive] at oracle/jbo/server/ViewRowSetImpl.execute(ViewRowSetImpl.java:1311)[inlined] at oracle/jbo/server/ViewRowSetImpl.executeQueryForMasters(ViewRowSetImpl.java:1449)[inlined] at oracle/jbo/server/ViewRowSetImpl.executeQueryForMode(ViewRowSetImpl.java:1355)[inlined] at oracle/jbo/server/ViewRowSetImpl.executeQuery(ViewRowSetImpl.java:1340)[inlined] at oracle/jbo/server/ViewObjectImpl.executeQuery(ViewObjectImpl.java:7236)[optimized] ^-- Holding lock: oracle/jbo/JboSyncLock@0x1fdb66220[recursive] ^-- Holding lock: oracle/jbo/JboSyncLock@0x1fdb66220[thin lock] at com/xyz/box/model/service/BoxAppModuleImpl.saveCertification(BoxAppModuleImpl.java:1496) at jrockit/vm/RNI.c2java(JJJJJ)V(Native Method) at jrockit/vm/Reflect.invokeMethod(Ljava/lang/Object;Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;(Native Method) at sun/reflect/NativeMethodAccessorImpl.invoke0(Ljava/lang/reflect/Method;Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;(Native Method) at sun/reflect/NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
We are bound to have more complex thread dumps where you may need to navigate through thread lock chains to arrive at the exact root cause.
And finally - The Remedy
As you would have guessed by now, there is no single shot solution to stuck thread. The solution approach relies on the diagnosis and eliminating the cause of the stuck thread which could be many based on the thread state -
i. The thread could have been waiting on a resource which is indicated by TIMED_WAITING or WAITING state of the thread.
ii. The thread could have been locked due a wrong synchronized block in the java code where in the developer needs to revisit the code. This would be the approach when the thread identified to be causing the issue is in BLOCKED state.
That said, from my experience, on usual applications, it is one of the below:
1. Bad Performing SQL - This usually is the culprit on data driven applications where we have a bad performing SQL hogging the server resources. This is indicated by a thread in 'alive' state and the stack trace showing PreparedStatement class. As a generic approach to prevent stuck threads from bad performing SQLs, we should look at the JDBC statement timeout settting in WLS console - Oracle Support Doc ID 1140473.1.
2. JCA Inbound Adapter - On SOA servers with OSB, we could see a flavour of stuck thread due to the polling nature of adapters. But I think Oracle has resolved this in the newer versions - Patch 14831608 is available for versions 18.104.22.168 and 22.214.171.124.
3. If the stack trace does not point to any clues and is ending with a framework class, it is beest to open an SR with Oracle with the thread dump uploaded.
It would be great to hear from you on any experiences you have had with stuck threads.