This discussion is archived
1 2 Previous Next 15 Replies Latest reply: Jul 26, 2012 6:42 AM by greybird RSS

Deadlock found in je-5.0.48

928454 Newbie
Currently Being Moderated
Hi,

We've seen what looks like a deadlock in je-5.0.48. Here are the offending threads:


_"RMI TCP Connection(5358)-10.119.152.33" - Thread t@41433_
java.lang.Thread.State: WAITING
     at sun.misc.Unsafe.park(Native Method)
     - parking to wait for <469f6330> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
     at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
     at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811)
     at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:842)
     at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1178)
     at java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:807)
     at com.sleepycat.je.latch.SharedLatch.acquireExclusive(SharedLatch.java:95)
     at com.sleepycat.je.tree.IN.latch(IN.java:473)
     at com.sleepycat.je.tree.IN.findParent(IN.java:3149)
     at com.sleepycat.je.tree.Tree.getParentINForChildIN(Tree.java:773)
     at com.sleepycat.je.tree.Tree.getParentINForChildIN(Tree.java:704)
     at com.sleepycat.je.tree.Tree.getParentINForChildIN(Tree.java:656)
     at com.sleepycat.je.tree.Tree.getNextBinInternal(Tree.java:993)
     at com.sleepycat.je.tree.Tree.getNextBin(Tree.java:924)
     at com.sleepycat.je.dbi.CursorImpl.getNext(CursorImpl.java:1621)
     at com.sleepycat.je.Cursor.retrieveNextAllowPhantoms(Cursor.java:2924)
     at com.sleepycat.je.Cursor.retrieveNextNoDups(Cursor.java:2801)
     at com.sleepycat.je.Cursor.retrieveNext(Cursor.java:2775)
     at com.sleepycat.je.Cursor.getNext(Cursor.java:1128)
     at org.apache.qpid.server.store.berkeleydb.BDBMessageStore.getContent(BDBMessageStore.java:1965)
     .....
     at java.lang.Thread.run(Thread.java:662)

Locked ownable synchronizers:
     - locked <5e37baac> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)

     - locked <1788a0f5> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)

     
_"RMI TCP Connection(5062)-10.119.128.156" - Thread t@41101_
java.lang.Thread.State: WAITING
     at sun.misc.Unsafe.park(Native Method)
     - parking to wait for <148e36dd> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
     at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
     at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811)
     at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:842)
     at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1178)
     at java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:807)
     at com.sleepycat.je.latch.SharedLatch.acquireExclusive(SharedLatch.java:95)
     at com.sleepycat.je.latch.SharedLatch.acquireShared(SharedLatch.java:135)
     at com.sleepycat.je.tree.IN.latchShared(IN.java:484)
     at com.sleepycat.je.tree.Tree.searchSubTreeInternal(Tree.java:1534)
     at com.sleepycat.je.tree.Tree.searchSubTree(Tree.java:1358)
     at com.sleepycat.je.tree.Tree.search(Tree.java:1218)
     at com.sleepycat.je.dbi.CursorImpl.searchAndPosition(CursorImpl.java:2069)
     at com.sleepycat.je.Cursor.searchInternal(Cursor.java:2666)
     at com.sleepycat.je.Cursor.searchAllowPhantoms(Cursor.java:2576)
     at com.sleepycat.je.Cursor.searchNoDups(Cursor.java:2430)
     at com.sleepycat.je.Cursor.search(Cursor.java:2397)
     - locked <54bb010e> (a com.sleepycat.je.Cursor)
     at com.sleepycat.je.Cursor.getSearchKeyRange(Cursor.java:1727)
     at org.apache.qpid.server.store.berkeleydb.BDBMessageStore.getContent(BDBMessageStore.java:1926)
     at org.apache.qpid.server.store.berkeleydb.BDBMessageStore$StoredBDBMessage.getContent(BDBMessageStore.java:2410)
     at org.apache.qpid.server.message.AMQMessage.getContent(AMQMessage.java:240)
     at org.apache.qpid.server.queue.AMQQueueMBean.viewMessageContent(AMQQueueMBean.java:416)
     at sun.reflect.GeneratedMethodAccessor166.invoke(Unknown Source)
     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
     at java.lang.reflect.Method.invoke(Method.java:597)
     at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:93)
     at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:27)
     at com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:208)
     at com.sun.jmx.mbeanserver.PerInterface.invoke(PerInterface.java:120)
     at com.sun.jmx.mbeanserver.MBeanSupport.invoke(MBeanSupport.java:262)
     at javax.management.StandardMBean.invoke(StandardMBean.java:391)
     at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:836)
     at com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:761)
     at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source)
     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
     at java.lang.reflect.Method.invoke(Method.java:597)
     at org.apache.qpid.server.management.MBeanInvocationHandlerImpl.invoke(MBeanInvocationHandlerImpl.java:203)
     at $Proxy8.invoke(Unknown Source)
     at javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1427)
     at javax.management.remote.rmi.RMIConnectionImpl.access$200(RMIConnectionImpl.java:72)
     at javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1265)
     at java.security.AccessController.doPrivileged(Native Method)
     at javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1367)
     at javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.java:788)
     at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source)
     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
     at java.lang.reflect.Method.invoke(Method.java:597)
     at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:305)
     at sun.rmi.transport.Transport$1.run(Transport.java:159)
     at java.security.AccessController.doPrivileged(Native Method)
     at sun.rmi.transport.Transport.serviceCall(Transport.java:155)
     at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:535)
     at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:790)
     at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:649)
     at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
     at java.lang.Thread.run(Thread.java:662)

Locked ownable synchronizers:
     - locked <3fe04982> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)


_"IoReceiver - /169.93.75.223:55970" - Thread t@261965_
java.lang.Thread.State: WAITING
     at sun.misc.Unsafe.park(Native Method)
     - waiting to lock <5e37baac> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) owned by "RMI TCP Connection(5358)-10.119.152.33" t@41433
     at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
     at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811)
     at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:941)
     at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1261)
     at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:594)
     at com.sleepycat.je.latch.SharedLatch.acquireShared(SharedLatch.java:149)
     at com.sleepycat.je.tree.IN.latchShared(IN.java:484)
     at com.sleepycat.je.tree.Tree.searchSplitsAllowed(Tree.java:1271)
     at com.sleepycat.je.tree.Tree.findBinForInsert(Tree.java:2205)
     at com.sleepycat.je.dbi.CursorImpl.putInternal(CursorImpl.java:834)
     at com.sleepycat.je.dbi.CursorImpl.put(CursorImpl.java:779)
     at com.sleepycat.je.Cursor.putAllowPhantoms(Cursor.java:2243)
     at com.sleepycat.je.Cursor.putNoNotify(Cursor.java:2200)
     at com.sleepycat.je.Cursor.putNotify(Cursor.java:2117)
     at com.sleepycat.je.Cursor.putNoDups(Cursor.java:2052)
     at com.sleepycat.je.Cursor.putInternal(Cursor.java:2020)
     - locked <38e6eb39> (a com.sleepycat.je.Transaction)
     at com.sleepycat.je.Database.putInternal(Database.java:1324)
     at com.sleepycat.je.Database.put(Database.java:1194)
     at org.apache.qpid.server.store.berkeleydb.BDBMessageStore.addContent(BDBMessageStore.java:1793)
     .....
     at java.lang.Thread.run(Thread.java:662)

There are lots of threads similar to "IoReceiver - /169.93.75.223:55970" - Thread t@261965 that are waiting to lock *5e37baac*. There is no other report in the thread dump of what has got lock *469f6330*.

- Has anyone seen this before?
- Is this fixed in 5.0.58?
- I am wondering if this problem is the reason why 5.0.48 is no longer available for download. Can anyone confirm this?
- I can't find a way of reporting bugs directly, other than on this forum. Please let me know if there's a better way for me to report this issue.


Thanks,
Phil

Edited by: philipharvey44 on 25-Jul-2012 04:21
  • 1. Re: Deadlock found in je-5.0.48
    greybird Expert
    Currently Being Moderated
    Hello Phil,
    We've seen what looks like a deadlock in je-5.0.48. Here are the offending threads:
    When you say it looks like a deadlock, do you mean that all threads are frozen in this position when you take repeated thread dumps, i.e., there is no change whatsoever? The reason I ask is that it is not an uncommon situation for many threads to be waiting on Btree latches such as you've shown. I have seen such blocking, for example, when write activity is high and in particular when there is a spike in log cleaning activity at the same time.

    If you believe this is a true deadlock, please email me the complete thread dump: mark.hayes at o.com (o == oracle). I'll reply to this thread on the forum, rather than by email.
    - Has anyone seen this before?
    - Is this fixed in 5.0.58?
    - I am wondering if this problem is the reason why 5.0.48 is no longer available for download. Can anyone confirm this?
    No, there is no deadlock fix in recent releases.
    - I can't find a way of reporting bugs directly, other than on this forum. Please let me know if there's a better way for me to report this issue.
    Oracle doesn't expose its bug database, so reporting bugs here is the right way to do it.

    --mark                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
  • 2. Re: Deadlock found in je-5.0.48
    greybird Expert
    Currently Being Moderated
    Oracle doesn't expose its bug database, so reporting bugs here is the right way to do it.
    However, if you have an Oracle support contract for BDB JE, please be sure to file an SR through the official support channel.

    --mark                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
  • 3. Re: Deadlock found in je-5.0.48
    greybird Expert
    Currently Being Moderated
    Phil,

    I took a look at the complete thread dumps you sent but just reached the same conclusion that you did: many threads are waiting on a thread that holds a latch X, and the thread holding latch X is waiting for latch Y, but no thread holds latch Y. As far as I know, this shouldn't happen with Java ReentrantReadWriteLock.
    We’re using Oracle JVM 1.6.0_25-b06, running on Linux version: 2.6.18-194.32.1.el5.
    One approach would be to try to search for a Java bug that could cause this sort of problem with ReentrantReadWriteLock, or if you have a Java support contract to go through the Oracle Java support channel. I have not done this search in the Java bug database myself.

    --mark                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
  • 4. Re: Deadlock found in je-5.0.48
    909806 Newbie
    Currently Being Moderated
    Coincidentally, I've just recently run into a similar-sounding issue in a context not involving BDB at all. My initial suspicion was that there's a bug in my code, wandering off an somehow missing an unlock, but it's possible that it's a JDK bug. In my case as well there's no obvious holder of one of the locks, and the JVM isn't detecting the deadlock; I'm not sure under what conditions the JVM is unable to detect a deadlock. But if you find a JVM or JDK bug, please do report back here.
  • 5. Re: Deadlock found in je-5.0.48
    greybird Expert
    Currently Being Moderated
    many threads are waiting on a thread that holds a latch X, and the thread holding latch X is waiting for latch Y, but no thread holds latch Y. As far as I know, this shouldn't happen with Java ReentrantReadWriteLock.
    After some more reading I can say I was wrong. If one thread holds a read lock and another thread tries to acquire a write lock (on the same ReentrantReadWriteLock), the thread dump doesn't show that the read lock is held. This is because there isn't a record of which thread holds a read lock, there is only a counter maintained.

    Since JE uses read and write locks for latching (unless je.env.sharedLatches is set to false), it could be that one of the threads in the thread dump does hold a read lock on latch Y, or it's possible that a thread holding the read lock has died.

    In other words, the fact that the thread dump doesn't show the owner of Y probably just means that a read lock is held on Y.

    --mark                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
  • 6. Re: Deadlock found in je-5.0.48
    943066 Newbie
    Currently Being Moderated
    I believe I'm seeing similar behaviour:

    -----
    "Grizzly(10)" - Thread t@33
    java.lang.Thread.State: WAITING
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for <e1ecc07> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:156)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:842)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1178)
    at java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:807)
    at com.sleepycat.je.latch.SharedLatch.acquireExclusive(SharedLatch.java:95)
    at com.sleepycat.je.tree.IN.latch(IN.java:473)
    at com.sleepycat.je.dbi.CursorImpl.latchBIN(CursorImpl.java:471)
    at com.sleepycat.je.dbi.CursorImpl.removeCursor(CursorImpl.java:523)
    at com.sleepycat.je.dbi.CursorImpl.close(CursorImpl.java:596)
    at com.sleepycat.je.dbi.CursorImpl.close(CursorImpl.java:579)
    at com.sleepycat.je.Cursor.close(Cursor.java:516)
    at com.sleepycat.je.Cursor.readPrimaryAfterGet(Cursor.java:3769)
    at com.sleepycat.je.SecondaryCursor.readPrimaryAfterGet(SecondaryCursor.java:1470)
    at com.sleepycat.je.SecondaryCursor.retrieveNext(SecondaryCursor.java:1448)
    at com.sleepycat.je.SecondaryCursor.getNext(SecondaryCursor.java:560)
    at com.sleepycat.je.SecondaryCursor.getNext(SecondaryCursor.java:492)
    -----

    In my case, lock e1ecc07 is the lock causing many of my threads to wait, however is not mentioned as being held anywhere in the thread dump.
    If one thread holds a read lock and another thread tries to acquire a write lock (on the same ReentrantReadWriteLock), the thread dump doesn't show that the read lock is held. This is because there isn't a record of which thread holds a read lock, there is only a counter maintained.
    In the case above, it appears that the waiting thread is attempting to get a write lock - considering that only write-locks are exclusive in a ReentrantReadWriteLock, shouldn't it definitely be possible to see which thread holds this existing exclusive lock? (as write locks aren't just counter maintained)

    As a related question - if we know a BDB cursor is only going to be used for reading, is it possible to use it in such a way that it doesn't try to obtain an exclusive/write lock on a record as a way of helping avoid these problems?

    Edited by: fb on Jul 25, 2012 1:04 AM
  • 7. Re: Deadlock found in je-5.0.48
    928454 Newbie
    Currently Being Moderated
    Thank guys for looking into this. I also have a heap dump, captured while this problem was occurring. I will try to find out there this supposed read lock is held.

    I'll let you know how I get on, and will obviously mark this question as "answered" if it looks like a non-BDB problem.

    Phil
  • 8. Re: Deadlock found in je-5.0.48
    943066 Newbie
    Currently Being Moderated
    Actually, just realised an error in my logic - If write locks are exclusive, then it would, by nature, be impossible for it to be shared with a read-lock, hence the term: "exclusive" :)

    If it helps anyone - the cases when I've seen the above locking case happen is when you are disk-IO bound and locks are held for increasing amounts of time.

    Making sure you have appropriate things in cache helps a lot :)

    However, I'm still interested in my question regarding read-only cursors (or similar) as a way of reducing exclusive locks.
  • 9. Re: Deadlock found in je-5.0.48
    928454 Newbie
    Currently Being Moderated
    We've analysed our heap dump and wanted to share our findings. Our semi-educated guess is that this may represent a locking bug in BDB but we'd appreciate a second opinion. We're confident that the system was genuinely locked because we took several thread dumps over several minutes and checked what every thread was doing.

    In the diagram below we've represented threads on the left hand side and sync objects (java.util.concurrent.locks.ReentrantReadWriteLock.NonfairSync) on the right, along with the object addresses that correspond to the thread dump in my original question.
    t1 ----------------------------------+
                                         |
    t2 ----------------------------------+
                                         |
    ...                                  |
                                         |
    tn ----------------------------------+
                                         |
                                         | Parked trying to acquire lock (see [1])
                                         |
                  Has exclusive lock     v
    t@41433 ---------------------------> sync-5e37baac
        |
        |   Parked trying to acquire
        +------------------------------> sync-469f6330
                                          ^
                                          |
                                          | reference to (see [2])
                                          |
    t@41101 --> Java Local --> IN --> SharedLatch
       |
       |
       |   Parked trying to acquire 
       +-------------------------------> sync-148e36dd (see [3])
    [1] Lots of threads are blocked trying to acquire sync-5e37baac, which causes our system hang.

    [2] We think the reference from t@41101 to sync-469f6330 might mean that this thread is holding its read lock and would appreciate a second opinion about how likely this is.

    [3] However, we can't any similar references to sync-148e36dd, so have no idea what's holding its read lock. We fear that this means a thread that holds a read lock on this sync has failed to release it.
  • 10. Re: Deadlock found in je-5.0.48
    greybird Expert
    Currently Being Moderated
    If it helps anyone - the cases when I've seen the above locking case happen is when you are disk-IO bound and locks are held for increasing amounts of time.

    Making sure you have appropriate things in cache helps a lot
    Very true.
    However, I'm still interested in my question regarding read-only cursors (or similar) as a way of reducing exclusive locks.
    You may be confusing record locks with latches, which are used for Btree internal nodes not records. The bottom internal node (BIN) that contains the record data (leaf nodes or LNs) for a range of keys is always latched exclusively while doing the record lookup, because it more often than not has to be modified, even during a read. For example, if the LN is fetched, the BIN must be exclusively latched to modify its slot that contains the reference to the LN. Latches are implemented as ReentrantReadWriteLocks.

    Note that the BIN latch is only held long enough to ensure that the target LN is present -- a very short period. Then the record lock is acquired, which is either a read or write lock, depending on the operation (read or write). Record locks are not implementing using ReentrantReadWriteLock, they are implemented using JE lock objects using lower level Object.wait/notify methods. Record locks may be long lived.

    The duration of the record lock depends on the transaction isolation mode and this is described in detail in the guide on writing transactional apps. For example, a write lock is always held until the transaction ends. A read lock can be released earlier (depending on the isolation mode) or may not even be acquired at all (when the isolation mode is READ_UNCOMMITTED). This is a traditional database record locking design.

    --mark                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
  • 11. Re: Deadlock found in je-5.0.48
    greybird Expert
    Currently Being Moderated
    Phil,
    [2] We think the reference from t@41101 to sync-469f6330 might mean that this thread is holding its read lock and would appreciate a second opinion about how likely this is.
    Yes, that seems likely. I've never tried using Java local references to track down a latch held, but it makes sense that there must be a local reference (although possibly indirect) to a latch object that's been acquired for read or write, since otherwise we would have no way to release it when the stack unwinds. And the Btree latches are acquired and released within the scope of a method call. Good thinking.

    We also know that this thread holds a read lock. The stack for t@41101 shows that this thread has acquired at least one latch for read, prior to blocking. Some Btree code paths latch for read, some don't. This one does, and due to the structure of the Btree and the code for this method I can tell that at least one read lock has been acquired. I can go into more detail about how I know this, if needed.
    [3] However, we can't any similar references to sync-148e36dd, so have no idea what's holding its read lock. We fear that this means a thread that holds a read lock on this sync has failed to release it.
    There is a little more I can tell you from the stack.

    t@41101 is currently blocked waiting for an exclusive (write) latch on the BIN, not a shared (read) latch as you might think. The SharedLatch.acquireShared method on the stack turns around and calls acquireExclusive, because this is a BIN (see my reply above to fb). I know it's confusing, but the SharedLatch class does both shared and exclusive latching. A SharedLatch for a BIN always does exclusive latching.

    Because t@41101 is blocked on a lock for a BIN, and a BIN latch can never be latched for read, we can infer that another thread must hold a write lock on the BIN latch. But we also know that the thread dump should tell us when another thread holds a write lock on 148e36dd, and yet there is no such thread.

    This brings me back to wondering how an exclusive lock can be held yet not show up in the thread dump. Perhaps this happens when a thread that holds a latch dies -- I have not looked into this. It also brings me back to wondering about a JVM bug with ReentrantReadWriteLock. Let me know if you agree.

    In case this is a JE bug, there are some general things to know about potential problems with latches not being released in JE. We rarely see such bugs (at this stage in the product's lifetime) but they do happen. For example, the most recent one was [#21328] (fixed in 5.0.48) and occurred only when key prefixing and a custom comparator were used together. To find such a problem, it almost has to be reproducible. In the case of [#21328] it was reproducible and we added debugging code to the SharedLatch class in order to find it. Such debugging code isn't appropriate for a deployment, and we provided it to the customer in a one-off instrumented je.jar file. The customer ran their test repeatedly over a multi-week period before we found the bug.

    So if you can reproduce the problem in a test environment, the approach I just described could be used to find the problem. Otherwise, the only real hope is to find a related exception that will give us more information. In almost all cases (I can't think of an exception) when a latch is not released it is because of inadequate error handling when an exception is thrown while the latch is held. JE will not log the exception, if it is propogated up through the API. (JE does log exceptions in its daemon threads, however.) I assume your app is logging exceptions. If you can find the original exception in your logs, it may tell us exactly what happened or at least will give us a clue.

    --mark                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
  • 12. Re: Deadlock found in je-5.0.48
    909806 Newbie
    Currently Being Moderated
    We also know that this thread holds a read lock. The stack for t@41101 shows that this thread has acquired at least one latch for read, prior to blocking. Some Btree code paths latch for read, some don't. This one does, and due to the structure of the Btree and the code for this method I can tell that at least one read lock has been acquired.
    Is there any possibility that this read lock it has is part of the same ReentrantReadWriteLock for which this same thread is trying to acquire the write lock? If so, that will definitely hang forever. That is, simply calling "lock.readLock().lock(); lock.writeLock().lock();" will block forever. See: https://gist.github.com/3180847#file_single_thread_deadlock.java
    we can infer that another thread must hold a write lock on the BIN latch. But we also know that the thread dump should tell us when another thread holds a write lock on 148e36dd, and yet there is no such thread.
    That's not quite right; the thread dump will only (it seems) show held intrinsic locks (i.e., synchronized blocks and methods) and the lock (or other synchronization primitive) on which a thread is currently blocking. For instance, see https://gist.github.com/3180847#file_forgotten_lock_test.java for an example of blocking on a lock without evidence in the thread dump about the lock holder. (Although, maybe inclusion of the "Locked ownable synchronizers" summary is platform-dependent--that part doesn't seem to show up for me.)
  • 13. Re: Deadlock found in je-5.0.48
    928454 Newbie
    Currently Being Moderated
    Thanks for your detailed analysis Mark.

    >
    This brings me back to wondering how an exclusive lock can be held yet not show up in the thread dump. Perhaps this happens when a thread that holds a latch dies -- I have not looked into this. It also brings me back to wondering about a JVM bug with ReentrantReadWriteLock. Let me know if you agree.
    >

    We wrote a test program to explore this. When a thread holding an exclusive lock on a ReentrantReadWriteLock dies (i.e. either its run() method returns or throws an exception), the lock is not automatically released. We also found from a heap dump that the lock.sync.exclusiveOwnerThread field still points to the "dead" thread. This seems like reasonable behaviour to me.

    However, in the heap dump from our production problem, the exclusiveOwnerThread field of the lock in question was null, which I think means it's been released, but our threads were nevertheless unable to lock it. I don't know how to explain this, other than to assume it's a JVM bug.

    We have raised this general locking question on the Oracle Java forum (My ReentrantReadWriteLock.writeLock's owning thread has gone missing and let you know what we discover.

    Phil

    Edited by: philipharvey44 on 26-Jul-2012 08:22
  • 14. Re: Deadlock found in je-5.0.48
    928454 Newbie
    Currently Being Moderated
    Jeffstyr wrote:
    That's not quite right; the thread dump will only (it seems) show held intrinsic locks (i.e., synchronized blocks and methods) and the lock (or other synchronization primitive) on which a thread is currently blocking. For instance, see https://gist.github.com/3180847#file_forgotten_lock_test.java for an example of blocking on a lock without evidence in the thread dump about the lock holder. (Although, maybe inclusion of the "Locked ownable synchronizers" summary is platform-dependent--that part doesn't seem to show up for me.)
    If you provide the "-l" option to jstack it prints the "locked ownable synchronizers", which lists the write locks held by each thread. Credit to Keith Wall for pointing out this flag to me.
1 2 Previous Next

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points