4 Replies Latest reply on Nov 18, 2010 1:04 AM by Greybird-Oracle

    Deadlock in evict with BDB/je 4.0.92

      Hi All,

      I had a misbehaving application today and it was reported by jstack as having 1 deadlock. With 178 threads BLOCKED on doEvict(Evictor.java:300 ) and 2 threads BLOCKED on lockInternal(Txn.java:465). In any case there are a 5 threads interacting with a 0x00002aab0c28d570 a (com.sleepycat.je.txn.Txn) and 178 threads interacting with a lock 0x00002aaadc022cd0 (a com.sleepycat.je.evictor.SharedEvictor).

      The SharedEvictor (0x00002aaadc022cd0) lock is held by one thread and 179 threads are waiting for it.

      The situation for the Txn locked on seems more complicated.
      Found one Java-level deadlock:
        waiting for ownable synchronizer 0x00002aab0fe420d8, (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync),
        which is held by "system.timer"
        waiting to lock monitor 0x00002aac3849e8d8 (object 0x00002aab0c28d570, a com.sleepycat.je.txn.Txn),
        which is held by "1075304977@qtp-569616903-472"
              at sun.misc.Unsafe.park(Native Method)
              - parking to wait for  <0x00002aab0fe420d8> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
              at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
              at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:747)
              at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:778)
              at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1114)
              at java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:807)
              at com.sleepycat.je.latch.SharedLatch.acquireExclusive(SharedLatch.java:95)
              at com.sleepycat.je.latch.SharedLatch.acquireShared(SharedLatch.java:135)
              at com.sleepycat.je.tree.IN.latchShared(IN.java:344)
              at com.sleepycat.je.tree.Tree.searchSubTreeInternal(Tree.java:1867)
              at com.sleepycat.je.tree.Tree.searchSubTree(Tree.java:1678)
              at com.sleepycat.je.tree.Tree.search(Tree.java:1544)
              at com.sleepycat.je.tree.Tree.getParentBINForChildLN(Tree.java:969)
              at com.sleepycat.je.recovery.RecoveryManager.undo(RecoveryManager.java:2410)
              at com.sleepycat.je.recovery.RecoveryManager.abortUndo(RecoveryManager.java:2360)
              at com.sleepycat.je.txn.Txn.undo(Txn.java:1149)
              at com.sleepycat.je.txn.Txn.abortInternal(Txn.java:1079)
              - locked <0x00002aab0c28d570> (a com.sleepycat.je.txn.Txn)
              at com.sleepycat.je.txn.Txn.abort(Txn.java:1039)
              at com.sleepycat.je.txn.Txn.abort(Txn.java:1012)
              at com.sleepycat.je.Transaction.abort(Transaction.java:116)
              at org.expasy.core.jobs.BdbJobStore.close(BdbJobStore.java:214)
      This is the one of our stores trying to close itself. And abort all running transactions.
              at com.sleepycat.je.txn.Txn.lockInternal(Txn.java:465)
              - waiting to lock <0x00002aab0c28d570> (a com.sleepycat.je.txn.Txn)
              at com.sleepycat.je.txn.ReadCommittedLocker.lockInternal(ReadCommittedLocker.java:88)
              at com.sleepycat.je.txn.Locker.nonBlockingLock(Locker.java:487)
              at com.sleepycat.je.dbi.CursorImpl.lockLNDeletedAllowed(CursorImpl.java:2677)
              at com.sleepycat.je.dbi.CursorImpl.lockLN(CursorImpl.java:2615)
              at com.sleepycat.je.dbi.CursorImpl.delete(CursorImpl.java:948)
              at com.sleepycat.je.Cursor.deleteNoNotify(Cursor.java:1606)
              at com.sleepycat.je.Cursor.deleteInternal(Cursor.java:1586)
              at com.sleepycat.je.Cursor.delete(Cursor.java:573)
              at org.expasy.core.jobs.BdbJobStore.removeOldJobs(BdbJobStore.java:194)
      And this is a a cursor loop that takes out records sees if they are older than one week and deletes them if that is the case.
      If you are interested in the BDB/je data I took a copy (1.9GB) and have the full stack trace if you are interested.

        • 1. Re: Deadlock in evict with BDB/je 4.0.92
          Hi Jerven,

          Thanks for reporting this. It appears to be a JE bug. I will take a closer look and get back to you later today.

          • 2. Re: Deadlock in evict with BDB/je 4.0.92
            Jerven, it looks like there is a race condition, where:

            - with a given txn, one thread tries to access a record (to delete it, in this case)
            - another thread aborts the same txn

            Coming up with a solution to the race condition in JE is going to take a little time.

            I see that before closing a "store" (Database), you abort any active txns for that database. Before aborting the txns, would it be possible for you to to notify your deleter thread, and wait for it to close the cursor/txn, or at least wait until you are certain the deleter thread is not accessing the database? This would work around the problem.

            What happens in your deleter thread today, when the txn is aborted? JE exceptions will be thrown when accessing the database with the invalid (aborted) txn. Do you expect those exceptions, and ignore them?

            • 3. Re: Deadlock in evict with BDB/je 4.0.92
              Yes you are correct. There is a race condition. It hadn't happend before because this is the first time that we shut down during a cleaning session.
              I had not seen a problem in testing with aborting transactions in this way. And in this particular case I restarted the server running the bdb/je and did not see any extra exceptions or log statements indicating that the exception path was taken. When the abort transactions are running no new transactions are spawned. Running transactions that are aborted may throw a DatabaseException (as I understand it) or will it be IllegalStateException?. In case of a database exception a new bug in my code would be triggered and I did not see that. The bug would be that the environment would be reinitialized and that is not what we need right then ;) (I quickly put in a guard against that for the future) For the rest we do not reuse transactions for multiple bdb/je writing/reading actions except in the delete old jobs cursor loop so a IllegalStateException might not be thrown.

              The work around would be easy to implement so I will be doing that in the morning. I think that I will wait for all transactions to finish instead of aborting them. This is probably a lot cleaner. I will also log how many active transactions there are when stating to close. Right now we do a checkpoint of this database before closing it which is hopefully a good idea.

              For the rest it seems that we met the doEvict serial performance problem. The issue that made us shut down the server in the first place. So I am happy to note that we will be upgrading to 4.1.6 in any case.
              • 4. Re: Deadlock in evict with BDB/je 4.0.92

                We have fixed the problem that caused the hard deadlock, and the fix will be in the next JE 4.1.x release. However, you are very wise to change your app to gracefully shut down the transactions. That way, you won't get exceptions that you'll have to handle or ignore.

                For future reference, the ticket for this fix is [#19321], which is what will appear in the change log.