This discussion is archived
1 2 Previous Next 26 Replies Latest reply: Oct 9, 2012 6:29 PM by greybird Go to original post RSS
  • 15. Re: Understanding LockTimeOutExceptions better
    vinothchandar Newbie
    Currently Being Moderated
    Also, can you please explain the relationship to duplicates? Is this a known issue? I recall that you mentioned the locking of the dupCountLN before. But given that is only 1% of these exceptions, I am curious to know how duplicates impact the transaction locking.

    Thanks
    Vinoth
  • 16. Re: Understanding LockTimeOutExceptions better
    greybird Expert
    Currently Being Moderated
    I guess "com.sleepycat.je.LockTimeoutException: (JE 4.0.92) Lock expired. Locker 1723973692 124031744_voldemort-niosocket-server14_Txn: waited for lock on database=store1 LockAddr:924552162 node=87042505 type=WRITE grant=WAIT_NEW timeoutMillis=500 startTime=1349656528721 endTime=1349656529221" is the exception.toString() ?
    Nope. The toString gives you the stack trace of the owner and the waiters, due to the debug params you set.

    What we're missing is the stack trace of the caller. It does not show up in the waiters list because it is removed when the timeout occurs. I'm pretty sure you can log full stack traces with log4j, but I understand that you may not be able to make a production change.

    --mark                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
  • 17. Re: Understanding LockTimeOutExceptions better
    greybird Expert
    Currently Being Moderated
    Also, can you please explain the relationship to duplicates? Is this a known issue? I recall that you mentioned the locking of the dupCountLN before. But given that is only 1% of these exceptions, I am curious to know how duplicates impact the transaction locking.
    It is not a known issue, and I can only give you a little more information about what's happening based on the second exception you posted, the one that shows an owner and one waiter. For both the owner and the waiter, a lookup is being performed: BdbStorageEngine.put is calling Cursor.getSearchKey.

    In both stacks you'll notice an internal call to CursorImpl.getNextDuplicate. This is because the first record (first duplicate for the key) found was deleted but not yet compressed away (deleted records are compressed out of the Btree asynchronously). So there are two (or more) records for the same key, and one (or more) with the lower valued data is deleted and no longer locked.

    I'm not pointing out a problem in JE -- I don't see any evidence of a problem -- I'm only pointing out what I can tell you about what's happening, from the stack trace. Perhaps this will give you a clue as to what might be causing lots of contention, i.e., a hot spot of some kind in your application.

    --mark                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
  • 18. Re: Understanding LockTimeOutExceptions better
    vinothchandar Newbie
    Currently Being Moderated
    This is because the first record (first duplicate for the key) found was deleted but not yet compressed away (deleted records are compressed out of the Btree asynchronously).
    The app does have a lot of concurrent writes on the same key i.e multiple simultaeneous transactiions (where we do a searchKey, delete, put). So, is the issue because its waiting for the delete to be compressed?
    If so, when does the compression happen or who does it?
  • 19. Re: Understanding LockTimeOutExceptions better
    greybird Expert
    Currently Being Moderated
    The app does have a lot of concurrent writes on the same key i.e multiple simultaeneous transactiions (where we do a searchKey, delete, put). So, is the issue because its waiting for the delete to be compressed?
    No. The uncompressed/deleted record will simply be locked, examined, and skipped over by cursors. As I mentioned, in both stack traces the deleted record was already skipped over when the timeout occurred. I mention this because it tells you something about what's happening in the app: that the app deleted a record, and other threads are trying to access records with the same key, and this is all happening in a short time period.
    If so, when does the compression happen or who does it?
    Compression of deleted records is performed in a JE background thread, but any delay in compressing a deleted record will not cause blocking.

    If you have many threads that are accessing the same record(s), this will cause lots of contention and you may have to increase the lock timeout and/or do retries.

    --mark                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
  • 20. Re: Understanding LockTimeOutExceptions better
    greybird Expert
    Currently Being Moderated
    I did not make any code changes for this. We just catch a DatabaseException and log it using log4j. https://github.com/voldemort/voldemort/blob/master/src/java/voldemort/store/bdb/BdbStorageEngine.java#L381
    You're logging just the toString of the exception here, since you're calling this log4j method:
    http://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/Category.html#error(java.lang.Object)

    To log the stack trace you use the error method with two params:
    http://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/Category.html#error(java.lang.Object, java.lang.Throwable)

    --mark                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
  • 21. Re: Understanding LockTimeOutExceptions better
    vinothchandar Newbie
    Currently Being Moderated
    Hi Mark,

    This then goes back to what I was asking before. We have a timeout of 500ms, which I consider very high ,given we are on SSDs. On the average, our operations finish in a millisecond. So, even if all the 50 threads we have in the app are contending on the same key, why would it cause the lock to expire? In other words, the worst thing would be for the last waiter to finish in 50 ms average?

    Based on what you mentioned, I think removing duplicates and simplifying this transaction to a vanilla get()/put() might speed up things and thus reduce the window of contention.

    In theory, this can happen with N threads if all of them go to the same key and an operation takes a maximum of t ms, with a locktimeout < t * N milliseconds.

    Thanks
    Vinoth
  • 22. Re: Understanding LockTimeOutExceptions better
    greybird Expert
    Currently Being Moderated
    As Jeff mentioned, a lock timing out after 500ms doesn't always mean that the operations it was waiting for took 500ms of real time, since those operations are sharing the CPUs with other threads.

    Yes, moving away from JE 4.x duplicates will reduce contention, but with the information you've given me I don't see this as an issue, so far. I would expect to see locks blocked on the DupCountLN if duplicates were the issue.

    I'm not sure there has been enough analysis yet to determine what is causing the timeouts. The first step is for you to understand the information that is printed in the exception messages, and it will be much easier to analyze if you have the stack trace -- otherwise, you have no information on what the thread that timed out is doing -- the exception message only contains information about the other threads, the owner and remaining waiters. I think if you have the complete information and you understand what it means, you'll be able to figure out what operations are really in play when the timeout occurs, and this will give you clues as to what is going on.

    I assume you've read all about locks, blocking and timeouts our transaction guide.

    --mark                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
  • 23. Re: Understanding LockTimeOutExceptions better
    vinothchandar Newbie
    Currently Being Moderated
    yes mark. Our new code has just a basic RMW txn. So things should improve.

    Let me try to confirm the missing information, so I can try adding it to the code and roll it out eventually. So, the exception that is logged was thrown from one of the threads calling trying to do the transaction. Currently, JE just dumps out every stack trace except the erroring thread? So, doing a e.printStackTrace() on the exception thrown would complete the picture?
  • 24. Re: Understanding LockTimeOutExceptions better
    greybird Expert
    Currently Being Moderated
    Let me try to confirm the missing information, so I can try adding it to the code and roll it out eventually. So, the exception that is logged was thrown from one of the threads calling trying to do the transaction. Currently, JE just dumps out every stack trace except the erroring thread? So, doing a e.printStackTrace() on the exception thrown would complete the picture?
    Yes.
  • 25. Re: Understanding LockTimeOutExceptions better
    909806 Newbie
    Currently Being Moderated
    I haven't traced through the details of how BDB-JE does locking, but with (for example) java.util.concurrent.locks.ReentrantLock you can easily get timeouts due to non-fair locking (that is, when locking doesn't attempt to hand out locks in first-waited-for order--because throughput is higher if you don't implement fairness). For a test case you can try out, see [url https://gist.github.com/3862472]. In that case, there are 50 threads fighting for a resource using tryLock() with a 500ms timeout, and once locked a thread hogs the lock for 1ms. Running that test, I get a constant stream of timeouts.

    My point is just, there may not be anything wrong. (And BDB-JE may implement fair locking, but that's typically much much slower than non-fair locking, such that you might still get timeouts.) Of course, this test case is maximizing contention and is not a perfect model for what's going on in BDB-JE, but it serves to demonstrate perhaps unexpected timeout behavior with highly contended locks, even when they locks are not being held for long periods of time.
  • 26. Re: Understanding LockTimeOutExceptions better
    greybird Expert
    Currently Being Moderated
    JE implements record locking (long lived locks) using Object.wait/notify. When a lock is released, the lock is granted to the next waiter in the waiter list, in the order the locks were requested. So it is fair in that sense. Record locks are what are being discussed here, and are what the LockTimeoutException applies to.

    JE implements Btree latching -- very short lived locks on the internal nodes of the Btree -- using ReentrantReadWriteLock in default (nonfair) mode. These have no timeout interval.

    Full GCs could also have a big impact.

    --mark                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
1 2 Previous Next

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points