Forum Stats

  • 3,770,003 Users
  • 2,253,045 Discussions
  • 7,875,265 Comments

Discussions

Sequential scan on large database

2789627
2789627 Member Posts: 6
edited May 23, 2017 3:08PM in Berkeley DB Java Edition

I find that when the database is not fitted in memory and there are a lot of parallel scan going on. It needs to get some IN record directly from the file. This causes too many random seek IO. Even though the entire set of data resides in file system cache, there are many calls like the stack trace below. I wonder what I can do to make the scan to be much faster and reduce the GC problem. The record size is around 200 bytes and each LogManager.getLogEntry seems to allocate around 2K of memory. This seems to have a significant impact to GC heavily given that we are trying to scan several hundred thousand records per second from multiple threads.

  java.lang.Thread.State: RUNNABLE

        at java.io.RandomAccessFile.seek0(Native Method)

        at java.io.RandomAccessFile.seek(RandomAccessFile.java:557)

        at com.sleepycat.je.log.FileManager.readFromFileInternal(FileManager.java:1907)

        - locked <0x00000001f7011dd8> (a com.sleepycat.je.log.FileManager$DefaultRandomAccessFile)

        at com.sleepycat.je.log.FileManager.readFromFile(FileManager.java:1862)

        at com.sleepycat.je.log.FileManager.readFromFile(FileManager.java:1799)

        at com.sleepycat.je.log.FileSource.getBytes(FileSource.java:92)

        at com.sleepycat.je.log.LogManager.getLogEntryFromLogSource(LogManager.java:909)

        at com.sleepycat.je.log.LogManager.getLogEntry(LogManager.java:826)

        at com.sleepycat.je.log.LogManager.getLogEntryAllowInvisibleAtRecovery(LogManager.java:788)

        at com.sleepycat.je.tree.IN.fetchLN(IN.java:2808)

        at com.sleepycat.je.tree.IN.fetchLN(IN.java:2746)

        at com.sleepycat.je.dbi.CursorImpl.getCurrent(CursorImpl.java:2235)

        at com.sleepycat.je.dbi.CursorImpl.lockAndGetCurrent(CursorImpl.java:2083)

        at com.sleepycat.je.dbi.CursorImpl.getNext(CursorImpl.java:2552)

        at com.sleepycat.je.Cursor.retrieveNextNoDups(Cursor.java:3064)

        at com.sleepycat.je.Cursor.retrieveNext(Cursor.java:2836)

        at com.sleepycat.je.Cursor.getNext(Cursor.java:1117)

        at com.sleepycat.util.keyrange.RangeCursor.doGetNext(RangeCursor.java:1126)

        at com.sleepycat.util.keyrange.RangeCursor.getNext(RangeCursor.java:650)

        at com.sleepycat.persist.BasicCursor.next(BasicCursor.java:116)

        at com.sleepycat.persist.BasicIterator.hasNext(BasicIterator.java:85)

java.lang.Thread.State: WAITING (parking)

        at sun.misc.Unsafe.park(Native Method)

        - parking to wait for  <0x000000017a22c300> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)

        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)

        at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)

        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)

        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)

        at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)

        at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)

        at java.util.concurrent.ArrayBlockingQueue.offer(ArrayBlockingQueue.java:327)

        at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1361)

        at com.sleepycat.je.evictor.Evictor.alert(Evictor.java:2179)

        at com.sleepycat.je.evictor.Evictor.doCriticalEviction(Evictor.java:2041)

        at com.sleepycat.je.dbi.EnvironmentImpl.criticalEviction(EnvironmentImpl.java:2880)

        at com.sleepycat.je.dbi.CursorImpl.criticalEviction(CursorImpl.java:776)

        at com.sleepycat.je.dbi.CursorImpl.close(CursorImpl.java:693)

        at com.sleepycat.je.Cursor.endMoveCursor(Cursor.java:4683)

        at com.sleepycat.je.Cursor.retrieveNextNoDups(Cursor.java:3072)

        at com.sleepycat.je.Cursor.retrieveNext(Cursor.java:2836)

        at com.sleepycat.je.Cursor.getNext(Cursor.java:1117)

        at com.sleepycat.util.keyrange.RangeCursor.doGetNext(RangeCursor.java:1126)

        at com.sleepycat.util.keyrange.RangeCursor.getNext(RangeCursor.java:650)

        at com.sleepycat.persist.BasicCursor.next(BasicCursor.java:116)

        at com.sleepycat.persist.BasicIterator.hasNext(BasicIterator.java:85)

Answers

  • Greybird-Oracle
    Greybird-Oracle Member Posts: 2,690
    edited May 18, 2017 10:35AM

    Hi,

    Do you know if the GC overhead is because the BINs in the JE cache have become "old generation" before they are evicted from the cache by JE? This is a problem we have seen.

    If so, you could try using CacheMode.EVICT_BIN. I see you are doing a sequential scan (you're using an iterator). EVICT_BIN will evict the BIN from the JE cache when the scan reads the last record in the BIN. Therefore the BIN will only be in cache long enough to read its records (roughly 100 records). Hopefully the BIN can be GC'd very quickly and because it will not become old generation.

    I assume that, in this workload, you don't need to keep BINs in the JE cache because you are relying on the file system cache for that.

    --mark

  • 2789627
    2789627 Member Posts: 6
    edited May 18, 2017 12:36PM

    Mark,

    I think the key problem is kinda chicken and egg problem here. If the item is not in memory, it requires the item to be fetched from the disk which is expensive. The GC is what comes after that. And I think the reason that the GC can be quite expensive could be partially because the LRU cache get updated so the pointer gets moved around where older item can be pointed to newer item and vis a vis. It would be nice if there is a concept of predictive fetching where it does a big fetch to fetch data in advance.

  • Greybird-Oracle
    Greybird-Oracle Member Posts: 2,690
    edited May 18, 2017 12:47PM

    Yes, predictive fetching would help with the expense of the IO. It's a good idea but I can't promise anything there.

    On the GC problem, CacheMode.EVICT_BIN should help because it effectively disables the LRU for the BIN. Although we go through the mechanics of updating the LRU list during the read operation, the BIN will be removed from the LRU list as soon as it is no longer needed by the scan. We have used this approach successfully to solve GC problems with LNs (CacheMode.EVICT_LN). Please give it a try if you haven't already. I think it is the best approach for sequential scans.

    The other solution is to increase the heap size (and JE cache size) so it is large enough to hold all the BINs. But my guess is that you would already be doing that, if it were possible with the hardware you have. You could also use the off-heap cache instead of the file system cache, but you would still need to use EVICT_BIN to avoid GC problems for the case where the majority of BINs do not fit in the main cache and you're scanning sequentially.

    --mark

  • 2789627
    2789627 Member Posts: 6
    edited May 18, 2017 12:53PM

    If I set CacheMode to EVICT_BIN mode and most of the items there are never accessed randomly, would it be much worse (i.e. take longer and perform worse) as the item will never be in the cache.

    By the way, I forgot to mention that the heap size is around 30GB and we have more than 1000 databases running in different environments but share the same cache. I think part of the problem is also that updating LRU hits the shared cache significantly as the sequential scan updates the cache in random buckets from different threads and that seems to generate a lot of lock contention as well.

  • Greybird-Oracle
    Greybird-Oracle Member Posts: 2,690
    edited May 18, 2017 1:38PM

    You can increase the number of LRU lists to help with contention.

  • 2789627
    2789627 Member Posts: 6
    edited May 18, 2017 1:59PM

    I already maxed out the LRU list at 32 already. Still the contention is high enough that there are a lot of locks going on at any point in time.

  • Greybird-Oracle
    Greybird-Oracle Member Posts: 2,690
    edited May 18, 2017 2:08PM

    Do you have concurrent read/write workloads and scan workloads? I can think of two possibilities:

    1. If the scans are performed separately from other workloads, then don't use a shared cache for the scans. And use EVICT_BIN to reduce GC costs.

    2. If you don't care about LRU accuracy, you could hack in a change to increase the maximum size beyond 32. I'm happy to point you to the code if you would like. (I assume you don't have a support contract, BTW, and therefore this is something you may be willing to try).

    Also, from earlier, I'm sorry but I don't understand this sentence:

    """

    If I set CacheMode to EVICT_BIN mode and most of the items there are never accessed randomly, would it be much worse (i.e. take longer and perform worse) as the item will never be in the cache.

    """

    Could you please rephrase?

    If most of the items are not accessed randomly, then CacheMode.EVICT_BIN is a good choice. If you want to support random access as well, try CacheMode.UNCHANGED instead. Testing is the best way to determine the benefit.

    --mark

  • 2789627
    2789627 Member Posts: 6
    edited May 18, 2017 8:07PM

    Thank you Mark. By the way, it turns out that the access is not quite sequential. This is what I got from doing strace. At first, I thought LSM will try to order the data in the order of the primary key. I guess that's why it can only perform around 200 read/s from the IO perspective.

    read(3853, "}D\241\360#\0\21s;\0\306\0\0\0\4|b\0008\0\v\0zb\370\24\377x47\200\0"..., 2048) = 2048lseek(3853, 3895162, SEEK_SET)          = 3895162read(3853, "UA\26\306#\0Zo;\0\267\0\0\0\4|\246\0028\0\v\0z\\\370\24\377x%7\200\0"..., 2048) = 2048lseek(3853, 3674773, SEEK_SET)          = 3674773read(3853, "\321;ze#\0u\0228\0\276\0\0\0\4|\21\0218\0\v\0zo\337\24\377x,7\200\0"..., 2048) = 2048lseek(3853, 3897725, SEEK_SET)          = 3897725read(3853, "\366G\251D#\0]y;\0\322\0\0\0\4|\205\0268\0\v\0zm\370\24\[email protected]\200\0"..., 2048) = 2048lseek(3853, 3896431, SEEK_SET)          = 3896431read(3853, "\230H+\322#\0Ot;\0\322\0\0\0\4|\360\0308\0\v\0zd\370\24\[email protected]\200\0"..., 2048) = 2048lseek(3853, 3897395, SEEK_SET)          = 3897395read(3853, "[I\245\32#\0\23x;\0\322\0\0\0\4|[\0338\0\v\0zj\370\24\[email protected]\200\0"..., 2048) = 2048lseek(3853, 3907610, SEEK_SET)          = 3907610read(3853, "\354P\226%#\0\372\237;\0\322\0\0\0\4|\260\225;\0\v\0z3\371\24\[email protected]\200\0"..., 2048) = 2048lseek(3853, 3684582, SEEK_SET)          = 3684582read(3853, "\251=\3561#\0\30688\0\276\0\0\0\4|b78\0\v\0z\177\340\24\377x,7\200\0"..., 2048) = 2048lseek(3853, 3907292, SEEK_SET)          = 3907292read(3853, "\23G(\361#\0\274\236;\0\306\0\0\0\4|\372\226;\0\v\0z0\371\24\377x47\200\0"..., 2048) = 2048lseek(3853, 3907940, SEEK_SET)          = 3907940read(3853, "\246E\236\36#\0D\241;\0\270\0\0\0\4|\200\224;\0\v\0z4\371\24\377x&7\200\0"..., 2048) = 2048lseek(3853, 3934070, SEEK_SET)          = 3934070read(3853, "\355Mm\215#\0V\7<\0\322\0\0\0\4|\10Z8\0\v\0z\16\374\24\[email protected]\200\0"..., 2048) = 2048lseek(3853, 3954367, SEEK_SET)          = 3954367read(3853, "7F\252F#\0\237V<\0\306\0\0\0\4|g\\8\0\v\0z\376\375\24\377x47\200\0"..., 2048) = 2048lseek(3853, 3700667, SEEK_SET)          = 3700667read(3853, "9K*\207#\0\233w8\0\355\0\0\0\4|\10v8\0\v\0z\256\341\24\377x[7\200\0"..., 2048) = 2048lseek(3853, 3933130, SEEK_SET)          = 3933130read(3853, "\260Cx\363#\0\252\3<\0\270\0\0\0\4|\260y8\0\v\0z\10\374\24\377x&7\200\0"..., 2048) = 2048lseek(3853, 3703017, SEEK_SET)          = 3703017read(3853, "*<$3#\0\311\2008\0\276\0\0\0\4|e\1778\0\v\0z\24\342\24\377x,7\200\0"..., 2048) = 2048lseek(3853, 3704215, SEEK_SET)          = 3704215read(3853, ":<\217\31#\0w\2058\0\276\0\0\0\4|\23\2048\0\v\0zi\342\24\377x,7\200\0"..., 2048) = 2048lseek(3853, 3705393, SEEK_SET)          = 3705393read(3853, "\275=\23I#\0\21\2128\0\302\0\0\0\4|\251\2108\0\v\0z\216\342\24\377x07\200\0"..., 2048) = 2048lseek(3853, 3927020, SEEK_SET)          = 3927020read(3853, "SJ\221\214#\0\314\353;\0\306\0\0\0\4|\t\2148\0\v\0zS\373\24\377x47\200\0"..., 2048) = 2048lseek(3853, 3706571, SEEK_SET)          = 3706571read(3853, "\274;q\300#\0\253\2168\0\276\0\0\0\4|G\2158\0\v\0z\236\342\24\377x,7\200\0"..., 2048) = 2048lseek(3853, 3707149, SEEK_SET)          = 3707149read(3853, "\202B\4\263#\0\355\2208\0\275\0\0\0\4|\212\2178\0\v\0z\300\342\24\377x+7\200\0"..., 2048) = 2048lseek(3853, 3925740, SEEK_SET)          = 3925740read(3853, "wDj]#\0\314\346;\0\266\0\0\0\4|:\2258\0\v\0zR\373\24\377x$7\200\0"..., 2048) = 2048lseek(3853, 3709491, SEEK_SET)          = 3709491read(3853, "\300>m\202#\0\23\2328\0\302\0\0\0\4|\253\2308\0\v\0z\34\343\24\377x07\200\0"..., 2048) = 2048lseek(3853, 3925410, SEEK_SET)          = 3925410read(3853, "\306K\3\271#\0\202\345;\0\322\0\0\0\4|\27\2348\0\v\0zN\373\24\[email protected]\200\0"..., 2048) = 2048lseek(3853, 3932812, SEEK_SET)          = 3932812read(3853, "\363G\352\342#\0l\2<\0\306\0\0\0\4|\327\0<\0\v\0z\5\374\24\377x47\200\0"..., 2048) = 2048lseek(3853, 3939413, SEEK_SET)          = 3939413
  • 2789627
    2789627 Member Posts: 6
    edited May 18, 2017 8:18PM

    Also the cache miss rate seems to be quite high even though the file system cache size is huge.

    sudo ./cachestat 1Counting cache functions... Output every 1 seconds.    HITS   MISSES  DIRTIES    RATIO   BUFFERS_MB   CACHE_MB  154421    87222      213    63.9%           74     398068  114272    68722      210    62.4%           74     398297   84430    80484      334    51.2%           74     398493  239422    59953      478    80.0%           74     398631   84908    98699      212    46.2%           74     398924   89881    52245      887    63.2%           74     399103   94761    62662      406    60.2%           74     399272   62116    73781      722    45.7%           74     399513   66246    59304     1077    52.8%           74     399684   59092    78807      127    42.9%           74     399981  109731    78009      600    58.4%           74     400148   92263    68459      718    57.4%           74     400251  131698    62940     1266    67.7%           74     400223  111774    61981      491    64.3%           74     400322  162338    64500     1359    71.6%           74     400307  119647    51358      282    70.0%           74     400281   95369    94097      284    50.3%           74     400285  155442   131097      282    54.2%           74     400355   96646    70328      859    57.9%           74     400253  111868    73463      752    60.4%           74     400325   94426    69726     1022    57.5%           74     400374  117022    57552      189    67.0%           74     400353   86936    80807      901    51.8%           74     400247
  • Greybird-Oracle
    Greybird-Oracle Member Posts: 2,690
    edited May 23, 2017 3:08PM

    The file system cache has drawbacks and you could probably benefit from using the off-heap cache. See:

    https://sleepycat-tools.us.oracle.com/support.web/doc_builds/doc.je/java/com/sleepycat/je/EnvironmentConfig.html#MAX_OFF…

    On your prior question, you will often not see sequential IO when doing a sequential key scan, since JE stores records (very roughly) in insertion order.

    --mark

This discussion has been closed.