Forum Stats

  • 3,770,000 Users
  • 2,253,045 Discussions
  • 7,875,263 Comments

Discussions

Application thread pauses for 5 seconds

3293742
3293742 Member Posts: 1
edited Nov 14, 2016 4:39PM in Berkeley DB Java Edition

1 or 2% of our write requests are pausing for 5 to 8 seconds during peak traffic time. All the rest of the requests are finishing in few milli seconds.

Some details about the env:

1. 16GB of java heap

2. 3 GB of BDB cache (the cache estimation utility suggested these - only IN nodes in cache - 1.3 GB, IN nodes and leaf  nodes in Cache - 3.4 GB)

3. OS - Linux 64 bit with JDK 1.8 (note that when I ran the cache size estimation utlity, I didn't use the compressedops jvm flag)

4. We don't use transactions

5. We are using je-5.0.73.jar (we can't upgrade as of now)

Prior to the above cache settings, we had only 350 MB BDB cache with 5 GB of java heap. We used to see poor write latencies which is when we ran the cache size estimation utility and increased the cache size.

My suspicion is that the cleaning and check pointing are probably causing high IO (write buffer getting flushed ?) that in turn pauses the application thread. The threads seem to either wait due to contention or doing IO.

The blog post here suggests that EnvironmentConfig.CHECKPOINTER_HIGH_PRIORITY with more cleaner threads may help. I am wondering if my use case fall in that.

When I look at the stats, the check point numbers are like this -

checkPointStart - 1109819556300535, checkPointEnd - 1109823853844035 (difference comes to 4297543500) - are these high values?

Thanks for the help.

Sathish

5b460d5c-b14f-4119-a7cc-f900ee261016

Answers

  • Greybird-Oracle
    Greybird-Oracle Member Posts: 2,690
    edited Oct 3, 2016 12:34PM

    Hi,

    I do not recommend using the high priority checkpoint option, and in fact this feature has been removed in future releases. It does more harm than good. Plus I doubt that checkpointing itself is causing the problem you report -- if the checkpointer is behind, then checkpoints will take a long time to finish, but since there is only one checkpointer thread I don't think it is causing the pauses in the app threads.

    The most common cause of a pause in the app threads, by far, is a GC pause. Have you collected a GC log and looked to see if the app pauses are correlated to GC pauses?

    While it is possible that IO is causing the pause, perhaps due to a mis-tuned file system, I suggest looking at GC first at least to rule it out.

    >>> checkPointStart - 1109819556300535, checkPointEnd - 1109823853844035 (difference comes to 4297543500) - are these high values?

    Sorry, these are pretty difficult for you to interpret. They are LSNs -- log sequence numbers. The first 32 bits of the 'long' value is the file number and the second 32 bits is the file offset. In hex these are 0x3f160006b06f7 and 0x3f16100925643. So the file number is one larger and the offset is 2576204 larger. To calculate the difference we need to know your file size. Anyway, as I say, it is unlikely that this is directly related to the pauses.

    --mark

    5b460d5c-b14f-4119-a7cc-f900ee261016
  • 5b460d5c-b14f-4119-a7cc-f900ee261016
    edited Oct 11, 2016 2:13AM

    Hi Mark,

    Thanks for the response. We did check the GC logs and they look clean. iostat shows that there is heavy IO intermittently which is what makes us believe that the OS file cache has reached a threshold where it seem to destage the cache by blocking the application threads.

    Note that our application is IO intensive, 80% of data gets written directly to the file system and 20% of data goes to BDB. It is possible that we have to tune the OS parameters for flushing. I find some information around this here - https://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/

    I will update on the progress. Thanks again.

    Sathish

  • Greybird-Oracle
    Greybird-Oracle Member Posts: 2,690
    edited Nov 14, 2016 4:39PM

    Sounds like it is IO then. Here is what we recommend for our NoSQL DB product (which uses BDB JE):

    Appendix D. Tuning

    --mark

This discussion has been closed.