This discussion is archived
5 Replies Latest reply: May 19, 2011 7:39 AM by Linda Lee RSS

improving 99th percentile read latency

788748 Newbie
Currently Being Moderated
Hello,

I'm seeing very high 99th percentile read latency, and I'm trying to figure out which settings I should look into. Please let me know if you have any suggestions.

Setup:
- Berkeley DB Java Edition 4.1.6
- Operating System: RHEL4
- Number of records: 20 million
- Record key size: ~20B
- Record value size: 4KB
- Java heap size: 4GB
- je.properties has 2 entries: je.env.isTransactional true, je.log.fileMax 1073741824

Workload:
- Database is accessed through RPC server with 32 worker threads.
- 32 RPC client processes.
- 550 reads/seconds

Latency:
Average: 24 ms
95th percentile: 60 ms
99th percentile: 235 ms

Thanks!
--Michi                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
  • 1. Re: improving 99th percentile read latency
    Linda Lee Journeyer
    Currently Being Moderated
    Michi,

    The first thing you'll have to do is to get some idea of what is causing the high latencies. Without some hints of that sort, it's not really possible to recommend any general settings. Some of the topics we think about when we see high operation latencies are:

    1. Java GC activity
    2. BDBJE checkpointing activity
    3. BDBJE log cleaning
    4. BDBJE cache misses

    EnvironmentStats, obtained via Environment.getStats(),has some fields that tell you whether (2) , (3) or (4) occurred.

    Of course, there are all sorts of application specific possibilities that must be considered:
    - are BDB data operations evenly distributed over time, or is there some kind of peak load point which is causing high latency?
    - if the latency is a measurement of your application operation, do some application operations translate into more than the usual number of BDB data operations?
    - is there other non-BDB activity that is spiking?

    We often instruct people to take periodic samples of Environment.getStats() and Java GC stats, look at those values over time, and correlate them to the performance seen. This works well when assessing throughput, but it can be harder to find latency outliers that way. In our own test frameworks, one useful tool we have are utilities that do thread dumps when latency drops, to give us a snapshot of what activity is happening throughout the JVM. These thread dumps can be obtained through java.lang.Thread.getAllStackTraces(). You could write your own simple version of this for your own application.

    Dumping stack traces is a very disruptive activity, so you have to apply some thought to this. You'd want the thread dumps to be triggered when an operation takes a long time, and you want to be sure that your application does the dump once, or periodically, so as not to completely skew the test.

    That's just one suggestion for gathering some more data. If you can find more clues, we may be able to follow on with recommendations.

    Regards,

    Linda
  • 2. Re: improving 99th percentile read latency
    788748 Newbie
    Currently Being Moderated
    Linda, thank you for your quick response!
    1. Java GC activity
    2. BDBJE checkpointing activity
    3. BDBJE log cleaning
    4. BDBJE cache misses

    EnvironmentStats, obtained via Environment.getStats(),has some fields that tell you whether (2) , (3) or (4) occurred.
    I will definitely try that. I already tried turning off checkpointer and cleaner threads, but I haven't paid attention to Java GC and cache misses.
    - are BDB data operations evenly distributed over time, or is there some kind of peak load point which is causing high latency?
    BDB operations are evenly distributed over time.
    - if the latency is a measurement of your application operation, do some application operations translate into more than the usual number of BDB data operations?
    No, there is only one type of operation (read), and that translates to one BDB read.
    - is there other non-BDB activity that is spiking?
    I don't think so, but I'll closely monitor it the next time.
    We often instruct people to take periodic samples of Environment.getStats() and Java GC stats, look at those values over time, and correlate them to the performance seen. This works well when assessing throughput, but it can be harder to find latency outliers that way. In our own test frameworks, one useful tool we have are utilities that do thread dumps when latency drops, to give us a snapshot of what activity is happening throughout the JVM. These thread dumps can be obtained through java.lang.Thread.getAllStackTraces(). You could write your own simple version of this for your own application.
    I'll give it a try.

    Thanks!
    --Michi                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
  • 3. Re: improving 99th percentile read latency
    Linda Lee Journeyer
    Currently Being Moderated
    Michi,

    A colleague had some additional thoughts to add. He said:

    If the read load is truly random, and since the cache size, at 2.4G, can only keep a small fraction of the (> 80G data) in JE cache, the user may simply be bottlenecking on disk IOPs ( input/output operations)

    Running

    iostat -xdm 1

    will give the disk utilization and service times. The 235 ms 99% latency numbers look suspiciously like the time spent in the disk wait queue when all 30 of the threads are waiting for the disk to work through an occasional disk pileup, where there are no cache hits (in JE or the FS cache) and the latency is thus roughly 30 threads X 8ms per seek.

    In that case, it might even be possible that reducing the number of threads has an inpact on the latency outliers.

    Regards,

    Linda
  • 4. Re: improving 99th percentile read latency
    788748 Newbie
    Currently Being Moderated
    Hi Linda,
    If the read load is truly random, and since the cache size, at 2.4G, can only keep a small fraction of the (> 80G data) in JE cache, the user may simply be bottlenecking on disk IOPs ( input/output operations)

    Running

    iostat -xdm 1

    will give the disk utilization and service times.
    I think you are right: disk is maxed out. Here is the iostat output (sorry it didn't have -m option on my box):
    Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s    wsec/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
    sda          0.00   4.12 645.36  2.06 18210.31   49.48  9105.15    24.74    28.20     5.27    8.12   1.59 102.99
    sda          0.00   0.00 664.58  0.00 19700.00    0.00  9850.00     0.00    29.64     5.46    8.27   1.56 103.96
    sda          0.00 115.31 766.33  8.16 23502.04  987.76 11751.02   493.88    31.62     7.42    9.46   1.32 102.24
    sda          0.00   0.00 722.68  0.00 22647.42    0.00 11323.71     0.00    31.34     6.54    9.07   1.43 103.09
    sda          0.00   0.00 741.05  0.00 22130.53    0.00 11065.26     0.00    29.86     6.42    8.73   1.41 104.53
    sda          0.00   1.03 700.00  2.06 20420.62   24.74 10210.31    12.37    29.12     6.21    8.83   1.47 103.30
    sda          0.00   0.00 677.32  0.00 20049.48    0.00 10024.74     0.00    29.60     6.06    8.97   1.52 103.20
    sda          1.03 134.02 680.41  7.22 20618.56 1129.90 10309.28   564.95    31.63     6.08    8.82   1.49 102.27
    sda          0.00   0.00 751.04  0.00 21833.33    0.00 10916.67     0.00    29.07     6.46    8.62   1.39 104.38
    In that case, it might even be possible that reducing the number of threads has an inpact on the latency outliers.
    I reduced the number of worker threads and the number of client processes to 16, and the 99th percentile latency improved significantly (from 235ms to 75ms). At this point, I'm thinking that there are 3 options:

    <li>get a better disk
    <li>increase MAX_MEMORY to reduce cache misses
    <li>reduce concurrency

    Let me know if there are other things I should pay attention to.

    Thanks!
    --Michi                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
  • 5. Re: improving 99th percentile read latency
    Linda Lee Journeyer
    Currently Being Moderated
    Michi,
    >
    I think you are right: disk is maxed out. Here is the iostat output (sorry it didn't have -m option on my box):
    Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s    wsec/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
    sda          0.00   4.12 645.36  2.06 18210.31   49.48  9105.15    24.74    28.20     5.27    8.12   1.59 102.99
    sda          0.00   0.00 664.58  0.00 19700.00    0.00  9850.00     0.00    29.64     5.46    8.27   1.56 103.96
    sda          0.00 115.31 766.33  8.16 23502.04  987.76 11751.02   493.88    31.62     7.42    9.46   1.32 102.24
    sda          0.00   0.00 722.68  0.00 22647.42    0.00 11323.71     0.00    31.34     6.54    9.07   1.43 103.09
    sda          0.00   0.00 741.05  0.00 22130.53    0.00 11065.26     0.00    29.86     6.42    8.73   1.41 104.53
    sda          0.00   1.03 700.00  2.06 20420.62   24.74 10210.31    12.37    29.12     6.21    8.83   1.47 103.30
    sda          0.00   0.00 677.32  0.00 20049.48    0.00 10024.74     0.00    29.60     6.06    8.97   1.52 103.20
    sda          1.03 134.02 680.41  7.22 20618.56 1129.90 10309.28   564.95    31.63     6.08    8.82   1.49 102.27
    sda          0.00   0.00 751.04  0.00 21833.33    0.00 10916.67     0.00    29.07     6.46    8.62   1.39 104.38
    Yes, it does seem that you have high activity, and are incurring multiple IO operations for each of your read requests.

    >
    <li>get a better disk
    <li>increase MAX_MEMORY to reduce cache misses
    <li>reduce concurrency
    In regards to reducing cache misses, you may also want to play with BDBJE cache policies and the http://download.oracle.com/docs/cd/E17277_02/html/java/com/sleepycat/je/CacheMode.html. Looking at the BDBJE environment stats can give you more information about the number of cache misses, and what is being evicted, to help guide you as to what settings to experiment with. Information about changing the caching policy can be found in the FAQ and the CacheMode javadoc.

    For example, looking at the stats may tell you whether you are missing on one, two, or three levels of the btree when you do read requests. Setting je.evictor.lruOnly=false and je.evictor.nodesPerScan=200 in your EnvironmentConfig settings can help to keep the internal nodes (INs) of the btree in cache. If you know that your database cannot fit entirely in cache, by using com.sleepycat.je.util.DbCacheSize, and your reads are entirely random, you may find it useful to specify one of the CacheModes in your read operations.

    Increasing max memory is a mixed bag. The memory not allocated to the JVM heap can be used by the machine for its own file system cache, which can really benefit JE operations. If increasing the JE cache ensures that the entire database can fit within cache, it's always a win. But if only part of the database fits, increasing the JE cache will also decrease the size of the file system cache and the number of LNs (data records) that may be fortuitously stored there, and that can offset the advantage. An increased JVM Heap can also put more stress on GC.

    For example, sometimes an application with random access reads works best by even decreasing the cache size so that just the internal nodes (INs and BINs) fit within cache, as determined by DbCacheSize, and specifying the CacheMode.EVICT_LN option. That means every read operation will incur a single miss to read the data record, but the data record will hopefully be sitting within the file system cache, and GC will be less stressed.

    Hope that these pointers help you in your tuning!

    Linda

    Edited by: Linda Lee on May 19, 2011 10:38 AM
    Typo: TC -> GC

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points