This discussion is archived
7 Replies Latest reply: Mar 26, 2012 9:25 AM by greybird RSS

Estimating cache miss cost on SSDs

vinothchandar Newbie
Currently Being Moderated
Hi,

We recently moved to SSDs, which in theory should enable us to tolerate much more cache misses. Given I know the depth d of the Btree based on number of keys and SSD access time of t ms avg.
Can I safely estimate the worst case cache miss taking time proportional to t x d? (of course not accounting for contention).

I tried to experimentally determine this, setting BDB cache size to 96kb mininum, hoping that it will stress the disk for every operation. But, i found that the overall throughput dropped due to contention within bdb.
What is a good way to go about determining a loose worst case bound for this?

Thanks
Vinoth
  • 1. Re: Estimating cache miss cost on SSDs
    greybird Expert
    Currently Being Moderated
    Hi Vinoth,

    Contention will definitely be an issue, since a parent Btree node is latched exclusively while fetching its child node. So I don't know of a good way to measure it. That said, t * d is a good rough estimate, disregarding contention.

    Note that regardless of the cache size setting you specify, a minimum of 500 KB is set aside for the Btree cache. Without this, JE can't function in a practical sense of word.

    You should also be aware that for a read-write application, the BINs (all internal nodes) should be in cache for good performance. This is not just an issue of latency caused by file IOs. See:
    http://www.oracle.com/technetwork/database/berkeleydb/je-faq-096044.html#WhyshouldtheJEcachebelargeenoughtoholdtheBtreeinternalnodes

    --mark                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
  • 2. Re: Estimating cache miss cost on SSDs
    vinothchandar Newbie
    Currently Being Moderated
    Hi Mark,

    The problem is that our data sets are big enough such that, all internal nodes won't fit inside bdb cache.
    For example, the following is for one of our stores. It needs a max of 27G, which is not possible (since we also have other databases on the same machine)

    inputs: records=178680261 keySize=78 dataSize=-1 nodeMax=512 binMax=512 density=80% overhead=10%

    === Cache Sizing Summary ===

    Cache Size Btree Size Description
    --------------- --------------- -----------
    25,723,870,731 23,151,483,658 Minimum, internal nodes only
    27,216,517,984 24,494,866,186 Maximum, internal nodes only


    In practice, workloads tend to have hotsets, as mentioned in the faq. My understanding is that as long as the cache is big enough to house the hotset (which should naturally happen due to LRU), we should be good? As you can see the data set here is only 178M, and its not possible to have a big enough cache.

    That said, when planning for the future, (which is what I am doing right now), its hard to predict how this hotset will grow. Workloads I deal with are almost always read heavy (hence Bdb cleaners do not come into play). Hence I was trying to estimate the worst case latency by going to disk. I am effectively trying to come up with a capacity model here and if you add GC to the picture, wow, its hard!

    Comments or experiences welcome.

    Thanks
    Vinoth
  • 3. Re: Estimating cache miss cost on SSDs
    greybird Expert
    Currently Being Moderated
    In practice, workloads tend to have hotsets, as mentioned in the faq. My understanding is that as long as the cache is big enough to house the hotset (which should naturally happen due to LRU), we should be good?
    Yes. My intention in mentioning the item in the FAQ was to try to discourage you from running in a mode where internal nodes in the Btree are not resident in the cache, for the hot data, especially data being written. If you're not aware of the issue in the FAQ, you might believe that as long as disk IO is fast enough (for example using an SSD), then keeping the internal nodes in the cache is not necessary.
    inputs: records=178680261 keySize=78 dataSize=-1 nodeMax=512 binMax=512 density=80% overhead=10%
    This output indicates you're not running JE 5 yet. You'll see different values from DbCacheSize in JE 5. The JE 5 DbCacheSize javadoc also talks about key size and a config param that can reduce the amount of memory used by internal nodes by quite a bit. Reducing the amount of memory needed has a big impact of course.
    That said, when planning for the future, (which is what I am doing right now), its hard to predict how this hotset will grow. Workloads I deal with are almost always read heavy (hence Bdb cleaners do not come into play). Hence I was trying to estimate the worst case latency by going to disk. I am effectively trying to come up with a capacity model here and if you add GC to the picture, wow, its hard!
    I see, that makes sense, I believe your thinking is correct. Yeah, it is hard, sorry I don't have more suggestions.

    The only approach we know works for certain is to actually test the workload at scale, and that takes time and also hardware resources. We have been able to extrapolate with some success when testing with smaller hardware and workloads (and then multiplying, plus adding a fudge factor of some kind). But that doesn't always work, for example, when you cross the 32 GB heap size boundary you'll see a big change.

    --mark                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
  • 4. Re: Estimating cache miss cost on SSDs
    vinothchandar Newbie
    Currently Being Moderated
    Hi Mark,

    Our situation is tricky since we should be able to tell if I can add one more database to the same instance, without affecting latencies beyond a guaranteed maximum.
    So more dimensions to scaling,

    1. by ops/sec
    2. by more databases
    3. by number of entries, which will increase tree size

    Anyways, thanks for helping out. Looks like the practical option is have a rough d * t , and just keep GC out of the way as much as possible.

    One final quick question ... Contention is only for writes I beleive? Do reads with LockMode.UNCOMMITTED be blocked by writes in a concurrent transaction?


    Thanks
    Vinoth
  • 5. Re: Estimating cache miss cost on SSDs
    greybird Expert
    Currently Being Moderated
    There are two types of blocking. Record-level blocking has to do with two transactions accessing the same record. You have isolation modes in JE to control this. As you say, with LockMode.READ_UNCOMMITTED there is no record-level blocking. With other lock modes, record locks may be held for the duration of a transaction.

    The contention I was referring to earlier is Btree node latching, which is not record-level blocking. When a Btree child node is not present in the cache and must be fetched (to do a read or write), its parent node is latched exclusively while fetching the child from the file system and placing a reference to the child in its parent slot.

    In the worst theoretical case, imagine that the root node is the only node resident in the cache (if you were able to set the cache that small, which you're not). Every time you do a read or write, that node will be latched exclusively in order to fetch its child node (and so on) in order to finally fetch the leaf node and perform the operation. This would serialize all access to the Btree.

    In what we hope to be the common case, all internal nodes (or those in the hot areas) are in cache. When doing a read, shared latching is used to traverse down to the BIN (bottom internal node). The BIN is latched exclusively in order to fetch the LN, but all parent nodes (above the BIN) were latched shared.

    So there is normally little contention, although some have reported contention on the BIN if many threads access records in the same BIN, because the BIN is latched exclusively. There are 128 entries per node by default and you can configure this with EnvironmentConfig.NODE_MAX_ENTRIES or DatabaseConfig.setNodeMaxEntries.

    Also note that latch coupling is used. When traversing down the tree from the root, after getting (and possibly fetching) the child node, and after latching it, the parent latch is released. Also, latches are never held in between operations (all latches are released before any API call returns) or internally while waiting for a record lock.

    --mark

    Edited by: greybird on Mar 23, 2012 10:49 AM
  • 6. Re: Estimating cache miss cost on SSDs
    vinothchandar Newbie
    Currently Being Moderated
    Ah I see.. Would like to learn more about this. Is the architecture white paper (from sept 2006) still relevant to understand the latching and general flow of operations?

    Thanks a lot mark. your help is greatly appreciated.
  • 7. Re: Estimating cache miss cost on SSDs
    greybird Expert
    Currently Being Moderated
    The basic latching approach hasn't changed since then, but I'm sure many details have changed. Feel free to ask specific questions.
    --mark                                                                                                                                                                                                                                                                                       

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points