This discussion is archived
5 Replies Latest reply: Apr 16, 2013 6:18 AM by Bogdan Coman RSS

Changing nodeMaxEntries

vinothchandar Newbie
Currently Being Moderated
Hi,

I am considering changing our B+Tree fanout (NODE_MAX_ENTRIES) to the default 128 from 512 (all our data is currently at this fanout).
From experimenting around, I have found this to reduce writing/cleaning by fitting more BINs in memory. (steady state random workload).

I have 2 quetions..

1. Is it possible to change the fanout without data conversion? How does this work in 4.1.x and 5.0.73?

2. Given 5.0.73 fixes problems around BIN eviction (i.e writes less metadata by writing deltas on eviction too), what is the real tradeoff here for a higher or lower fanout?



Thanks
Vinoth
  • 1. Re: Changing nodeMaxEntries
    Bogdan Coman Journeyer
    Currently Being Moderated
    Hi Vinoth,
    vinothchandar wrote:
    1. Is it possible to change the fanout without data conversion?
    The fanout can be modified after the database has been created, as indicated by this table: http://docs.oracle.com/cd/E17277_02/html/java/com/sleepycat/je/DatabaseConfig.html

    Regarding data conversion, the existing Btree nodes are not converted to use the new value. The new value is used only when a new Btree node is created as a result of a record insertion. So if there are old records that will stick around, you may be better off (from a performance standpoint) to reload the database in a new one.
    How does this work in 4.1.x and 5.0.73?
    What changed between 4.1 and 5.0 is that node fanouts were mutable and transient in 4.1 and are now, in 5.0, mutable and persistent database attributes. This means that in 4.1 the new value might sometimes revert to the previously existing setting. This was fix in [#18262].
    2. Given 5.0.73 fixes problems around BIN eviction (i.e writes less metadata by writing deltas on eviction too), what is the real tradeoff here for a higher or lower fanout?
    It is hard to speculate here without knowing how big the cache is (more exactly if the BINs fit in cache or not). Since you are mentioning the JE 5 BIN eviction design change, my understanding is that the cache is not big enough, so I suggest you to find more about the tread-offs here: http://www.oracle.com/technetwork/database/berkeleydb/je-faq-096044.html#WhyshouldtheJEcachebelargeenoughtoholdtheBtreeinternalnodes

    You are correct about BIN eviction, you are talking about [#19671], which also mentions:
    "By significantly reducing writing, the new approach provides overall performance improvements. However, there is also an additional cost to the new approach: When a BIN is not in cache, fetching the BIN now often requires two random reads instead of just one; one read to fetch the BINDelta and another to fetch the last full BIN. For applications where all active BINs fit in cache, this adds to the I/O cost of initially populating the cache. For applications where active BINs do not fit in cache, this adds to the per-operation cost of fetching a record (an LN) when its parent BIN is not in cache. In our tests, the lower write rate more than compensates for the additional I/O of fetching the BINDelta, but the benefit is greatest when all BINs fit in cache."

    Now, a lower fanout (small internal node (IN, BIN)) increases the percentage of overhead of JE metadata in the on-disk footprint of the internal node, and it increases the depth of the btree, but as an advantage it gives you a smaller unit to write to disk when cleaning. Maybe you obtained better results when testing with a smaller fanout (I still don't know which version you tested with, 4.1, 5.0 or both?) because smaller INs let you pull small granularity items into the cache. A cache statistics output would allow me to identify if this is the case.

    Thanks,
    Bogdan Coman
  • 2. Re: Changing nodeMaxEntries
    vinothchandar Newbie
    Currently Being Moderated
    the existing Btree nodes are not converted to use the new value. The new value is used only when a new Btree node is created as a result of a record insertion.
    If I understand it correctly, cleaning and updates will not change the fanout to a lower value? [I guess that seems fair, since if you create 4 BINs from a single BIN on log (512 fanout to 128 fanout), you would also need to adjust the parents to make room for four pointers instead of one.. ]..
    Maybe you obtained better results when testing with a smaller fanout (I still don't know which version you tested with, 4.1, 5.0 or both?) because smaller INs let you pull small granularity items into the cache. A cache statistics output would allow me to identify if this is the case.
    I think so. Almost all of our data sets, don't have BINs fitting in memory.. Here are some results from 4.1.17 ...
    I think its a 200M key DB with 100 byte keys and 1kb values

    Fanout 128, cache size 1GB
    avgCleanerEntriesRead/min : 35K
    avgNumReadBytes/min : 7639MB
    avgNumWriteBytes/min : 1670MB

    Fanout 256, cache size 1GB
    avgCleanerEntriesRead/min : 52K
    avgNumReadBytes/min : 7847MB
    avgNumWriteBytes/min : 1786MB

    Fanout 512, cache size 1GB
    avgCleanerEntriesRead/min : 79K
    avgNumReadBytes/min : 8414MB
    avgNumWriteBytes/min : 2087MB

    I am going to try some runs with BDB-JE 5.0.73. Given that eviction now logs delta records, I am beginning to think it may be okay to stay at 512.. We just converted all of the data to a new on disk format, another data conversion would be operationally expensive for us. :)

    Thanks
    Vinoth
  • 3. Re: Changing nodeMaxEntries
    Bogdan Coman Journeyer
    Currently Being Moderated
    vinothchandar wrote:
    the existing Btree nodes are not converted to use the new value. The new value is used only when a new Btree node is created as a result of a record insertion.
    If I understand it correctly, cleaning and updates will not change the fanout to a lower value? [I guess that seems fair, since if you create 4 BINs from a single BIN on log (512 fanout to 128 fanout), you would also need to adjust the parents to make room for four pointers instead of one.. ]..
    The already existing tree will keep the 512 format until it gets updated. This would be useful to do only if the old records get updated/deleted by the application, but if the old records get to stick around, reloading the database in a new one makes more sense from a performance point of view.
    Maybe you obtained better results when testing with a smaller fanout (I still don't know which version you tested with, 4.1, 5.0 or both?) because smaller INs let you pull small granularity items into the cache. A cache statistics output would allow me to identify if this is the case.
    I think so. Almost all of our data sets, don't have BINs fitting in memory.. Here are some results from 4.1.17 ...
    I think its a 200M key DB with 100 byte keys and 1kb values

    Fanout 128, cache size 1GB
    avgCleanerEntriesRead/min : 35K
    avgNumReadBytes/min : 7639MB
    avgNumWriteBytes/min : 1670MB

    Fanout 256, cache size 1GB
    avgCleanerEntriesRead/min : 52K
    avgNumReadBytes/min : 7847MB
    avgNumWriteBytes/min : 1786MB

    Fanout 512, cache size 1GB
    avgCleanerEntriesRead/min : 79K
    avgNumReadBytes/min : 8414MB
    avgNumWriteBytes/min : 2087MB
    Thanks for running the tests. Now, a smaller fanout (pulling small granularity items into the cache) is generally better for random access applications, as your application is. We ran tests/benchmarks to identify the best default value for nodeMaxEntries, which is 128. Increasing it fourfold is a lot, but not to say that you can't possibly get better results with certain configurations and access patterns. Running a few tests with your application and different values is the best thing to do.
    I am going to try some runs with BDB-JE 5.0.73. Given that eviction now logs delta records, I am beginning to think it may be okay to stay at 512.. We just converted all of the data to a new on disk format, another data conversion would be operationally expensive for us. :)
    I'm curious about the JE 5 results too, to compare them to the above 4.1 results. Also, maybe you can let us know what your decision is.

    Thanks,
    Bogdan
  • 4. Re: Changing nodeMaxEntries
    vinothchandar Newbie
    Currently Being Moderated
    The already existing tree will keep the 512 format until it gets updated. This would be useful to do only if the old records get updated/deleted by the application, but if the old records get to stick around, reloading the database in a new one makes more sense from a performance point of view.
    I am sort of unclear again. Hmmm the application is not guaranteed to update/delete all the keys in a reasonable amount of period (say a month). But given a cleaner migration is simply an update, I am hoping most of these "updates" will come from the cleaner. So, will the cleaner "updates" result in lowering the fanout dynamically?

    Right now, we were testing the waters with JE 5.0.73 with the same 512 fanout. If this goes well, the next step would be to try to go down to 128 fanout. Our decision would be to go with 128 fanout unless it mandates a data conversion.. Will keep the forum posted..
  • 5. Re: Changing nodeMaxEntries
    Bogdan Coman Journeyer
    Currently Being Moderated
    vinothchandar wrote:
    The already existing tree will keep the 512 format until it gets updated. This would be useful to do only if the old records get updated/deleted by the application, but if the old records get to stick around, reloading the database in a new one makes more sense from a performance point of view.
    I am sort of unclear again. Hmmm the application is not guaranteed to update/delete all the keys in a reasonable amount of period (say a month). But given a cleaner migration is simply an update, I am hoping most of these "updates" will come from the cleaner. So, will the cleaner "updates" result in lowering the fanout dynamically?
    LATER EDIT: I was wrong all along about the format update, a new Btree node will not be created due to updates, or cleaner migrations (which are equivalent to updates). Or deletions. Only when a new IN is created due to a split, due to many insertions, will the new max entries setting be used.

    Edited by: Bogdan Coman on Apr 16, 2013 9:17 AM

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points