Forum Stats

  • 3,768,730 Users
  • 2,252,841 Discussions
  • 7,874,701 Comments

Discussions

Question about IN/BIN usage in BDB-JE

Gaojie
Gaojie Member Posts: 5
edited Apr 28, 2017 5:59PM in Berkeley DB Java Edition

Hi BDB-JE developers/users,

Our team is using BDB-JE as the storage backend, and we are using je-5.0.104.  We are seeing a fair amount of disk space, which is occupied by unnecessary B+Tree node (IN and BIN) in both testing and prod envs.

Here is the test:

1. Ingest 10m records with key of around 50 bytes and value of 200 bytes into a single BDB database;

2. All the input records are sorted by key bytes in lexicographic order (we found that sorted input will improve the ingestion rate greatly);

Output of 'DbPrintLog':

typetotal countprovisional counttotal bytesmin bytesmax bytesavg bytesentries as % of log
MapLN570357812821206276.60E-04
NameLN_TX20964452481.77E-06
FileSummaryLN191031703274383216590.005846822627
IN3750126994648627140343062523817.45545353
BIN7640076233134614560233340771761924.82612028
DbTree3207833514392441.44E-04
Commit1000000203197658802532315.897241863
CkptStart2908382731281.55E-05
CkptEnd29017944763613.31E-05
BINDelta172316427082033911934110.0130609444
Trace101091091091092.01E-06
FileHeader87033063838386.10E-05
INS_LN_TX100000000280882275227228128051.80135892
key bytes56888889010.49166152
data bytes200000000036.88474745

Output of 'DbStat':

BTree: Composition of btree, types and counts of nodes.

  binCount=36,860

  binEntriesHistogram=[20-29%: 1; 40-49%: 32,937; 50-59%: 1,538; 60-69%: 161; 70-79%: 56; 80-89%: 5; 90-99%: 2,162]

  binsByLevel=[level 1: count=36,860]

  deletedLNCount=0

  inCount=136

  insByLevel=[level 2: count=135; level 3: count=1]

  lnCount=10,000,000

  mainTreeMaxDepth=3

BTree: Composition of btree, types and counts of nodes.

  nRootSplits=0

  relatchesRequired=0

Based on the output of 'DbStat', the count of actual IN being used in B+ Tree is 136, but the total number of IN in disk is 37,501.

The count of actual BIN being used in B+ Tree is 36,856, but the total number of BIN in disk is 76,400.

Given all the input records are sorted by key bytes in lexicographical order, BINDelta overhead is very minimal, which is expected.

But why BDB-JE is generating so many obsolete/duplicate IN/BIN during ingestion?

I checked some code related to this logic, and it seems BDB-JE will force flushing the B+Tree node when splitting, but not sure whether it is root cause or not.

Is there anyone else, who is seeing the similar behavior? Is there any way to tune the config to avoid this unnecessary disk usage?

I don't want to use cleaner thread to clean up those unused entries since it will take a pretty long time.

Any comment is welcome.

Tagged:

Best Answer

  • Greybird-Oracle
    Greybird-Oracle Member Posts: 2,690
    edited Apr 25, 2017 4:23PM Accepted Answer

    OK. First, you are right that splits cause extra logging of upper INs (INs that are not BINs). Also, even when loading in key order, a small BIN will be written when the first record is added to it, and again later by the next checkpoint.

    Because you have a read-only workload you can optimize for space fairly easily. During the load:

    1. Open databases in DeferredWrite mode. This will prevent logging of splits and the initial logging of a small BIN.

    2. Disable the checkpointer, so that BINs are not logged at all until you close the environment after the load.

    3. Be sure to configure the JE cache size to be large enough to hold all BINs, so that BINs are not logged when they are evicted.

    During the load, you cannot use Transactions because they are not supported by DeferredWrite mode. But when you open this environment later in production, you do not need to open the databases in DeferredWrite mode, and you can use transactions if you wish.

    You'll find that BINs and INs are reduced, and LNs are also smaller because there are no transactions. The Commit entries will also disappear.

    See:

    DatabaseConfig (Oracle - Berkeley DB Java Edition API)

    Chapter 7. Databases

    Please let us know how it goes.

    --mark

Answers

  • Greybird-Oracle
    Greybird-Oracle Member Posts: 2,690
    edited Apr 25, 2017 2:37PM

    Hi,

    Before answering with details, could you please tell us what sort of production workload you have? Based on your questions, I'm guessing that you do an initial load and then run a read-only workload in production, and you want to minimize disk usage.

    Or do you also do record writes in production? If so, then optimizing disk usage after the initial load won't help, since as time goes on, more BINs/INs will be written and you will have to rely on the JE cleaner.

    --mark

  • Gaojie
    Gaojie Member Posts: 5
    edited Apr 25, 2017 2:59PM

    Hi Mark,

    Thanks a lot for the reply.

    You are right, this is a read-only workload in production, and we won't write records to this store after the initial load.

    That is why we would like to minimize the disk usage since there will be a lot of stores in production.

    Please let me know if you need more info.

    Thanks,

    Gaojie

  • Greybird-Oracle
    Greybird-Oracle Member Posts: 2,690
    edited Apr 25, 2017 4:23PM Accepted Answer

    OK. First, you are right that splits cause extra logging of upper INs (INs that are not BINs). Also, even when loading in key order, a small BIN will be written when the first record is added to it, and again later by the next checkpoint.

    Because you have a read-only workload you can optimize for space fairly easily. During the load:

    1. Open databases in DeferredWrite mode. This will prevent logging of splits and the initial logging of a small BIN.

    2. Disable the checkpointer, so that BINs are not logged at all until you close the environment after the load.

    3. Be sure to configure the JE cache size to be large enough to hold all BINs, so that BINs are not logged when they are evicted.

    During the load, you cannot use Transactions because they are not supported by DeferredWrite mode. But when you open this environment later in production, you do not need to open the databases in DeferredWrite mode, and you can use transactions if you wish.

    You'll find that BINs and INs are reduced, and LNs are also smaller because there are no transactions. The Commit entries will also disappear.

    See:

    DatabaseConfig (Oracle - Berkeley DB Java Edition API)

    Chapter 7. Databases

    Please let us know how it goes.

    --mark

  • Gaojie
    Gaojie Member Posts: 5
    edited Apr 26, 2017 1:04PM

    Hi Mark,

    Thanks a lot for providing those useful info.

    I tested with deferred write database with checkpoint enabled and sorted input:

    The result is very good, and here is the output of 'DbPrintLog':

    typetotal countprovisional counttotal bytesmin bytesmax bytesavg bytesentries as % of log
    MapLN380147272811203874.23E-04
    NameLN_TX10444444441.26E-06
    NameLN10515151511.46E-06
    FileSummaryLN116016884706301454.85E-04
    IN2261534107144034102150910.09792861847
    BIN406814068174050703733340721820221.26148106
    DbTree2305032514002181.44E-04
    Commit10252525257.18E-07
    CkptStart2005772731281.66E-05
    CkptEnd20011734760583.37E-05
    BINDelta171698683313565802.83E-04
    Trace101121121121123.22E-06
    FileHeader56021283838386.11E-05
    UPD_LN100000000273888889026827427378.63913689
    key bytes56888889016.33397085
    data bytes200000000057.42411617

    Output of 'DbStat':

    BTree: Composition of btree, types and counts of nodes.

      binCount=36,860

      binEntriesHistogram=[20-29%: 1; 40-49%: 32,937; 50-59%: 1,538; 60-69%: 161; 70-79%: 56; 80-89%: 5; 90-99%: 2,162]

      binsByLevel=[level 1: count=36,860]

      deletedLNCount=0

      inCount=136

      insByLevel=[level 2: count=135; level 3: count=1]

      lnCount=10,000,000

      mainTreeMaxDepth=3

    BTree: Composition of btree, types and counts of nodes.

      nRootSplits=0

      relatchesRequired=0

    'DbSpace' showed that the overall utilization is 98%.

    The reason to enable checkpoint thread is that we still want to have the capability to recover the data load if it stops in the middle.

    Another interesting finding is that with sorted input, BDB-JE doesn't really need big cache size to hold all the BINs in the memory, which is important for us since our product could have a lot of ongoing data loads at the same time, and it is not possible to hold all the BINs of all the databases in the memory. If the input data is not sorted, it seems it is necessary to hold all the BINs in the memory, otherwise the result is very bad. With the same input file size (unsorted), the final BDB-JE database will occupy 100 GB hard disk space with smaller cache size than B+Tree index, and it seems deferred-write database won't generate BINDelta at all when BDB-JE needs to evict some BINs into disk.

    Let me know if you have any question about the above test.

    Thanks,

    Gaojie

  • Greybird-Oracle
    Greybird-Oracle Member Posts: 2,690
    edited Apr 26, 2017 3:56PM

    Thanks for the summary and I'm glad it's working well for you.

    Sorry, when I recommended a cache size that holds all BINs, I had forgotten that you insert in key order. Ordered key insertions fill one BIN and then move onto the next, etc. So you only need enough cache for a few BINs in memory at a time. When the cache overflows, the oldest BIN will be written to disk, and that's OK because it won't be modified again.

    But with random key insertions you really do need enough cache to hold all BINs in cache. Otherwise, the BINs will be repeatedly evicted from cache and logged to disk, then fetched again and modified, evicted, etc. This results in each BIN being written to disk multiple times.

    Another thing I noticed is that the average BIN size in your test is fairly large, 18k. You may be able to reduce BIN size significantly by using key prefix compression, if you're not already using it. See DatabaseConfig.setKeyPrefixing.

    --mark

  • Gaojie
    Gaojie Member Posts: 5
    edited Apr 26, 2017 5:35PM

    Hi Mark,

    Thanks for the explanation for the BINDelta part, and we noticed the exact same behavior as you mentioned, and that is the main reason we switched to sorted input from out-of-order input.

    For key-prefixing feature, we have done several tests, and it did reduce the average size of IN/BIN node, and we are planning to enable it if we got very positive result with customers' real data.

    BTW, the max entries per node config is 512 in our product right now, do you suggest to use even smaller value? I assume smaller value will introduce more IN/BIN/Tree depth, but we can definitely test with smaller config values.

    Thanks,

    Gaojie

  • Greybird-Oracle
    Greybird-Oracle Member Posts: 2,690
    edited Apr 28, 2017 11:56AM

    To be honest I'm not sure about the max entries per node. We only do performance testing with 128, the default value. In your use case, because each BIN will only be written once, I think a larger value makes sense, since as you say this reduces Btree depth.

    --mark

  • Gaojie
    Gaojie Member Posts: 5
    edited Apr 28, 2017 5:59PM

    That is good enough for us for now.

    Thanks a lot for all the helps, Mark!

This discussion has been closed.