3 Replies Latest reply: Apr 18, 2012 5:36 PM by Greybird-Oracle RSS

    Clarification on 4.x handling of duplicates

    vinothchandar
      Hi,

      I know from the architecture paper that prior to BDB5, the duplicates were stored as a separate tree, linked from the BIN. This tree has a DIN, DBIN and LN I suppose.

      Here are my questions :

      1. Assuming sortedDuplicates is enabled, does this duplicate tree exist always or created only when a key has a duplicate? i.e if only 50% of my keys have duplicates, will all of them still have this tree?

      2. How does evict_ln work? Are the DBIN and DIN nodes also cached? or are they treated as LNs since they are lower than the BIN.

      Appreciate your help.

      Thanks
      Vinoth
        • 1. Re: Clarification on 4.x handling of duplicates
          Greybird-Oracle
          Hi Vinoth,

          On 1: In JE 4.1 and earlier, if there is only one LN with a given key, then there is no duplicate tree (no DIN or DBINs). This is an optimization I regret, because it gives the false impression that a duplicates DB is meant to be used for such cases. Duplicates DBs are meant to be used when duplicates are common, not uncommon. In JE 5, duplicates DBs work very well when duplicates are common, but they perform quite badly (use more memory, for example) when duplicates rarely exist.

          This is why the approach taken by Voldemort -- using a duplicates DB when true duplicates only occur rarely, when there are record version conflicts -- will not work well in JE 5. This is very unfortunate, because in general duplicates perform much better in JE 5, but not in the way they are used in Voldemort.

          The only good solution, that I know of, is: 1) use a unique key in Voldemort and disable duplicates in JE, 2) convert existing databases to this new format before upgrading to JE 5. This is the approach that Diego is taking.

          On 2: The LNs are evicted in the same way, whether a dup tree (DIN, DBIN) is present or not. The LN is evicted from whatever parent it has -- a BIN or a DBIN.

          --mark                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
          • 2. Re: Clarification on 4.x handling of duplicates
            vinothchandar
            Hi Mark.

            Thanks for clarification. This was actually for the Voldemort capacity modelling work I am doing.

            About the duplicate issues, I do understand the problem. But unfortunately, ruling out duplicates is impossible with Voldemort, since it needs to handle concurrent writes and support conflict resolution later on., Diego's solution makes sense. But there are some Voldemort/distributed systems issues that needs to be flushed out. We are working with him on that.

            Thanks again for your fantastic help

            Thanks
            vinoth
            • 3. Re: Clarification on 4.x handling of duplicates
              Greybird-Oracle
              Thanks for clarification. This was actually for the Voldemort capacity modelling work I am doing.
              I see, I didn't mean to jump to conclusions. If you have additional questions on JE 4.1 Btree, just let me know.
              About the duplicate issues, I do understand the problem. But unfortunately, ruling out duplicates is impossible with Voldemort, since it needs to handle concurrent writes and support conflict resolution later on., Diego's solution makes sense. But there are some Voldemort/distributed systems issues that needs to be flushed out. We are working with him on that.
              Great, I'm glad to hear that!

              For others' sake, I should explain a little more about the duplicates format in JE 5.0 vs 4.1 This is also described in the change log to some extent.

              In general, duplicates databases are intended for indexes where true duplicates are the common case. If you are using a duplicates database as your "primary database" (as done in Voldemort), and true duplicates are uncommon, please read on.

              In JE 4.1, the duplicates Btree had reduced concurrency and a larger memory size, due to the separate sub-Btree for each unique key, and the transactional maintenance of the duplicate count. This has been greatly improved in JE 5, by instead using a two-part internal key in a duplicates database. The first part of each internal key is the "key" param you specify in the API. The second part of the internal key is the "data" param you specify in the API.

              You may ask, why doesn't the two-part key use more memory, since the data is always present in the Btree internal nodes? There are several reasons:

              1) The data (LN) is never kept resident in the Btree, so it is not redundant (memory-wise) with the second part of the internal key.

              2) Key prefix compression is always enabled for duplicates databases, meaning that when there is a sufficient number of duplicates per key (more than the fan out, or 128 by default), the entire first part of the internal key is compressed away. Even with less duplicates per key, some portion of the first part of the key will often be compressed away.

              3) In fact the data is also present in the internal key in JE 4.1 and earlier, when there is at least one duplicates. When there are true duplicates, the JE 5 format takes less memory than the JE 4.1 format.

              The last point is important. When a duplicates database does not in fact have duplicates (or they are uncommon), the old JE 4.1 format is more efficient than the new 5.0 format. This is the opposite of what we consider to be the normal case. In such cases, it is best to convert the database to use a unique key, using a non-duplicates (normal) database, before upgrading to JE 5.

              Another important point is that duplicates databases should not normally be used when the "data" in the key-data pair is large. Because the data is stored in the internal Btree nodes of a duplicates database (as part of the internal key), large data will result in large internal nodes and surprising large amounts of overall memory usage. This is true of JE 4.1 (when there are true duplicates) and JE 5.0 (always).

              Another reason for using a normal (non-duplicates) database for primary databases is to allow for the addition of secondary indices. SecondaryDatabases cannot be used if the primary database is configured for duplicates. Note that with a SecondaryDatabase, the "data" is the key of the primary database, and therefore normally not large.

              In general, normal (non-duplicates) databases are intended for primary databases and duplicates databases are intended for secondary databases.

              --mark