12 Replies Latest reply: Jul 19, 2012 5:04 PM by Greybird-Oracle RSS

    Optimizing large DB scans

    vinothchandar
      Hi,

      This is more of a generic JE question. Here is the situation.

      Voldemort supports a daily clean up job, which will do a cursor walk over the entire database and then call delete() on items that are stale. As you can see, this is a costly operation with millions of keys.
      A better way to do this would be to amortize this cleanup when the BDB Cleaner reads and migrates records.
      Is there a way to write a hook to the BDB cleaner to execute some custom code during migration?

      Thanks
      Vinoth
        • 1. Re: Optimizing large DB scans
          Greybird-Oracle
          Hi Vinoth,

          No, there is no such feature. We may do that someday, but after doing an analysis of how best to improve deletion performance we determined that the most important thing was to allow deleting a record without having to fetch the LN. This is implemented in JE 5, and I think you'll see a significant improvement in deletions with JE 5, if your app doesn't read the record before it is deleted.

          I supposed if you are deleting stale records then you have to check a timestamp of some kind, so perhaps you do have to read the record, in which case this optimization doesn't help you. But I thought I should mention it.

          --mark                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
          • 2. Re: Optimizing large DB scans
            Greybird-Oracle
            I realized that when you ask about the cost of the deletion scan, part of the cost is the scan and the other part is the deletion. I mentioned the optimizations for deletion in JE 5.

            As far as the scan goes, what is the ratio of records deleted to total records scanned? Is it fairly small?

            --mark                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
            • 3. Re: Optimizing large DB scans
              vinothchandar
              Hi mark,

              Its mostly small. less than 1/7 th of the data. But currently we scan the entire data daily, which is wasteful.
              So thinking of ways to amortize/optimize it.

              Thanks
              Vinoth
              • 4. Re: Optimizing large DB scans
                Greybird-Oracle
                I was going to suggest using a disk-ordered scan to find the records to be deleted, and then a regular cursor lookup/delete on the selected records. But then I remembered the disk-ordered scan is only available in JE 5. Sorry about that. But maybe a thought for the future.

                --mark                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
                • 5. Re: Optimizing large DB scans
                  vinothchandar
                  Hi Mark,

                  Sure. But we are SSDs anyway. So we are not bottlenecked by the IOPS. Its the cache churn due to the scan that hurts.
                  Anyways, thanks for the input. I will keep thinking

                  Thanks
                  Vinoth
                  • 6. Re: Optimizing large DB scans
                    Greybird-Oracle
                    We may be going around in circles on this, but I'll ask anyway: Using EVICT_LN for all operations doesn't help with the cache churn issue? With SSDs, EVICT_LN is a good policy for all operations.
                    • 7. Re: Optimizing large DB scans
                      vinothchandar
                      Hi Mark,

                      Sorry forgot to hit enter on the reply. :) .

                      So, evict_ln does help. But then we ran into that bloat issue with 4.1.17, which set us back to 4.0.92. I am rewriting the duplicate handling. But its a huge huge huge effort to get all the data converted to this format (which was some reason why we were hesistant with BDB 5 before). Anyways, once this conversion happens, we can go back to 4.1.17 and set EVICT_BIN for the scan cursor and EVICT_LN for the normal operations. We were seeing improvements with this but unfortunately the bloat ruined it.

                      Anyways, I am also thinking about something else to handle the clean up jobs. I want to run it by you, to see if you see any gotchas I might hit.

                      To recap :
                      We have a cleanup job that scans the entire DB daily, and deletes expired entries looking at the timestamp in the value. This causes enormous amount of GC churn and impacts every other DB on the server.

                      Planned fix:
                      Use secondary index to create an index on the timestamp present in the value. (the index can just be an integer denoting 'number of days since an epoch'). So, we will have a small set of keys in the secondary index pointing to all records that were updated on a given day. Given this, we can simply issue a delete() via the secondary index which should delete all stale entries.

                      This avoids the costly scan on the entire DB, trading off performance hit due to Sec Index and extra storage. Do you see any problems with this approach?

                      Some questions :
                      -- Can I simply add secondary index support to an existing environment (duplicates on)?
                      -- Do you see any GC churn due to the bulk deletes? is there a more efficient way of doing this?

                      Help really appreciated.

                      Thanks
                      Vinoth
                      • 8. Re: Optimizing large DB scans
                        943066
                        Hey Vinoth, sorry to butt in but I've just been dealing with some (similar-ish) stuff.

                        Secondary database/index works quite well as an alternate traversal means, however it does come with some caveats:

                        As you note, you effectively double your write load (ie: you're updating two indexes for every write, rather than one) - for me, my index grew significantly (close to 35% bigger), though I have large secondary keys.

                        You must run in transactional mode with a secondary index.

                        In your case I would think key prefixing should work really well, as timestamps will share a temporal common prefix (which should reduce the size a lot).

                        I'm also struggling with GC/cache churn a bit so would be really interested in hearing how you go.

                        Now I'm sure Mark can give you some more authoritative answers :)
                        • 9. Re: Optimizing large DB scans
                          vinothchandar
                          Ah. A fellow GC crusader. Heart warming! (Your name please? :))

                          Prefixing keys with timestamp assumes an insert only workload. If you put timestamp inside the key, how could we even update the key again? So might not work out well.

                          Personally, I don't think transactional mode has too much of an overhead (by default we run with no_sync). So defintely doable. I have not done any perf testing with secondary indexes. We are SSDs. So I don't really care if the avg latency goes from 2ms to 4ms. If the secondary index will help the clean up, then its a life saver since we literally have hundreds of millions of keys. But you are right, it is double write effectively.

                          These large scans are inherently costly and any system that relies on it must pay a premium upfront. Sure. love to share more as I go along. Feel free to connect with me on linkedin (duh!) .

                          Thanks
                          Vinoth
                          • 10. Re: Optimizing large DB scans
                            Greybird-Oracle
                            So, evict_ln does help. But then we ran into that bloat issue with 4.1.17, which set us back to 4.0.92. I am rewriting the duplicate handling. But its a huge huge huge effort to get all the data converted to this format (which was some reason why we were hesistant with BDB 5 before). Anyways, once this conversion happens, we can go back to 4.1.17 and set EVICT_BIN for the scan cursor and EVICT_LN for the normal operations. We were seeing improvements with this but unfortunately the bloat ruined it.
                            Once you've done the conversion I strongly suggest upgrading directly to JE 5 rather than 4.1, since we've done more optimization of cache management (and many other optimizations, for example, for deletions) there.
                            This avoids the costly scan on the entire DB, trading off performance hit due to Sec Index and extra storage. Do you see any problems with this approach?
                            The doubling of writes is the cost, as fb points out, and I would definitely not use this approach without first testing. I will be a little surprised if the doubling of writes is acceptable, but it really depends on your particular situation. Using time granularity of a day (or longer) is a good idea, since it means you can update the primary in that interval without a change to the index.

                            Note that you can't add a secondary to a duplicates DB, so you can't do this until you've made your primary keys unique.
                            -- Can I simply add secondary index support to an existing environment (duplicates on)?
                            Yes, but it's expensive, see SecondaryConfig.setAllowPopulate.
                            -- Do you see any GC churn due to the bulk deletes? is there a more efficient way of doing this?
                            JE doesn't have a bulk delete. A big potential advantage of deleting by key (without having to read the primary data to look at the timestamp) is that you avoid the read on the primary data and associated GC cost. But you don't get this benefit in JE 4.1 and earlier, because JE must read the record in order to lock it (during the delete). In JE 5, we no longer have to read the record and we see a large improvement in delete performance.

                            --mark                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
                            • 11. Re: Optimizing large DB scans
                              vinothchandar
                              JE doesn't have a bulk delete.
                              I referred to the delete of all primary records for a particular "day" (the seconday key) as bulk deletes.
                              A big potential advantage of deleting by key (without having to read the primary data to look at the timestamp) is that you avoid the read on the primary data and associated GC cost.
                              Okay so, if I issue a delete(dayKey) on the secondary index, it will silently mark all the corresponding primary records dead on disk and the cleaner will get rid of them, when it runs? (in JE 5 of course)
                              But you don't get this benefit in JE 4.1 and earlier, because JE must read the record in order to lock it (during the delete). In JE 5, we no longer have to read the record and we see a large improvement in delete performance
                              Ideally, we want to go one step at a time. So, quick confirmation. If I disable duplicates, I don't have to run the preupgrade to move to JE 5 (from 4.1.x), right?

                              Thanks for the inputs Mark.
                              -Vinoth
                              • 12. Re: Optimizing large DB scans
                                Greybird-Oracle
                                Okay so, if I issue a delete(dayKey) on the secondary index, it will silently mark all the corresponding primary records dead on disk and the cleaner will get rid of them, when it runs? (in JE 5 of course)
                                No, the deletion is always logged and the record is always locked, in all versions of JE. The difference in JE 5 is that it doesn't have to read the LN in order to lock it, in order to delete it. So there is less IO and less cache churn.
                                If I disable duplicates, I don't have to run the preupgrade to move to JE 5 (from 4.1.x), right?
                                No, you always have to run the preupgrade utility before opening with JE 5, for an environment created with any earlier release.

                                --mark