4 Replies Latest reply: Dec 17, 2012 12:03 PM by 979004 RSS

    Advice needed: is BDB a good fit for what I aim at?

    979004
      Hello everyone,

      I'm not a BDB user (yet), but I really think that this the BDB library
      IS the perfect fit for my needs.

      I'm designing an application with a "tricky" part, that requires a very fast
      data storage/retrieval solution, mainly for writes (but for reads too).

      Here's a quick summary of this tricky part, that should at least use
      2 databases:
      - the first db will hold references to contents, with a few writes per hour
      (the references being "pushed" to it from a separate admin back end), but
      expected high numbers of reads
      - the second db will log requests and other events on the references
      contained in the first db: it is planned that, on average, one read from DB1
      will produce five times as much writes into DB2.

      To illustrate:
      DB1 => ~25 writes / ~100 000 reads per hour
      DB2 => ~500 000 writes / *(60?) reads per hour

      (*will explain about reads on DB2 later in this post)

      Reads and writes on both DBs are not linear, say that for 500 000 writes
      per hour, you could have the first 250 000 being done within 20 minutes,
      for instance. There will be picks of activity, and low activity phases
      as well.

      That being said, do the BDB experts here think that BDB is a good fit for
      such a need? If so or if not, could you please let me know what makes you
      think what you think? Many thanks in advance.

      Now, about the "*(60?) reads per hour" for BD2: actually, data from DB2
      should be accessed in real time for reporting. As of now, here is what
      I thing I should do to insure and preserve a high rate throughput not to
      miss any write in DB2 => once per minute another "DB2" is created that will
      now record new events. The "previous" DB2 is now dumped/exported into another
      database which will then be queried for real-time (not exactly real-time,
      but up to five minutes is an acceptable delay) reporting.
      So, in my first approach, DB2 is "stopped" then dumped each minute, to another
      DB (not necessarily BDB, by the way - data could probably re-structured another
      way into another kind of NoSQL storage to facilite queriing and retrieval
      from the admin back end), which would make 60 reads per hour (but "entire"
      reads, full db)

      The questions are:
      - do you think that renewing DB2 as often would improve or strain performances?
      - is BDB good and fast at doing massive dumps/exports? (OK: 500 000 entries per
      hour would make ~8300 entries per minute on average, so let's say that a dump's
      max size is 24 000 rows of data)
      - would it or not be better to read directly into the current DB2 as it is
      storing (intensively) new rows, which would then avoid the need to dump each
      minute and then provide more real-time features? (then would just need a daily
      dump, to archive the "old" data)

      Anyone who has had to face such questions already is welcome, as well as
      any BDB user who think they can help on this topic!

      Many thanks in advance for you advice and knowledge.

      Cheers,
      Jimshell
        • 1. Re: Advice needed: is BDB a good fit for what I aim at?
          Ashok_Ora
          100,000 reads/hour translates to about 27-30 reads/sec on DB1.

          500,000 writes/hour translates to about 150 writes/sec for DB2.
          *60 translates to 900 writes/sec for DB2.

          Can you confirm that I understand your throughput requirements correctly? BDB can certainly handle this throughput for reads as well as writes.

          What is the min/max/average size of a record? How many records to you expect to have in the database? Will DB2 keep growing forever?

          You may not need to do the massive dump/export that you mention. Berkeley DB allows you to use multiple databases, each database could be stored in its own file. You could set it up so that the writer writes data to a database for a period of time, then creates and opens a new database and closes the old one. The reporting application could read and report from the databases that the writer application has closed.

          Hope this answer was helpful. if you can confirm my throughput estimates and also provide additional details that I mentioned above, that'll be great!

          Warn regards.
          ashok
          • 2. Re: Advice needed: is BDB a good fit for what I aim at?
            979004
            Hi Ashok!

            Many thanks for your reply, and sorry for getting back here so late.

            >
            100,000 reads/hour translates to about 27-30 reads/sec on DB1.

            That's right.


            >
            500,000 writes/hour translates to about 150 writes/sec for DB2.
            That's right again :-)


            >
            *60 translates to 900 writes/sec for DB2.
            I've written it wrong, actually, sorry for the misleading "*" in my example.
            Actually, I primarily thought that the whole content of DB2 would be "dumped" each minute, which comes to 60 (full) reads per hour (so 1 per minute, or 1 every 60 seconds)
            This is in the scenario where I do not query DB2 for reporting, but where I dump it in another, maybe more efficient db system for complex queries (for instance: MongoDB, CouchDB, etc., I really don't know at this stage and this should be studied as well). So very few reads on DB2, but each read implies all the data to be read (and put elsewhere)


            >
            Can you confirm that I understand your throughput requirements correctly? BDB can certainly handle this throughput for reads as well as writes.
            Yes, you got it OK - except the confusing "*(60?)" sorry about that, think it's clearer now.


            >
            What is the min/max/average size of a record? How many records to you expect to have in the database? Will DB2 keep growing forever?
            As of DB1: record size are expected to be => 200 bytes / 600 bytes / 1,5 kbyte (min/average/max)
            As of DB2 (which will have the highest write rate) => 300 bytes / 1 kbyte / 3 kbytes (min/average/max)


            >
            You may not need to do the massive dump/export that you mention. Berkeley DB allows you to use multiple databases, each database could be stored in its own file. You could set it up so that the writer writes data to a database for a period of time, then creates and opens a new database and closes the old one. The reporting application could read and report from the databases that the writer application has closed.
            That is very interesting.
            I didn't see it that way, but you're totally right regarding the "simple" closing of the "old" (previous minute) BD2 and opening of a fresh, "new", DB2 each minute.
            But I guess it would become pretty tough at some point: imagine after one hour, I'll end up with 60 databases which I should query (sequentially?) to retrieve stats?
            To goal of "dumping" old DB2s is to put the old data into a unique DB for faster read/complex querying, actually. That could be done in BDB directly, maybe? (though I feel it might be simpler/faster with some other DB systems that are optimized for complex queries/searches) Ideally, the "old" data should be available up to (say) 3 months back. Guess the right strategy here would be to maintain a specific "3 month collecting" DB, that would roll data (on a daily basis?) into another, archive DB (or several), which would be used pretty rarely.


            >
            Hope this answer was helpful. if you can confirm my throughput estimates and also provide additional details that I mentioned above, that'll be great!
            Yes your answer is very useful, many thanks! :-)
            Makes things much clearer to me, definitely.
            Please let me know your thoughts on the additional information I detailed in this reply, and if you think this is all relevant.


            >
            Warn regards.
            ashok
            Best Regards,
            Jimshell

            Edited by: 976001 on 17 déc. 2012 08:01

            Edited by: 976001 on 17 déc. 2012 08:01
            • 3. Re: Advice needed: is BDB a good fit for what I aim at?
              Ashok_Ora
              Great -- thanks for the clarification.

              Based on this information, it appears that you're generating about 12 GB/day into DB2, which is about a terabyte of data every 3 months. Here are some things to consider for ad-hoc querying of about 1 TB of data (which is not a small amount of data).

              Query performance is dramatically improved by using indexes. On the other hand, indexing data during the insert operation is going to add some overhead to the insert - this will vary depending on how many fields you want to index (how many secondary indices you want to create). BDB automatically indexes the primary key. Generally, any approach that you consider for satisfying the reporting requirement will benefit from indexing the data.

              Here are some alternatives to consider, for the reporting application:
              - Move the data to another system like MongoDB or CouchDB as you suggest and run the queries there. The obvious cost is the movement of data and maintaining two different repositories. You can implement the data movement in the way I suggested earlier (close "old" and open "new" periodically).
              - Use BDB's SQL API to insert and read data in DB1 and DB2. You should be able to run ad-hoc queries using SQL. After doing some experiments, you might decide to add a few indices to the system. This approach eliminates the need to move the data and maintaining separate repositories. It's simpler.
              - Use the Oracle external table mechanism (Overview and how-to - http://docs.oracle.com/cd/B28359_01/server.111/b28319/et_concepts.htm) to query the data from Oracle database. Again, you don't need to move the data. You won't be able to create indices on the external tables. If you do want to move data from the BDB repository into Oracle DB, you can run a "insert into <oracle_table> select * from <external_table_in_DB2>;". As you know, Oracle database is excellent database for all sorts of applications, including complex reporting applications.

              Hope this was helpful. Let me know your thoughts.

              Just out of curiousity, can you tell us some additional details about your application?

              Thanks and warm regards.
              ashok
              • 4. Re: Advice needed: is BDB a good fit for what I aim at?
                979004
                Hi Ashok

                Many thanks for your fast reply again :)
                Ashok_Ora wrote:
                Great -- thanks for the clarification.
                Thank YOU, my first post was indeed a bit confusing, at least about the reads on DB2.

                Ashok_Ora wrote:
                Based on this information, it appears that you're generating about 12 GB/day into DB2, which is about a terabyte of data every 3 months. Here are some things to consider for ad-hoc querying of about 1 TB of data (which is not a small amount of data).
                That's right, this is quite a huge lot of data, and will keep growing, and growing... Although the main goal of the app is to be able to achieve (almost) real time reporting, it will also need to be able (potentially) to compute data over different time ranges, including yearly ranges for instance - but in this case, the real time capabilities wouldn't be relevant, I guess: if you look at some data on a year span, you probably don't need it to be accurate on a dayly interval, for instance (well, I guess), so this part of the app would probably only use the "very old" data (not the current day data), whatever it is stored in...

                Ashok_Ora wrote:
                Query performance is dramatically improved by using indexes. On the other hand, indexing data during the insert operation is going to add some overhead to the insert - this will vary depending on how many fields you want to index (how many secondary indices you want to create). BDB automatically indexes the primary key. Generally, any approach that you consider for satisfying the reporting requirement will benefit from indexing the data.>
                Thanks for pointing that out! I did envisage using indexes, but my concern was (and you guessed it) the expectable overhead that it brings. At this stage (but I may be wrong, this is just a study in progress, that will also need proper tests and benchmarking), I plan to favour write speed over anything else, to insure that all the incoming data is indeed stored, even if it is quite tough to handle in the primary stored form.
                I prefer to envisage (but again, it's not said that it is the right way of doing it) very fast inserts, then possibly re-process (sort of) the data later, and (maybe? certainly?) elsewhere, in order to have it more "query friendly" and efficient for moderately complex queries for legible reports/charts.

                Ashok_Ora wrote:
                Here are some alternatives to consider, for the reporting application:
                - Move the data to another system like MongoDB or CouchDB as you suggest and run the queries there. The obvious cost is the movement of data and maintaining two different repositories. You can implement the data movement in the way I suggested earlier (close "old" and open "new" periodically).
                This is pretty much "in line" with what I had in mind when posting my question here :).
                I found out in several benchmarks (there are not a lot, but I did find some ^^) that BDB amongst others is optimized for bunch queries, say that retrieving a whole lot of data is faster that, for instance, retrieving n times the same row. Is that right? Now, I guess that this is tightly related to the configuration and the server's performances...
                The process would then feed data into a new "DB2" instance every 60 seconds, and "dumping"/merging the previous one into another DB (BDB or else), which would grow until some defined limit.
                Would the "old DB2" > "main, current archive" be a heavy/tricky process, according to you? Especially as the "archive" DB is growing and growing - what would be a decent "limit" to take into account? I guess that 1TB for 3 months of data would be a bit big, wouldn't it?

                Ashok_Ora wrote:
                - Use BDB's SQL API to insert and read data in DB1 and DB2. You should be able to run ad-hoc queries using SQL. After doing some experiments, you might decide to add a few indices to the system. This approach eliminates the need to move the data and maintaining separate repositories. It's simpler.
                I read a bit about it, and this is indeed very interesting capabilities - especially as I know how to write decent SQL statements.
                That would mean that DB2 could grow more than just within a 60 seconds time span - but would this growing alter the write troughput? I guess so... This will require proper tests, definitely.
                Now, I plan the "real" data (the "meaningfull part of the data"), except timestamps, to be stored in quite a "NoSQL" way (this term is "à la mode"...), say as JSON objects (or something close to it).
                This is why I envisaged MongoDB for instance as the DB layer for the reporting part, as it is able to query directly into JSON, with a specific way to handle "indexes" too. But I'm no MongoDB expert in any way, so I'm not sure at all, again, that it is a good fit (just as much as I'm not sure right know what the proper, most efficient approach is, at this stage).

                Ashok_Ora wrote:
                - Use the Oracle external table mechanism (Overview and how-to - http://docs.oracle.com/cd/B28359_01/server.111/b28319/et_concepts.htm) to query the data from Oracle database. Again, you don't need to move the data. You won't be able to create indices on the external tables. If you do want to move data from the BDB repository into Oracle DB, you can run a "insert into <oracle_table> select * from <external_table_in_DB2>;". As you know, Oracle database is excellent database for all sorts of applications, including complex reporting applications.
                This is VERY interesting. VERY.
                And Oracle DB is, you're, a very powerful and flexible database for every kind of processes.
                I'll look into the docs carefully, many thanks for pointing that out (again!) :)

                I have not yet decided if the final application would be free nor open source, but this will eventually be a real question. Right now, I don't want to think of it, and just find the best technical solution(s) to achieve the best possible results.
                And BDB and Oracle DB are very serious competitors, definitely ;)

                Ashok_Ora wrote:
                Hope this was helpful. Let me know your thoughts.
                It definitely is so much useful! Makes things clearer and allow me to get more into BDB (and Oracle as well with your latest reply), and that's much appreciated. :)

                As I said, my primary goal is to insure the highest write throughput - I cannot miss any incoming data as there is no (easy/efficient) way to re-ask for what would be lost and get it again being sure that it hadn't changed (the simple act of re-asking would induce data flaws, actually).
                So, everything else (including reporting, stats, etc.) IS secondary, as long as what comes in is always stored for sure (almost) as soon as it comes in.
                This is why, in this context, "real" real-time is not really crucial, an can be "1 minute delayed" real time (could even be "5 minute delayed", actually, but let's be a bit demanding ^^).

                Ashok_Ora wrote:
                Just out of curiousity, can you tell us some additional details about your application?
                Of course, I owe you a bit more details as you help me a lot in my research/study :)
                The application is sort of a tracking service. It is primarily thought to serve the very specific needs of a client of mine: they have several applications that all use the same "contents". Those contents can be anything, text, HTML, images, whatever, and they need to know almost in real time what application (used by which external client/device) is requesting ressources, which ones, from where, in which locale/area and language, etc.
                Really a kind of "Google Analytics" stuff (which I pointed out at the very beginning, but they need something more specific, and, above all, they need to keep all the data with them, so GA is not a solution here).
                So, as you can guess, this is pretty much... big. On the paper, at least. Not sure if this will ever be implemented one day, to be honest with you, but I really want to do the technical study seriously and bring the best options so that they know where they plan to go.
                As of me, I would definitely love it if this could become reality, this is very interesting and exciting stuff. Especially as it requires to see things as they are and not to fall into the "NoSQL fashion" for the sake of being "cool". I don't want a cool application, I want an efficient one, that fits the needs ;) What is very interesting here is that BDB is not new at all, though it's one of the most serious identified players so far!

                Ashok_Ora wrote:
                Thanks and warm regards.
                ashok
                Many thanks again, Ashok!
                I'll leave this question opened, in order to keep on posting as I'm progressing (and to be able to get your thoughts and rewarding comments and advice above all :) )

                Cheers,
                Jimshell