Forum Stats

  • 3,770,003 Users
  • 2,253,045 Discussions
  • 7,875,265 Comments

Discussions

Unable to insert 9 million records.

535253
535253 Member Posts: 7
edited Sep 24, 2006 8:14PM in Berkeley DB Java Edition
Hi,

I'm trying to build a large JE index and am unable to get beyond 9 million inserts before my insertion rate drops to a trickle.

I'm running with transactions on but with the txnWriteNoSync setting on. I also set je.evictor.lruOnly to false and je.evictor.nodesPerScan to 100 as my database is far larger than available RAM and lookup requests are expected to be highly random.

The keys for the records are (essentially random) 32 byte values that correspond to the SHA-256 digest of the data I'm indexing. My test code simply generates random 32 byte key values. The value is a fixed 1024 byte value.

I believe JE's performance will be acceptable even when the cache is not being hit as I intend to have several machines handling portions of my total index.

However, it's impossible, even with my trivial test code, to insert 10 million entries into JE with the keys as described above. CPU usage becomes 100% at the point that insertion rates plunge, and taking frequent thread dumps shows that much of the time is spent in the Cleaner threads (I wouldn't swear to this as I don't have a profiler attached at this point).

I've tried this on several machines, all of server class, and experience the same problem.

Is this a known limitation or not? Are there any settings I can tweak to move this problem out? I would, ideally, like to get 200 million entries into a single database.

Oh, I'm building this on Linux 2.6 on an XFS filesystem (I saw an old post about XFS not being a good choice but wonder if that's still true?) using jdk 1.5.0b8.

I'm trying the following tests tomorrow to try to resolve this but, since the tests take hours to run, I'm posting here for a hopeful shortcut!

* Use ext2 instead of ext3
* Increasing the log size.
* Taking manual control of cleaning.
«1

Comments

  • Charles Lamb
    Charles Lamb Member Posts: 836
    Hi Robert,

    We've talked this over here and offer the following comments:

    . Random inserts are probably exacerbating the problem so if these are not typical of your actual load scenario, you might consider removing them and using sequential inserts. Sequential will offer a significantly different performance profile.

    . Are you setting the cache size? If so, what are you setting it to? If not, recall that JE sets its cache size to 60% of the JVM heap size (-Xmx if set) and 7% of that is used for log buffers.

    . We suspect that the internal nodes of the tree (at least the ones involved in your working set) are not remaining in cache (yup, and this can cause the cleaner running overtime situation that you're seeing). You might want to run the DbCacheSize utility to see how much cache you need. Here's a URL with the FAQ entry describing it.

    http://www.oracle.com/technology/products/berkeley-db/faq/je_faq.html#34

    . Finally, it looks like you have submitted a pretty comprehensive description of the environment and benchmark, but could you please confirm that you haven't set any JE parameters that you didn't mention (e.g. cleaner utilization)?

    . You might want to dump the Environment Stats periodically both before and after the performance degrades -- that should tell us better what sorts of events are occurring to cause the problem.

    By the way, the "control-backslash style metering" that you mention is probably "good enough" at this point, especially for a first blush on the problem. In other words, if you're seeing a lot of cleaner stack traces, that's probably what is happening.

    Let us know how these suggestions go and we'll work from there.

    Regards,

    Charles Lamb
  • 535253
    535253 Member Posts: 7
    Hi,

    The random inserts are, sadly, typical of my actual load, since I'll be indexing documents by their SHA-256 hash (which is effectively random).

    I am not setting the cache size, though the JVM has a -Xmx512m for this test. The je settings I mentioned is the complete set for the run I described, but here's the code snippet;

    final EnvironmentConfig environmentConfig = new EnvironmentConfig();
    environmentConfig.setAllowCreate(true);
    environmentConfig.setTransactional(TRANSACTIONS);
    if (TRANSACTIONS) {
    environmentConfig.setTxnWriteNoSync(true);
    }
    environmentConfig.setConfigParam("je.evictor.lruOnly", "false");
    environmentConfig.setConfigParam("je.evictor.nodesPerScan", "100")

    where TRANSACTIONS=true.

    I have managed to reach 20 million inserts (though the insertion rate is beginning to tank) after adding the following settings;

    environmentConfig.setConfigParam("je.log.fileMax", Long.toString(100 * 1024 * 1024));
    environmentConfig.setConfigParam("je.cleaner.lookAheadCacheSize", "32768");
    environmentConfig.setConfigParam("je.cleaner.minAge", "10");
    environmentConfig.setConfigParam("je.cleaner.maxBatchFiles", "20");

    Unfortunately, I added them all at once, so I don't know if I only need one of these settings, or if there are conflicts between them, but they made sense to me.

    In future runs I will dump the environment stats. I do have extensive CVS-style logs of the insertion rate and it makes for a very interesting graph. If you're interested in that, let me know how I can get it to you.

    Given the randomness of the keys, do you think increasing the cache size will help? I can certainly give 2gb over to this for my test on the original box, and 6gb on one of my other servers. However, it was my understanding that the working set would be nearly useless given the uncorrelated insert and fetch load (hence the lru settings).

    So, in summary, the additional settings have moved the wall out by, at least, a factor of two. I would be happy to reach 50m entries as long as the process remains I/O bound as I can just throw disks at the problem (The servers I use have either four or twenty four drives, independently spinning).
  • Charles Lamb
    Charles Lamb Member Posts: 836
    Hi Robert,

    Since the inserts need to be random, it is important that the internal nodes of the tree stay in memory. This will help improve performance of the inserts, reduce cache eviction, and improve cleaning performance. So it would definitely be worth your while to run the DbCacheSize program to see what kind of cache size is needed to achieve that.

    Assuming your evicting Internal Nodes (as opposed to Leaf Nodes), anything you can do to keep more of those nodes in cache will be better. So increasing the cache size will most likely help.

    Also consider the following:

    . Can you reduce the cleaner utilization (je.cleaner.minUtilization) below the default of 50%?

    . If this is a bulk load process, are you willing to defer cleaning until the end of the load (i.e. when you've loaded all the records)? If so, you can disable the cleaner and run it manually at the end (I'll supply details on how to do this if you need them)?

    . Do you need transactions for the inserts? If not, you might consider using Deferred Write databases. We have debated this option internally and in a low-cache situation, this might actually slow performance since checkpoints are more expensive. But it might not so it's worth a try.

    . If it is possible to do any inserts-in-a-sorted-batch kind of processing, it might prove beneficial. That is, if you are able to gather a batch of documents, sort them by their key, and then insert them into the database, this might prove useful also. That may not be practical for your application.

    But in any case, I do recommended starting with the DbCacheSize program so that you at least have a handle on an approximate cache size that will allow keeping the internal nodes in memory.

    Please let me know how this works out.

    Charles Lamb
  • 535253
    535253 Member Posts: 7
    I will run the DbCacheSize program and I'll report the answer here.

    I have done runs with utilization set to 20% and these are still running (though slowing down, obviously).

    The purpose of the test is to prove whether it's feasible to insert 50m-100m entries at all, so I could disable transactions. On the flip-side, in the real world, we'll need to know that entries are really durable at some point (we're replicated the data a number of times as well, so maybe our requirements here are loose). And, finally, if transactions is 'merely' a linear slow-down then it won't make any significant difference to my test. If it's an order of magnitude, or related to the number of records, or the size of the database, then I can see it would have a much more significant impact.

    I would like to know how to disable cleaning as scheduling this in production may well be feasible (we have tons of disk space, so the overspill may be acceptable).

    I can certainly try sorting the keys in batches, but given that they're random, I don't think it will help. However, the real system is receiving requests asynchronously via JMS, so we can certainly batch them in practise (our acknowledgements are also asynchronous).

    I'll start by increasing the cache size and Xmx to the maximum, since the box will do nothing but run this database in production. Is there a penalty for having too large a cache (gc, for example)?

    B.
  • 535253
    535253 Member Posts: 7
    Oh, btw, my latest test has written 24m entries, it seems to be consuming 90-100% of the disk I/O and roughly 12-20% of cpu. I don't know what percentage of that I/O is 'real' work (i.e, my records) versus the cleaner moving data from under utilized files...

    here's a snip from the log, the columns are: total records count, total duration, duration since last line. If you graph the first two columns, it's a horrible curve indicating that indicates I'll never reach 30m. If you graph the first and last column, then things look much better, insertion rate is slowing at a slightly worse than linear rate. I'd love to send you the log but I don't think this forum supports attachments.

    24709000,29887804,5967
    24710000,29894829,7024
    24711000,29900927,6098
    24712000,29906083,5156
    24713000,29911829,5745
    24714000,29918013,6183
    24715000,29923094,5081
  • Greybird-Oracle
    Greybird-Oracle Member Posts: 2,690
    I can certainly try sorting the keys in batches, but given that they're random, I don't think it will
    help. However, the real system is receiving requests asynchronously via JMS, so we can
    certainly batch them in practise (our acknowledgements are also asynchronous).
    I'd like to emphasize one of the points Charles made and say that such batching and sorting will likely have a large positive impact. The larger the batch, the more the impact. Since your responses are asynchronous, you have a great opportunity to take advantage of this.

    A similar advantage can be gotten by designing the hash for your documents in a way that is less truly random. A hash that is designed for security purposes such as SHA1 is the worst thing to use. A hash that gives similar values for similar documents will produce keys that are more likely to be close together in value and cause less I/O.

    Mark
  • 535253
    535253 Member Posts: 7
    I did understand the point. However, I'm specifically building a content-address storage system, so the notion of finding things by their digest value is key (no pun intended).

    I'm doing a further run without transactions (since I only really need durability I can rely on the checkpointer) and some of the above suggestions. Obviously the numbers from the start of the run are much better (1-2 orders of magnitude) so this might get me to 50m regardless of the random key problem.

    I can experiment with batching but since even 1000 digest values are likely to be radically different values I'm unsure that it can help much. Now, indexing by some other derived value is interesting, though I don't know of anything remotely like it for arbitrary-length documents.

    I'll have more data on my latest run tomorrow. I plan to let it run till Monday anyway, and I'll post the results of that here. Here's the complete environment for the new run;

    final EnvironmentConfig environmentConfig = new EnvironmentConfig();
    environmentConfig.setAllowCreate(true);
    environmentConfig.setTransactional(false);
    environmentConfig.setCachePercent(85);
    environmentConfig.setConfigParam("je.log.fileMax", Long
    .toString(250 * 1024 * 1024));
    environmentConfig.setConfigParam("je.evictor.lruOnly", "false");
    environmentConfig.setConfigParam("je.evictor.nodesPerScan", "100");
    environmentConfig.setConfigParam("je.cleaner.minAge", "10");
    environmentConfig.setConfigParam("je.cleaner.maxBatchFiles", "20");
    environmentConfig.setConfigParam("je.cleaner.minUtilization", "20");

    the machine is running with Xmx of 2500m (it's a 3g machine but I've left some headroom).

    Some sample output from the beginning of the run;

    2317000,365662,62
    2318000,365722,60
    2319000,365781,59

    Which is great but I don't know if it will last. :)

    Thanks for all the input, your collective suggestions and thoughts have really helped.

    One final point, a future version of JE might want to automatically scale back the cleanup stuff when it reaches the point where it's the only process that gets to perform I/O. It was pretty simple to reach the point where Cleaner hogged my machine.

    B.
  • Charles Lamb
    Charles Lamb Member Posts: 836
    The purpose of the test is to prove whether it's
    feasible to insert 50m-100m entries at all, so I
    could disable transactions. On the flip-side, in the
    real world, we'll need to know that entries are
    really durable at some point (we're replicated the
    data a number of times as well, so maybe our
    requirements here are loose).
    Just to be clear, I assume you realize that with the txnWriteNoSync property set, you're not making things durable -- you're only ensuring that they're written to the file system (and therefore protecting against JVM crashes, but not OS or hardware crashes). I suspect you know that though.
    And, finally, if
    transactions is 'merely' a linear slow-down then it
    won't make any significant difference to my test. If
    it's an order of magnitude, or related to the number
    of records, or the size of the database, then I can
    see it would have a much more significant impact.
    The transactions are not really causing a slow down -- their overhead is fairly low. I was suggesting that since if you were able to relax the transaction constraints you might be able to achieve some batching. But in retrospect, if txnWriteNoSync is what you need (i.e. for durability guarantees) then you will need some form of transactions. You might want to also read: http://blogs.oracle.com/charleslamb/2006/09/22#a14
    I would like to know how to disable cleaning as
    scheduling this in production may well be feasible
    (we have tons of disk space, so the overspill may be
    acceptable).
    To disable the cleaner, set je.env.runCleaner to false. Then, you need to iteratively call Environment.cleanLog. See the javadoc for Environment.cleanLog for the complete wording on this, but here's the relevant excerpt from it:

    In the second use case, "batch cleaning", the application disables the cleaner thread
    for maximum performance during active periods, and calls cleanLog during periods
    when the application is quiescent or less active than usual. If the cleaner has a large
    number of files to clean, cleanLog may stop without reaching the target utilization; to
    ensure that the target utilization is reached, cleanLog should be called in a loop until it
    returns zero. And to complete the work of cleaning, a checkpoint is necessary. An
    example of performing batch cleaning follows.

    Environment env;
    boolean anyCleaned = false;
    while (env.cleanLog() > 0) {
    anyCleaned = true;
    }
    if (anyCleaned) {
    CheckpointConfig force = new CheckpointConfig();
    force.setForce(true);
    env.checkpoint(force);
    }
    I'll start by increasing the cache size and Xmx to
    the maximum, since the box will do nothing but run
    this database in production. Is there a penalty for
    having too large a cache (gc, for example)?
    If you're running a 32b jvm, then I think the max heap size is around 3.5GB. If you want to go higher, then you need to run a 64b JVM (we can handle that ok). There is always more GC overhead with that large a heap because the object copying will cost more. Here's another relevant blog entry on this topic and JE:

    http://blogs.oracle.com/charleslamb/2006/09/22#a23

    Finally, with regard to the batching and sorting inserts, it may still be useful even though the keys are not exactly sequential. Just the act of sorting them prior to insert will cause those inserts to "move across" the bottom leaves of the tree rather than randomly accessing the tree. By moving across the tree, it may be possible to reuse some of the elements in the cache for the elements of the batch.

    Regards,

    Charles Lamb
  • 535253
    535253 Member Posts: 7
    Yes, I did know that. I was happy with the level of durability it gave, since I have to handle a rebuild in the event of an OS or hardware failure anyway (which is why I'm replicating). However, acknowledging after a manual checkpoint also works fine, assuming the records are durable after that completes.

    The settings in my last post seem to have improved things considerably;

    12994000,3371888,107
    12995000,3372000,112
    12996000,3372125,124
    12997000,3372240,115
    12998000,3372354,113
    12999000,3372463,108
    13000000,3372574,111
    13001000,3372689,115

    I assume this is because thousands of records are lazily flushing to disk whereas before the transaction settings were forcing at least a flush to the filesystem (XFS in this case)?

    Anyway, as long as the transaction overhead is 'merely' a multiplier, I can do my tests without it set and then add boxes if I turn it back on.

    I do have some 64-bit machines with 6-8gb of ram and I'll try those out too.

    I'll try the sorted batch, your explanation sounds good to me.

    The current run is suspicously great, I assume I'll have to call Environment.sync() more frequently than it's occuring, perhaps once a minute?

    The settings I have for cleaning seem to prevent it from swamping my machine, so I don't think I'll need manual control after all. However, your code sample and explanation are illuminating, thanks.

    I'll add a periodic sync() call and see how long it takes to reach 50m. Once I'm through that, I think I've resolved my major headache with JE. Obviously I'll need to add update and retrieval traffic too to see what my real world looks like, but that really only tells me how many machines and/or disks I need. Given the scale of the data I need to manage, I have lots of both; I just have to avoid exponentially worsening conditions, linear is generally fine.

    Thanks, again, for all the help.
  • It's probably worth opening a Support Request for this whole issue. Drop me an email (charles.lamb at the domain-you'd-think-it-is) and I'll get one opened up.

    In the tracking triplets you posted are those units of time msec or secs? I assume msecs.

    Charles Lamb
This discussion has been closed.