I'm moving a question to a new thread. The original question was here: Re: Questions re: opening a large BDB
jhalex / Jeff said:
I have a BDB/JE with a table in it that is taking about 45 minutes to open. It is about 23GB of data, and there are around 20,000,000 records of the problem type. I'm using the DPL and all the time is taken up in the getPrimaryIndex method for this type. I'm assuming that this is the point where the DB needs to do its consistency check, and I suspect that this particular database was not closed cleanly. However, based on what I've read here, it seems like even if the database wasn't closed cleanly, there should be a checkpoint somewhere near the end of the DB that allows it to open fairly quickly, if not as fast as it would normally. I'm fairly certain this DB was created by a program that did a whole lot of puts of new objects and not any updating/deleting. It is possible that some long running read-only transactions from cursors could have been interrupted when the DB was closed. Could that kind of thing explain this excessively long time?
I tried running the DbPrintLog command above, although I could only run it without the "-ty20" command without an error. It gave me 100 million lines of XML, and I can tell you that there is a "CkptEnd" line within the last 20,000 lines of the file (I don't have an exact number -- it's a lot of data to filter!).
I'd love to have the DB open faster than this even following a failure. Even 5 minutes is a lot better than 45! Is there anything I can be doing to help that? Any patterns to avoid that would cause it to take this long?
A performance issue with getPrimaryIndex is not related to recovery, which is performed in the Environment constructor. The only thing I can think that could cause getPrimaryIndex to be so slow is when secondary indexes are "populated".
When a primary index is opened read-write (the default), all its secondary indices and related (via relatedEntity in the @SecondaryIndex annotation) indices are also opened. Whenever any secondary index is opened, the DPL uses a mechanism in the base API -- com.sleepycat.je.DatabaseConfig.setAllowPopulate -- to automatically populate the index. This allows you to add a new secondary index and automatically populate it the first time it is opened. Automatic population works by reading through the entire primary index and writing out any require secondary index records. This can take a long time if the primary index is large.
One issue is that the automatic population feature is triggered when the secondary index is empty. If it is empty, the entire primary index will be read. If by chance you have a secondary index that should be empty because none of the entity's secondary key fields are non-null, the population process will occur every time the primary is opened. This is a case we don't handle very well, obviously.
So the question is: Do you have any secondary indices that are empty? You can check this by dumping the secondary databases. Or you can simply get a thread dump during the call to getPrimaryIndex and see what's happening (post it here).
Ah-ha! Got it. Since this was a test database, we do have a couple secondary indices that are totally empty. So it sounds like BDB thinks they may have been new and populates them. So if we create even a single record that has a value for each of the otherwise unused secondary keys, we should be in good shape?
Yes, in a real-world scenario, it is unlikely that any secondary index would be empty. Just to make sure, I'm adding a bit of code that will insert an invalid record (that shouldn't ever match any queries) that has all its fields defined.