I am doing massive updates on a database. In order to improve performance, and since I can "redo" the updates if necessary, I am configuring the transactions with "DB_TXN_NOSYNC" (I do checkpointing from time to time, and a final FLUSH when I finish). Since I am doing "bulk" updates, I use too the DB_TXN_BULK flag.
The problem is, I am seeing A LOT of synchronous traffic to disk, at transaction commit times (fdsync). Not all transactions are synchronous. In fact, most of them are not. But I have TONS of synchronous writes, when I am using "DB_TXN_NOSYNC" just to avoid it.
Reading the documentation I see this:
Enable transactional bulk insert optimization. When this flag is set, the transaction avoids logging the contents of insertions on newly allocated database pages. In a transaction that inserts a large number of new records, the I/O savings of choosing this option can be significant.
Users of this option should be aware of several issues. When the optimization is in effect, page allocations that extend the database file are logged as usual; this allows transaction aborts to work correctly, both online and during recovery. *At commit time, the database's pages are flushed to disk*, eliminating the need to roll-forward the transaction during normal recovery. However, there are other recovery operations that depend on roll-forward, and care must be taken when DB_TXN_BULK transactions interact with them.
Bold is mine.
So, DB_TXN_BULK seems a good idea when you are updating records ALREADY in the database, but when you are populating a new database, and so you are creating tons of new pages, it looks like you are going to be VERY slow, even when you are using DB_TXN_NOSYNC just to avoid it.
Berkeley DB 5.3.15 here.
In my case, deleting the BULK flag provides a massive speed improvement (orders of magnitude).
Am I missing something?. Advices?.
What is happening here is that the two optimizations are canceling each other out. DB_TXN_NOSYNC optimizes by not waiting for the logs to be flushed to disk before continuing processing. DB_TXN_BULK works by not logging the data that is inserting into new pages in the database, but instead flushing the data files at commit time to ensure durability. Usually database pages are not flushed at commit time, because they can be recreated from the logs that were flushed if there is an application failure. The extra cost of flushing the pages at the end is usually made up by the savings in not having to flush so many logs to disk, but since you are using DB_TXN_NOSYNC, the logs are not flushed synchronously anyway, so you loss the savings.
When populating a new database, DB_TXN_BULK does provide a lot of savings because it does not log the data that is inserted into the pages added to the database (although it still has to log that the page was created). In long transactions flushing the new data pages instead of logging all the insertions is faster. Data added to existing pages in the database still has to be logged, so the flag is not very useful for lots of updates of existing records.
On a related note, have you tried the Bulk API (http://docs.oracle.com/cd/E17076_02/html/programmer_reference/am_misc_bulk.html) to speed up populating the database?