I've got a peculiar performance issue going on with the BDB pagecache being flushed to disk. I've managed to reproduce the issue perfectly on three out of three quite different systems that I've tried on, so it is at least quite well-defined.
My usage pattern for the database in question is such that I periodically (perhaps once every 10-60 seconds or so) need to read through an amount of values (around 500-2000 or so) from a database containing a rather large amount (in the millions, at least) of keys. There are a few writes for every such batch, but not very many (a couple of tens). The keys that are read each batch are quite random, and very likely to be completely different from batch to batch. The database is a DB_HASH.
When I do that, BDB seems to dirty a lot of pages in the page cache (which I have currently sized at 512 MB so that pages don't have to be forced out from it), from what I can tell by manipulating refcounts and stuff, so all in all, a single batch seems to dirty some 10-40 MB or so of the mmapped cache region. (I check this using pmap -x on Linux.) Note that when I speak of pages and the dirtying of them here, I mean at the VM level, not the BDB level.
A while after this has happened, the VM comes around and wants to flush the dirty pages to disk, so it batches writes of large portions (often the entire set of dirty pages, but sometimes it only does 10-20 MB or so at a time; this detail shouldn't matter) of the dirtied pages to the backing block device. Since the dirty pages are often rather interspersed in the region file, such a flush usually requires a couple of thousands of write ops, so it might sometimes take up to 10-20 seconds for the requests to complete.
If the program, then, again tries to dirty any of the pages while they are waiting to be flushed, which is often the case, the VM will block it until the page in question is flushed. This means that the thread in question might very well be blocked for up to 20 seconds, causing quite annoying wait times.
How to deal with this problem? I've considered trying to put the region files on tmpfs or so, but that seems like such an excessive measure for a problem which, from what I can tell, should be commonplace.
On a very related note, I've noticed a large discrepancy in the I/O performance between the systems I've tried this on. Two of the systems in question manage to carry out some 200-500 write ops per second on my test load, while the third manages closer to 2000-3000 write ops per second, which makes quite a difference. What makes it very weird is that the faster system uses the exact same hard drive as one of the slower systems. I know this isn't exactly a BDB-specific question, but I thought someone around here might have experience in the matter. All three systems use Linux and S-ATA hard disks (not SSDs), but they use different S-ATA host adapters, different kernel versions and are configured in quite different ways.
Thanks for reading my wall of text! I'm sorry for dragging on so long, but I didn't know how to describe the situation more briefly.
As a follow-up on this, it appears that the blocking behavior was introduced in Linux 3.0 to stabilize pages under writeback:
It seems that the commits that introduced the behavior can be safely patched away, and also that it is due to change in 3.9, but for now, this is not the route I took to solve it.
Rather, I wrote a patch to Berkeley DB to allow me to store the region files in another directory than the environment root directory, and used it to store them in /dev/shm -- that is, on tmpfs, which avoids writeback of the region files altogether.
If you want the patch, it is here for db4.8 (which what Debian Stable uses), and here for 5.1, which is what Debian Testing uses.
(For some reason, the hyperlink format suggested by the forum doesn't seem to be working?)