Hi, I've managed to configure BDB in synchronous mode (i.e., each put is persisted on disk when committing). However, now I'm doing 2000 puts per seconds, each with payload from 10 to 250 kilobytes, yet I'm getting (from iostat) that each disk-transfer is only about 6 kilobytes (because of 23 megabytes written on disk per second, divided by 3750 transfers per second). How is that even possible? Is there a way of telling BDB to minimize the number of disk-transfers per second in SYNC mode? It seems that BDB is breaking each put's payload into smaller pieces, and only then saving to disk in a bunch of disk-transfers.
JE does not split up a single write into multiple writes -- and certainly doesn't do an fsync for each one.
JE may do multiple writes (but not fsyncs) for a single, multi-operation txn if the write buffer fills. And it will do multiple writes for a single operation, if the record is larger than the write buffer. However, it doesn't sounds like this (overflowing of the write buffer) is what you're experiencing. In any case, you can configure the size of the JE write buffer with EnvironmentConfig.LOG_TOTAL_BUFFER_BYTES, LOG_NUM_BUFFERS, and LOG_BUFFER_SIZE.
Another thing is that JE will group fsyncs (this is called "group commit") when multiple threads are committing concurrently with SYNC durability. In this case you'll see a smaller number of physical writes than the number of commits.
I asked a colleague who has more experience with iostat than I do about this, and he gave me the following information:
We would expect there to be one sync per put on average, assuming the application is doing serial writes and there are no group commits to further obfuscate the issue. Given the high sync write rate, the writes are presumably to an SSD, or spinning rust with a large non-volatile disk write cache
I'm not sure what he means by disk transactions in iostats. Perhaps he means the number of disk transfer requests issued to the device listed as tps (transfers per sec) in iostat output.
If he is using ext3 and he does not have the file system mounted with noatime, he may be observing write requests to update the file system "atime" metadata with each request. So for 2K sync puts/sec he would see roughly 4K (2k put + 2k atime update) write requests/sec and his average write payload would be ~12KB/transfer (the atime write payload is negligible), which would be consistent with the application's put behavior. This is all a guess.
I hope this helps.
Thanks, mark. Yeah, I was referring to transfers per second. (I don't know why on earth I read "transactions"). I'll keep investigating here. It definitely helps to know that bdb does not split a write into multiple writes.
What bothers me is that zookeeper (the application I'm competing with) managed to reach 40 kb/transfer while, with bdb, I barely reached 6 kbytes/transfer.