I have an application that uses persistent queues. Currently I am using BTREEs. Performance is great because the locality of reference (writes are done at the end, reads can be "streamed"). There is only a problem: queue entries deletion.
The persistent queues can grow without bound. That is good and by design. From time to time I delete a bulk of entries at the beginning of the queue. Several millions. Here performance suffers quite a bit. The tree must be rebalanced a lot, other pages must be read for it, etc.
Investigating choices I studied QUEUE access method. Data is stored in "extends", and when all records in an extend are deleted, the extend (the file) is physically deleted. So deletes are quite fast, and the traffic is local to those pages. There is a problem, thought. A roadblock: all record must have the same size.
Then I tried RECNO. Record sizes can vary. Nice. But it doesn't use extends, so I wonder how deletes are managed. So I wrote this python test program:
db = bsddb3.db.DB()
db.open("z.db", dbtype=bsddb3.db.DB_RECNO, flags=bsddb3.db.DB_CREATE)
for i in xrange(1,1000000) :
for i in xrange(1, 1000000, 2) :
for i in xrange(1000000, 2000000) :
for i in xrange(2, 1000000, 2) :
for i in xrange(2000000, 3000000) :
for i in xrange(1000000, 3000000) :
for i in xrange(3000000, 6000000) :
Thu Mar 15 16:42:45 2012 Local time
53162 Btree magic number
9 Btree version number
Little-endian Byte order
0 Fixed-length record size
0x20 Fixed-length record pad
4096 Underlying database page size
3 Number of levels in the tree
3000000 Number of records in the tree
3000000 Number of data items in the tree
52 Number of tree internal pages
4510 Number of bytes free in tree internal pages (97% ff)
20662 Number of tree leaf pages
24M Number of bytes free in tree leaf pages (71% ff)
0 Number of tree duplicate pages
0 Number of bytes free in tree duplicate pages (0% ff)
0 Number of tree overflow pages
0 Number of bytes free in tree overflow pages (0% ff)
0 Number of empty pages
0 Number of pages on the free list
I see this: when a record is deleted, is marked as it, BUT the space is NOT reused for other records, so the database grows without limits. Moreover, when you iterate over the database, it takes "a while" to skip over the deleted areas.
What are I missing?
I just read about the DB_REVSPLITOFF flag for BTREE. Would it solve the disk traffic problem/performance I am experiencing with massive deleting of old records?.
Would the use of Berkeley DB BULK delete interface improve performance, beside the obvious call overhead of a million "delete".
PS: The real databases are transactional. The code showed is only for testing&evaluation. if the behaviour is different, let me know.
After a lot of thinking and experimentation, I surrender.
There is no nice and efficient way.
Last week I implemented a daily queue. Each notification is stored in a daily BTREE "queue", with an extra database for metadata (range covered by each day, list of days covered, etc). Now deleting an old day is trivial: you delete the entire (daily) database and update the metadata.