We are also seeing sequence skipping in production, perhaps a function of the cleaner not catching up (although we have two cleaner threads configured).
Yesterday, in an attempt to perform maintenance that would bring our growing log files back down to a reasonable number, I was running a sequence of DbDump/DbLoads through a tool I wrote that simply iterates through the database names in each log file set, dumps those to a temp dir, and rebuilds a new log file set from the dump files (resulting in many fewer log files) by importing each database back into the new environment.
While doing this, I got a BufferUnderflowException on a logfile set that had perfect sequence numbers, which surprised me. Thus, I'm not able to do a DbDump and DbLoad on that data set. The exception stack trace I get is:
ERROR 2009-01-26 07:01:03,157 | Maintenance:dumpAndReload | error dumping persist#FlightPlanStore#com.foreflight.plan.FlightPlanning from /workspace/mymetar_com/exploded/webapp/data/flightplan
com.sleepycat.je.DatabaseException: (JE 3.3.75) fetchTarget of 0x3/0x68ef71 parent IN=17457 IN class=com.sleepycat.je.tree.IN lastFullVersion=0x8/0x8c47f8 parent.getDirty()=false state=0 com.sleepycat.je.DatabaseException: java.nio.BufferUnderflowException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
Caused by: com.sleepycat.je.DatabaseException: java.nio.BufferUnderflowException
... 26 more
Caused by: java.nio.BufferUnderflowException
... 28 more
With this, I figured there is some corruption and I need to recover. I abandoned the dump/load java util I wrote and reverted to the command line DbScavenger and DbLoad utils. I can run a DbScavenger and dump something to a target directory, but the result doesn't seem to have what I might expect. There are a bunch of "dbX.dump" files and one "<index>.dump" file instead of a set of "<index name>.dump" files. Maybe this is a result of the tools not being able to recover all the data? I'm wondering if I have a time bomb with this set of logfiles in production, where some keys/records can be accessed, but others not, and just we haven't had users that have corrupted data access their records yet. Happy to upload the logfiles somewhere if having these would be helpful. It's time we buy that maintenance This is a mission critical db for us.
Thanks for posting a new thread.
The BufferUnderflowException you're getting is not related to log cleaning / file deletion, so whether there are any deleted files is unrelated.
This exception is the equivalent of a checksum failure, but it is detected earlier than the checksum verification can take place, while attempting to parse the log entry header. It has the same meaning as a checksum: The log is corrupt, possibly as the result of a hardware, file system or OS failure.
In this particular case I believe the log file is incomplete, although I'm not sure that tells us anything about why the corruption occurred. JE was attempting to fetch this LSN: 0x3/0x68ef71. The file is 00000003.jdb and the entry is at offset 0x68ef71. You may want to take a look at the length of that file and see whether it ends just past this offset.
The three approaches for dealing with a file corruption are to 1) revert to a backup, 2) recreate the data from some other source, or 3) try to scavenge the file and reconstruct the data. You are attempting (3) so I assume you don't have a useful backup or another source for the data. Is that correct?
DbScavenger outputs a dump file per database. The name of the file has the name of the database in it, but if the name cannot be determined then it uses the internal database ID instead. In the latter case, you'll have to figure out what database it belongs to by examining the contents. Assuming that you can determine which database a file belongs to, you can use DbLoad to re-create that database in fresh environment.
DbScavenger attempts to skip over corrupt portions of the log files and it writes out only the records it can read successfully -- the records that are fully intact and have valid checksums. How useful this is depends on how much is corrupted and whether you have the ability to piece together what is left. You'll need to examine the output and try to determine for yourself whether it is worthwhile to try to reconstruct the data from that information, or use approach (1) or (2) instead.