I have successfully reproduced it using 3.3.87. I would like you to try out 4.0 on your machine to see if it happens there (it does no on my machine with 4.0). If you send me an email I will give you instructions on how to download 4.0.
I've been testing a 100gb+ dataset using the 4.0.60 build, and so far no issues at all. It seems to have resolved the problem. What was the root cause? Is windows 7 doing something funky with the filesystem now?
I'm not ready to say what the cause is quite yet. I am reasonably certain I understand the problem, but am not ready to go out on a limb and say what it is in case I'm wrong. I'll post a report when I am certain.
At this point I am reasonably certain that the problem has to do with a write() call being initiated on a file when an fsync() is already in progress in another thread (i.e. a concurrent fsync and write on the same file, but not with the same file descriptors). JE routinely performs concurrent IO operations on a given file. In the particular test case that user Ambber sent me, it is by virtue of the checkpointer initiating an fsync while the user application thread is writing.
It turns out that in ext3 we previously encountered a performance slowdown because that file system takes an exclusive mutex on the inode for any IO operation, and therefore an fsync will block reads and writes. JE 4.0 has a "fix" to this problem which is described here .
That said, there seems to be a true Windows 7 bug here, if for no other reason than I can observe corruption on sector boundaries in the log files (JE does no operations on sector boundaries).
Edited by: Charles Lamb on Oct 28, 2009 7:58 AM
It is reproduced not only on Windows 7. I have the same issue but now on the linux machine. It were reproduced on java 1.5_07 and 1.6_16. I'm using MontaVista Linux with ext3 filesystem. The exception I got was the same:
<DaemonThread name="Cleaner-1"/> caught exception: com.sleepycat.je.log.DbChecksumException: (JE 3.3.87) Location 0x0/0x488bd70 expected 3031505114 got 3055098078
com.sleepycat.je.log.DbChecksumException: (JE 3.3.87) Location 0x0/0x488bd70 expected 3031505114 got 3055098078
at java.lang.Thread.run(Unknown Source)
It was reproduced 3 times when database file was getting big. Maximum size of each JE log was set to 200Mb.
Also, could you please let us know what parameters you had set, including VERIFY_CHECKSUMS? Or was this the JETester program that you gave us? Does DbPrintLog -s 0x0 -e 0x1 produce the same exception? Does it happen on ext3 w/ JE 4.0?
The database configuration is following:
EnvironmentConfig envConf = new EnvironmentConfig();
The database used in non-transactional mode and VERIFY_CHECKSUMS as You can see is not set.
After getting this error I'd restarted everything from scratch, so 00000000.jdb was lost. I'll try to get this error one more time and will give you it.
I will try JETester on that configuration.
I didn't triesd JE 4.0, cause I don't know where to get it.
Thanks for the update.
I've run the JETester against 4.0.60 several times now. I haven't seen the checksum error at all, both in JETester and my own app. Is this fix likely to be put into the 3.x series release?