This discussion is archived
1 2 Previous Next 27 Replies Latest reply: Oct 30, 2009 7:17 PM by Charles Lamb RSS

Corruption on Windows 7 ?

722161 Newbie
Currently Being Moderated
I've recently moved my development machine to 64bit Windows 7 build 7600 and Sun 32bit JDK 1.6.0_20.

I've seen DbChecksumException 3 times now when working on a dataset of about 20gb in size. I didn't track it the first two times, but the exception from the third instance is below. The first two times I removed the log files and reloaded them from a bulk import to clear the error. I'm not sure, but I think the other exceptions occured in my program threads and not the cleaner thread.

This is on a transaction-enabled database using the collection APIs. Has anyone seen similar issues on Windows 7? What information would be helpful to troubleshoot this?

<DaemonThread name="Cleaner-1"/> caught exception: com.sleepycat.je.log.DbChecksumException: (JE 3.3.87) Location 0x23/0x3fb8ee expected 4034848955 got 3710585889
com.sleepycat.je.log.DbChecksumException: (JE 3.3.87) Location 0x23/0x3fb8ee expected 4034848955 got 3710585889
     at com.sleepycat.je.log.ChecksumValidator.validate(ChecksumValidator.java:96)
     at com.sleepycat.je.log.FileReader.validateChecksum(FileReader.java:593)
     at com.sleepycat.je.log.FileReader.readNextEntry(FileReader.java:314)
     at com.sleepycat.je.cleaner.FileProcessor.processFile(FileProcessor.java:386)
     at com.sleepycat.je.cleaner.FileProcessor.doClean(FileProcessor.java:233)
     at com.sleepycat.je.cleaner.FileProcessor.onWakeup(FileProcessor.java:138)
     at com.sleepycat.je.utilint.DaemonThread.run(DaemonThread.java:141)
     at java.lang.Thread.run(Thread.java:619)
  • 1. Re: Corruption on Windows 7 ?
    Charles Lamb Pro
    Currently Being Moderated
    Hello Ambber,

    We've seen another issue on Windows 7, but it had to do with fsync's. Still, it might be related to what you're seeing.

    Could you please give us the JE parameters you're using along with any java options you're passing in on the command line?

    What kind of hardware are you running on? In general is your application doing multi-threaded access to JE? I suspect it is M/T.

    How readily can you reproduce the problem? If you've seen it three times, it sounds like you can reproduce it fairly easily, but over what time period did the 3 occurrences happen?

    Do you see the same problem on other platforms besides Win 7?

    Could you please run with je.log.verifyChecksums=true and tell us what you get?

    You can contact me via email at charles.lamb @ o.com

    Thanks.

    Charles Lamb
  • 2. Re: Corruption on Windows 7 ?
    722161 Newbie
    Currently Being Moderated
    Hi Charles,

    My environment is configured as follows:

    final EnvironmentConfig environmentConfig = new EnvironmentConfig();
    environmentConfig.setAllowCreate(true);
    environmentConfig.setTransactional(true);
    environmentConfig.setConfigParam(EnvironmentConfig.MAX_MEMORY, String.valueOf(10* 1000 * 1000));
    environmentConfig.setConfigParam(EnvironmentConfig.CLEANER_MIN_UTILIZATION, String.valueOf(10* 1000 * 1000));

    And the database is configured as follows:

    final DatabaseConfig dbConfig = new DatabaseConfig();
    dbConfig.setAllowCreate(true);
    dbConfig.setTransactional(true);


    No other bdb-je parameters are set via command line.

    This is on a thinkpad t61p laptop with a core2 duo cpu. I did not see this issue when I was running on vista up to a few weeks ago on the same machine. The app is mostly deployed on Linux servers and I've never seen this error there either.

    My app is a servlet so it is indeed multi-threaded, but when the error occurred a data bulkload was occuring on a single thread with no user accesses. In the times I had the error it was shortly after a large table (15m or so keys) had been truncated. I haven't worked on the project since I first posted, but next time I see it I'll set the verifyChecksums to true and see what the error is. I did spend some time time trying to reproduce on purpose but I was unsuccessful.
  • 3. Re: Corruption on Windows 7 ?
    Charles Lamb Pro
    Currently Being Moderated
    Thanks for the info.

    In order for verify checksums to be useful, it will have to be set when you see the exception, not afterwards. It is a debugging param which causes checksums to be verified just before they are written to disk.

    Thanks again.

    Charles Lamb

    Edited by: Charles Lamb on Oct 22, 2009 9:08 AM
  • 4. Re: Corruption on Windows 7 ?
    722161 Newbie
    Currently Being Moderated
    I started with fresh log files, and ran through my program again, this time with

    environmentConfig.setConfigParam(EnvironmentConfig.LOG_VERIFY_CHECKSUMS, "true");

    set from the beginning, and ended up with the same exception:

    <DaemonThread name="Cleaner-1"/> caught exception: com.sleepycat.je.log.DbChecksumException: (JE 3.3.87) Location 0x439/0x78a1a0 expected 2616883590 got 1417244505
    com.sleepycat.je.log.DbChecksumException: (JE 3.3.87) Location 0x439/0x78a1a0 expected 2616883590 got 1417244505
         at com.sleepycat.je.log.ChecksumValidator.validate(ChecksumValidator.java:96)
         at com.sleepycat.je.log.FileReader.validateChecksum(FileReader.java:593)
         at com.sleepycat.je.log.FileReader.readNextEntry(FileReader.java:314)
         at com.sleepycat.je.cleaner.FileProcessor.processFile(FileProcessor.java:386)
         at com.sleepycat.je.cleaner.FileProcessor.doClean(FileProcessor.java:233)
         at com.sleepycat.je.cleaner.FileProcessor.onWakeup(FileProcessor.java:138)
         at com.sleepycat.je.utilint.DaemonThread.run(DaemonThread.java:141)
         at java.lang.Thread.run(Thread.java:619)

    So verify checksums didn't change the behavior or exception.

    I seem to be getting good at making this happen in my app. I add in about 20gb (takes many hours on my laptop with these config settings) of data to a single database with the cleaner disabled, the stop the app, restart with the cleaner on and truncate the database, and I'll get the exception within a few minutes. I'm not sure yet which part of this is the trigger. It doesn't happen if I only populate a few gb of data. I've writen a test app that basically does the same thing and so far I can't get it to fail.. I'm going to try next by not disabling the cleaner at all and see if that has anything to do with it.
  • 5. Re: Corruption on Windows 7 ?
    Charles Lamb Pro
    Currently Being Moderated
    Hi,

    When you say you truncate the database, you mean you call Database.truncateDatabase(), correct? I'm sure that's what you're doing, but I want to just make sure you're not actually rm'ing *.jdb files.

    When you restart with the cleaner running, is anything else going on or do you just have the cleaner running? i.e. is there any application work going on?

    After the checksum exception, are you able to still open the database with an application or is the environment corrupted permanently?

    BTW, the verifyChecksums just checks the checksums when they're being written out as well as after they're written out. It's interesting that it's not being detected at write time.

    If you do DbPrintLog -s 0x439 on the corrupted log environment that you have does it blow out at the same lsn? 0x439/0x78a1a0. Are you able to give us that file (0x00000439.jdb) or does it have confidential info in it? Is 00000439.jdb the last file or are there more after it? If you can give me that file, please email it to me (charles.lamb @ o.com).

    In a previous post you said that you are setting CLEANER_MIN_UTILIZATION:

    environmentConfig.setConfigParam(EnvironmentConfig.CLEANER_MIN_UTILIZATION, String.valueOf(10* 1000 * 1000));

    Do we really accept a value of 10,000,000 for a percentage without throwing an exception at y?

    Also, it seems like 10MB of cache is pretty small for a 20GB database.

    If you try to recreate the exception, and you run DbPrintLog 2>&1 | grep Exception after the load and before you start the cleaner, does it throw a checksum exception from DbPrintLog? In other words, is the environment ok before you start the cleaner? That would tell us a lot.

    I'm working on getting my hands on a win 7 machine. Is it possible for you to ship your app to me so that I can run it here?

    Thanks for your help.

    Charles Lamb

    Edited by: Charles Lamb on Oct 22, 2009 9:08 AM
  • 6. Re: Corruption on Windows 7 ?
    722161 Newbie
    Currently Being Moderated
    When you say you truncate the database, you mean you call Database.truncateDatabase(), correct? I'm sure that's what you're doing, but I want to just make sure you're not actually rm'ing *.jdb files.
    Correct, truncateDatabase() is what im doing, no *.jdb files were manipulated outside the api.
    After the checksum exception, are you able to still open the database with an application or is the environment corrupted permanently?
    In this case, the enviroment and all databases open and close happily. It's just the cleaner thread that seems to be having problems. I'm not sure what problems I would have if I kept using the app, I havent really tested anything much after I spotted the error.
    BTW, the verifyChecksums just checks the checksums when they're being written out as well as after they're written out. It's interesting that it's not being detected at write time.
    Perhaps I should try adding a read value after write during my population routine, it probably will be happy too. I suspect windows odditties. My current theory, which I base on absolutely nothing, is that for a jdb file to exhibit this problem it has to be evicted from the OS disk cache. Maybe I'll try a 2gb data set and a reboot in between....
    If you do DbPrintLog -s 0x439 on the corrupted log environment that you have does it blow out at the same lsn? 0x439/0x78a1a0. Are you able to give us that file (0x00000439.jdb) or does it have confidential info in it? Is 00000439.jdb the last file or are there more after it?
    DbPrintLog barfs at the same place. 439.jdb is faiirly elderly among the 2800 log files, d84 is the youngest.
    In a previous post you said that you are setting CLEANER_MIN_UTILIZATION:
    environmentConfig.setConfigParam(EnvironmentConfig.CLEANER_MIN_UTILIZATION, String.valueOf(10* 1000 * 1000));
    Do we really accept a value of 10,000,000 for a percentage without throwing an exception at y?
    Busted. I goofed during copy and paste, the values on those lines are actually constants from another class so I pasted in the values for readability. The real value for that line should be 80. The other mistake I just noticed is I'm on 1.6.0_16, not 1.6.0_20.
    Also, it seems like 10MB of cache is pretty small for a 20GB database
    True, and only 64mb heap. But this should result in slowness, or possibly an OOM (with potential for subsequent corruption), but I'm just seeing the corruption. Now that I'm getting closer towards repeatability, maybe I'll try with bigger heap/cache. It basically takes an overnight data load for a test cycle now, I have to use the machine during the day for other stuff.

    I'm not sure about sending the log file / app, I'll have to check on that...

    I'm still focusing on trying make a reproducible test case. So far there isn't any real impact to this problem since the only place it exists is on my laptop. It's just a curiosity, at least until it happens on a real machine :)
  • 7. Re: Corruption on Windows 7 ?
    Charles Lamb Pro
    Currently Being Moderated
    After the checksum exception, are you able to still open the database with an application or is the environment corrupted permanently?
    In this case, the enviroment and all databases open and close happily. It's just the cleaner thread that seems to be having problems. I'm not sure what problems I would have if I kept using the app, I havent really tested anything much after I spotted the error.
    You'd eventually run into it if you read that piece of data from the log with your app.

    >
    BTW, the verifyChecksums just checks the checksums when they're being written out as well as after they're written out. It's interesting that it's not being detected at write time.
    Perhaps I should try adding a read value after write during my population routine, it probably will be happy too. I suspect windows odditties. My current theory, which I base on absolutely nothing, is that for a jdb file to exhibit this problem it has to be evicted from the OS disk cache. Maybe I'll try a 2gb data set and a reboot in between....
    If you do a read after write you'll have to make sure that the data you're reading is not in the cache, otherwise it will just take the cached data.

    DbPrintLog will accomplish this, but, it will do it by reading every record in the log, whether or not they are "live". You could do a DbDump, or use your own app to do a read of the entire database. Those programs will only read "live" data. It would be interesting to know if the thing it is stumbling upon is live or dead. That said, since you're doing a load, it is probably live.
    If you do DbPrintLog -s 0x439 on the corrupted log environment that you have does it blow out at the same lsn? 0x439/0x78a1a0. Are you able to give us that file (0x00000439.jdb) or does it have confidential info in it? Is 00000439.jdb the last file or are there more after it?
    DbPrintLog barfs at the same place. 439.jdb is faiirly elderly among the 2800 log files, d84 is the youngest.
    In a previous post you said that you are setting CLEANER_MIN_UTILIZATION:
    environmentConfig.setConfigParam(EnvironmentConfig.CLEANER_MIN_UTILIZATION, String.valueOf(10* 1000 * 1000));
    Do we really accept a value of 10,000,000 for a percentage without throwing an exception at y?
    Busted. I goofed during copy and paste, the values on those lines are actually constants from another class so I pasted in the values for readability. The real value for that line should be 80. The other mistake I just noticed is I'm on 1.6.0_16, not 1.6.0_20.
    OK, no problem. That explains that.

    >
    Also, it seems like 10MB of cache is pretty small for a 20GB database
    True, and only 64mb heap. But this should result in slowness, or possibly an OOM (with potential for subsequent corruption), but I'm just seeing the corruption. Now that I'm getting closer towards repeatability, maybe I'll try with bigger heap/cache. It basically takes an overnight data load for a test cycle now, I have to use the machine during the day for other stuff.
    I expect that this will be squirmy. You may be able to reproduce it on a smaller run some of the time, but not as often. Just a guess.
    I'm not sure about sending the log file / app, I'll have to check on that...

    I'm still focusing on trying make a reproducible test case. So far there isn't any real impact to this problem since the only place it exists is on my laptop. It's just a curiosity, at least until it happens on a real machine :)
    I'd like to have you try 4.0 on this if that's ok with you. I can't post the download link to it on the forum, but I can email it to you. If that still shows the bug, then I'll start giving you debugging jars built on 4.0.

    What is bothersome is that with the verify checksums enabled we are doing a read-after-write (see src/com/sleepycat/je/log/FileManager.java:1344).

    Edited by: Charles Lamb on Oct 22, 2009 9:07 AM
  • 8. Re: Corruption on Windows 7 ?
    722161 Newbie
    Currently Being Moderated
    Here's a test case. On my machine, this generally fails in the early part of the read cycle, but sometimes fails during the initial write, and sometimes doesn't fail at all, and a reboot of the machine may prompt the next run to produce a failure.

    I started with something more complicated and have been whittling down, and I think there can be more whittling....



    ------


    import com.sleepycat.bind.tuple.TupleBinding;
    import com.sleepycat.collections.StoredMap;
    import com.sleepycat.je.*;

    import java.io.File;
    import java.util.Random;
    import java.util.HashMap;
    import java.util.Map;
    import java.util.Date;
    import java.text.SimpleDateFormat;

    public class JETester {
    private static final int ITEM_COUNT = 10 * 1000 * 1000;
    private static final int BATCH_SIZE = 10 * 1000;
    private static final String DB_NAME = "test";
    private static final Random RANDOM = new Random();
    private static final File DB_LOCATION = new File("H:\\temp");

    public static void main(final String[] args) throws DatabaseException, InterruptedException {
    {
    log("starting write cycle");
    final File databaseDirectory = DB_LOCATION;
    final Environment environment = setupEnvironment(databaseDirectory);
    final Database database = getDatabase(environment);
    final StoredMap<String,String> storedMap = setupMap(database);
    populateMap(storedMap, ITEM_COUNT);
    for (int i = 0; i <= ITEM_COUNT; i++ ) {
    storedMap.get(String.valueOf(i).toLowerCase());
    if (i % (BATCH_SIZE * 10) == 0) log("read tick-" + i);
    }
    database.close();
    environment.truncateDatabase(null, DB_NAME, false);
    environment.close();
    log("read complete");
    }
    }


    private static void populateMap(final StoredMap<String, String> storedMap, final int size) {
    Map<String,String> buffer = new HashMap<String,String>();
    int counter = 0;
    while (counter < size) {
    while (buffer.size() < BATCH_SIZE && counter < size) {
    buffer.put(String.valueOf(counter++),randomStringValue(RANDOM.nextInt(256) + 256));
    }
    storedMap.putAll(buffer);
    buffer.clear();
    if (counter % (BATCH_SIZE * 10) == 0) log("write tick-" + counter);
    }
    }

    private static void log(String str) {
    System.out.println(SimpleDateFormat.getInstance().format(new Date()) + " " + str);
    }

    private static String randomStringValue(int length) {
    StringBuilder sb = new StringBuilder();
    while (sb.length() < length) sb.append((char)(RANDOM.nextInt(26) + 32) );
    return sb.toString();
    }

    private static Database getDatabase(Environment environment) throws DatabaseException {
    final DatabaseConfig dbConfig = new DatabaseConfig();
    dbConfig.setAllowCreate(true);
    return environment.openDatabase(null, DB_NAME, dbConfig);
    }

    private static Environment setupEnvironment(File databaseDirectory) throws DatabaseException {
    final EnvironmentConfig environmentConfig = new EnvironmentConfig();
    environmentConfig.setAllowCreate(true);
    environmentConfig.setConfigParam(EnvironmentConfig.MAX_MEMORY, String.valueOf(10 * 1000 * 1000 ));
    environmentConfig.setConfigParam(EnvironmentConfig.LOG_FILE_MAX, String.valueOf(10 * 1000 * 1000));
    environmentConfig.setConfigParam(EnvironmentConfig.CLEANER_MIN_UTILIZATION, "80");
    environmentConfig.setConfigParam(EnvironmentConfig.LOG_VERIFY_CHECKSUMS, "true");
    return new Environment(databaseDirectory, environmentConfig);
    }

    private static StoredMap<String,String> setupMap(final Database database) throws DatabaseException {
    final TupleBinding<String> stringBinding = TupleBinding.getPrimitiveBinding(String.class);
    return new StoredMap<String,String>(database, stringBinding, stringBinding, true);
    }
    }
  • 9. Re: Corruption on Windows 7 ?
    Charles Lamb Pro
    Currently Being Moderated
    Thank you. I'll run it various platforms and see what happens. I won't be getting my copy of Win 7 until tomorrow but will try it there as soon as I get it installed.

    Will you be upgrading to the "real" release of Win 7?

    Charles Lamb

    Edited by: Charles Lamb on Oct 22, 2009 9:07 AM
  • 10. Re: Corruption on Windows 7 ?
    722161 Newbie
    Currently Being Moderated
    I'm running on the real release now (Build 7600). It's been available for awhile for MSDN subscribers.
  • 11. Re: Corruption on Windows 7 ?
    Charles Lamb Pro
    Currently Being Moderated
    Tell me about the disk "H:" which you're running this on. Since it's your laptop, it seems like H: is kind of "high" given that you probably only have one drive (but maybe you have a bunch of partitions). Is H: a remotely mounted network file system? Or a local HDD? Or is it an SSD? Is it a standard SATA notebook drive? What file system type is it? NTFS or something else? Is it a file system which was created using Vista or is it a fresh Win 7-created file system?

    Thanks.

    Charles Lamb

    Edited by: Charles Lamb on Oct 22, 2009 9:07 AM
  • 12. Re: Corruption on Windows 7 ?
    Charles Lamb Pro
    Currently Being Moderated
    Good news. We have reproduced this (quite easily) on a Windows 7 machine. The good news is that we can't reproduce it on any other machine. Stay tuned. It'll take me some time to diagnose it.

    Thank you for the test case -- it helped a lot.

    Charles Lamb
  • 13. Re: Corruption on Windows 7 ?
    722161 Newbie
    Currently Being Moderated
    Oh that... Sorry, I'm just using the old DOS "subst" command. It's just my plain old NTFS C: drive. And problem occurs if I actually reference it as C: drive.
  • 14. Re: Corruption on Windows 7 ?
    722161 Newbie
    Currently Being Moderated
    Great news indeed! I'm no longer alone! It would be great if you could post an analysis once you've got it sorted out, I'm terribly curious at this point.
1 2 Previous Next

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points