4 Replies Latest reply: May 8, 2014 8:40 AM by userBDBDMS-Oracle RSS

    Stress-testing CDB as used by RPM causes a crash

    982876

      (Re-openning this https://community.oracle.com/message/10776786; archived as I was way too slow to respond.)

       

      Summary:

       

      1.) RPM in RHEL corrupts its database when used extensively

      2.) I've been able to craft a reproducer for the issue.

      3.) I can reliably reproduce the problem with all versions I used (including latest db-6.0.20)

      4.) Cindy Zeng was not able to reproduce the problem, the reproducer does not crash for her

       

      My reproducer is available here: http://v3.sk/~lkundrak/bdb-crash/

      The run on RHEL 7 Beta on x86_64:

       

      bdb-crash♥ make CFLAGS="-I/usr/local/BerkeleyDB.6.0/include" LDFLAGS="-L/usr/local/BerkeleyDB.6.0/lib -Wl,-rpath=/usr/local/BerkeleyDB.6.0/lib"
      cc -I/usr/local/BerkeleyDB.6.0/include -c -o reader.o -DREADERS test.c
      cc -L/usr/local/BerkeleyDB.6.0/lib -Wl,-rpath=/usr/local/BerkeleyDB.6.0/lib -ldb -lpthread  reader.o   -o reader
      cc -I/usr/local/BerkeleyDB.6.0/include -c -o writer.o -DWRITER test.c
      cc -L/usr/local/BerkeleyDB.6.0/lib -Wl,-rpath=/usr/local/BerkeleyDB.6.0/lib -ldb -lpthread  writer.o   -o writer
      bdb-crash♥ sh test.sh
      Mon Feb 10 14:35:21 CET 2014
      BDB0113 Thread/process 30877/30877 failed: BDB1507 Thread died in Berkeley DB library
      test.c:53: BDB0087 DB_RUNRECOVERY: Fatal error, run database recoveryReader died
      test.sh: line 12: 30871 Terminated              ( while ./writer; do
          :;
      done; echo 'Writer died' )
      Mon Feb 10 14:35:21 CET 2014
      bdb-crash♥ sh test.sh
      Mon Feb 10 14:35:23 CET 2014
      test.c:53: BDB0087 DB_RUNRECOVERY: Fatal error, run database recoveryReader died
      test.sh: line 12: 30900 Terminated              ( while ./writer; do
          :;
      done; echo 'Writer died' )
      Mon Feb 10 14:35:28 CET 2014
      bdb-crash♥

        • 1. Re: Stress-testing CDB as used by RPM causes a crash
          userBDBDMS-Oracle

          We definitely want to take a look at this.   thanks for the reproducer.   Can you contact me directly on this issue so we can discuss in more detail.  Just email me at michael.brey@oracle.com.

           

          thanks

          mike

          • 2. Re: Stress-testing CDB as used by RPM causes a crash
            982876

            A couple of more observations: this does not seem to be caused by anything specific to RHEL. Happens on openSUSE as well.

            Looks like there are two ways it can fail:

             

            BDB0087 DB_RUNRECOVERY: Fatal error, run database recoveryReader died

            and BDB0113 Thread/process 4388/4388 failed: BDB1507 Thread died in Berkeley DB library

             

            The second one (BDB0113) is easier to hit when running under strace(2), possibly due to a slowdown. There's a quite obvious race in src/env/env_failchk.c:__env_in_api():

            When the check is running as another process is starting and being added to the table, its ip->dbth_state state changes while the body of the SH_TAILQ_FOREACH(ip...) loop is running having a different value in the if() conditionals, with a chance that none of those will match.

            I'm nor sure how to fix that though. A big case() instead of the conditionals would cause it to be evaluated only once, but the same issue affects other fields (tid, pid) as well; the other process can reuse the slot, changing those fields to their own identification mid-air.

             

            Message was edited by: 982876: The race affects other fields too

            • 3. Re: Stress-testing CDB as used by RPM causes a crash
              userBDBDMS-Oracle

              thanks for this information - this is useful to know.   I am hoping it will reproduce for us as well.

               

              thanks

              mike

              • 4. Re: Stress-testing CDB as used by RPM causes a crash
                userBDBDMS-Oracle

                Just doing a follow up on this item.

                 

                We found an obscure race condition in the code that was the cause of this issue.  We have it fixed and the fix will be included in the next release.

                 

                thanks

                mike