7 Replies Latest reply: Jan 16, 2012 1:24 PM by JesúsCea RSS

    BDB 5.2.36 can hang in dbenv->close when using replication

    JesúsCea
      I am the developer of pybsddb.

      Since upgrading to Berkeley DB 5.2.36, I am seeing that "DBEnv->close()" can hangup if the election/replication procedure is not yet complete. That is, I start an election, but I call "DBEnv->close()" before the election is fully completed.

      The hangup is permanent. No timeouting. I have to kill the process with "kill".

      I see this effect in 5.2.36, but not under the 5.2.28 release, or previous BDB releases.

      The stack is (in reverse order): (Running Solaris 10 here)

      febebb45 pollsys (8044950, 0, 80449a0, 0)
      feb9530a pselect (0, fec613d0, fec613d0, fec613d0, 80449a0, 0) + 18e
      feb95600 select (0, 0, 0, 0, 80449d8, fe54ee70) + 82
      fe5ff5ca __os_yield (84f52f0, 1, 0, fe54888d, 84f52f0, fec5f000) + 8a
      fe53dc2e __env_rep_enter (84f52f0, 0, fecf546d, 8412ae8, 8060b88, fef2dc00) + ce
      fe5c60f2 __env_close_pp (8581a10, 0, 8044ad8, feeb7ab5, 8044acc, 0) + 232
      [...]

      What I am doing is:

      1. Launch two threads, activate replication in both as CLIENT.
      2. Request an election in one thread.
      3. The other thread receives the HOLD_ELECTION, and cast its vote.
      4. One of the threads becomes the MASTER, and reconfigures itself.
      5. Shutdown the threads.
      6. Try to close the environments <- HANGUP

      If I add a small delay between 4 and 5, to finish processing all messages, everything goes fine. This is not necessary with 5.2.28 and previous releases.

      Since a remote client can die anytime, I think this is a serious regression.

      Can you confirm? Any suggestion?.

      I am doing a rollback of my Berkeley DB installation to 5.2.28 now.
        • 1. Re: BDB 5.2.36 can hang in dbenv->close when using replication
          JesúsCea
          You can reproduce this issue downloading pybsddb 5.2.0 (http://pypi.python.org/pypi/bsddb3/5.2.0) and running the testsuite. The hangup happens in "test03_master_election (bsddb3.tests.test_replication.DBBaseReplication)".

          This is a race condition, so you could need to try a few times could depend of OS scheduling decisions, too), but it is pretty reproductible under x86 Solaris 10 Update 10.
          • 2. Re: BDB 5.2.36 can hang in dbenv->close when using replication
            JesúsCea
            Correction: I can reproduce this issue also in 5.2.28. I guess I was being lucky.

            A few hundred of testsuite cycles over 5.1.25 and it seems to work reliabily.
            • 3. Re: BDB 5.2.36 can hang in dbenv->close when using replication
              JesúsCea
              5.1.25, 5.0.32 fail too.

              I was being lucky or some Solaris 10 Update 10 changes (upgraded a couple of days ago) are inducing this error.

              I will try to find a easy testcase for you, or find an error in my testsuite.
              • 4. Re: BDB 5.2.36 can hang in dbenv->close when using replication
                Paula B-Oracle
                Jesus,

                I am one of the BDB Replication developers. I started looking at your problem when it seemed to involve the change from 5.2.28 to 5.2.36. I just checked the forum again and I see it is not confined to that change in BDB version.

                Please let me know what you find with regard to Solaris changes or an easy test case. I may need some help getting set up to run a test case. You can email me at firstname.lastname@oracle.com.

                Meanwhile, can you post or send me the stacks for your other threads at the time of the hang?

                Thanks,
                Paula Bingham
                Oracle
                • 5. Re: BDB 5.2.36 can hang in dbenv->close when using replication
                  JesúsCea
                  I won't be able to invest more time in this issue until next friday, at least.

                  At the moment of the hangup, there are no other threads involved. They are done doing the election, and they doesn't exist anymore (remember, this is a testsuite, not a real application).

                  I will try to debug this myself. I will keep you informed.

                  Do you prefer posting here or use your email?.

                  I have a fundamental issue here: I want to close the environment and shutdown the replication threads. If I kill the threads first, and then close the environment, some replication messages are pending and the test hangup (my current case). If I close first the environment, if I have any other pending replication message, the replication threads will try to use an already closed environment. Any idea?.

                  I am testing only the master election, not full replication (that works fine, btw).
                  • 6. Re: BDB 5.2.36 can hang in dbenv->close when using replication
                    Paula B-Oracle
                    Let's take this to email so that we can record the details internally. I'll send you an email shortly.

                    Paula Bingham
                    Oracle
                    • 7. Re: BDB 5.2.36 can hang in dbenv->close when using replication
                      JesúsCea
                      Short history:

                      - Berkeley DB can deadlock in the client initialization (replication) if the right circunstances happen. This is a know issue.

                      - Solaris 10 Update 10 has changed "something" in the thread scheduling, so now I am hitting the deadlock easily when running the testsuite. That is, the problem was ALWAYS there, but after upgrading to Solaris 10 Update 10 (from Update 9), the problem is VERY visible.

                      - I have added a workaround in the testsuite, as suggested by Paula Bingham in a private email, date 20110929.

                      The problem is a know issue in Berkeley DB and it is still there, but now Python bindings testsuite passes.