This content has been marked as final. Show 7 replies
You can reproduce this issue downloading pybsddb 5.2.0 (http://pypi.python.org/pypi/bsddb3/5.2.0) and running the testsuite. The hangup happens in "test03_master_election (bsddb3.tests.test_replication.DBBaseReplication)".
This is a race condition, so you could need to try a few times could depend of OS scheduling decisions, too), but it is pretty reproductible under x86 Solaris 10 Update 10.
I am one of the BDB Replication developers. I started looking at your problem when it seemed to involve the change from 5.2.28 to 5.2.36. I just checked the forum again and I see it is not confined to that change in BDB version.
Please let me know what you find with regard to Solaris changes or an easy test case. I may need some help getting set up to run a test case. You can email me at email@example.com.
Meanwhile, can you post or send me the stacks for your other threads at the time of the hang?
I won't be able to invest more time in this issue until next friday, at least.
At the moment of the hangup, there are no other threads involved. They are done doing the election, and they doesn't exist anymore (remember, this is a testsuite, not a real application).
I will try to debug this myself. I will keep you informed.
Do you prefer posting here or use your email?.
I have a fundamental issue here: I want to close the environment and shutdown the replication threads. If I kill the threads first, and then close the environment, some replication messages are pending and the test hangup (my current case). If I close first the environment, if I have any other pending replication message, the replication threads will try to use an already closed environment. Any idea?.
I am testing only the master election, not full replication (that works fine, btw).
- Berkeley DB can deadlock in the client initialization (replication) if the right circunstances happen. This is a know issue.
- Solaris 10 Update 10 has changed "something" in the thread scheduling, so now I am hitting the deadlock easily when running the testsuite. That is, the problem was ALWAYS there, but after upgrading to Solaris 10 Update 10 (from Update 9), the problem is VERY visible.
- I have added a workaround in the testsuite, as suggested by Paula Bingham in a private email, date 20110929.
The problem is a know issue in Berkeley DB and it is still there, but now Python bindings testsuite passes.