A couple of more observations: this does not seem to be caused by anything specific to RHEL. Happens on openSUSE as well.
Looks like there are two ways it can fail:
BDB0087 DB_RUNRECOVERY: Fatal error, run database recoveryReader died
and BDB0113 Thread/process 4388/4388 failed: BDB1507 Thread died in Berkeley DB library
The second one (BDB0113) is easier to hit when running under strace(2), possibly due to a slowdown. There's a quite obvious race in src/env/env_failchk.c:__env_in_api():
When the check is running as another process is starting and being added to the table, its ip->dbth_state state changes while the body of the SH_TAILQ_FOREACH(ip...) loop is running having a different value in the if() conditionals, with a chance that none of those will match.
I'm nor sure how to fix that though. A big case() instead of the conditionals would cause it to be evaluated only once, but the same issue affects other fields (tid, pid) as well; the other process can reuse the slot, changing those fields to their own identification mid-air.
Message was edited by: 982876: The race affects other fields too
thanks for this information - this is useful to know. I am hoping it will reproduce for us as well.
Just doing a follow up on this item.
We found an obscure race condition in the code that was the cause of this issue. We have it fixed and the fix will be included in the next release.