I have to debug a BDB system hanging when the code tries to get the cursor on a table:
hBdb->bdberr = hBdb->dbcp->c_get(hBdb->dbcp, &key, &data, flag);
The trace in gdb is
#0 0x00007fe28ea35d29 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#1 0x00007fe28ec723b1 in __db_pthread_mutex_lock (dbenv=0xb0b0e0, mutex=<value optimized out>) at ../dist/../mutex/mut_pthread.c:218
#2 0x00007fe28ec71fb7 in __db_tas_mutex_lock (dbenv=0xb0b0e0, mutex=47) at ../dist/../mutex/mut_tas.c:183
#3 0x00007fe28ed44922 in __memp_fget (dbmfp=0xb0c420, pgnoaddr=0x7fffae5ccbdc, txn=0x0, flags=0, addrp=<value optimized out>) at ../dist/../mp/mp_fget.c:233
#4 0x00007fe28ec85811 in __bam_search (dbc=0xb0c9c0, root_pgno=1, key=0x7fffae5cd580, flags=1409, slevel=1, recnop=0x0, exactp=0x7fffae5ccce4)
#5 0x00007fe28ec76746 in __bamc_search (dbc=0xb0c9c0, root_pgno=0, key=0x7fffae5cd580, flags=26, exactp=0x7fffae5ccce4) at ../dist/../btree/bt_cursor.c:2486
#6 0x00007fe28ec7792a in __bamc_get (dbc=0xb0c9c0, key=<value optimized out>, data=0x7fffae5cd550, flags=26, pgnop=0x7fffae5ccd84)
#7 0x00007fe28ed0175a in __dbc_get (dbc_arg=0xb0c660, key=0x7fffae5cd580, data=0x7fffae5cd550, flags=26) at ../dist/../db/db_cam.c:697
#8 0x00007fe28ed0ab9a in __dbc_get_pp (dbc=0xb0c660, key=0x7fffae5cd580, data=0x7fffae5cd550, flags=26) at ../dist/../db/db_iface.c:2021
#9 0x00007fe28e2c19cc in gd_DbLocate (h=0xb09f70, ptrart=0x7fffae5cd6e0 ",\001", flag=26) at ../src/DbBDB.c:456
I'm trying to find out why the c_get (frame 9) call blocks, to then correct the situation. Any hint?
do you have a reproducible test case? What does your run environment look like? What isolation level are you using? From the call stack, the page that has the record on it that you want is locked by some other process/thread. It is normal for database systems to take locks out, as this is how they enforce the ACID properties. Most likely the way to address is to look at the application design and make some adjustments. If 2 processes go after the same page at the same time, then only one of them will get it and the other has to wait.
BDB is an embedded library. If a process dies while in the library, then there is the possibility that it may have had locks that no longer have an owner. In a separate process you can run failchk. failchk is a utility that will look for things like this and *try* and clean them up. If failchk detects that the process died while doing some updates to the database, then it will not be able to clean up and throws a DB_RUNRCOVERY error. Why: if the process was in the middle of doing writes, then it could have been in the middle of a txn and we have partial txn out there that could leave you with a corrupted db. By running recovery, we fix that problem.