This content has been marked as final. Show 9 replies
Yes, that should protect you. In particular as long as you set the priority of the
site early on and don't change it. Elections are always based on a site having
the latest LSN/log records. However, having sites that have 0 priority muddies the
water and those sites could still get HANDLE_DEAD. If all sites have non-zero
priority and the same ack policy then I think what you described will work.
Do you intend on having all sites electable?
Yes, I intend on having all sites be electable. Also as far as I am concerned all sites have equal priority so if it set them all to the same priority (say 1), never change the priority and all replicas have a commit policy of wait for ack from quorum of electable peers, would that be enough to avoid the DEAD_HANDLE case? Also what error would I see if I tried to do a write on a replica that is no longer the master ?
I think it is important to separate the txn commit guarantees from the
HANDLE_DEAD error return. What you are describing mitigates the
chance of getting that error, but you can never eliminate it 100%.
Your app description for your group (all electable, quorum ACKs)
uses the best scenario for providing the guarantees for txn commit.
Of course the cavaets still remain that you run risk if you use TXN_NOSYNC
and if you have total group failure and things in memory are lost.
Also, it is important to separate making a txn guarantee at the master site
with getting the HANDLE_DEAD return value at a client site. The
client can get that error even with all these safeguards in place.
But, let's assume you have a running group, as you described, and
you have only the occasional failure of a single site. I will describe
at least 2 ways a client can get HANDLE_DEAD while your txn integrity
is still maintained.
Both examples assume a group of 5 sites, call them A, B, C, D, E
and site A is the master. You have all sites electable and quorum
In the first example, site E is slower and more remote than the other 4
sites. So, when A commits a txn, sites B, C, and D quickly apply that
txn and send an ack. They meet the quorum policy and processing
on A continues. Meanwhile, E is slow and slowly gets further and
further behind the rest of the group. At some point, the master runs
log_archive and removes most of its log files because it has sufficient
checkpoint history. Then, site E requests a log record from the master
that is now archived. The master sends a message to E saying it has
to perform an internal initialization because it is impossible to
provide that old log record. Site E performs this initialization (under the
covers and not directly involving the application) but any
DB handles that were open prior to the initialization will now get
HANDLE_DEAD because the state of the world has changed and
they need to be closed and reopened.
Technically, no txns were lost, the group has still maintained its
txn integrity because all the other sites have all the txns. But E cannot
know what may or may not exist as a result of this initialization so
it must return HANDLE_DEAD.
In the second example, consider that a network partition has happened
that leaves A and B running on one side, and C, D, and E on the other.
A commits a txn. B receives the txn and applies it, and sends an ack.
Site A never hears from C, D, E and quorum is not met and PERM_FAILED
is returned. In the meantime, C, D, and E notice that they no longer can
communicate with the master and hold an election. Since they have a
majority of the sites, they elect one, say C to be a new master. Now,
since A received PERM_FAILED, it stops. If the network partition
is resolved, B will find the new master C. However, B still has the
txn that was not sufficiently ack'ed. So, when B sync's up with C, it
will unroll that txn. And then HANDLE_DEAD will be returned on B.
In this case, the unrolled txn was never confirmed as durable by A to
any application, but B can get the HANDLE_DEAD return. Again, B
should close and reopen the database.
I think what you are describing provides the best guarantees,
but I don't think you can eliminate the possibility of getting that error
return on a client. But you can know about your txn durability on the
You might also consider master leases. You can find a description of
them in the Reference Guide. Leases provide additional guarantees
Thanks for the details.
My application does not use TXN_NOSYNC. Also I intend to setup my application so that the master process exits if it does not get acks from the quorum within the timeout period (by registering for DB_EVENT_REP_PERM_FAILED events). Also the replicas exist purely for fault tolerance so all reads and writes will be served by the master. Failures are not expected very often so under these circumstances, I'm guessing it would be very rare to see the HANDLE_DEAD case and this could be further avoided by simply reopening the handles on all sites after a new election. As I understand it, I can't reopen the handles in the callback function directly and need to do this in a separate thread. Is that right?
Yes, it is true that you must refrain from invoking any Berkeley DB
functions from within a callback function.
Even in the scenario that you outline, certainly Sue's first example
still applies. So it would still be possible to see HANDLE_DEAD.
The recommended behavior is for applications to check for the
HANDLE_DEAD return from each operation. When you get the HANDLE_DEAD
error return, all you have to do is close the DB handle, and then
re-open it and you should be able to retry the operation.
In my application I have multiple threads sharing the same DB handle.
Wouldn't doing a close and then and open on the DB handle introduce a race condition ? I'd rather not have to use my own mutex to protect all DB handle access to avoid such a case. Assuming that close and open are individually atomic, what error would I see if another thread tried to access the DB using a closed handle ? I suppose I could treat both these errors in the same way that I currently treat the DB_DEADLOCK which is to wait a random amount a time and retry the operation.
Btw is there a way to "atomically" reopen (do close+open in a single atomic step) the handles ?
No, there is no "reopen" API.
You are correct that closing and opening a DB handle introduces the
race condition. I suspect most multi-threaded apps have each thread open
up its own DB handle, while sharing the DB_ENV handle, for the reason
you stated. If you have a pool of worker threads then you would only need
to open when the thread starts and close when it exits or when it would
get HANDLE_DEAD, which should be rare, but possible. That way each
thread could just operate on its own DB handle without conflict.
Otherwise, if you ultimately choose to share a handle, you also have the
responsibility to protect its access and modification as needed. A mutex
would work or a read/write locking mechanism, where any change in the
dbp itself (close/open) would need the write lock and the vast majority of
operations (get, put, del, etc) would just need the read lock.
If you access the DB handle after a close the results are undefined. You
are accessing freed memory. It could potentially be reallocated elsewhere,
or reinitialized. BDB itself, on close, ultimately calls free() and puts the
handle back on the memory heap and any call to malloc() within your
process could reallocate the memory that was the handle.
Thanks for the suggestions, Sue. I use clients only for replication, not for any application read/writes.
In addition, for now it is sufficient that no data is lost and its acceptable if the application shuts down and all replication sites are manually restarted (after copying an archive from a replica to a fresh machine if required). Given this case, I think I can ignore the HANDLE_DEAD case for now.
Eventually, I will follow your suggestion and re-write my code so that threads don't share the DB handle and instead only share the DBEnv. Then I can deal with the HANDLE_DEAD case as it arises on an a site which switches from client to master (clients will continue to be used only for replication/durability).