This discussion is archived
7 Replies Latest reply: Oct 11, 2011 4:25 AM by 857786 RSS

Understanding deadlock

857786 Newbie
Currently Being Moderated
Hello,
I would like to answer a few questions about deadlock detection within Berkeley DB.
We are using BDB 5.1.19. We use the replication manager to provide highly available applications.

If we are not writing from multiple threads is that ever possible to get a deadlock?
I guess the answer of the above is yes but only at election time since BDB might lock the DB during the election while our single thread is writing at the same time, is that understanding correct or could there be deadlock even with a single thread writing?

Deadlocks I am talking about here are referring to a return value of DB_LOCK_DEADLOCK when trying to perform a db->put as following:
dbp->put(dbp, txn, &k, &d, 0);

We currently have in our application the same handling for when receiving the 2 following result codes:
DB_LOCK_DEADLOCK:
EACCES:

We just abort the transaction, is that the correct behavior?

I would like to better understand the locking, when a db->put is performed is the lock at the database level(like a single table) or at the environment table(all the tables), the reason I am asking is because in the above, even if we abort a transaction when we receive a deadlock, still there could be another put before the abortion but on different database(different table), could that be a problem? It is guaranteed though there will be no put on the same database once a deadlock is received? Should EACCESS be handled differently with respect to the above?


The reason i am raising the above is because we are experiencing a cluster(2 boxes) being stuck after we split brain the cluster and then reconnect it. At the reconnection we see the cluster being stuck forever, could the above theory explain that?

Thx in advance.
  • 1. Re: Understanding deadlock
    857786 Newbie
    Currently Being Moderated
    Anything answer about the above questions please?

    Edited by: 854783 on Sep 19, 2011 6:01 AM
  • 2. Re: Understanding deadlock
    524722 Explorer
    Currently Being Moderated
    If you are using HA then you can definitely get DB_LOCK_DEADLOCK errors.
    Certain situations in HA cause BDB to need to lock out the application from
    API calls. If a user tries to make an API call during that lock out period, the
    application will get DB_LOCK_DEADLOCK returned.

    It is an indication to abort the txn and possibly retry. It seems like EACCES
    would be dependent on your app. Certainly, from BDB's perspective, the
    application has the choice to abort any txn that it chooses, so if your app
    chooses to abort for whatever reason, that is fine. If you were to retry (which you do
    not discuss, so I assume you do not), it isn't clear
    to me what would change on an EACCES error, since that is usually dealing
    with file and directory permissions. So unless there is something else going
    on that will change those permissions in the meantime, a retry would seem
    likely to get the same error.

    For most access methods, locking on updates is done at a page-level. The
    queue access method does record-level locking.

    Sue LoVerso
    Oracle
  • 3. Re: Understanding deadlock
    857786 Newbie
    Currently Being Moderated
    Thx Sue for the answer that is helpful.
    I would like to narrow down my understanding regarding what you said.

    You say:
    "Certain situations in HA cause BDB to need to lock out the application from
    API calls. If a user tries to make an API call during that lock out period, the
    application will get DB_LOCK_DEADLOCK returned."

    Can you please expand on what are those situations? Would an election cause that? Are there other situations too?


    You say:
    "For most access methods, locking on updates is done at a page-level. The
    queue access method does record-level locking."

    Excuse for the ignorance but i might not be familiar with some of the concepts that you are using. If there is a doc that i should have read about the following questions please let me know.
    When you say page-level, what is the relation between a page and a DB table (a database in BDB terminology) ?
    Would a page contain one table or multiple ones or even part of a table? Thx

    What do you mean when that the "queue access method does record-level locking"?
    What is the queue access method?
    I didn't understand too from your response when exactly there is page level locking and when there is only record level locking.

    Thx in advance for your patience.
  • 4. Re: Understanding deadlock
    524722 Explorer
    Currently Being Moderated
    I suggest you read chapters 2, 3, 4 from the BDB Reference Guide. The access method is the type of database (table) you choose to have (btree, hash, queue, etc) and different access methods structure the database in different ways. Your database (table) is made up of pages. You can choose the pagesize and there are different reasons for choosing different page sizes. I think you'll find a lot of information in the Reference Guide for this topic.

    The HA subsystem needs to lockout the API during certain situations where the system may be inconsistent or needs to be essentially single-threaded. An election is not one of those situations. However, a site that is elected and upgrades to master is. Anytime a site changes roles is a situation where a lockout of the API is necessary. Anytime a replica site is synchronizing with a master (by running a recovery, which must be single-threaded in BDB) or a replica is outdated and performs an internal initialization, a lockout is necessary. Those are some examples, but not an exhaustive list.

    An internal initialization means a replica reinitializes all databases and logs from the master. While we are transferring database pages, and have partial databases that may be inconsistent, we cannot allow the app to attempt to read or traverse the database, for example, because some pages may not exist yet and a crash would occur if a traversal were attempted.

    Sue LoVerso
    Oracle
  • 5. Re: Understanding deadlock
    857786 Newbie
    Currently Being Moderated
    Thx Sue.
    I'll make sure to read the relevant documentation.
    My first understanding though is that a page will have only data that is relevant to a single database table, therefore if i have 2 database tables they will never land on the same page.Correct?
    If it is correct, then is the following statement correct(put aside any HA for this question).
    - Since the locking is at page level and 2 databases are on different pages, a deadlock can never happen due to simultaneous writes to different table but can happen only due to simultaneous writes to the same table. Is that a valid statement?
    - Following the same thinking with some more details, trying to do a dp->put() or txn_begin() from different tables should cause no problem. Is that correct?


    Now when it comes to HA, i would like to validate my understanding, basically when a site changes role (either from master to client or vice versa from client to master) there is a lockout which means that doing a db->put or txn_begin from any database table in that case will block or return rep_lockout if we are using conf_nowait? Is that correct?
  • 6. Re: Understanding deadlock
    524722 Explorer
    Currently Being Moderated
    854783 wrote:
    My first understanding though is that a page will have only data that is relevant to a single database table, therefore if i have 2 database tables they will never land on the same page.Correct?
    That is correct when your two databases reside in separate physical files.
    If they share the same physical file, there is a meta-page where they could collide.
    If it is correct, then is the following statement correct(put aside any HA for this question).
    - Since the locking is at page level and 2 databases are on different pages, a deadlock can never happen due to simultaneous writes to different table but can happen only due to simultaneous writes to the same table. Is that a valid statement?
    At this point, it is important to distinguish the concept of a deadlock (T1 holds resource A and needs resource B. T2 holds resource B and needs resource A.) from the DB_LOCK_DEADLOCK return value. There is a large degree of overlap, but the return value encompasses other things (including, but not limited to HA).

    Assuming separate physical files, then simultaneous, but separate writes should not deadlock on page resources.
    - Following the same thinking with some more details, trying to do a dp->put() or txn_begin() from different tables should cause no problem. Is that correct?
    The phrase "cause no problem" is not correct. Beginning a txn is not affiliated with any database table. It is its own handle and then database operations occur within that txn. Users may choose to do a single operation in a txn, multiple operations to a single database, or multiple operations to many databases, all within a txn. Trying to do a db->put to different database tables should not cause an actual page resource deadlock, but could result in an error return indicating a variety of problems.

    Now when it comes to HA, i would like to validate my understanding, basically when a site changes role (either from master to client or vice versa from client to master) there is a lockout which means that doing a db->put or txn_begin from any database table in that case will block or return rep_lockout if we are using conf_nowait? Is that correct?
    Actually db->put (and similar operations) are handled differently than txn/cursor resources. A db->put operation is a short-term "in the library" operation while a txn/cursor is a long-lived handle. Those are very distinct and different. The db->put operation will never wait for lockout to be released. It will always return right away (with either an error value or success). The txn/cursor usage will block waiting for BDB to release the lockout, unless you set the NOWAIT configuration flag.


    Sue LoVerso
    Oracle
  • 7. Re: Understanding deadlock
    857786 Newbie
    Currently Being Moderated
    Thx a lot. That's helpful.

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points