This discussion is archived
1 2 Previous Next 26 Replies Latest reply: May 2, 2007 8:49 AM by 524761 RSS

Regenerate a replica

566200 Newbie
Currently Being Moderated
I (currently) have two machines. One master (the intention is to have it as a 'shadow' master - force this to always be master) and one slave (will naturally be more in the end, but...).<p>
I start the client with an empty database, and it is syncronized with the master (very quickly I must add - I was surprised!) and updates on the master is propagated to the client. But if I then stop both the master and slave and remove the slave database again (to force a sync again).<p>
But it stops after a couple of seconds, and in the master's error (nothing on the client) log I see:<p>
<strong>
unable to join the environment
</strong><p>
And no matter how many times I retry this, the client won't get/retreive the whole database. The only way to 'restore replication' is start completely fresh on the master...
  • 1. Re: Regenerate a replica
    566200 Newbie
    Currently Being Moderated
    Hmm... Actually, when I start updating the master, MOST of the database will 'migrate' (?) to the slave. But not all, and one table isn't migrated at all...
  • 2. Re: Regenerate a replica
    566200 Newbie
    Currently Being Moderated
    This is SO weird! A third update (I have a script that stress test the whole application/BDB which forces a lot of add's) will put the two databases in sync...

    But is there a way that I can force this in the application? Something like 'go fetch the whole database, and don't set OK until you have the whole thing' kind'a thing?
  • 3. Re: Regenerate a replica
    566200 Newbie
    Currently Being Moderated
    Adding support for all the <strong>event</strong> parameters in my event callback will show that when the client starts, the master will get the MASTER, the client will get CLIENT and NEWMASTER successfully. But, the client will <strong>not</strong> receive the STARTUPDONE message. And neither does the master...

    And if I md5sum the files on both the master and client, they differ but have the exact same size... I guess have to write a program or something that goes through the whole database and look for differences.. But any idea why I have do that many updates (80000 * 3 times running) to get it to look (on the surface) like the database is in sync? And why I don't get the STARTUPDONE?
  • 4. Re: Regenerate a replica
    566200 Newbie
    Currently Being Moderated
    On another client application (that however uses the exact same BDB library API we written for the app above) and which opens the database r/o, the syncronization starts and goes for about a second, then it stops and never continues... Repeated start/stop will (would - I don't have that patience :) probably syncronize all the tables eventually.

    The stop occurs even if I put a sleep(10) right after DB_ENV->repmgr_start() which is weird... I don't understand why it just ... stops!
  • 5. Re: Regenerate a replica
    524761 Journeyer
    Currently Being Moderated
    Does the application first open the environment with recovery (DB_RECOVER or DB_RECOVER_FATAL), and in a single thread?

    My colleague provides the following information about the "unable to join the environment" error:

    "It usually means that the environment is being recovered. Often it
    happens if recovery previously failed (so it never finishes). We
    loop because it could just be a race between two processes trying to
    create the environment at the same time. It could mean that the code
    was compiled differently than the code that created the environment
    and the structures do not line up right."

    Alan Bram
    Oracle
  • 6. Re: Regenerate a replica
    566200 Newbie
    Currently Being Moderated
    I did use DB_RECOVER but tried DB_RECOVER_FATAL instead. Didn't do much difference. It seems like I can't use them at the same time(?).

    These are the flags I (_currently_) use:
    DB_CREATE | DB_INIT_MPOOL | DB_INIT_TXN | DB_RECOVER_FATAL | DB_INIT_LOCK | DB_INIT_LOG | DB_THREAD | DB_INIT_REP
  • 7. Re: Regenerate a replica
    524761 Journeyer
    Currently Being Moderated
    DB_RECOVER (without the _FATAL) is probably what you want, unless you've restored log files after a media failure.

    But, what I was trying to get at in my previous question: at the master, do you open the environment just once, in a single thread, and wait for that to complete before proceeding with any other operations? Or do you have multiple env opens, perhaps in different processes?

    Alan Bram
    Oracle
  • 8. Re: Regenerate a replica
    566200 Newbie
    Currently Being Moderated
    Sorry, yes only one environment open and only one thread/process.

    The application is currently multi-process designed (forked), but that have been disabled in my svn branch (it's being rewritten in another branch to be fully thread-safe - mine isn't), so the app isn't threadsafe, but as I said, I only run one thread and open the environment only once (in both the master and the slave).

    The problem is that replication 'kinda works'. Starting the master, waiting for it to be ready (open the environment, databases and starting replication etc) and then start the slave... The slave will get a couple of hundra kilobytes of the database, then stops. Nothing in the logs...

    Message was edited by:
    Turbo
  • 9. Re: Regenerate a replica
    566200 Newbie
    Currently Being Moderated
    Talking about this with collegue and techleader, we desided to check if there is something wrong with the database(s) after running for a couple of minutes...

    This is the output of 'db_stat -c' on the MASTER:
    ----- s n i p -----
    44 Last allocated locker ID
    0x7fffffff Current maximum unused locker ID
    9 Number of lock modes
    1000 Maximum number of locks possible
    1000 Maximum number of lockers possible
    1000 Maximum number of lock objects possible
    0 Number of current locks
    12 Maximum number of locks at any one time
    0 Number of current lockers
    20 Maximum number of lockers at any one time
    0 Number of current lock objects
    12 Maximum number of lock objects at any one time
    33 Total number of locks requested
    33 Total number of locks released
    0 Total number of locks upgraded
    11 Total number of locks downgraded
    0 Lock requests not available due to conflicts, for which we waited
    0 Lock requests not available due to conflicts, for which we did not wait
    0 Number of deadlocks
    0 Lock timeout value
    0 Number of locks that have timed out
    0 Transaction timeout value
    0 Number of transactions that have timed out
    344KB The size of the lock region
    0 The number of region locks that required waiting (0%)
    ----- s n i p -----

    And this is the same command but on the slave:
    ----- s n i p -----
    777 Last allocated locker ID
    0x7fffffff Current maximum unused locker ID
    9 Number of lock modes
    1000 Maximum number of locks possible
    1000 Maximum number of lockers possible
    1000 Maximum number of lock objects possible
    12 Number of current locks
    16 Maximum number of locks at any one time
    33 Number of current lockers
    34 Maximum number of lockers at any one time
    12 Number of current lock objects
    16 Maximum number of lock objects at any one time
    372117 Total number of locks requested
    372105 Total number of locks released
    0 Total number of locks upgraded
    12 Total number of locks downgraded
    0 Lock requests not available due to conflicts, for which we waited
    0 Lock requests not available due to conflicts, for which we did not wait
    0 Number of deadlocks
    0 Lock timeout value
    0 Number of locks that have timed out
    0 Transaction timeout value
    0 Number of transactions that have timed out
    344KB The size of the lock region
    17 The number of region locks that required waiting (0%)
    ----- s n i p -----

    My untrained eye directly see the 'Number of current locks'...
    db_stat is run AFTER shutting down the application(s).
  • 10. Re: Regenerate a replica
    524761 Journeyer
    Currently Being Moderated
    I'm still concerned about the "unable to join the environment" error,
    because that indicates a fundamental problem. The environment is the
    foundation upon which the rest of the system is supposed to work, so
    if that's broken it's not surprising that replication would not work.

    Did you see that just one time? Or do you see that every time you try
    the experiment?

    That message is generated in the function __db_e_attach
    (env/env_region.c). We should only be able to get there by "goto
    retry", because we know that "ret == 0" at that point. There are 3
    places within that function where we goto retry. Can you put a
    debugger breakpoint and/or printf call at those 3 places, and tell me
    which one is happening?

    Other notes:

    When you say you "remove the slave database again (to force a sync
    again)", do you mean that you remove the entire environment (all
    database files, all transaction log files, and all region files such
    as "__db.001", etc.)? Or do you mean just one single database file?
    Please remember that Berkeley DB terminology is a bit nonstandard: we
    use the term "database" to refer to what corresponds to a single table
    in a usual relational system; an "environment" can have many databases
    associated with it.

    Normally if you connect a brand new (empty) client to an existing
    master, the client should synchronize itself with the master
    completely. However, the STARTUPDONE event normally does not occur
    when that synchronization is complete, unless there has been new
    activity (i.e., additional transactions generated) at the master site,
    after the beginning of the synchronization. (This behavior is
    admittedly a bit confusing, and has been rectified in the upcoming
    release.) Also, STARTUPDONE is a client concern; it is never generated
    at the master.

    Are you using the Replication Framework in all of these test cases?

    Finally, in your message about the locks you say "after shutting down
    the application(s)". Does shutting down include a successful closing
    of the Berkeley DB environment (i.e., a call to DB_ENV->close() which
    returns "0")?


    Alan Bram
    Oracle
  • 11. Re: Regenerate a replica
    566200 Newbie
    Currently Being Moderated
    Sorry I haven't replied earlier, but I've been busy trying to make the apps 'more threadable' by removing the global transaction and cursor pointers which didn't seem to help. The apps is STILL quite 'very untreadsafe' (meaning a lot of some quite large global variables etc)...

    I haven't had/seen the 'unable to join the environment' in quite a while, so it might just been a problem with my flags to open()'s etc...

    For me, the environment = database, but I see your point. I'll try to adjust my vocabulary to BDB terms :). I.e., I was/is 'removing the environment' (not the database).

    If using repmgr_start() et al is using the 'Replication Framework', then yes, I'm using that in all my tests and the app is shutting down the database(s) and the environment correctly with the close() etc. I also make sure (but obviosly isn't succeeding) that any transactions and cursors are closed as well. At least the code to do that is there and is executed (have logs to make sure of that). I'll have to tripple check that it actually returns '0' though. Not all the calls checks that.

    UPDATE: Adding some checks for return of the close()'s, I now end up with (unrelated to changes, but I've got that a couple of times previously so I can just as well show it here) the following messages on the server:

    redundant incoming connection will be ignored
    writing data: Connection reset by peer
    DB_ENV->rep_process_message: Operation not permitted
    message thread failed: Operation not permitted
    PANIC: Operation not permitted

    The first line comes when the client receives the NEW_MASTER message, and the rest when I shutdown the client...
    My own debugging/messaging say:

    Replication event PANIC received.
    Failed to close table. Description: DB_RUNRECOVERY: Fatal error, run database recovery

    One thing that would be interesting to know is how can/could the master database be corrupted just because there where a problem at the client end? No updates where done at the master...

    UPDATE: You where right. The application didn't close the database(s) and the environment when it got a SIGINT, only SIGSEGV (which didn't happen). Now there's no lingering locks, but it still doesn't replicate. And nothing in the logs that can shed some light on the matter...

    Message was edited by:
    Turbo
  • 12. Re: Regenerate a replica
    566200 Newbie
    Currently Being Moderated
    i got a second pair of eyes on the problem today (from one of the guys that helped 'design' the whole thing - all applications etc) and eventually he wanted to try to put the database on normal filesystem.
    I haden't noticed (I've only been on the project a couple of weeks), but the whole environment was in a tmpfs!
    Putting the environment and all the databases helped - it now seems to have finished synchronizing. I apologize for the waste of time for everyone....

    I DO however still get the PANIC (on the master!) when I shutdown the client, so maybe the time isn't completely wasted...

    Also, come to think of it, I configured a 2Gb logfile size limit, and i got 15 files with *.{idx,tbl} only being 104Mb! Why do I get 30Gb worth of logs with only 104Mb worth of data?

    Message was edited by:
    Turbo
  • 13. Re: Regenerate a replica
    524761 Journeyer
    Currently Being Moderated
    The key error is this line:

    <blockquote> DB_ENV->rep_process_message: Operation not permitted</blockquote>

    Could you turn on verbose replication messages, and let's see what message the master is trying to process at the time it gets this error.

    <p>Alan Bram
    <br>Oracle
  • 14. Re: Regenerate a replica
    566200 Newbie
    Currently Being Moderated
    I haven't seen this in a while now, but I'll keep an eye open for it...

    Now I'm getting 'DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock' on the client and two 'EOF on connection from site <client:port>'. The first is about the same time as the DB_EVENT_REP_NEWMASTER is received and the other when the DB_LOCK_DEADLOCK occurs. The client gets 11Mb out of 11.5Mb of the first database (there's 11 databases in the environment) from the master (missing -293376 bytes) before the DB_LOCK_DEADLOCK occures...

    The DB_LOCK_DEADLOCK happens in/at the DB->open() call. DB->open() is called with the following flags: DB_THREAD | DB_RDONLY | DB_AUTO_COMMIT | DB_READ_UNCOMMITTED.

    DB_ENV->open() flags (just for completness): DB_CREATE | DB_INIT_MPOOL | DB_INIT_TXN | DB_RECOVER | DB_INIT_LOCK | DB_INIT_LOG | DB_THREAD | DB_INIT_REP.

    Running with verbose modes: DB_VERB_DEADLOCK won't get me anything.


    Every time I fix something, something else breaks. All I did from yesterday (when it worked, but slowly) was do nested transactions (to speed up bulk loads - which WAS faster), but that's not any where near the DB*->open() calls...

    How sensetive is BDB about global variables/'thread-safeness'? I wish I could post the code for extra eyes, but it's not open code...
1 2 Previous Next