This discussion is archived
5 Replies Latest reply: May 23, 2013 4:19 AM by Andrei Costache, Oracle RSS

Re: db_recover -c always fails at 67%

1010357 Newbie
Currently Being Moderated
Hello
I am using Berkeley DB release db-4.5.20.NC and I have the same problems recovering with db_recover.

I had a system crashes this week.
Two weeks ago I took a snapshot of all the files of the environment directory and also every day a cron job makes the copies of all the logs from environment directory to backup directory.
First attempt to recover with "db_recover -c" and db_recover fails, the new logs could be added to the Database. after removing some log files, db_recover recovered ok, but there are no Logs added, information is related to two weeks ago.

I also have a full backup (slapcat) from and two weeks ago and now is restored (slapadd) and now is in production.

Now there are 2 weak missing work, but :
- I have all the Logs related to this two weeks until the crash.
- All the logs have only new data to ADD, no modifications no deletes.
- What I need is only to ADD the data in the LOGs to the DataDase, if possible ....

For me, the information is not clear about hat happens in db_recover or "db_recover -c", if there are:
a) Some of the logs are Old Logs that are not necessary but by mistake, are there ;
b) Partial or total missing Logs situation.

Can you give me some help one this suggestion?
Using db_recover (normal or Catastrophic), db_verify, db_printlog
1 - Using the today database and logs, there are any way to select the proper logs from the last 2 weak and
then put them in the working directory with today Logs (openldap-data) in order to do a db_recover (-catastrophic ?) and ADDING has much as possible data ?

2- Using the backup from 2 weeks ago there are any way to select the proper logs from the last 2 weeks and
then put them in the backup system in order to do a db_recover (-catastrophic ?) and ADDING has much as possible data in the backup database and then create a LDIF file with 2 weeks work using a ldap_search withe CreateTime/ModificationTime filter and then "ldap_add" this data ?


Jose
  • 1. Re: db_recover -c always fails at 67%
    Andrei Costache, Oracle Journeyer
    Currently Being Moderated
    Hi Jose,
    user13384449 wrote:
    I had a system crashes this week.
    Two weeks ago I took a snapshot of all the files of the environment directory and also every day a cron job makes the copies of all the logs from environment directory to backup directory.
    Some remarks here, though you are probably doing things this way. You should be copying the log files from the smallest numbered log file to the largest numbered log file, in other words, in ascending order (based on the log file's number).
    Also, you should be overwriting the log files that already exist in the backup directory (since you're copying in more up to date log files).
    Your backup strategy looks to be a hot backup than a daily incremental backup.
    First attempt to recover with "db_recover -c" and db_recover fails, the new logs could be added to the Database. after removing some log files, db_recover recovered ok, but there are no Logs added, information is related to two weeks ago.
    Even if recovery completed you probably lost the modifications from the log files that were removed from the live environment. Also, removing log files just to make recovery complete is not safe. Log files should be removed by following the guidelines here. Note that enabling automatic log file removal using the DB_LOG_AUTO_REMOVE flag will make catastrophic recovery impossible as you'll not have a chance to copy the log files to a backup location before they are removed; same applies to a log_archive(DB_ARCH_REMOVE) call -- before removing the logs, copy them to the backup directory.

    In addition have a look over the backup and recovery procedures:
    http://docs.oracle.com/cd/E17076_02/html/programmer_reference/transapp_recovery.html
    http://docs.oracle.com/cd/E17076_02/html/gsg_txn/C/recovery.html
    http://docs.oracle.com/cd/E17076_02/html/gsg_txn/C/backuprestore.html

    If after the system crash, a normal recovery fails (db_recover or opening the environment with the DB_RECOVER flag) and a catastrophic recovery fails too (db_recover -c or opening the environment with the DB_RECOVER_FATAL flag) then the next steps will depend on whether you have a backup available on top of which you can add the log files that were created in the live environment since the backup was taken or not.

    My understanding is that you have all the log files that were created in the live environment since the last time you took an incremental backup.
    I also have a full backup (slapcat) from and two weeks ago and now is restored (slapadd) and now is in production.

    Now there are 2 weak missing work, but :
    - I have all the Logs related to this two weeks until the crash.
    - All the logs have only new data to ADD, no modifications no deletes.
    - What I need is only to ADD the data in the LOGs to the DataDase, if possible ....
    You can try to copy the backup to an empty directory (we'll call it a recover directory), than copy all the log files from the live environment in that empty directory too and perform a catastrophic recovery. If the catastrophic recovery succeeds (db_recover -c), then you can use that recover directory as the live env directory, or move the files to the designated live env directory (overwriting any existing files in there).
    For me, the information is not clear about hat happens in db_recover or "db_recover -c", if there are:
    a) Some of the logs are Old Logs that are not necessary but by mistake, are there ;
    There is no problem if during catastrophic recovery (since catastrophic recovery is the one replaying all the existing log files -- a normal recovery only replays the logs since the last checkpoint) some old log files are present, log files containing changes that are already applied in the databases.
    b) Partial or total missing Logs situation.
    Catastrophic recovery requires a continuous series of log files and will likely fail if there are gaps, that is, missing log files. By partially missing logs, I assume you are referring to partially missing log records -- this rather indicates log files corruption; in this case too, recovery will likely fail.
    Can you give me some help one this suggestion?
    Using db_recover (normal or Catastrophic), db_verify, db_printlog
    1 - Using the today database and logs, there are any way to select the proper logs from the last 2 weak and
    then put them in the working directory with today Logs (openldap-data) in order to do a db_recover (-catastrophic ?) and ADDING has much as possible data ?

    2- Using the backup from 2 weeks ago there are any way to select the proper logs from the last 2 weeks and
    then put them in the backup system in order to do a db_recover (-catastrophic ?) and ADDING has much as possible data in the backup database and then create a LDIF file with 2 weeks work using a ldap_search withe CreateTime/ModificationTime filter and then "ldap_add" this data ?
    See the suggestion I mentioned a few paragraphs above. Let me know if you manage to successfully recover the database(s).

    Regards,
    Andrei
  • 2. Re: db_recover -c always fails at 67%
    1010357 Newbie
    Currently Being Moderated
    Hi Andrei,
    thanks for the you advises but we already tried the recover in the backup database that is the only one that we get due to the disk failure in the main database.

    My understanding is that you have all the log files that were created in the live environment since the last time you took an incremental backup.

    I don't have the live environment due to disk problem, the only one is done in this way
    First a snapshot, ready to work, to the backup computer:
    with slapd stoped:
    - db_remove unnecessary logs
    - scp /usr/local/var/openldap-data/__*      user1@192.168.1.202:/usr/local/var/openldap-data/
    - scp /usr/local/var/openldap-data/objct*      user1@192.168.1.202:/usr/local/var/openldap-data/
    - scp /usr/local/var/openldap-data/log.*      user1@192.168.1.202:/usr/local/var/openldap-data/
    - scp /usr/local/var/openldap-data/dn2id.bdb     user1@192.168.1.202:/usr/local/var/openldap-data/
    - scp /usr/local/var/openldap-data/id2entry.bdb user1@192.168.1.202:/usr/local/var/openldap-data/

    On every day basis copy the Logs no LOG_NEW directory
    with slapd running:
    - scp /usr/local/var/openldap-data/log.* user1@192.168.1.202:/usr/local/var/openldap-data/NEW/

    Catastrophic recovery requires a continuous series of log files and will likely fail if there are gaps, that is, missing log files. By partially missing logs, I assume you are referring to partially missing log records -- this rather indicates log files corruption; in this case too, recovery will likely fail.

    theoretical all the log has there. First attempt to recover has:
    a) copy all the NEW/log.* to /usr/local/var/openldap-data (2 logs replaced)
    b) db_recover -c I got a PANNIC message LSN...

    Question is:
    - In this particular situation should the system recover properly and insert all the data in the DataBase or not due to two Logs file replacement ?
    - In other situation where one log is damaged, there are some way to overcame the problem with continue option for instance in order to have has much as possible inserts in the database?

    Last resource is to develop a program to get has much as possible data from logs, because we are in a not so difficult situation: the schema is always the same, and all the data is to add to database.


    Regards,
    Jose
  • 3. Re: db_recover -c always fails at 67%
    Andrei Costache, Oracle Journeyer
    Currently Being Moderated
    Hi Jose,

    Thanks for the clarifications. Though, my suggestions still applies, as I do not think you are properly "assembling" the logs for running recovery.
    user13384449 wrote:
    I don't have the live environment due to disk problem, the only one is done in this way
    First a snapshot, ready to work, to the backup computer:
    with slapd stoped:
    - db_remove unnecessary logs
    - scp /usr/local/var/openldap-data/__*      user1@192.168.1.202:/usr/local/var/openldap-data/
    - scp /usr/local/var/openldap-data/objct*      user1@192.168.1.202:/usr/local/var/openldap-data/
    - scp /usr/local/var/openldap-data/log.*      user1@192.168.1.202:/usr/local/var/openldap-data/
    - scp /usr/local/var/openldap-data/dn2id.bdb     user1@192.168.1.202:/usr/local/var/openldap-data/
    - scp /usr/local/var/openldap-data/id2entry.bdb user1@192.168.1.202:/usr/local/var/openldap-data/
    On every day basis copy the Logs no LOG_NEW directory
    with slapd running:
    - scp /usr/local/var/openldap-data/log.* user1@192.168.1.202:/usr/local/var/openldap-data/NEW/
    Okay, so an offline/cold backup taken first, which is than followed by incremental backups, by copying the new logs files daily to the backup directory.
    However, I do not see the point of copying the new logs into a subdirectory in the backup directory -- you can simply retain the same directory structure as that of the live env directory.
    theoretical all the log has there. First attempt to recover has:
    a) copy all the NEW/log.* to /usr/local/var/openldap-data (2 logs replaced)
    b) db_recover -c I got a PANNIC message LSN...
    I do not understand why you copied the logs from the backup directory into the live environment directory. You basically overwrote/replaced up-to-date logs with their older versions.
    Suppose that after a daily incremental backup you copy logs 500 to 600 -- the last two logs being 599 and 600. Log 600 might not be fully filled with log records, as it is the current log in the live environment. Next day of application run log 600 gets filled entirely and new logs get created in the live env. I am not sure whether you have automatic log file removal or remove log files by calling log_archive() in the live env, so assume that you remove log files in the live env, the last checkpoint is in log 599, hence all the logs after log 599 (including log 599) remain in the live env. Than you get the crash and a normal recovery or catastrophic recovery do not work in the live env. If at this point you copy the logs from the backup directory on top of the logs in the live env you will end up with a log 600 that is incomplete, meaning it does not reflect all the changes in the databases.

    Since you mentioned a LSN error, it should be simple to figure out if the above is the case, if the log file number in the LSN reported in the error represents one of the logs that you wrongly replaced by copying from the backup directory to the live env directory, or the log file immediately following these two (in terms of log file number).

    Again, try to create an empty directory into which recovery will run, copy there all the environment files, including databases followed by logs from the backup directory (192.168.1.202:/usr/local/var/openldap-data/log.* and 192.168.1.202:/usr/local/var/openldap-data/NEW/log.*) than copy all the logs from the live env directory, and than try a catastrophic recovery (db_recover -cv).
    If catastrophic recovery still fails please:
    -- explain in detail the exact actions you took after the system crash and provide me the exact errors you received when you first tried a normal than catastrophic recovery in the live environment after the system crash,
    -- the error output from db_recover -cv when you attempt to perform a catastrophic recovery as explained above, in that empty/recover directory, and a recursive directory listing of this directory.
    Question is:
    - In this particular situation should the system recover properly and insert all the data in the DataBase or not due to two Logs file replacement ?
    As explained above, you wrongly replaced those two log files, thus ending up with pieces of info (log records) missing from the log files, log records that (catastrophic) recovery needs in order to consistently reconstruct the databases.
    - In other situation where one log is damaged, there are some way to overcame the problem with continue option for instance in order to have has much as possible inserts in the database?
    A damaged log file would be a log file that is corrupted -- but in this case you would have seen a checksum error not an LSN error.
    If so, then you would need to verify the database using db_verify; if they do not verify correctly then you would need to salvage dump and reload them using db_dump -r/db_load, than reset the LSNs in the databases before using them in a transactional environment using db_load -r lsn.


    Regards,
    Andrei
  • 4. Re: db_recover -c always fails at 67%
    1010357 Newbie
    Currently Being Moderated
    Hi Andrei,

    Thanks for the response. There some misunderstood.
    - I have a Production/live system where I don´t touch. I only copy Log´s from there I do not copy the Logs back again!
    - Creating a a "empty" directory, has you suggested, is already done. I have a backup ready to work AND a directory with "NEW" logs directory to try the db_recover.
    - I have 2 potential files (Log 500 and 501) in the "NEW" logs that they have the same name related to the backup . But the file 501 in the "NEW" directory it has more transactions!

    Before copping the Log from "new" directory, backup Database is working good, because is a copy of the working/live database and uses the same environment (with log 500 and 501) , but, as you explained in the last response, after copy the Log files that in 'NEW' (500 to 600), there are 2 log´s that are replace 500 and 501 in the BACKUP. I think this is the initial problem, the db_recover -r founds the changes and stops giving a error. After some tries ("db_recover -c" "db_recover") it will start working, but the new data from logs 502 to 600 are not in the database!

    - If that situation and others where some log is damaged, there are some way to overcame the problem, with a option, in order to have has much as possible inserts in the database?


    Best regards
    Jose
  • 5. Re: db_recover -c always fails at 67%
    Andrei Costache, Oracle Journeyer
    Currently Being Moderated
    Hi Jose,

    I think I understand now what you are doing. Though, I asked for some information in my previous update/post. Could you please provide it?
    Also, you did not mention whether and how you are removing log files from the live env.
    Note that you should always run a catastrophic recovery -- db_recover -c or DB_RECOVER_FATAL -- and NOT a normal recovery (not db_recover or DB_RECOVER) when recovering from a snapshot, a backup.

    Recovery cannot skip portions of logs, in other words, jump over specific sections in the logs; there are no such options.
    I'll send you in e-mail the details for uploading an archive containing the entire backup directory (including the "NEW" subdirectory in it) -- I might be able o better understand the issue if I examine the files.

    Regards,
    Andrei