Recovering From Catastrophic Failure — oracle-tech

    Forum Stats

  • 3,714,874 Users
  • 2,242,643 Discussions


Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Recovering From Catastrophic Failure

bblasing-Oracle Member Posts: 64 Employee
edited June 2015 in Oracle HSM

Recovering From Catastrophic Failure

Certain events, such as flooding in a computer room, can be classified as catastrophic failures. This page describes the procedure to follow after such an event.

Recovery Task Overview

You should not recover any system component, software element, or file system that has not failed. However, you might need to reconfigure the file system on a restored system to regain access to file systems or to determine whether any file system has failed.

The process of recovering from a catastrophic failure involves the following tasks:

TaskFor More Information
Determine the failed system component.How to Restore Failed System Components
Disable the archiver and the recycler until all files are restored.How to Disable the Archiver and Recycler Until All Files Are Restored
Compare previous and current configuration files, and reconcile inconsistencies.How to Keep and Compare Previous and Current Configuration and Log Files
Repair disks.How to Repair Disks
Restore or build new library catalog files.How to Restore or Build New Library Catalog Files
Create new file systems and restore from samfsdump output.How to Make New File Systems and Restore From samfsdump Output

How to Restore Failed System Components

  1. Ascertain which components have failed.
  2. If a hardware component has failed, restore it to operation, preserving any available data.
    If the failing component is a disk drive that has not totally failed, preserve as much information as possible. Before replacing or reformatting the disk, identify any salvageable files, and copy these files to a tape or to another disk for future use in the recovery process. Salvageable files to identify and copy include the following:
    • File system dumps
    • Sun SAM configuration files, archiver log files, or library catalogs
  3. If the Oracle Solaris Operating System (OS) has failed, restore it to operation.
    Verify that the Solaris OS is functioning correctly before proceeding.
  4. If the Sun SAM or Sun QFS package has been damaged, remove and reinstall it from a backup copy or from its distribution file.
    You can verify whether a package has been damaged by using the pkgchk(1M) utility.
  5. If disk hardware used by the Sun SAM software was repaired or replaced in Step 2, configure the disks (for RAID binding or mirroring) as necessary.
    Reformat disks only if they have been replaced or if it is otherwise absolutely necessary.

How to Disable the Archiver and Recycler Until All Files Are Restored

Before You Begin

If the recycler is enabled so that it runs before all files are restored, cartridges with good archive copies might be improperly relabeled.

  1. Add a single global wait directive to the archiver.cmd file or add a file-system-specific wait directive for each file system for which you want to disable archiving.
  2. Open the /etc/opt/SUNWsamfs/archiver.cmd file for editing and find the section in which you want to insert the wait directive.
    In the following sample file, local archiving directives exist for two file systems, samfs1 and samfs2.

    # vi /etc/opt/SUNWsamfs/archiver.cmd ... fs = samfs1 allfiles . 1 10s fs = samfs2 allfiles . 1 10s

  3. Add the wait directive.
    • To apply the directive globally, insert it before the first fs = command (fs = samfs1), as shown here:

      wait fs = samfs1 allfiles . 1 10s fs = samfs2 allfiles . 1 10s :wq

    • To apply the directive to a single file system, insert it after the fs = command for that file system, as shown here:

      fs = samfs1 wait allfiles . 1 10s fs = samfs2 wait allfiles . 1 10s :wq

  4. Add a global ignore directive to the recycler.cmd file, or add a file-system-specific ignore directive for each library for which you want to disable recycling.
  5. Open the /etc/opt/SUNWsamfs/recycler.cmd file for editing, as shown in the following example.

    # _vi /etc/opt/SUNWsamfs/recycler.cmd_ ... logfile = /var/adm/recycler.log lt20 -hwm 75 -mingain 60lt20 75 60hp30 -hwm 90 -mingain 60 -mail root gr47 -hwm 95 -mingain 60 -mail root

  6. Add the ignore directives.
    The following example shows ignore directives added for three libraries.

    # recycler.cmd.after - example recycler.cmd file # logfile = /var/adm/recycler.log lt20 -hwm 75 -mingain 60 -ignore hp30 -hwm 90 -mingain 60 -ignore -mail root gr47 -hwm 95 -mingain 60 -ignore -mail root

How to Keep and Compare Previous and Current Configuration and Log Files

Follow these steps before rebuilding the system.

  1. Recover any available Sun SAM configuration files or archiver log files from the system's disks.
  2. Compare the restored versions of all configuration files represented in the SAMreport with those restored from the system backups.
  3. If inconsistencies exist, determine the effect of the inconsistencies and reinstall the file system, if necessary, using the configuration information in the SAMreport file.
    For more information on SAMreport file, see the samexplorer(1M) man page.

How to Repair Disks

  1. For file systems that reside on disks that have not been replaced, run the samfsck(1M) utility to repair small inconsistencies, reclaim lost blocks, and so on.
    For command-line options to the samfsck utility, see the samfsck(1M) man page.

How to Restore or Build New Library Catalog Files

  1. Replace the most recent library catalog file copies from the removable media files, from the Oracle's Sun StorageTek SAM server disks, or from the most recent file system archive copies.
  2. If the library catalogs are unavailable, build new catalogs by using the command and the library catalog section of the most recent SAMreport as input.
    Use the newest library catalog copy available for each automated library.
    Note -
    Sun SAM systems automatically rebuild library catalogs for SCSI-attached automated libraries. This does not occur for ACSLS-attached automated libraries. Tape usage statistics are lost.

How to Make New File Systems and Restore From samfsdump Output

Follow these steps for file systems that were partially or completely resident on disks that were replaced or reformatted.

  1. Obtain the most recent copy of the samfsdump(1M) output file.
  2. Make a new file system and restore the file system using the samfsdump output file.
  3. Use the sammkfs (1M) command to make a new file system.

    # mkdir /sam1 # sammkfs samfs1 # mount samfs1

  4. Use the samfsrestore (1M) command with the -f option and the -g option, use the following syntax:

    samfsrestore -f <output-file-location> -g <log-file>

    • output-file-location is the location of the samfsdump output file.
    • log-file is the path name of the new log file that will list all the files that were online.
      For example:

      # cd /sam1 # samfsrestore -f /dump_sam1/dumps/040120 -g /var/adm/messages/restore_log

      Note -
      Once all file systems have been restored, the system can be made available to users in degraded mode.

  5. On the file systems you have just restored, perform the following steps:
    1. Run the script against the log file, and stage all files that were known to be online before the outage. In a shared environment, this script must be run on the metadata server.
    2. Run the sfind (1M) command against the file system to determine which files are labeled as damaged.
      These files might or might not be restorable from tape, depending on the content of the archive log files. Determine the most recently available archive log files from one of the following sources, in this order:
      • The removable media file.
      • The Sun SAM server disk.
      • The most recent file system archive. This source is likely to be slightly outdated.
    3. Run the grep(1) command against the most recent archive log file to search for the damaged files.
      This will enable you to determine whether any of the damaged files were archived to tape after the last time the samfsdump(1M) command was run.
    4. Examine the archive log files to identify any archived files that do not exist in the file system.
  6. Use the star(1M) command to restore the damaged and nonexistent files identified in Step c and Step d.
  7. Reimplement disaster recovery scripts, methods, and cron(1M) jobs using information from the backup copies.
Sign In or Register to comment.