This discussion is archived
4 Replies Latest reply: Jan 16, 2013 11:15 AM by Marty RSS

How to fix: Deleting snapshot causes kernel panic

Marty Newbie
Currently Being Moderated
I have spent the last 16 hours or so trying to get a system somewhat stable. System is Solaris 11.1 with three ZFS pools on fibre channel drives. After a power failure, the machine started randomly crashing with:

BAD TRAP: type=e (#pf Page fault) rp=fffffffc8f133610 addr=30 occurred in module "zfs" due to a NULL pointer dereference

I narrowed it down to (at least) a time-slider monthly snapshot. When I tried to delete the snapshot, the machine crashed. Scrub of the pool returned zero errors. I did make a backup of the filesystem, but when I tried to recursively destroy the filesystem, the machine crashed. I have disabled time-slider, renamed the filesystem and renamed the snapshot.

Must I back up and destroy the pool? Right now a backup is being taken of all of the filesystems in this pool.

In my many years of using Solaris, I have never seen anything so debilitating. To be fair, so far the data appears to be intact.

Thanks,
Marty
  • 1. Re: How to fix: Deleting snapshot causes kernel panic
    800381 Explorer
    Currently Being Moderated
    Google for uses of zdb. You might find something useful.

    http://docs.oracle.com/cd/E23823_01/html/816-5166/zdb-1m.html
    Description

    The zdb command is used by support engineers to diagnose failures and gather statistics. Since the ZFS file system is always consistent on disk and is self-repairing, zdb should only be run under the direction by a support engineer.

    If no arguments are specified, zdb, performs basic consistency checks on the pool and associated datasets, and report any problems detected.

    Any options supported by this command are internal to Sun and subject to change at any time.
    Remember - "the ZFS file system is always consistent on disk and is self-repairing".

    Always_. </SARCASM>
  • 2. Re: How to fix: Deleting snapshot causes kernel panic
    Marty Newbie
    Currently Being Moderated
    Thanks for the tip. I am running zdb now against the pool. I don't know if it normally takes 20 minutes or 20 months to complete. Hopefully it will find something useful.
  • 3. Re: How to fix: Deleting snapshot causes kernel panic
    Marty Newbie
    Currently Being Moderated
    I think my pool is pretty much borked because zdb tells me:

    Traversing all blocks to verify metadata checksums and verify nothing leaked ...
    zdb_blkptr_cb: Got error 50 reading <0, 0, 0, 65> -- skipping
    zdb_blkptr_cb: Got error 50 reading <0, 3249, 1, 0> -- skipping

    I am trying to get a clean backup now of the remaining filesystems. This pool only has a few TB on it and it has not crashed yet today (fingers crossed). Sadly, this recovery is hampered by a strange set of timeouts every few minutes of the fibre disks similar to the following:

    Jan 16 10:49:57 dl585 scsi: [ID 243001 kern.warning] WARNING: /scsi_vhci (scsi_vhci0):
    Jan 16 10:49:57 dl585 /scsi_vhci/disk@g20000014c350f580 (sd40): Command Timeout on path fp7/disk@w21000014c350f580,0

    Those only happen during replication, but they effectively freeze pool activity for a minute or so every few minutes.

    Worse, there is some strange nuance with the (Sun branded) nVidia 3D card when the machine reboots. It throws some error about invalid settings on the card, causing the console and Sun Rays to be unusable. The only way to get a clean boot is to power down then up, so crashes restart to an unusable server. This would be more palatable if each boot didn't take 8 minutes.

    This is going to be a long day...
  • 4. Re: How to fix: Deleting snapshot causes kernel panic
    Marty Newbie
    Currently Being Moderated
    My bad, it seems I do have some shaky drives. Using 'iostat -E' and smartctl, I found the main offenders. I am running a background scan on the drives and the timeouts have calmed down.

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points