This discussion is archived
1 2 Previous Next 19 Replies Latest reply: Apr 23, 2013 3:59 PM by 1001999 RSS

SA_MAGIC header corrupt on ZFS causing kernel panic in S11.1 SRU5.5

1001999 Newbie
Currently Being Moderated
I have a number of Sun Fire x4500s and x4540s running S11.1 SRU5.5. One of the x4540s has recently started kernel panicking within minutes of powering on and sharing its file systems over NFS (it provides user directories for a computing cluster). If I don't share via NFS, the kernel panics do not occur.

When I run mdb on the kernel panic coredump I get:
\> ::panicinfo
cpu 1
thread fffffffc80e52c20
message
assertion failed: BSWAP_32(hdr->sa_magic) == SA_MAGIC, file: ../../common/fs/zfs/sa.c, line: 2079
[...]
\> ::stack
vpanic()
0xfffffffffbac1081()
sa_get_field64+0x96(ffffc1c0aaf34a00, 18)
zfs_space_delta_cb+0x46(2c, ffffc1c0aaf34a00, fffffffc80e52840, fffffffc80e52848)
dmu_objset_userquota_get_ids+0xca(ffffc1c1085d9008, 0, ffffc1c080796640)
dnode_sync+0xa7(ffffc1c1085d9008, ffffc1c080796640)
dmu_objset_sync_dnodes+0x7f(ffffc1c0299a5b00, ffffc1c0299a5a80, ffffc1c080796640)
dmu_objset_sync+0x1e2(ffffc1c0299a5940, ffffc1c07f0d17c8, ffffc1c030756680, ffffc1c080796640)
dsl_dataset_sync+0x63(ffffc1c0418cd300, ffffc1c07f0d17c8, ffffc1c080796640)
dsl_pool_sync+0xdb(ffffc1c0285493c0, 2c720a)
spa_sync+0x395(ffffc1c01b1f6a80, 2c720a)
txg_sync_thread+0x244(ffffc1c0285493c0)
thread_start+8()
In doing some google searching, I'm able to infer from similar lines of code in OpenSolaris (and various ZFS derivatives) that the point of this line of code is to ensure that a ZFS block has a particular header with the correct endianness. Lines of code just prior to this swap the endianness if its is incorrect (at least in these derivatives). For example:

[ZFS corruption probably from bad RAM|https://github.com/zfsonlinux/spl/issues/157]
[FreeBSD kernel panic|http://lists.freebsd.org/pipermail/freebsd-stable/2012-July/068882.html]
[ZFSOnLinux Panic|https://github.com/zfsonlinux/zfs/issues/1303]

So basically, this assert ought to work unless the header was written to disk incorrectly. I've recently done a resilver since a disk had failed and I'm currently in the middle of a scrub with estimated 90h to go. It did fix about 12kb of mistakes on the scrub even though it was right after the resilver.

Nevertheless, I don't think this should be able to happen unless the redundant copies of this header were also corrupted.

I wrote a little script that loops over process IDs and addresses in the core dump and prints this for each

echo $addr | awk '{print $1,"::pfiles"}' | mdb $DMPNO

The files that are open are all device files (eg./devices/pseudo/...) or things in /var, /etc, /lib, things like that. The user directories/files do not appear to be actively open in the core dump. That said, I cannot match the thread above to a process ID. So maybe there are open files that are not captured in the core dump

Since the machine will panic when I make NFS available, I am currently doing two things to try to resolve the problem:

(1) a zpool scrub
(2) "zfs send"ing a snapshot of every filesystem straight to /dev/null

The idea behind (2) is that eventually I'll read the bad block and force the kernel panic to happen. I'll infer from the log of the script which filesystem is bad and NFS share all the others.

You should assume that I can spell mdb, but not much more than that and everything I've done thus far has been mimicking things I have found in google searching.

Thanks for any advice the community has to help me fix the bad file/filesystem. I am happy to provide more details from mdb or whatever is needed.


Tom

Edited by: 998996 on Apr 9, 2013 7:13 AM

Edited by: 998996 on Apr 9, 2013 7:13 AM

Edited by: 998996 on Apr 9, 2013 7:14 AM

Edited by: 998996 on Apr 9, 2013 7:18 AM

Edited by: 998996 on Apr 9, 2013 8:48 AM
  • 1. Re: SA_MAGIC header corrupt on ZFS causing kernel panic in S11.1 SRU5.5
    cindys Pro
    Currently Being Moderated
    Hi Tom,

    This is a known problem so I apologize.

    We think you are hitting a couple of issues that are related to 15791909 and 15850819.
    You have the fix for 15791909, which added an assertion, that your system is tripping over.

    Do you have a support contract?

    If so, we would recommend that you open a MOS service request to get this panic reviewed
    and also to get an IDR for 15850819.

    Thanks, Cindy
  • 2. Re: SA_MAGIC header corrupt on ZFS causing kernel panic in S11.1 SRU5.5
    1001999 Newbie
    Currently Being Moderated
    Cindy:

    Glad that this is a known problem and might have a solution/diagnosis in an IDR.

    I am new to my post and am currently reviewing the state of our Oracle Support. I am certain that the support identifiers are not yet in my Oracle.com account.

    Would stepping back a release help me any? I have a boot environment of S11 SRU10.5 but have not tried booting into it since upgrading all zpools to version 34 and all zfs to version 6. I'm having a little trouble figuring out what versions SRU 10.5 supports.

    Yours,

    Tom
  • 3. Re: SA_MAGIC header corrupt on ZFS causing kernel panic in S11.1 SRU5.5
    cindys Pro
    Currently Being Moderated
    Hi Tom,

    No, I don't think going back would help.

    If you upgraded your pools and file systems when you moved to S11.1, then you probably wouldn't be able to boot back to an earlier BE.

    I think opening a support case with Oracle is really the best resolution.

    Thanks, Cindy
  • 4. Re: SA_MAGIC header corrupt on ZFS causing kernel panic in S11.1 SRU5.5
    cindys Pro
    Currently Being Moderated
    One more thing...

    Diagnosis via email is difficult, which is why you need a support expert to look at this panic.
    We suspect the bug I referenced, however, it looks like something else might be involved.

    Have you reviewed your FMA output? Do you have faults or errors accumulating from these commands:

    # fmadm faulty
    # fmdump
    # fmdump -eV | more

    Thanks, Cindy
  • 5. Re: SA_MAGIC header corrupt on ZFS causing kernel panic in S11.1 SRU5.5
    1001999 Newbie
    Currently Being Moderated
    The output of fmdump is below. On March 27th, 3 disks were automatically identified as having too many errors and were pulled from their zpools. I replaced them on April 2nd (I believe) and, while they were resilvering, kernel panics started happening on April 6th. After several panic / reboot cycles, I disabled NFS and have not experienced any panics sense. The resilver finished sometime on April 7th (I think) and I started a zpool scrub relatively soon thereafter.

    Mar 27 11:21:05.6894 2b5fca6d-8a55-6a16-d59a-86abf600ce1e DISK-8000-0X Diagnosed
    Mar 27 11:21:18.2224 335f323c-c723-e67f-c194-a346edde5f41 DISK-8000-0X Diagnosed
    Mar 27 11:21:43.8004 f515a18e-235d-c9e8-9b11-ff7284ed1b48 DISK-8000-0X Diagnosed
    Mar 31 12:27:08.8977 864387c6-9b50-6f90-d710-9e213a0ba327 SUNOS-8000-KL Diagnosed
    Apr 02 15:37:45.7040 2b5fca6d-8a55-6a16-d59a-86abf600ce1e FMD-8000-58 Updated
    Apr 02 15:37:45.8345 335f323c-c723-e67f-c194-a346edde5f41 FMD-8000-58 Updated
    Apr 02 15:37:45.9517 f515a18e-235d-c9e8-9b11-ff7284ed1b48 FMD-8000-58 Updated
    Apr 02 15:50:44.1012 2b5fca6d-8a55-6a16-d59a-86abf600ce1e FMD-8000-4M Repaired
    Apr 02 15:50:44.1429 2b5fca6d-8a55-6a16-d59a-86abf600ce1e FMD-8000-6U Resolved
    Apr 02 15:51:00.5162 f515a18e-235d-c9e8-9b11-ff7284ed1b48 FMD-8000-4M Repaired
    Apr 02 15:51:00.5614 f515a18e-235d-c9e8-9b11-ff7284ed1b48 FMD-8000-6U Resolved
    Apr 02 15:51:18.9518 335f323c-c723-e67f-c194-a346edde5f41 FMD-8000-4M Repaired
    Apr 02 15:51:18.9959 335f323c-c723-e67f-c194-a346edde5f41 FMD-8000-6U Resolved
    Apr 02 15:55:22.0946 1ccd6f62-7607-6f63-9069-9d947691df4b ZFS-8000-LR Diagnosed
    Apr 02 15:56:14.7847 f8d032c1-5d6e-e742-f01a-e2f38ccc6ce8 ZFS-8000-QJ Diagnosed
    Apr 02 15:59:05.6633 5564cf7e-3648-cd69-915f-9b6757509bae ZFS-8000-QJ Diagnosed
    Apr 02 15:59:05.8154 0214c0e7-59b6-47ca-c8c6-853d76713257 ZFS-8000-D3 Diagnosed
    Apr 02 16:01:14.5517 513db3ba-27a7-c8f2-f50e-8ef58c0fa71c ZFS-8000-QJ Diagnosed
    Apr 06 05:49:43.6917 eddb3a8f-9117-cfa8-8784-a1fceb242f7f SUNOS-8000-KL Diagnosed
    Apr 06 06:44:24.2562 6186711a-735a-67f7-9226-be419378c70e SUNOS-8000-KL Diagnosed
    Apr 06 07:20:45.2416 33b36e54-c2cd-ee0e-b370-821a00d342fe SUNOS-8000-KL Diagnosed
    Apr 06 08:05:29.7940 a7f6d49b-6b15-ef5a-ca58-911e2c6b1196 SUNOS-8000-KL Diagnosed
    Apr 06 08:28:06.7965 e4eea2e9-c800-e1c9-bd53-c45af9b616b0 SUNOS-8000-KL Diagnosed
    Apr 06 08:56:21.9017 a51779c9-549a-4334-f102-d26ff958c566 SUNOS-8000-KL Diagnosed
    Apr 06 09:26:33.7999 25e4e137-7962-e70f-ea91-f77dfdd8238b SUNOS-8000-KL Diagnosed
    Apr 06 09:47:56.5975 2d416320-f69e-e5c5-ea66-cafab1997546 SUNOS-8000-KL Diagnosed
    Apr 06 10:19:35.7141 ae4e7ef4-bbf5-e80a-8e30-8a99bafbc352 SUNOS-8000-KL Diagnosed
    Apr 06 11:07:14.7489 abc3c89c-d9f9-ee00-e546-99cbff435810 SUNOS-8000-KL Diagnosed
    Apr 06 11:51:31.1321 7aa0ab5e-d54c-4201-e56f-e8a502bb063e SUNOS-8000-KL Diagnosed
    Apr 06 12:17:42.4073 46d85a95-8f15-c0ea-e061-f241628d124a SUNOS-8000-KL Diagnosed
    Apr 06 12:48:15.4996 d16c0c54-33d5-c463-dae8-fc8cc82e517b SUNOS-8000-KL Diagnosed
    Apr 06 13:14:45.4799 021e8659-5359-edea-c7e9-8551c7de6d2c SUNOS-8000-KL Diagnosed
    Apr 06 13:51:56.5406 89620151-c97f-4256-8347-f412ee416282 SUNOS-8000-KL Diagnosed
    Apr 06 14:11:47.2594 b9923bc1-46d6-4fb6-ec7d-f3a1e0076419 SUNOS-8000-KL Diagnosed
    Apr 07 12:54:49.3759 513db3ba-27a7-c8f2-f50e-8ef58c0fa71c FMD-8000-4M Repaired
    Apr 07 12:54:49.4121 513db3ba-27a7-c8f2-f50e-8ef58c0fa71c FMD-8000-6U Resolved
    Apr 07 12:54:49.5092 0214c0e7-59b6-47ca-c8c6-853d76713257 FMD-8000-4M Repaired
    Apr 07 12:54:49.5368 0214c0e7-59b6-47ca-c8c6-853d76713257 FMD-8000-6U Resolved
    Apr 07 12:54:49.5864 5564cf7e-3648-cd69-915f-9b6757509bae FMD-8000-4M Repaired
    Apr 07 12:54:49.6064 5564cf7e-3648-cd69-915f-9b6757509bae FMD-8000-6U Resolved
    Apr 07 12:54:49.6367 f8d032c1-5d6e-e742-f01a-e2f38ccc6ce8 FMD-8000-4M Repaired
    Apr 07 12:54:49.6498 f8d032c1-5d6e-e742-f01a-e2f38ccc6ce8 FMD-8000-6U Resolved
    Apr 07 12:54:49.7409 1ccd6f62-7607-6f63-9069-9d947691df4b FMD-8000-4M Repaired
    Apr 07 12:54:49.7585 1ccd6f62-7607-6f63-9069-9d947691df4b FMD-8000-6U Resolved

    The output of fmadm faulty and fmdump -eV are quite long:

    https://pantherfile.uwm.edu/downes/public/fmdump_ev.txt
    https://pantherfile.uwm.edu/downes/public/fmadm_faulty.txt

    I note a number of events on April 9th in the output of fmdump -eV. Maybe these are the scrub fixing things?

    Tom
  • 6. Re: SA_MAGIC header corrupt on ZFS causing kernel panic in S11.1 SRU5.5
    cindys Pro
    Currently Being Moderated
    Yes, FMA is quite verbose. I was looking for something like cpu or memory issues but I don't see anything.
    If the export pool is the pool with the panic, then I see the disk problems. Is this pool redundant? Somewhat
    concerning is the ereport...zfs.corrupt.data with April dates, which for a redundant pool can be recovered,
    but not for non-redundant pool. If this pool is available, can you provide the zpool status -v output?

    Thanks, Cindy
  • 7. Re: SA_MAGIC header corrupt on ZFS causing kernel panic in S11.1 SRU5.5
    1001999 Newbie
    Currently Being Moderated
    The zpool is healthy and n the middle of a scrub. It found 12k (kb?) of reparable errors. I think that's the source of some of the fmdump -eV messages since the scrub probably started on the 7th.

    https://pantherfile.uwm.edu/downes/public/zpool_status.txt

    Tom
  • 8. Re: SA_MAGIC header corrupt on ZFS causing kernel panic in S11.1 SRU5.5
    cindys Pro
    Currently Being Moderated
    Hi Tom,

    Besides a non-redundant pool, a very large RAIDZ pool is my next least favorite config.

    Your export pool has RAIDZ2 redundancy (better than RAIDZ1), which means each RAIDZ2 disk grouping (or RAIDZ2 VDEV) can withstand the failure of only *2* disks. I see 2 disks failed from one grouping so you should be okay, but after you run the scrub, clear the pool error with zpool clear, and then run zpool status -v to see if this pool has any data corruption. (This could explain why the panic looks worse and hitting this bug is a blessing in disguise.)

    Does this pool perform well? We recommend only 3-9 disks per RAIDZ VDEV. You only have one spare, which is worrisome for such a large pool. Always have good backups. Monitor this pool closely. I would recommend a smaller config (2 mirrored pools) or RAIDZ3 for such a large pool.

    See our best practices here:

    http://docs.oracle.com/cd/E26502_01/html/E29007/practice-1.html#scrolltoc

    Thanks, Cindy
  • 9. Re: SA_MAGIC header corrupt on ZFS causing kernel panic in S11.1 SRU5.5
    1001999 Newbie
    Currently Being Moderated
    Cindy:

    I've read that page as I configure new zpools. The previous sysadmin was unaware of these practices and, so far as I know, Oracle is not working on the ability to resize down a zpool (consider this another vote in favor of prioritizing that). So basically I'm stuck with this thumper as it came to me. It would require quite a bit of work and downtime to shuffle the data around in a way that would still allow our users to access it with acceptable performance. Which maybe is work that needs to be done, but it is lower priority than other issues in the data center.

    It does take next to forever to resilver/scrub. The 2-disk failure on one raidz2 group was resolved with new disks. I will watch the post-resilver scrub complete, but it's unclear to me if there's any meaningful way for me to determine if the bug in the first message is going to bite me as soon as I re-enable NFS without actually re-enabling NFS.

    Still working on figuring out the purchase history, etc. to determine support. I came into this position a few months ago.

    Tom
  • 10. Re: SA_MAGIC header corrupt on ZFS causing kernel panic in S11.1 SRU5.5
    cindys Pro
    Currently Being Moderated
    Okay, thanks for the update. Its really important that you run zpool status -v to be sure that corrupted data is not making this problem worse before you make your NFS shares available. We are working on device removal but I understand if reconfiguring this pool is too much. 5 TBs is a lot to backup but you need more safety nets in addition to monitoring this pool weekly. With only one spare on this pool, you should make sure you have extra disks on a shelf. Make sure HBA/disk firmware is up-to-date. Let me know us what zpool status -v turns up.

    Thanks, Cindy
  • 11. Re: SA_MAGIC header corrupt on ZFS causing kernel panic in S11.1 SRU5.5
    1001999 Newbie
    Currently Being Moderated
    Are we looking at the same zpool? It's 35TB of data.

    I do have extra disks and an hourly monitor of zpool health. So overall I think I'm at least in an acceptable place even though I agree with all your comments regarding zpool vdevs and having more spares.

    I happen to have had another disk failure so it's in the middle of a resilver. The spare went in automatically.

    When I run "zpool status -v" you are asking me to look for corrupted files after it lists all the members of the pool? I have never seen any corruptions listed on this pool at any point. I have seen these on other systems and took the time to rebuild the pool.

    Tom
  • 12. Re: SA_MAGIC header corrupt on ZFS causing kernel panic in S11.1 SRU5.5
    1001999 Newbie
    Currently Being Moderated
    FWIW, this was the disk that was just automatically removed:

    https://pantherfile.uwm.edu/downes/public/fmdump_failure.txt
  • 13. Re: SA_MAGIC header corrupt on ZFS causing kernel panic in S11.1 SRU5.5
    cindys Pro
    Currently Being Moderated
    Yes, I see now, a 35TB pool. Always have good backups because if more than 2 disks fail in one VDEV, your data might not be recoverable.

    If you have corrupt data in a pool, you'll see it in zpool status -v output, after the device info, like this example:

    http://docs.oracle.com/cd/E26502_01/html/E29007/gbbwl.html#scrolltoc

    Repairing a Corrupted File or Directory

    My concern for your pool is that the fmdump -eV shows corrupted data on the failed devices. It could be that that the corruption was resolved because your pool is redundant but I want to rule it out as your panic looks unusual.

    The fmdump_failure text below looks like a fmdump -eV error. Do you have fmadm faulty output that identifies the failed or failing disk?
    The fmdump -eV command logs (recoverable/unrecoverable) errors and fmadm faulty indicates whether the error (problem) is now an actual fault.

    Thanks, Cindy
  • 14. Re: SA_MAGIC header corrupt on ZFS causing kernel panic in S11.1 SRU5.5
    1001999 Newbie
    Currently Being Moderated
    At no point has zpool status -v reported corrupted files on this pool. As you say, the lack of zpool corruption is probably because of the RAIDZ2 redundancy as there have been corrupt disks at various points over the last week.

    I've managed to update my account with the only support identifier that I am able to track down. I can see bug 15791909 but not 15850819.

    15791909 looks different than my issue: which is a SA_MAGIC header panic.

    This is the relevant output from fmadm faulty. The disk was automatically replaced by the spare which is already being resilvered. Still no corrupt files listed.

    I'm ensuring the status of our backups as we speak...

    Apr 09 20:41:32 40fdac1a-24c2-eee4-9516-be7ddc099d93 DISK-8000-3E Critical
    Manufacturer : Sun-Microsystems
    Problem Status : isolatedFire-X4540
    Diag Engine : eft / 1.16n
    System
    Manufacturer : unknown
    Name : unknown
    Part_Number : unknown
    Serial_Number : unknown

    System Component
    Manufacturer : Sun-Microsystems
    Name : Sun-Fire-X4540
    Part_Number : THORXATO

    ----------------------------------------
    Suspect 1 of 1 :
    Fault class : fault.io.scsi.cmd.disk.dev.rqs.derr
    Certainty : 100%
    Affects : dev:///:devid=id1,sd@n5000cca373de0e37//pci@0,0/pci10de,377@a/pci1000,1000@0/sd@3,0
    Status : faulted and taken out of service

    FRU
    Location : "HD_ID_3"
    Manufacturer : unknown
    Name : unknown
    Part_Number : HITACHI-H7210CA30SUN1.0T-1036A43NAL
    Revision : JP4OA3CB
    Serial_Number : JPW9K0HD243NAL
    Chassis
    Manufacturer : Sun-Microsystems
    Name : Sun-Fire-X4540
    Part_Number : unknown
    Serial_Number : 1043AMR007
    Status : faulty

    Description : A non-recoverable hardware failure was detected by the device
    while performing a command.

    Response : The device may be offlined or degraded.

    Impact : The device has failed. The service may have been lost or
    degraded.

    Action : Use 'fmadm faulty' to provide a more detailed view of this event.
    Please refer to the associated reference document at
    http://support.oracle.com/msg/DISK-8000-3E for the latest service
    procedures and policies regarding this diagnosis.
1 2 Previous Next

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points