4 Replies Latest reply on Feb 25, 2013 7:48 PM by Cindys-Oracle

    ZFS Integrity issues

      Hi All,

      First I want to say sorry if my tone comes off a little jaded. I LOVE Solaris since I've converted to it last summer. But these stability issues are really eating at me.. Late nights fixing servers is getting old.

      I seem to be having issues keeping my ZFS filesystems and Solaris 11.1 in general stable. I have 5 servers I am running Solaris 11.1 on. 3 of them primarily as SAN machines for ZFS and iSCSI. The rest for applications and virtual machine hosts.

      Each of the machines are different hardware configurations. Some similar, but still different. All of them seem to have their quirks.. VM Hosts tend to lock up the network adapter after a month or so of uptime.. Dunno why, reboot fixes it. I have general issues with console keyboards on all machines.. they seem to lockup randomly and my screen just keeps seeing the enter key being hit.. Again each computer has different chipsets and hardware.. All Solaris 11.1s exhibit this behavior. All machines tend to lockup on kernel boot with the new Intel Encryption acceleration i.e (AES-NI bios option) enabled. (This would be nice for SSH'd zfs sends but alas I need to disable it).

      The SANs.. well.. Thats what this post is about.. Because losing data is scary, and I can't seem to keep data for any length of time on Solaris 11.1 ZFS anymore.

      My 2 primary SANs have been the most problematic. The first one started giving me kernel panics and reboots within a few weeks of operations and would randomly do this every few weeks. I replaced the CPU.. nope.. Ram.. nope... Motherboard... Nope... GAH! Called Oracle, they said that ZFS was building up too many errors.. O.K... Why does it have to kernel panic if its my storage???? So... I buy a nice shiny new Areca 1320-8i SAS HBA.. Buy all new 10K RPM Enterprise WD SAS drives... Upgrade to Solaris 11.1 Things go great for another couple weeks.. Then I run a scrub.....
       pool: tank
       state: DEGRADED
      status: One or more devices has experienced an error resulting in data
              corruption.  Applications may be affected.
      action: Restore the file in question if possible. Otherwise restore the
              entire pool from backup.
              Run 'zpool status -v' to see device specific details.
         see: http://support.oracle.com/msg/ZFS-8000-8A
        scan: scrub repaired 0 in 5h46m with 3 errors on Mon Feb 18 23:27:07 2013
              NAME        STATE     READ WRITE CKSUM
              tank        DEGRADED     0     0     3
                mirror-0  DEGRADED     0     0     6
                  c1t0d0  DEGRADED     0     0     6
                  c1t1d0  DEGRADED     0     0     6
                mirror-1  DEGRADED     0     0     4
                  c1t2d0  DEGRADED     0     0     4
                  c1t3d0  DEGRADED     0     0     4
                mirror-2  DEGRADED     0     0     2
                  c1t4d0  DEGRADED     0     0     2
                  c1t5d0  DEGRADED     0     0     2
                mirror-3  DEGRADED     0     0     6
                  c1t6d0  DEGRADED     0     0     6
                  c1t7d0  DEGRADED     0     0     6
                c5t0d0s1  ONLINE       0     0     0
                c5t0d0s0  ONLINE       0     0     0
      What??!?. I KNOW not all my drives can be bad.. Whats going on here? Everything is running fine.. I can't find any data corruption on the data that is there. All my programs work great.. So .... what the heck? Everything in the machine has been completely switched out. I am afraid of doing anything to the array at this point for risk of shutting down production.

      So in building another array for another application.. I used a new Adaptec 6804e.. Appeared to work great.. But under load the machine starts pausing every 30m for about 3 minutes and I keep seeing IOP Reset successfully completed in the kernel logs. So.. I try another Adaptec card... Same thing.. Switch to Areca 1320-4i, no issues. For now.

      I am at my witts end. I have tried every combination of hardware I can find.. Several modern off the shelf SAS card that supports Solaris (there isn't that many available)..

      Am I just running into a problem with SAS HBA Manufactures having cruddy Solaris support? If so this is kinda horrible, ZFS (and maybe zones, fast smb, and iscsi) are the primary reason for me using Solaris.. If the OS has no reliable supported SAS HBAs on the market... I can't use this. Is my only option to spend 20K on an Oracle/Sun server with similar hardware stats but just oracle hardware?

      Sorry, now I'm just ranting.. Anyhow.. Would anyone possibly have insight into what the heck happened to my array? Why its still working? And what I can do to get it back into a less fragile state?


      Edited by: TomS on Feb 20, 2013 2:31 PM
        • 1. Re: ZFS Integrity issues
          Sounds very frustrating.

          We run our own gear (Oracle/Sun hardware and disks) with mirrored ZFS pools and I don't see problems were all devices are DEGRADED unless there is some catastrophic hardware failure. You are correct. Data doesn't go bad overnight, devices do but not all at the same time, unless they are connected to the same array, then the array looks suspect.

          You can use FMA to determine when the current problems started with this pool:

          # fmadm faulty
          # fmdump -eV | more
          # iostat -En

          This output should give you some ideas. Continuous H/W resets are not a good sign.

          Maybe someone else can suggest different hardware. Lots of people use non-Oracle hardware with ZFS but I'm just not familiar
          with it.

          Thanks, Cindy
          • 2. Re: ZFS Integrity issues
            An inaccurate host adapter (and corresponding support through the driver)
            seems to be a common problem with disk arrays.

            A colleague had persistent similar problems (with a RAID array however) on
            Linux until he changed the host adapter (and manufacturer).
            The indicated broken disks turned out to be ok then.

            • 3. Re: ZFS Integrity issues
              Thanks, I am running this Areca 1320 in two other machines and this is the first I am encountering the issue. (hopefully its not the adapter). Nothing seems to want to just die anymore.

              Anyhow I noticed these three errors:
              errors: Permanent errors have been detected in the following files:
              Any way for me to correct these? This says metadata, so I am guessing it has something to do with the attributes on the file, like compression etc. So how would I identify which zfs attributes are at issue here? If I can get rid of the three errors I can at least get the array out of the degraded state.
              • 4. Re: ZFS Integrity issues
                If the hardware is now stable, you might try zpool scrub and zpool clear until those errors are resolved, but they might not be resolved. Corrupted metadata is not always recoverable.

                This is worth a try but only if the hardware is stable.

                # zpool scrub tank
                # zpool clear tank

                If errors still exist, then try again:

                # zpool scrub tank
                # zpool clear tank

                You might review this section of the ZFS Admin Guide as well to review other options:


                Thanks, Cindy