12 Replies Latest reply: Dec 17, 2012 7:58 PM by user2110550 RSS

    Extremely Disappointed and Frustrated with ZFS

    980054
      I feel like crying. In the last 2 years of my life ZFS has caused me more professional pain and grief than anything else. At the moment I'm anxiously waiting for a resilver to complete, I expect that my pool will probably fault and I'll have to rebuild it.

      I work at a small business, and I have two Solaris 11 servers which function as SAN systems primarly serving virtual machine disk images to Proxmox VE cluster. Each is fitted with an Areca 12 port SATA controller put in JBOD mode, and populated with WD Caviar Black 2TB drives (probably my first and biggest mistake, was in not using enterprise class drives). One system is configured as a ZFS triple mirror, and the other a double mirror, both have 3 hot spares.

      About a year ago I got CKSUM errors on one of the arrays I promptly ran zpool scrub on the array, and stupidly I decided to scrub the other array at the same time. The scrub quickly turned into resilvers on both arrays as more CKSUM errors were uncovered, and as the resilver continued the CKSUM count rose until both arrays faulted and were irrecoverable. Irrecoverable metadata corruption was the error I got from ZFS. After 20+ hours of attempted recovery, trying to play back the ZIL I had to destroy them both and rebuild everything. I never knew for certain what the cause was, but I suspected disk write caching on the controller, and/or the use of a non-enterprise class flash drive for ZIL.

      In the aftermath did extremely thorough checking of all devices. I checked each backplane port, each drive for bad sectors, SATA controller onboard RAM, main memory, ran extended burn-in testing, etc. I then rebuilt the arrays without controller write caching and no seperate ZIL device. I also scheduled weekly scrubbing, and scripted ZFS alerting.

      Yesterday I got an alert from the controller on my array with the triple mirror about read errors on one of the ports. I ran a scrub which completed and then I proceeded to replace the drive I was getting read errors on. I offlined the old drive and inserted a brand new drive and ran zfs replace. Re-silver started fast and then the rate quickly dropped down to 1.0MB/s, my controller began spitting out a multitude of SATA command timeout errors on the port of the newly inserted drive. Since the whole array had essential froze up, I popped the drive out and everything ran back at full speed, resilvering against one of the hot spares. Now the resilver soon started uncovering CKSUM errors similar to the disaster I had last year. Error counts rose and now my system dropped another drive off in the same mirror set and is resilvering 2 drives in the same set, with the third drive in the set showing 6 CKSUM errors. I'm afraid I'm going to lose the whole array again, as the only drive left in the set is showing errors as well. WTF?!?!?!?!?!?!

      So I suspect I have a bad batch of disks, however, why the heck did zfs scrub complete and show no errors? What is the point of ZFS scrub if it doesn't accuratly uncover errors? I'm so frustrated that these types of errors seem to show up only during resilvers. I'm beginning to think ZFS isn't as robust as advertised....
        • 1. Re: Extremely Disappointed and Frustrated with ZFS
          980054
          Amazingly, the resilver completed and I didn't lose the pool. I'm not sure what to do now, however as ZFS is so damn fragile that me just giving the server the evil-eye will probably crash it. Here is how the pool currently sits:


          <pre>
          pool: san
          state: DEGRADED
          status: One or more devices are faulted in response to persistent errors.
          Sufficient replicas exist for the pool to continue functioning in a
          degraded state.
          action: Replace the faulted device, or use 'zpool clear' to mark the device
          repaired.
          scan: resilvered 189G in 0h37m with 0 errors on Fri Dec 14 13:11:16 2012
          config:

          NAME STATE READ WRITE CKSUM
          san DEGRADED 0 0 0
          mirror-0 DEGRADED 0 0 0
          spare-0 DEGRADED 0 0 0
          c6t0d0 DEGRADED 0 0 83 too many errors
          c6t3d0 ONLINE 0 0 0
          c6t4d0 ONLINE 0 0 6
          spare-2 DEGRADED 0 0 0
          c6t8d0 FAULTED 1 42.0K 0 too many errors
          c6t7d0 ONLINE 0 0 0
          mirror-1 ONLINE 0 0 0
          c6t1d0 ONLINE 0 0 0
          c6t5d0 ONLINE 0 0 0
          c6t9d0 ONLINE 0 0 2
          mirror-2 ONLINE 0 0 0
          c6t2d0 ONLINE 0 0 0
          c6t6d0 ONLINE 0 0 0
          c6t10d0 ONLINE 0 0 0
          spares
          c6t3d0 INUSE currently in use
          c6t7d0 INUSE currently in use
          c6t11d0 AVAIL

          errors: No known data errors
          </pre>

          c6t8d0 was the disk that the controller reported read errors on, which started this whole thing. I replaced it, and then the replacement froze up and I removed it (which is the one you see here as faulted), the problem cascaded and took c6t0d0 offline. As you can see c6t4d0 also shows errors, but fortunately it didn't surpass the error threshold and take it offline during the resilver and kill my pool.

          Let me re-iterate.... Prior to the resilver, when the original c6t8d0 was online I scrubbed this pool and no errors were reported, 0K repaired.

          Edited by: 977051 on Dec 14, 2012 2:44 PM
          • 2. Re: Extremely Disappointed and Frustrated with ZFS
            903009
            Sounds like you have a hardware problem somewhere that is giving ZFS fits.

            Memory (especially non-ECC RAM),
            Controller, HBA, backplane
            mobo's Host Bus or flakey PSU.
            Can all cause problems similar to that.


            Since you say this happened during scrubs of 2 different pools, I suspect your RAM or HBA (the way it happened sounds totally voltage-related), and wonder how full your disks are?
            • 3. Re: Extremely Disappointed and Frustrated with ZFS
              Cindys-Oracle
              I'm sorry that you are so frustrated. I see that you have scripted for ZFS alerts but I'm wondering
              if multiple hardware issues are occurring. FMA provides CSI like data that you can review to see
              what the underling causes are, such as:

              # fmadm faulty
              # fmdump
              # fmdump -eV | more

              A problem I see in the ZFS community is that admins are unaware that their hardware is failing
              so its important that you use the commands above to routinely monitor your gear and data.

              Let us know if the output from the above helps provide more data to find a root cause.

              That you are using triple mirrors is a very good idea, but quality of your devices and hardware
              should equal or exceed the importance of your data.

              Thanks, Cindy
              • 4. Re: Extremely Disappointed and Frustrated with ZFS
                user2110550
                Actually it would have been nice if the fm framework would by default deliver notices to root
                of real problems via a cron service. At least than you can't blame the system admin they are ignorant.
                ---Bob
                • 5. Re: Extremely Disappointed and Frustrated with ZFS
                  980054
                  Thanks for responding. Last year when I lost my pools, I lost two pools on two separate systems. I have two servers with identical hardware on two networks. After the crash I thoroughly checked all the hardware, well I should say - I thought I did. Obviously there is still issues.

                  I think it's disk related, but could be PSU or maybe as you suggested a bad voltage regulator on the HBA. I'm pumped a lot of data through the HBA during testing, filling and checking the pool, intentionally removing drives to force resilvers, everything checked out good. I put the system back into production in August. And didn't have any issues with it until last Friday.

                  Another lapse in judgement I made, was in buying all the HDDs for my arrays at once. I bought 44 drives of the same model, at the same time, from the same vendor. 24 were used in these arrays, 12 in one and 12 in another. Out of the 44 drives I've identified 14 that are bad. All were manufactured on the same day, and have similar serial numbers. I thought I had weeded out all the bad drives (using badblocks -w on linux) back in August, but I think I need to ditch the whole lot.

                  The disks are not very full. Pool is at 5% capacity. 5.44TB usable 283GB used.

                  Edited by: callen on Dec 17, 2012 9:54 AM
                  • 6. Re: Extremely Disappointed and Frustrated with ZFS
                    Cindys-Oracle
                    FMA does notify admins automatically through the smtp-notify service via mail notification to root. You
                    can customize this service to send notification to your own email account on any system.

                    The poster said he has scripted for ZFS, but I don't know if that means reviewing zpool status or
                    FMA data.

                    With ongoing hardware problems, you need to review FMA data as well. See the example below.

                    Rob Johnston has a good explanation of this smtp-notify service, here:

                    https://blogs.oracle.com/robj/entry/fma_and_email_notifications

                    For the system below, I had to enable sendmail to see the failure notice in root's mail,
                    but that was it.

                    Thanks, Cindy

                    I failed a disk in a pool:

                    # zpool status -v tank
                    pool: tank
                    state: DEGRADED
                    status: One or more devices are unavailable in response to persistent errors.
                    Sufficient replicas exist for the pool to continue functioning in a
                    degraded state.
                    action: Determine if the device needs to be replaced, and clear the errors
                    using 'zpool clear' or 'fmadm repaired', or replace the device
                    with 'zpool replace'.
                    scan: resilvered 944M in 0h0m with 0 errors on Mon Dec 17 10:30:05 2012
                    config:

                    NAME STATE READ WRITE CKSUM
                    tank DEGRADED 0 0 0
                    mirror-0 DEGRADED 0 0 0
                    c3t1d0 ONLINE 0 0 0
                    c3t2d0 UNAVAIL 0 0 0

                    device details:

                    c3t2d0 UNAVAIL cannot open
                    status: ZFS detected errors on this device.
                    The device was missing.
                    see: http://support.oracle.com/msg/ZFS-8000-LR for recovery


                    errors: No known data errors

                    Check root's email:

                    # mail
                    From noaccess@tardis.space.com Mon Dec 17 10:48:54 2012
                    Date: Mon, 17 Dec 2012 10:48:54 -0700 (MST)
                    From: No Access User <noaccess@tardis.space.com>
                    Message-Id: <201212171748.qBHHmsiR002179@tardis.space.com>
                    Subject: Fault Management Event: tardis:ZFS-8000-LR
                    To: root@tardis.space.com
                    Content-Length: 751

                    SUNW-MSG-ID: ZFS-8000-LR, TYPE: Fault, VER: 1, SEVERITY: Major
                    EVENT-TIME: Mon Dec 17 10:48:53 MST 2012
                    PLATFORM: SUNW,Sun-Fire-T200, CSN: 11223344, HOSTNAME: tardis
                    SOURCE: zfs-diagnosis, REV: 1.0
                    EVENT-ID: c2cfa39b-71f4-638e-fb44-9b223d9e0803
                    DESC: ZFS device 'id1,sd@n500000e0117173e0/a' in pool 'tank' failed to open.
                    AUTO-RESPONSE: An attempt will be made to activate a hot spare if available.
                    IMPACT: Fault tolerance of the pool may be compromised.
                    REC-ACTION: Use 'fmadm faulty' to provide a more detailed view of this event. Run 'zpool status -lx' for more information. Please refer to the associated reference document at http://support.oracle.com/msg/ZFS-8000-LR for the latest service procedures and policies regarding this diagnosis.
                    • 7. Re: Extremely Disappointed and Frustrated with ZFS
                      980054
                      Thanks for this info. It's very helpful, and I wasn't aware of fmadm/fmdump. I've been looking through the errors and all of them are from disk devices, including the PCI-E flash drive I was using as a ZFS cache. I removed the cache device from my pool. It was logging errors from it all the way back in September.

                      Why doesn't ZFS automatically take a cache device offline that's experiencing errors? In the very least ZFS ought to show the cache device as degraded when the admin checks the pools status. And again... Why doesn't ZFS scrub uncover corrupted data. This alone is pushing me to ditch ZFS altogether. Any file system who's data integrity tool can run to completion and find zero errors even when errors exist belongs in a dump heap.

                      So basically I have to intentionally degrade my pool by removing a drive once a week in order to check and see if my data is sound? What a stupid system.
                      • 8. Re: Extremely Disappointed and Frustrated with ZFS
                        980054
                        If I continue using ZFS I'll certainly take your advice and include FMA alerts via smtp-notify. I scripted ZFS alerting, by checking the output of 'zpool status' every 15 minutes, looking for anything abnormal.
                        • 9. Re: Extremely Disappointed and Frustrated with ZFS
                          980054
                          I really appreciate all of you responders in giving of your time and knowledge in helping me. I apologize for being emotional in my bashing of ZFS. I obviously have outstanding hardware issues that I need to flush out. I love the features ZFS offers, and would like to continue to use it. But I'm having a really hard time right now getting past the fact that scrub either doesn't check every block, is buggy and doesn't properly check blocks, or just plain lies.

                          FMA gives me another important level of information, that I should of been looking at all along. I feel dumb for not knowing about this. However, that's still not a good enough safety net knowing scrub isn't complete enough.

                          I admit my Solaris 11 knowledge is not in-depth. I've been a Solaris admin for many years but have only been running 11 on a few systems for a year or so. And most of my Solaris experience has been on Solaris 8/9. I'm the lone sysadmin at my company, so my knowledge is broad but not deep in any one area as I wear too many hats. Hell we are even still running OpenVMS on VAXen at my work.
                          • 10. Re: Extremely Disappointed and Frustrated with ZFS
                            Cindys-Oracle
                            A couple of clarifications around the data that is shared between FMA and ZFS:

                            1. The fmadm faulty and fmdump logs show problems that are severe enough
                            to become an actual fault. If it is a fault that is related to a ZFS pool, pool device,
                            or data, its passed up to ZFS.

                            2. The fmdump -eV output identifies both faults and errors. The error data has
                            not yet become a fault. If your cache device had actual faults you would see them
                            in your zpool status output. It sounds like they were collecting errors.

                            3. You don't have to degrade your pool devices to catch problems. Watch the
                            various FM data instead until you can determine what is causing the h/w problems.

                            4. Routine pool scrubs are good but in some cases, scrubbing a pool with multiple
                            h/w issues can cause more problems. If you suspect you are having hardware issues,
                            isolate and resolve those first, then scrub the pool.

                            5. A pool scrub does uncover corrupted data but if multiple issues are occurring then
                            its going to another pass. See 4 above.

                            Thanks, Cindy
                            • 11. Re: Extremely Disappointed and Frustrated with ZFS
                              980054
                              You are a wealth of information! I'm reading the man pages on fmadm and fmdump, and I'll pull up the documentation online on FMA in a few.

                              I realize all bets are off when dealing with faulty hardware, yet, it still seems rather unlikely to me that with bad hardware scrub wouldn't find any errors and then a few minutes later during resilver a multitude of errors would surface. I get that stressing my hardware with a scrub might cause more problems, but given that, performing weekly scrubs on my bad hardware should of been finding lots of errors especially if scrub was introducing new errors as subsequent scrubs would find the new problems.

                              I suppose all the data corruption could have happened fairly recently. I'm just questioning if the hardware was sound months ago... This is essentially the same platform that bit the dust a year ago.
                              • 12. Re: Extremely Disappointed and Frustrated with ZFS
                                user2110550
                                cindys wrote:
                                FMA does notify admins automatically through the smtp-notify service via mail notification to root. You
                                can customize this service to send notification to your own email account on any system.
                                Ah, my apology I was looking at an older system that didn't have it. And reading Rob's article
                                it's nice that that the email can be customized through the service. More updates to do I guess
                                but it's worth it.

                                ---Bob