This discussion is archived
12 Replies Latest reply: Dec 17, 2012 5:58 PM by user2110550 RSS

Extremely Disappointed and Frustrated with ZFS

980054 Newbie
Currently Being Moderated
I feel like crying. In the last 2 years of my life ZFS has caused me more professional pain and grief than anything else. At the moment I'm anxiously waiting for a resilver to complete, I expect that my pool will probably fault and I'll have to rebuild it.

I work at a small business, and I have two Solaris 11 servers which function as SAN systems primarly serving virtual machine disk images to Proxmox VE cluster. Each is fitted with an Areca 12 port SATA controller put in JBOD mode, and populated with WD Caviar Black 2TB drives (probably my first and biggest mistake, was in not using enterprise class drives). One system is configured as a ZFS triple mirror, and the other a double mirror, both have 3 hot spares.

About a year ago I got CKSUM errors on one of the arrays I promptly ran zpool scrub on the array, and stupidly I decided to scrub the other array at the same time. The scrub quickly turned into resilvers on both arrays as more CKSUM errors were uncovered, and as the resilver continued the CKSUM count rose until both arrays faulted and were irrecoverable. Irrecoverable metadata corruption was the error I got from ZFS. After 20+ hours of attempted recovery, trying to play back the ZIL I had to destroy them both and rebuild everything. I never knew for certain what the cause was, but I suspected disk write caching on the controller, and/or the use of a non-enterprise class flash drive for ZIL.

In the aftermath did extremely thorough checking of all devices. I checked each backplane port, each drive for bad sectors, SATA controller onboard RAM, main memory, ran extended burn-in testing, etc. I then rebuilt the arrays without controller write caching and no seperate ZIL device. I also scheduled weekly scrubbing, and scripted ZFS alerting.

Yesterday I got an alert from the controller on my array with the triple mirror about read errors on one of the ports. I ran a scrub which completed and then I proceeded to replace the drive I was getting read errors on. I offlined the old drive and inserted a brand new drive and ran zfs replace. Re-silver started fast and then the rate quickly dropped down to 1.0MB/s, my controller began spitting out a multitude of SATA command timeout errors on the port of the newly inserted drive. Since the whole array had essential froze up, I popped the drive out and everything ran back at full speed, resilvering against one of the hot spares. Now the resilver soon started uncovering CKSUM errors similar to the disaster I had last year. Error counts rose and now my system dropped another drive off in the same mirror set and is resilvering 2 drives in the same set, with the third drive in the set showing 6 CKSUM errors. I'm afraid I'm going to lose the whole array again, as the only drive left in the set is showing errors as well. WTF?!?!?!?!?!?!

So I suspect I have a bad batch of disks, however, why the heck did zfs scrub complete and show no errors? What is the point of ZFS scrub if it doesn't accuratly uncover errors? I'm so frustrated that these types of errors seem to show up only during resilvers. I'm beginning to think ZFS isn't as robust as advertised....
  • 1. Re: Extremely Disappointed and Frustrated with ZFS
    980054 Newbie
    Currently Being Moderated
    Amazingly, the resilver completed and I didn't lose the pool. I'm not sure what to do now, however as ZFS is so damn fragile that me just giving the server the evil-eye will probably crash it. Here is how the pool currently sits:


    <pre>
    pool: san
    state: DEGRADED
    status: One or more devices are faulted in response to persistent errors.
    Sufficient replicas exist for the pool to continue functioning in a
    degraded state.
    action: Replace the faulted device, or use 'zpool clear' to mark the device
    repaired.
    scan: resilvered 189G in 0h37m with 0 errors on Fri Dec 14 13:11:16 2012
    config:

    NAME STATE READ WRITE CKSUM
    san DEGRADED 0 0 0
    mirror-0 DEGRADED 0 0 0
    spare-0 DEGRADED 0 0 0
    c6t0d0 DEGRADED 0 0 83 too many errors
    c6t3d0 ONLINE 0 0 0
    c6t4d0 ONLINE 0 0 6
    spare-2 DEGRADED 0 0 0
    c6t8d0 FAULTED 1 42.0K 0 too many errors
    c6t7d0 ONLINE 0 0 0
    mirror-1 ONLINE 0 0 0
    c6t1d0 ONLINE 0 0 0
    c6t5d0 ONLINE 0 0 0
    c6t9d0 ONLINE 0 0 2
    mirror-2 ONLINE 0 0 0
    c6t2d0 ONLINE 0 0 0
    c6t6d0 ONLINE 0 0 0
    c6t10d0 ONLINE 0 0 0
    spares
    c6t3d0 INUSE currently in use
    c6t7d0 INUSE currently in use
    c6t11d0 AVAIL

    errors: No known data errors
    </pre>

    c6t8d0 was the disk that the controller reported read errors on, which started this whole thing. I replaced it, and then the replacement froze up and I removed it (which is the one you see here as faulted), the problem cascaded and took c6t0d0 offline. As you can see c6t4d0 also shows errors, but fortunately it didn't surpass the error threshold and take it offline during the resilver and kill my pool.

    Let me re-iterate.... Prior to the resilver, when the original c6t8d0 was online I scrubbed this pool and no errors were reported, 0K repaired.

    Edited by: 977051 on Dec 14, 2012 2:44 PM
  • 2. Re: Extremely Disappointed and Frustrated with ZFS
    903009 Newbie
    Currently Being Moderated
    Sounds like you have a hardware problem somewhere that is giving ZFS fits.

    Memory (especially non-ECC RAM),
    Controller, HBA, backplane
    mobo's Host Bus or flakey PSU.
    Can all cause problems similar to that.


    Since you say this happened during scrubs of 2 different pools, I suspect your RAM or HBA (the way it happened sounds totally voltage-related), and wonder how full your disks are?
  • 3. Re: Extremely Disappointed and Frustrated with ZFS
    cindys Pro
    Currently Being Moderated
    I'm sorry that you are so frustrated. I see that you have scripted for ZFS alerts but I'm wondering
    if multiple hardware issues are occurring. FMA provides CSI like data that you can review to see
    what the underling causes are, such as:

    # fmadm faulty
    # fmdump
    # fmdump -eV | more

    A problem I see in the ZFS community is that admins are unaware that their hardware is failing
    so its important that you use the commands above to routinely monitor your gear and data.

    Let us know if the output from the above helps provide more data to find a root cause.

    That you are using triple mirrors is a very good idea, but quality of your devices and hardware
    should equal or exceed the importance of your data.

    Thanks, Cindy
  • 4. Re: Extremely Disappointed and Frustrated with ZFS
    user2110550 Newbie
    Currently Being Moderated
    Actually it would have been nice if the fm framework would by default deliver notices to root
    of real problems via a cron service. At least than you can't blame the system admin they are ignorant.
    ---Bob
  • 5. Re: Extremely Disappointed and Frustrated with ZFS
    980054 Newbie
    Currently Being Moderated
    Thanks for responding. Last year when I lost my pools, I lost two pools on two separate systems. I have two servers with identical hardware on two networks. After the crash I thoroughly checked all the hardware, well I should say - I thought I did. Obviously there is still issues.

    I think it's disk related, but could be PSU or maybe as you suggested a bad voltage regulator on the HBA. I'm pumped a lot of data through the HBA during testing, filling and checking the pool, intentionally removing drives to force resilvers, everything checked out good. I put the system back into production in August. And didn't have any issues with it until last Friday.

    Another lapse in judgement I made, was in buying all the HDDs for my arrays at once. I bought 44 drives of the same model, at the same time, from the same vendor. 24 were used in these arrays, 12 in one and 12 in another. Out of the 44 drives I've identified 14 that are bad. All were manufactured on the same day, and have similar serial numbers. I thought I had weeded out all the bad drives (using badblocks -w on linux) back in August, but I think I need to ditch the whole lot.

    The disks are not very full. Pool is at 5% capacity. 5.44TB usable 283GB used.

    Edited by: callen on Dec 17, 2012 9:54 AM
  • 6. Re: Extremely Disappointed and Frustrated with ZFS
    cindys Pro
    Currently Being Moderated
    FMA does notify admins automatically through the smtp-notify service via mail notification to root. You
    can customize this service to send notification to your own email account on any system.

    The poster said he has scripted for ZFS, but I don't know if that means reviewing zpool status or
    FMA data.

    With ongoing hardware problems, you need to review FMA data as well. See the example below.

    Rob Johnston has a good explanation of this smtp-notify service, here:

    https://blogs.oracle.com/robj/entry/fma_and_email_notifications

    For the system below, I had to enable sendmail to see the failure notice in root's mail,
    but that was it.

    Thanks, Cindy

    I failed a disk in a pool:

    # zpool status -v tank
    pool: tank
    state: DEGRADED
    status: One or more devices are unavailable in response to persistent errors.
    Sufficient replicas exist for the pool to continue functioning in a
    degraded state.
    action: Determine if the device needs to be replaced, and clear the errors
    using 'zpool clear' or 'fmadm repaired', or replace the device
    with 'zpool replace'.
    scan: resilvered 944M in 0h0m with 0 errors on Mon Dec 17 10:30:05 2012
    config:

    NAME STATE READ WRITE CKSUM
    tank DEGRADED 0 0 0
    mirror-0 DEGRADED 0 0 0
    c3t1d0 ONLINE 0 0 0
    c3t2d0 UNAVAIL 0 0 0

    device details:

    c3t2d0 UNAVAIL cannot open
    status: ZFS detected errors on this device.
    The device was missing.
    see: http://support.oracle.com/msg/ZFS-8000-LR for recovery


    errors: No known data errors

    Check root's email:

    # mail
    From noaccess@tardis.space.com Mon Dec 17 10:48:54 2012
    Date: Mon, 17 Dec 2012 10:48:54 -0700 (MST)
    From: No Access User <noaccess@tardis.space.com>
    Message-Id: <201212171748.qBHHmsiR002179@tardis.space.com>
    Subject: Fault Management Event: tardis:ZFS-8000-LR
    To: root@tardis.space.com
    Content-Length: 751

    SUNW-MSG-ID: ZFS-8000-LR, TYPE: Fault, VER: 1, SEVERITY: Major
    EVENT-TIME: Mon Dec 17 10:48:53 MST 2012
    PLATFORM: SUNW,Sun-Fire-T200, CSN: 11223344, HOSTNAME: tardis
    SOURCE: zfs-diagnosis, REV: 1.0
    EVENT-ID: c2cfa39b-71f4-638e-fb44-9b223d9e0803
    DESC: ZFS device 'id1,sd@n500000e0117173e0/a' in pool 'tank' failed to open.
    AUTO-RESPONSE: An attempt will be made to activate a hot spare if available.
    IMPACT: Fault tolerance of the pool may be compromised.
    REC-ACTION: Use 'fmadm faulty' to provide a more detailed view of this event. Run 'zpool status -lx' for more information. Please refer to the associated reference document at http://support.oracle.com/msg/ZFS-8000-LR for the latest service procedures and policies regarding this diagnosis.
  • 7. Re: Extremely Disappointed and Frustrated with ZFS
    980054 Newbie
    Currently Being Moderated
    Thanks for this info. It's very helpful, and I wasn't aware of fmadm/fmdump. I've been looking through the errors and all of them are from disk devices, including the PCI-E flash drive I was using as a ZFS cache. I removed the cache device from my pool. It was logging errors from it all the way back in September.

    Why doesn't ZFS automatically take a cache device offline that's experiencing errors? In the very least ZFS ought to show the cache device as degraded when the admin checks the pools status. And again... Why doesn't ZFS scrub uncover corrupted data. This alone is pushing me to ditch ZFS altogether. Any file system who's data integrity tool can run to completion and find zero errors even when errors exist belongs in a dump heap.

    So basically I have to intentionally degrade my pool by removing a drive once a week in order to check and see if my data is sound? What a stupid system.
  • 8. Re: Extremely Disappointed and Frustrated with ZFS
    980054 Newbie
    Currently Being Moderated
    If I continue using ZFS I'll certainly take your advice and include FMA alerts via smtp-notify. I scripted ZFS alerting, by checking the output of 'zpool status' every 15 minutes, looking for anything abnormal.
  • 9. Re: Extremely Disappointed and Frustrated with ZFS
    980054 Newbie
    Currently Being Moderated
    I really appreciate all of you responders in giving of your time and knowledge in helping me. I apologize for being emotional in my bashing of ZFS. I obviously have outstanding hardware issues that I need to flush out. I love the features ZFS offers, and would like to continue to use it. But I'm having a really hard time right now getting past the fact that scrub either doesn't check every block, is buggy and doesn't properly check blocks, or just plain lies.

    FMA gives me another important level of information, that I should of been looking at all along. I feel dumb for not knowing about this. However, that's still not a good enough safety net knowing scrub isn't complete enough.

    I admit my Solaris 11 knowledge is not in-depth. I've been a Solaris admin for many years but have only been running 11 on a few systems for a year or so. And most of my Solaris experience has been on Solaris 8/9. I'm the lone sysadmin at my company, so my knowledge is broad but not deep in any one area as I wear too many hats. Hell we are even still running OpenVMS on VAXen at my work.
  • 10. Re: Extremely Disappointed and Frustrated with ZFS
    cindys Pro
    Currently Being Moderated
    A couple of clarifications around the data that is shared between FMA and ZFS:

    1. The fmadm faulty and fmdump logs show problems that are severe enough
    to become an actual fault. If it is a fault that is related to a ZFS pool, pool device,
    or data, its passed up to ZFS.

    2. The fmdump -eV output identifies both faults and errors. The error data has
    not yet become a fault. If your cache device had actual faults you would see them
    in your zpool status output. It sounds like they were collecting errors.

    3. You don't have to degrade your pool devices to catch problems. Watch the
    various FM data instead until you can determine what is causing the h/w problems.

    4. Routine pool scrubs are good but in some cases, scrubbing a pool with multiple
    h/w issues can cause more problems. If you suspect you are having hardware issues,
    isolate and resolve those first, then scrub the pool.

    5. A pool scrub does uncover corrupted data but if multiple issues are occurring then
    its going to another pass. See 4 above.

    Thanks, Cindy
  • 11. Re: Extremely Disappointed and Frustrated with ZFS
    980054 Newbie
    Currently Being Moderated
    You are a wealth of information! I'm reading the man pages on fmadm and fmdump, and I'll pull up the documentation online on FMA in a few.

    I realize all bets are off when dealing with faulty hardware, yet, it still seems rather unlikely to me that with bad hardware scrub wouldn't find any errors and then a few minutes later during resilver a multitude of errors would surface. I get that stressing my hardware with a scrub might cause more problems, but given that, performing weekly scrubs on my bad hardware should of been finding lots of errors especially if scrub was introducing new errors as subsequent scrubs would find the new problems.

    I suppose all the data corruption could have happened fairly recently. I'm just questioning if the hardware was sound months ago... This is essentially the same platform that bit the dust a year ago.
  • 12. Re: Extremely Disappointed and Frustrated with ZFS
    user2110550 Newbie
    Currently Being Moderated
    cindys wrote:
    FMA does notify admins automatically through the smtp-notify service via mail notification to root. You
    can customize this service to send notification to your own email account on any system.
    Ah, my apology I was looking at an older system that didn't have it. And reading Rob's article
    it's nice that that the email can be customized through the service. More updates to do I guess
    but it's worth it.

    ---Bob

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points