This discussion is archived
9 Replies Latest reply: May 22, 2013 11:55 AM by cindys RSS

ZPool performance during resilver

994707 Newbie
Currently Being Moderated
Hi everyone. Hoping someone can provide some guidance here.

We are trying to develop some low-end NAS devices using Solaris 11.1 to house virtual machines. Between the RAID configuration and the SSDs we have been able to obtain very solid performance under max load. The problem is when a drive fails.

Disk latency, more precisely read latency, goes extremely high. An example is we have one recently built and not housing many machines. start a resilver, read latency jumps from 25ms to 235ms. A bit of a problem for VMs.

1.37TB used on a 14TB pool, almost 200GB to resilver.

Hardware controllers in my experience are bale to rebuild a drive and the performance impact is not noticable. Different situation I understand.

I am trying to locate a tunable or some other suggestion to help resolve this issue. I suppose the same can be said for scrubs. We currently do not perform scrubs because it brings the system to its knees.
  • 1. Re: ZPool performance during resilver
    cindys Pro
    Currently Being Moderated
    Hi--

    Not sure what is going on yet but I have a couple of questions:

    1. Are you saying that this is a very large ZFS RAIDZ pool?

    2. If so, is the system busy when the resilver occurs?

    For overall best performance, a RAIDZ pool works best for large I/Os like streaming video.
    Mirrored pools perform best for random reads/writes.

    Thanks, Cindy
  • 2. Re: ZPool performance during resilver
    994707 Newbie
    Currently Being Moderated
    This is a large RAID 10 pool consisting on 8 mirrors.

    Yes the system is busy during resilver. Since it is a VMWare storage pool, the VMs are always running.
  • 3. Re: ZPool performance during resilver
    994707 Newbie
    Currently Being Moderated
    Checking back to see if anyone had any ideas. We know a resilver will reduce performance we'd just like it not to kill performance.
  • 4. Re: ZPool performance during resilver
    cindys Pro
    Currently Being Moderated
    Is this a mirrored ZFS pool or a ZFS pool that is sitting on top of a mirrored H/W config? What kind of hardware? Is the pool built directly on whole disks and not virtual disks?

    We have very large mirrored ZFS pools built on hardware arrays in JBOD mode for large build servers, lots of compiling and stuff and I don't think anyone notices when a disk resilvers. At least, they don't mention it.

    My nearby coworkers who know resilvering and scrubs best are in constant meetings this week and last week. I'm hesitant to hand-out tuning info on this discussion list without knowing this stuff really well. However, you might start by tuning the zfs_resilver_delay parameter, which is the number of clock ticks to delay resilvering when other work is present. What concerns me is if you slow down resilvering, do you risk another disk failure while the first disk is still resilvering? The default is 2. O means ignore all other work and increase resilvering speed. You might try setting this to 4:

    set zfs:zfs_resilver_delay=4

    Really, your best approach is to file a MOS SR to get this investigated.

    Thanks, Cindy
  • 5. Re: ZPool performance during resilver
    994707 Newbie
    Currently Being Moderated
    I understand but if I am not mistaken compiling is more CPU intensive on the client than disk I/O on the storage side.

    This is a ZFS pool consisting of 8 mirrors. This is all physical, no virtual hardware. The pool built directly on whole disks being presented by an LSI HBA. No H/W RAID.

    Thank you for the delay parameter this is exactly what I was looking for. Something to slow it down just a bit. I am hesitant to change the default values but I may need to test to see if going to 4 would help us. I am not so concerned about losing another disk during resilver as long as it is not the disk in the same vdev which I believe should be unlikely. Thank you for you help. Depending on how this goes I may open that MOS SR.

    Another quick question. I've seen documentation stating resilver can be stopped but haven't found the command to do so. I know how to stop a scrub "zpool scrub -s".

    Edited by: 991704 on May 21, 2013 10:56 AM
  • 6. Re: ZPool performance during resilver
    cindys Pro
    Currently Being Moderated
    Hi,

    A build server also has 1000s of files shared out over NFS.

    No, there is no way to stop resilvering. Because resilvering is a key component for resolving failed disks, I think it would be considered too dangerous to provide a shut off knob.

    If this is all bare metal, then I don't understand why resilvering is so painful.

    Do you see device errors collecting in any of the following output:

    # iostat -En
    # fmadm faulty
    # fmdump

    Failing that, I think you would get better assistance by opening an SR. There is a series of tunables that might help you. I'm just not experienced enough in this area to recommend them.

    Thanks, Cindy
  • 7. Re: ZPool performance during resilver
    cindys Pro
    Currently Being Moderated
    A couple more things I forgot to mention:

    1. If you have a support contract then you might find MOS document ID 1496593.1 helpful.

    2. A google search of "zfs resilvering is slow" for ZFS community postings might be enlightening depending on your hardware.

    Thanks, Cindy
  • 8. Re: ZPool performance during resilver
    994707 Newbie
    Currently Being Moderated
    Thanks again for your feedback.

    If resilvering cannot be stopped then Orcale may want to clarify it's documentation: http://docs.oracle.com/cd/E19082-01/817-2271/gbcus/index.html

    "Resilvering is interruptible and safe." I guess this is only in terms of an accidental interruption like a power outage.

    I also ran the iostat -En and there are a lot of errors. Small portion below.
    root@pop-tac-san02:~# iostat -En
    c9d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
    Model: WDC WD800AAJS-0 Revision: Serial No: WD-WCAV2N5 Size: 80.03GB <80025845760 bytes>
    Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
    Illegal Request: 0
    c10d0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
    Model: WDC WD800AAJS-0 Revision: Serial No: WD-WCAV2N4 Size: 80.03GB <80025845760 bytes>
    Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
    Illegal Request: 0
    c7t21d0 Soft Errors: 0 Hard Errors: 11 Transport Errors: 23
    Vendor: ATA Product: WDC WD2001FASS-0 Revision: 0101 Serial No: WD-WMAUR0278223
    Size: 2000.40GB <2000398934016 bytes>
    Media Error: 9 Device Not Ready: 0 No Device: 2 Recoverable: 0
    Illegal Request: 12 Predictive Failure Analysis: 0
    c7t22d0 Soft Errors: 0 Hard Errors: 5 Transport Errors: 19
    Vendor: ATA Product: WDC WD2001FASS-0 Revision: 0101 Serial No: WD-WMAUR0371013
    Size: 2000.40GB <2000398934016 bytes>
    Media Error: 0 Device Not Ready: 0 No Device: 5 Recoverable: 0
    Illegal Request: 10 Predictive Failure Analysis: 0
    c7t23d0 Soft Errors: 0 Hard Errors: 7 Transport Errors: 28
    Vendor: ATA Product: WDC WD2001FASS-0 Revision: 0101 Serial No: WD-WMAY00096167
    Size: 2000.40GB <2000398934016 bytes>
    Media Error: 0 Device Not Ready: 0 No Device: 7 Recoverable: 0
    Illegal Request: 9 Predictive Failure Analysis: 0
    c7t24d0 Soft Errors: 0 Hard Errors: 4 Transport Errors: 16
    Vendor: ATA Product: WDC WD2001FASS-0 Revision: 0101 Serial No: WD-WMAY00129943
    Size: 2000.40GB <2000398934016 bytes>
    Media Error: 0 Device Not Ready: 0 No Device: 4 Recoverable: 0
    Illegal Request: 10 Predictive Failure Analysis: 0

    I'm no expert but I'm guessing this is causing my performance problems during resilver.

    Edited by: 991704 on May 22, 2013 10:34 AM
  • 9. Re: ZPool performance during resilver
    cindys Pro
    Currently Being Moderated
    Ha...the joke's on me. I'll have that text fixed because it implies that you can stop a resilver but what it means is that if the system reboots or the power fails, resilvering will continue where it left of when the OS is running again.

    Are any of the iostat -En errors seen by FMA or ZFS?

    I know that updating the HBA/disk firmware can sometimes reduce the transport errors. I'm not sure about the hard errors. Which LSI HBA model is this?

    Thanks, Cindy

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points