This discussion is archived
6 Replies Latest reply: Mar 19, 2013 8:20 AM by cindys RSS

Large number of Transport errors on ZFS pool

994707 Newbie
Currently Being Moderated
This is sort of a continuation of thread:
Issues with HBA and ZFS

But since it is a separate question thought I'd start a new thread.

Because of a bug in 11.1, I had to downgrade to 10_U11. Using an LSI 9207-8i HBA (SAS2308 chipset). I have no errors on my pools but i consistently see errors when trying to read from the disks. They are always Retryable or Reset. All in all the system functions but as I started testing I am seeing a lot of errors in IOSTAT.

bash-3.2# iostat -exmn
extended device statistics ---- errors ---
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b s/w h/w trn tot device
0.1 0.2 1.0 28.9 0.0 0.0 0.0 41.8 0 1 0 0 1489 1489 c0t5000C500599DDBB3d0
0.0 0.7 0.2 75.0 0.0 0.0 21.2 63.4 1 1 0 1 679 680 c0t5000C500420F6833d0
0.0 0.7 0.3 74.6 0.0 0.0 20.9 69.8 1 1 0 0 895 895 c0t5000C500420CDFD3d0
0.0 0.6 0.4 75.5 0.0 0.0 26.7 73.7 1 1 0 1 998 999 c0t5000C500420FB3E3d0
0.0 0.6 0.4 75.3 0.0 0.0 18.3 68.7 0 1 0 1 877 878 c0t5000C500420F5C43d0
0.0 0.0 0.2 0.7 0.0 0.0 0.0 2.1 0 0 0 0 0 0 c0t5000C500420CE623d0
0.0 0.6 0.3 76.0 0.0 0.0 20.7 67.8 0 1 0 0 638 638 c0t5000C500420CD537d0
0.0 0.6 0.2 74.9 0.0 0.0 24.6 72.6 1 1 0 0 638 638 c0t5000C5004210A687d0
0.0 0.6 0.3 76.2 0.0 0.0 20.0 78.4 1 1 0 1 858 859 c0t5000C5004210A4C7d0
0.0 0.6 0.2 74.3 0.0 0.0 22.8 69.1 0 1 0 0 648 648 c0t5000C500420C5E27d0
0.6 43.8 21.3 96.8 0.0 0.0 0.1 0.6 0 1 0 14 144 158 c0t5000C500420CDED7d0
0.0 0.6 0.3 75.7 0.0 0.0 23.0 67.6 1 1 0 2 890 892 c0t5000C500420C5E1Bd0
0.0 0.6 0.3 73.9 0.0 0.0 28.6 66.5 1 1 0 0 841 841 c0t5000C500420C602Bd0
0.0 0.6 0.3 73.6 0.0 0.0 25.5 65.7 0 1 0 0 678 678 c0t5000C500420D013Bd0
0.0 0.6 0.3 76.5 0.0 0.0 23.5 74.9 1 1 0 0 651 651 c0t5000C500420C50DBd0
0.0 0.6 0.7 70.1 0.0 0.1 22.9 82.9 1 1 0 2 1153 1155 c0t5000C500420F5DCBd0
0.0 0.6 0.4 75.3 0.0 0.0 19.2 58.8 0 1 0 1 682 683 c0t5000C500420CE86Bd0
0.0 0.0 0.2 0.7 0.0 0.0 0.0 1.9 0 0 0 0 0 0 c0t5000C500420F3EDBd0
0.1 0.2 1.0 26.5 0.0 0.0 0.0 41.9 0 1 0 0 1511 1511 c0t5000C500599E027Fd0
2.2 0.3 133.9 28.2 0.0 0.0 0.0 4.4 0 1 0 17 1342 1359 c0t5000C500599DD9DFd0
0.1 0.3 1.1 29.2 0.0 0.0 0.2 34.1 0 1 0 2 1498 1500 c0t5000C500599DD97Fd0
0.0 0.6 0.3 75.6 0.0 0.0 22.6 71.4 0 1 0 0 677 677 c0t5000C500420C51BFd0
0.0 0.6 0.3 74.8 0.0 0.1 28.6 83.8 1 1 0 0 876 876 c0t5000C5004210A64Fd0
0.6 43.8 18.4 96.9 0.0 0.0 0.1 0.6 0 1 0 5 154 159 c0t5000C500420CE4AFd0


Mar 12 2013 17:03:34.645205745 ereport.fs.zfs.io
nvlist version: 0
     class = ereport.fs.zfs.io
     ena = 0x114ff5c491a00c01
     detector = (embedded nvlist)
     nvlist version: 0
          version = 0x0
          scheme = zfs
          pool = 0x53f64e2baa9805c9
          vdev = 0x125ce3ac57ffb535
     (end detector)

     pool = SATA_Pool
     pool_guid = 0x53f64e2baa9805c9
     pool_context = 0
     pool_failmode = wait
     vdev_guid = 0x125ce3ac57ffb535
     vdev_type = disk
     vdev_path = /dev/dsk/c0t5000C500599DD97Fd0s0
     vdev_devid = id1,sd@n5000c500599dd97f/a
     parent_guid = 0xcf0109972ceae52c
     parent_type = mirror
     zio_err = 5
     zio_offset = 0x1d500000
     zio_size = 0xf1000
     zio_objset = 0x12
     zio_object = 0x0
     zio_level = -2
     zio_blkid = 0x452
     __ttl = 0x1
     __tod = 0x513fa636 0x26750ef1

I know all of these drives are not bad and I have confirmed they are all running the latest firmware and correct sector size, 512 (ashift 9). I am thinking it is some sort of compatibility with this new HBA but have no way of verifying. Anyone have any suggestions?

Edited by: 991704 on Mar 12, 2013 12:45 PM
  • 1. Re: Large number of Transport errors on ZFS pool
    cindys Pro
    Currently Being Moderated
    I don't know how else to advise you about these transport errors. I know that a customer with our x86 hardware updated
    the firmware of either the HBA or devices and his transport errors resolved. I do know that in the above output, zio_err = 5,
    is an I/O error.

    Are any of these errors impacting the pool...are you seeing issues in zpool status -v?

    Thanks, Cindy
  • 2. Re: Large number of Transport errors on ZFS pool
    994707 Newbie
    Currently Being Moderated
    Cindy, during large data transfers yes. The pool has (4) 15k SAS drives as a ZFS RAID 10 for the log write cache. During large transfers these four drives would fail.

    I ended up reloading the system with an older HBA LSI 3081e on S10 U11. The transport errors have reduced significantly. The only time I have a problem now is if I do a scrub. A scrub will cause one of the drives to fail. The SAS drives have zero errors.
  • 3. Re: Large number of Transport errors on ZFS pool
    994707 Newbie
    Currently Being Moderated
    There must be something small I am missing. We have another system configured nearly the same (same server and HBA, different drives) and it functions. I've gone through the recommended storage practices guide. The only item I have not been able to verify is

    "Confirm that your controller honors cache flush commands so that you know your data is safely written, which is important before changing the pool's devices or splitting a mirrored storage pool. This is generally not a problem on Oracle/Sun hardware, but it is good practice to confirm that your hardware's cache flushing setting is enabled."

    How can I confirm this? As far as I know these HBAs are simply HBAs. No battery backup. No on-board memory. The 9207 doesn't even offer RAID.

    Edited by: 991704 on Mar 15, 2013 12:33 PM
  • 4. Re: Large number of Transport errors on ZFS pool
    cindys Pro
    Currently Being Moderated
    I don't know how to check whether its disabled on your gear. I wonder if its time to check all hardware and cables, maybe even switch stuff out with the system that is running well. Also, this is a long shot but make sure that you don't have anything in /etc/system that
    is causing this problem, like disabling the ZFS cache flushes, which would be this: set zfs:zfs_nocacheflush = 1

    Thanks, Cindy
  • 5. Re: Large number of Transport errors on ZFS pool
    994707 Newbie
    Currently Being Moderated
    Cindys, thank for your help. You're suggestions and comments have helped me a great deal. I wanted to let you know I found the problem. It occurred to me that we had switched out every piece of hardware except for the backplane. This backplane has three SAS-8087 ports. One for HBA and two for daisy chaining another disk array. I moved the cable From the HBA port to one of the expansion ports, not a single error has been generated since.

    I have maxed out disk I/O, as much as possible, and ran repeated scrubs. No errors anywhere.
  • 6. Re: Large number of Transport errors on ZFS pool
    cindys Pro
    Currently Being Moderated
    Good work and your persistence finally pays off. This is the best news I have heard in 2 weeks. When you said that another system with a similar config was running well, I thought it was time to check cabling and so on.

    Thanks, Cindy

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points