But since it is a separate question thought I'd start a new thread.
Because of a bug in 11.1, I had to downgrade to 10_U11. Using an LSI 9207-8i HBA (SAS2308 chipset). I have no errors on my pools but i consistently see errors when trying to read from the disks. They are always Retryable or Reset. All in all the system functions but as I started testing I am seeing a lot of errors in IOSTAT.
I know all of these drives are not bad and I have confirmed they are all running the latest firmware and correct sector size, 512 (ashift 9). I am thinking it is some sort of compatibility with this new HBA but have no way of verifying. Anyone have any suggestions?
I don't know how else to advise you about these transport errors. I know that a customer with our x86 hardware updated
the firmware of either the HBA or devices and his transport errors resolved. I do know that in the above output, zio_err = 5,
is an I/O error.
Are any of these errors impacting the pool...are you seeing issues in zpool status -v?
Cindy, during large data transfers yes. The pool has (4) 15k SAS drives as a ZFS RAID 10 for the log write cache. During large transfers these four drives would fail.
I ended up reloading the system with an older HBA LSI 3081e on S10 U11. The transport errors have reduced significantly. The only time I have a problem now is if I do a scrub. A scrub will cause one of the drives to fail. The SAS drives have zero errors.
There must be something small I am missing. We have another system configured nearly the same (same server and HBA, different drives) and it functions. I've gone through the recommended storage practices guide. The only item I have not been able to verify is
"Confirm that your controller honors cache flush commands so that you know your data is safely written, which is important before changing the pool's devices or splitting a mirrored storage pool. This is generally not a problem on Oracle/Sun hardware, but it is good practice to confirm that your hardware's cache flushing setting is enabled."
How can I confirm this? As far as I know these HBAs are simply HBAs. No battery backup. No on-board memory. The 9207 doesn't even offer RAID.
I don't know how to check whether its disabled on your gear. I wonder if its time to check all hardware and cables, maybe even switch stuff out with the system that is running well. Also, this is a long shot but make sure that you don't have anything in /etc/system that
is causing this problem, like disabling the ZFS cache flushes, which would be this: set zfs:zfs_nocacheflush = 1
Cindys, thank for your help. You're suggestions and comments have helped me a great deal. I wanted to let you know I found the problem. It occurred to me that we had switched out every piece of hardware except for the backplane. This backplane has three SAS-8087 ports. One for HBA and two for daisy chaining another disk array. I moved the cable From the HBA port to one of the expansion ports, not a single error has been generated since.
I have maxed out disk I/O, as much as possible, and ran repeated scrubs. No errors anywhere.
Good work and your persistence finally pays off. This is the best news I have heard in 2 weeks. When you said that another system with a similar config was running well, I thought it was time to check cabling and so on.