This discussion is archived
11 Replies Latest reply: Mar 13, 2013 9:41 AM by 994707 RSS

Issues with HBA and ZFS

994707 Newbie
Currently Being Moderated
I am attempting to build a NAS with Solaris 11.1 utilizing an LSI SAS9207-8i 6Gb HBA. It is correctly using the mtp_sas driver.

The problem I am having is after building the server, creating ZFS pools, and mirroring the boot drive I perform a scrub. At this point the rpool is the only pool with data. The scrub always comes back with some sort of error (read, write, cksum, or it fails one of the drives completely).

Scenario (after fresh install, boot drive is mirrored, second pool created using 22 additional disks)
1. Scrub rpool & Repair errors that are found.
2. Subsequent scrubs are clean.
3. Upload a 400MB ISO.
4. Run scrub
5. Errors are found and repaired
6. Subsequent scrubs are clean.
7. Copy same ISO to second pool
8. Run scrub
9. Errors are found and repaired
10. Subsequent scrubs are clean.
11. Upload a new file 450MB ISO.
12. Run scrub
13. Errors are found and repaired
14. Subsequent scrubs are clean.
15. Copy same ISO to second pool
16. Run scrub
17. Errors are found and repaired

If the file uploaded is larger, say 1.5GB ISO, generally one of the disks in rpool will fail.

I have swapped out the motherboard, memory, HBA, HBA cable, disks, ran memtest, reinstalled about ten times, tried different files to upload (doc, iso, pic) always with the same result. Either ZFS is a bad filesystem (which I highly doubt), I have bad hardware (not likely at this point), or this is software (probably driver) related. I have broached the issue with LSI and there response was "The issue is possible to be a driver problem. We do not have driver support for Solaris 11.1 and this could be an issue in performance with use of an LSI SAS HBA."

The card is listed in the HCL and I have also verified the driver is correct as the card uses the SAS2308 chip. Any new data (files, etc) has errors after being uploaded. I have tried using an older 3Gb card sas3081e-r, but the installer will not boot and only shows consistent errors communicating with disks. At this point I am out of ideas. Any help would be appreciated.
  • 1. Re: Issues with HBA and ZFS
    cindys Pro
    Currently Being Moderated
    Hello again,

    This issue is starting to sound familiar to me so gathering a bit more info would be helpful to isolate this problem.

    Does the fmdump -eV include text like this:

    driver-assessment = fail
    un-decode-info = invalid-sense-data

    Thanks, Cindy
  • 2. Re: Issues with HBA and ZFS
    994707 Newbie
    Currently Being Moderated
    The most recent round of file copies and scrubs left the system unbootable. I am performing a re-install of the OS now. Should have an answer within a couple hours.

    When should I run the command? Right after install or should I create the failures again (ZFS and/or Disk) then run the command?
  • 3. Re: Issues with HBA and ZFS
    994707 Newbie
    Currently Being Moderated
    Cindy, you were right on. After install I ran the command. There are tons of these so I only copied in a couple. What does this mean?

    root@solaris:~# zpool status
    pool: rpool
    state: ONLINE
    scan: none requested
    config:

    NAME STATE READ WRITE CKSUM
    rpool ONLINE 0 0 0
    c0t5000C500420CDED7d0s1 ONLINE 0 0 0

    errors: No known data errors

    Mar 04 2013 16:00:41.538745414 ereport.io.scsi.cmd.disk.tran
    nvlist version: 0
         class = ereport.io.scsi.cmd.disk.tran
         ena = 0xccda04969f00425
         detector = (embedded nvlist)
         nvlist version: 0
              version = 0x0
              scheme = dev
              cna_dev = 0x5135168e00000041
              device-path = /pci@0,0/pci8086,340e@7/pci1000,3020@0/iport@f0/disk@w5000c500420cded5,0
         (end detector)

         devid = id1,sd@n5000c500420cded7
         driver-assessment = fail
         op-code = 0x2a
         cdb = 0x2a 0x0 0x3 0x88 0xc 0x5c 0x0 0x7 0x43 0x0
         pkt-reason = 0x4
         pkt-state = 0x0
         pkt-stats = 0x10
         __ttl = 0x1
         __tod = 0x51351989 0x201c9a46



    Mar 05 2013 08:45:42.284527395 ereport.io.scsi.cmd.disk.dev.uderr
    nvlist version: 0
         class = ereport.io.scsi.cmd.disk.dev.uderr
         ena = 0x16d21f1a8600001
         detector = (embedded nvlist)
         nvlist version: 0
              version = 0x0
              scheme = dev
              cna_dev = 0x5136050200000018
              device-path = /pci@0,0/pci8086,340e@7/pci1000,3020@0/iport@f0/disk@w5000c500420cded5,0
              devid = id1,sd@n5000c500420cded7
         (end detector)

         devid = id1,sd@n5000c500420cded7
         driver-assessment = retry
         op-code = 0x2a
         cdb = 0x2a 0x0 0x2c 0x9 0x9b 0x76 0x0 0x0 0x5 0x0
         pkt-reason = 0x4
         pkt-state = 0x0
         pkt-stats = 0x10
         stat-code = 0x0
         un-decode-info = invalid-sense-data

         __ttl = 0x1
         __tod = 0x51360516 0x10f58b23
  • 4. Re: Issues with HBA and ZFS
    cindys Pro
    Currently Being Moderated
    I think there are a combination of issues. LSI says this driver is not supported in S11.1 and some change in S11.1 is
    causing very bad stability problems for this config. Please send me the output of this command:

    # prtconf -vD > /tmp/prtconf.out

    Send it to me directly:

    cindy.swearingen@oracle.com

    I want to add this info to the S11.1 bug, which I think is 15819341.

    Did you ever test this config on Solaris 11? You might consider going back to S11 to test this config or consider switching HBAs.


    Thanks, Cindy
  • 5. Re: Issues with HBA and ZFS
    993561 Newbie
    Currently Being Moderated
    I've also had a couple of problems with my system, (11.1) and lsi sas cards (3080) with FW: 01.33.00.00; BIOS: 6.36.00.00 in Initiator-target mode.
  • 6. Re: Issues with HBA and ZFS
    994707 Newbie
    Currently Being Moderated
    Info sent. How can I get version 11. The download page seems to only offer 11.1. I can get 10 easily enough.
  • 7. Re: Issues with HBA and ZFS
    Dave Miner Explorer
    Currently Being Moderated
    If you have support, previous releases can be downloaded via support.oracle.com, document is ID 1277964.1
  • 8. Re: Issues with HBA and ZFS
    994707 Newbie
    Currently Being Moderated
    Unfortunately I don't have a support contract. I've installed 10_U11 will let you know how it goes.
  • 9. Re: Issues with HBA and ZFS
    994707 Newbie
    Currently Being Moderated
    I was able to install Solaris 10 U11. After recreating all pools I did a variety of "stress tests" and was able to upload large amounts of data with no errors found during scrubs. This was definitely my problem. Thanks Cindys
  • 10. Re: Issues with HBA and ZFS
    cindys Pro
    Currently Being Moderated
    Good to hear this is working out for you.

    I may be paranoid but I seen enough sad stories with problematic hardware or bad configs, that you need to keep monitoring this config routinely to know this one is a keeper. Use zpool status, iostat -En and fm* commands routinely as described here:

    http://docs.oracle.com/cd/E26505_01/html/E37384/gbbbb.html#scrolltoc
    Resolving General Hardware Problems

    ZFS best practices are here:

    http://docs.oracle.com/cd/E26505_01/html/E37384/practice-1.html#scrolltoc

    Thanks, Cindy
  • 11. Re: Issues with HBA and ZFS
    994707 Newbie
    Currently Being Moderated
    Thanks. Unfortunately I'm not out of the woods yet. As I started using Sol 10 there were no ZFS errors but later I found large amounts of transport errors with iostat. I was eventually able to generate enough volume with a VM where I started seeing hardware errors and ZFS errors.

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points