3 Replies Latest reply on Apr 14, 2006 5:12 PM by 807557

    Errors on new SAN disks (fabric login port failure)

    807557
      I posted something to the storage forum yesterday morning right before
      the switch over to the new format and I think my post was lost. I
      apologize if it later reappears and there are two posts of the same
      subject. In any case, here's my question again...

      We ran into an odd problem getting our machine to see some of our SAN
      disks. In the example below, we restricted it to a single disk. We
      have an x4200 running Solaris 10x64 with Sun (QLogic) fiber cards
      (SG-XPCI1FC-QLC) connected through a Sun-sold QLogic 5200 switch to a
      couple of StorCase JBOD's with a bunch Seagate disks (ST3300007FC
      Rev3). We have the sun drivers:
      root 16: modinfo | grep FC
      111 fffffffff04d2000  19918  58   1  fp (SunFC Port v20051108-1.68)
      113 fffffffff04ee000  18770  61   1  fcp (SunFC FCP v20051108-1.93)
      115 fffffffff0509000   9d80   -   1  fctl (SunFC Transport v20051108-1.50)
      116 fffffffff0512000  c9c48 119   1  qlc (SunFC Qlogic FCA v20051013-2.08)
      164 fffffffff06bb000   9670  59   1  fcip (SunFC FCIP v20051108-1.43)
      165 fffffffffbbb9ff0   5610  62   1  fcsm (Sun FC SAN Management v20051108)
      182 fffffffff0719000   4b90   -   1  zmod (RFC 1950 decompression routines)
      And a relatively recent version of the qlc driver:
      root 17: showrev -p | grep 119131
      Patch: 119131-14 Obsoletes: 119087-05 Requires:  Incompatibles:  Packages: SUNWf
      ctl, SUNWfcip, SUNWfcmdb, SUNWfcp, SUNWfcsm, SUNWqlc
      We've zoned the two switches so that the x4200 can see the drives. We
      noticed when when we turned on the zoning for the disks the following
      messages appeared in the /var/adm/message for each disk:
      Mar 31 20:41:45 pemsdc fctl: [ID 517869 kern.warning] WARNING: fp(0)::N_x Port w
      ith D_ID=202d2, PWWN=21000014c34f774b reappeared in fabric
      Mar 31 20:41:45 pemsdc qlc: [ID 308975 kern.warning] WARNING: qlc(0): login fabr
      ic port failed D_ID=202d2h, error=4009h
      Mar 31 20:41:45 pemsdc fp: [ID 517869 kern.info] NOTICE: fp(0): PLOGI to 202d2 f
      ailed state=Packet Transport error, reason=No Connection
      Mar 31 20:41:45 pemsdc fctl: [ID 517869 kern.warning] WARNING: fp(0)::PLOGI to 2
      02d2 failed. state=e reason=5.
      Mar 31 20:41:45 pemsdc scsi: [ID 243001 kern.warning] WARNING: /pci@0,0/pci1022,
      7450@2/pci1077,132@1/fp@0,0 (fcp0):
      Mar 31 20:41:45 pemsdc  PLOGI to D_ID=0x202d2 failed: State:Packet Transport err
      or, Reason:No Connection. Giving up
      Nevertheless, we went to configure the access point with the cfgadm
      command and we saw the following:
      root 17: cfgadm -c configure c6
      cfgadm: Library error: report LUNs failed: 21000014c34f774b
      failed to configure ANY device on FCA port
      
      root 18: cfgadm -c configure c7
      cfgadm: Library error: report LUNs failed: 22000014c34f774b
      failed to configure ANY device on FCA port
      If we have more disks then for some disks it succeeds and for some it
      fails (we picked a case where it fails). For the disks that succeed we
      don't have the error message in /var/adm/messages. At this point we
      can see the disks in the unconfigured state with cfgadm:
      root 20: cfgadm -al c6 c7
      Ap_Id                          Type         Receptacle   Occupant     Condition
      c6                             fc-fabric    connected    unconfigured unknown
      c6::21000014c34f774b           unavailable  connected    unconfigured failed
      c7                             fc-fabric    connected    unconfigured unknown
      c7::22000014c34f774b           unavailable  connected    unconfigured failed
      But they obviously aren't usable. So format, luxadm, vxdisk, etc can't
      see them. Has anybody seen this before? What is the 'login fabric'
      thing doing? And what does the 'report LUNs failed' response from
      cfgadm mean? Is that the switch that's not allowing the LUNs to be
      reported? Or is the disk not reporting the LUN correct? Thanks.

      Karl
        • 1. Re: Errors on new SAN disks (fabric login port failure)
          807557
          Interesting. Could you login to the switche and run a "show port X" on the ports where the storage and hosts are attached? Or even better yet you can run a "create support" and attach the resulting file to a post here.

          Lyle
          • 2. Re: Errors on new SAN disks (fabric login port failure)
            807557
            I've been playing around with the switches quite a bit over the weekend
            so I don't have exactly the same setup. Unfortunately, it was fairly
            easy to reproduce a similar situation. I have 10 disks zoned in a single
            switch now (I wanted to isolate the problem from the multipath stuff). I
            see the following on the host:
            root 18: cfgadm -al c7
            Ap_Id                          Type         Receptacle   Occupant     Condition
            c7                             fc-fabric    connected    configured   unknown
            c7::22000014c34f773e           unavailable  connected    unconfigured failed
            c7::22000014c34f774a           disk         connected    configured   unknown
            c7::22000014c34f774b           disk         connected    configured   unknown
            c7::22000014c34f77be           unavailable  connected    unconfigured failed
            c7::22000014c34f7a8a           disk         connected    configured   unknown
            c7::22000014c34f7d5a           disk         connected    unconfigured unknown
            c7::22000014c34fc1c3           disk         connected    configured   unknown
            c7::22000014c34fc427           disk         connected    configured   unknown
            c7::22000014c34fc42a           unavailable  connected    unconfigured failed
            c7::22000014c34fd551           disk         connected    configured   unknown
            For the disks that are 'unavailable' above we see the same 'login fabric
            port failed' messages in the /var/adm/messages log. I tried to unconfigure
            and then configure and it didn't clear the problem. The host is attached
            to port 6 and the disks are attached to port 2. I couldn't figure out how
            to attach the switch support file so here are the show port details for
            the two ports (as taken from the support file):
              CMD: show port 2
              ----
            
              Port Number: 2
              ------------
              AdminState       Online              OperationalState Online
              AsicNumber       0                   PerfTuningMode   MFS
              AsicPort         2                   PortID           010200
              ConfigType       GL                  PortWWN          20:02:00:c0:dd:07:3d:14
              DiagStatus       Passed              RunningType      FL
              EpConnState      None                MediaPartNumber  JSMR21S002B01
              EpIsoReason      NotApplicable       MediaRevision
              IOStreamGuard    Disabled            MediaType        200-M5-SN-I
              LinkSpeed        2Gb/s               MediaVendor      JDS UNIPHASE
              LinkState        Active              MediaVendorID    0000019c
              LoginStatus      LoggedIn            SymbolicName     Port2
              MaxCredit        16                  SyncStatus       SyncAcquired
              MediaSpeeds      1Gb/s, 2Gb/s        XmitterEnabled   True
            
              ALInit          3                      LIP_F8_AL_PS    0
              ALInitError     0                      LIP_F8_F7       0
              BadFrames       0                      LinkFailures    0
              Class2FramesIn  0                      Login           1
              Class2FramesOut 0                      Logout          0
              Class2WordsIn   0                      LoopTimeouts    0
              Class2WordsOut  0                      LossOfSync      0
              Class3FramesIn  845305                 PrimSeqErrors   0
              Class3FramesOut 842510                 RxLinkResets    0
              Class3Toss      0                      RxOfflineSeq    0
              Class3WordsIn   30461729               TotalErrors     0
              Class3WordsOut  30256013               TotalLinkResets 0
              DecodeErrors    0                      TotalLIPsRecvd  1
              EpConnects      0                      TotalLIPsXmitd  3
              FBusy           0                      TotalOfflineSeq 3
              FlowErrors      0                      TotalRxFrames   845305
              FReject         0                      TotalRxWords    30461729
              InvalidCRC      0                      TotalTxFrames   842510
              InvalidDestAddr 0                      TotalTxWords    30256013
              LIP_AL_PD_AL_PS 0                      TxLinkResets    0
              LIP_F7_AL_PS    0                      TxOfflineSeq    3
              LIP_F7_F7       1                      TxWaits         166700000
            And for port 6:
              CMD: show port 6
              ----
            
              Port Number: 6
              ------------
              AdminState       Online              OperationalState Online
              AsicNumber       0                   PerfTuningMode   Normal
              AsicPort         6                   PortID           010600
              ConfigType       GL                  PortWWN          20:06:00:c0:dd:07:3d:14
              DiagStatus       Passed              RunningType      F
              EpConnState      None                MediaPartNumber  JSMR21S002B01
              EpIsoReason      NotApplicable       MediaRevision
              IOStreamGuard    Disabled            MediaType        200-M5-SN-I
              LinkSpeed        2Gb/s               MediaVendor      JDS UNIPHASE
              LinkState        Active              MediaVendorID    0000019c
              LoginStatus      LoggedIn            SymbolicName     Port6
              MaxCredit        16                  SyncStatus       SyncAcquired
              MediaSpeeds      1Gb/s, 2Gb/s        XmitterEnabled   True
            
              ALInit          5                      LIP_F8_AL_PS    0
              ALInitError     0                      LIP_F8_F7       1
              BadFrames       0                      LinkFailures    2
              Class2FramesIn  0                      Login           3
              Class2FramesOut 0                      Logout          2
              Class2WordsIn   0                      LoopTimeouts    0
              Class2WordsOut  0                      LossOfSync      2
              Class3FramesIn  12678                  PrimSeqErrors   0
              Class3FramesOut 21803                  RxLinkResets    3
              Class3Toss      0                      RxOfflineSeq    0
              Class3WordsIn   188523                 TotalErrors     13
              Class3WordsOut  958528                 TotalLinkResets 3
              DecodeErrors    11                     TotalLIPsRecvd  5
              EpConnects      0                      TotalLIPsXmitd  7
              FBusy           0                      TotalOfflineSeq 7
              FlowErrors      0                      TotalRxFrames   12678
              FReject         0                      TotalRxWords    188523
              InvalidCRC      0                      TotalTxFrames   21803
              InvalidDestAddr 0                      TotalTxWords    958528
              LIP_AL_PD_AL_PS 0                      TxLinkResets    0
              LIP_F7_AL_PS    0                      TxOfflineSeq    7
              LIP_F7_F7       4                      TxWaits         264
            I have a number of random questions/items that you might help as well:

            1) The MS approach seems to work to correct, or change, the problem: if
            the disks are in this failed state then if I reboot the machine they
            are either fine or the set of failed disks has changed. For example,
            for the situation above, for which I attached the switch support file,
            I just rebooted the machine and now I get the following:
            root 4: cfgadm -al c7
            Ap_Id                          Type         Receptacle   Occupant     Condition
            c7                             fc-fabric    connected    configured   unknown
            c7::22000014c34f773e           disk         connected    configured   unknown
            c7::22000014c34f774a           disk         connected    configured   unknown
            c7::22000014c34f774b           disk         connected    configured   unknown
            c7::22000014c34f77be           unavailable  connected    unconfigured failed
            c7::22000014c34f7a8a           disk         connected    configured   unknown
            c7::22000014c34f7d5a           unavailable  connected    unconfigured failed
            c7::22000014c34fc1c3           disk         connected    configured   unknown
            c7::22000014c34fc427           disk         connected    configured   unknown
            c7::22000014c34fc42a           disk         connected    configured   unknown
            c7::22000014c34fd551           disk         connected    configured   unknown
            Needless to say, I'm looking for a better alternative than rebooting until
            all the disks are ok.

            2) During some of my poking over the weekend I noticed at one point that
            the disks were being added with different device paths. For example, I
            added one disk (these are all exactly the same kind of disks, all behind
            the SANbox in the same type of chassis) by sticking it in the proper zone
            and I saw the following message in /var/adm/messages:
            Apr  2 18:58:47 pemsdc scsi: [ID 799468 kern.info] sd33 at scsi_vhci0: name g200
            00014c34f7d5a, bus address g20000014c34f7d5a
            Apr  2 18:58:47 pemsdc genunix: [ID 936769 kern.info] sd33 is /scsi_vhci/disk@g2
            0000014c34f7d5a
            Apr  2 18:58:47 pemsdc genunix: [ID 408114 kern.info] /scsi_vhci/disk@g20000014c
            34f7d5a (sd33) online
            Apr  2 18:58:47 pemsdc genunix: [ID 834635 kern.info] /scsi_vhci/disk@g20000014c
            34f7d5a (sd33) multipath status: degraded, path /pci@1,0/pci1022,7450@2/pci1077,
            132@1/fp@0,0 (fp1) to target address: w22000014c34f7d5a,0 is online Load balanci
            ng: round-robin
            Notice that the device path is through /scsi_vhci. Luxadm probe then showed:
            root 3: luxadm probe
            
            Found Fibre Channel device(s):
              Node WWN:20000014c34f7d5a  Device Type:Disk device
                Logical Path:/dev/rdsk/c8t20000014C34F7D5Ad0s2
              Node WWN:20000014c34f5faa  Device Type:Disk device
                Logical Path:/devices/scsi_vhci/disk@g20000014c34f5faa:c,raw
              Node WWN:20000014c34f774b  Device Type:Disk device
                Logical Path:/dev/rdsk/c8t20000014C34F774Bd0s2
              Node WWN:20000014c34f7a8a  Device Type:Disk device
                Logical Path:/dev/rdsk/c8t20000014C34F7A8Ad0s2
            Where the second disk there has the new path. Is that the way it's
            supposed to be? To correct this I had to do an unconfigure and then
            configure with cfgadm. Then the path looked like the regular /dev/rdsk
            path.

            Sometime during the weekend, in my desperation, I patched the fiber
            stack so now I have the following versions of drivers:
            root@pemsdc 19: modinfo | grep FC
            110 fffffffff04d5000  19918  58   1  fp (SunFC Port v20051215-1.69)
            113 fffffffff04f4000  18830  61   1  fcp (SunFC FCP v20051215-1.94)
            114 fffffffff050c000   9d80   -   1  fctl (SunFC Transport v20051215-1.50)
            115 fffffffff0515000  c9c48 119   1  qlc (SunFC Qlogic FCA v20051013-2.08)
            180 fffffffffbbb6310   5610  62   1  fcsm (Sun FC SAN Management v20051215)
            184 fffffffffbbbb380   9670  59   1  fcip (SunFC FCIP v20051215-1.43)
            201 fffffffff05e0000   4b90   -   1  zmod (RFC 1950 decompression routines)
            So fp, fcp and fcsm were all bumped up. Could this have caused the change?

            -) It seems if I add the disks one at a time, meaning I put them in the
            zone on the switch and then do a 'cfgadm -al' on the host that I can
            add quite a few of them in a row. But if I ever added more than one
            disk at a time then the second one would 'fail' with this 'login fabric
            port failed' message. I did this a number of times but then gave up
            because it was too slow.

            -) Random unrelated-to-problem question: I noticed that cfgadm reports
            the disks attached to controllers c6 and c7. But luxadm probe and
            format both seem to indicate that the disks are attached to c8. Is
            this a function of the fact that the multipathing software assigned a
            new controller to the single disk (which has two paths)?

            Thanks.

            Karl
            • 3. Re: Errors on new SAN disks (fabric login port failure)
              807557
              We opened up a support case for this and the engineer pointed out the problem. It turns out that the fiber HBA cards we have don't actually support this configuration. At least not very well.

              We have a bunch of JBODs in fabric mode. The HBA in our host sees each disk in each JBOD disk array as a target. We have multiple arrays and hundreds of disks (we then virtualize them into volumes with veritas). Unfortunately, the HBA cards we have, SG-XPCI1FC-QLC, can only accomodate 8 devices in fabric mode. So the problems that we were seeing were a function of that limitation. The problems seemed random (different disks could be seen through a reboot) because it depended on the order in which the devices attempted to login.

              In any case, the more expensive HBA cards (ie: SG-XPCI1FC-QL2) apparently support more - up to the fiber spec. The take away: watch out when you buy something that says, "entry level".

              Karl