13 Replies Latest reply: Jul 4, 2011 8:20 AM by 854744 RSS

    STK2540 in Degraded state and LUN 31 Corrupt label, wrong magic number

    854744
      Hi,

      PROBLEM 1 -
      Our Storage team recently configured an STK2540 for our Sun M4000 (where I've installed CAM v6.7.0.12) and I'm happily using the storage. My problem is that the 2540's health is reported via CAM as 'Degraded' and so my scripts to check the array's status via sscs are showing it has a problem.

      I suspect that an alarm is to blame, as the 2540 was originally registered and configured via a cross-over network cable on a private network (as no serial connector found). We then changed it's IP via CAM and plugged it into our main LAN, then un-registered and re-registered it with it's new IP. Alarms were generated as below, but do not appear to have been "auto cleared"...

      # /opt/SUNWstkcam/bin/sscs list alarm
      Alarm Id : alarm2
      Severity : Critical
      Type : 2540.CommunicationLostEvent
      Topic : OutOfBand
      Event Code : 70.12.31
      Date : 2011-06-03 11:36:38
      Device : nbukmtr1-stk[SUN.540-7198-02.1106BE90B0]
      Description : Lost out-of-band communication with 2540 nbukmtr1-stk
      State : Open
      Acknowledged By : PJB
      Auto Clear : Y
      Aggregated Count : 0

      Found one alert entry in health database.

      I've tried and failed with...
      un-rgistering and re-registering the 2540,
      re-opening the alarm and acknowledging it again,
      rebooting M4000 & 2540

      Is there any way to clear this alarm and/or get the health status back to normal?


      PROBLEM 2 -
      Our Storage team have configured an "Access" volume mapping to the "Default Storage Domain" and our M4000 host (LUN 31). From Solaris 10 on the M4000 this appears as a 16MB LUN which has not yet been labelled in format. /var/adm/messages complains as follows (2 paths to same LUN via single 2540 controller)...

      Jun 8 13:44:42 xxxxxxxxx scsi: [ID 107833 kern.warning] WARNING: /pci@3,700000/SUNW,emlxs@0/fp@0,0/ssd@w202400a0b8760eb6,1f (ssd0):
      Jun 8 13:44:42 xxxxxxxxx Corrupt label; wrong magic number
      Jun 8 13:44:42 xxxxxxxxx scsi: [ID 107833 kern.warning] WARNING: /pci@3,700000/SUNW,emlxs@0/fp@0,0/ssd@w202400a0b8760eb6,1f (ssd0):
      Jun 8 13:44:42 xxxxxxxxx Corrupt label; wrong magic number
      Jun 8 13:44:42 xxxxxxxxx scsi: [ID 107833 kern.warning] WARNING: /pci@1,700000/SUNW,emlxs@0/fp@0,0/ssd@w203400a0b8760eb6,1f (ssd1):
      Jun 8 13:44:42 xxxxxxxxx Corrupt label; wrong magic number
      Jun 8 13:44:42 xxxxxxxxx scsi: [ID 107833 kern.warning] WARNING: /pci@1,700000/SUNW,emlxs@0/fp@0,0/ssd@w203400a0b8760eb6,1f (ssd1):
      Jun 8 13:44:42 xxxxxxxxx Corrupt label; wrong magic number

      From what I can gather from the docs, this is the UMT LUN used to communicate with the 2540, so I have left well alone. Is this mapping really needed or can I label the disk? I'm just trying to get rid of the scsi errors going to /var/adm/messages as it's picked up by our monitoring software.


      If you need any further info, please let me know.

      Many thanks!!!
      Pete
        • 1. Re: STK2540 in Degraded state and LUN 31 Corrupt label, wrong magic number
          Nik
          Hi.
          Alarm "Description : Lost out-of-band communication with 2540 nbukmtr1-stk" mean that CAM can't contact with array via network.

          You can register array, so it's look like:
          1. Temporary problem.
          2. Compatibility problem.

          Check:
          1. Can you ping controllers of 2540 ? ( Your array have dual or singler controllers in array ? )
          2. Port on network switch must configure for auto-negotation speed.
          3. Check Events in CAM. Problem persitent or not.
          4. Try use another switch for connect 2540. (dummy as possible).



          About Lun 31. Access Lun used in case in-band managment of array. This LUN read-only so you can't label it.

          You can unmap this LUN via CAM or ignore this error messages.


          Another way for resolve problem with managment access to 2540 - use in-band managment.
          In this case you not need have network connections to array but must map LUN 31.

          Read CAM docs about configure in-band managment.


          Regards.
          • 2. Re: STK2540 in Degraded state and LUN 31 Corrupt label, wrong magic number
            854744
            Hi Nik,

            Thanks for the info. I'll have to plan for the LUN31 removal as the server is now in use, so I'll concentrate on the out-of-band management which is where I want to end up.

            We have a single 2540 controller and the M4000 can ping it's mgmt IP fine (same network so no firewall). Snooping shows traffic while I run sscs commands and CAM session and there are no further out-of-band alarms so I'm hoping that it is now working fine and was just logged while I changed the mgmt IP and swapped cables. I just need to find a way to clear the alarm and get rid of the Degraded status.

            I've found what I think is the alarm on the CAM server...

            # ls -la /var/opt/SUNWsefms/store/Events
            total 18
            drwxr-xr-x 2 root root 512 Jun 9 10:52 .
            drwxr-xr-x 15 root root 512 Jun 3 11:46 ..
            -rw-r--r-- 1 root root 0 Jun 7 16:47 .event.keys
            -rw-r--r-- 1 root root 294 Jun 9 10:52 event.keys
            -rw-r--r-- 1 root root 1191 Jun 9 09:52 event1954
            -rw-r--r-- 1 root root 1191 Jun 9 10:52 event1966
            -rw-r--r-- 1 root root 1543 Jun 3 11:36 event252

            # cat /var/opt/SUNWsefms/store/Events/event.keys
            1307097398223.85a6ff2e.227 event252 Date 1307097398222 Device.type 2540 GridCode 70.12.31 Severity 3 Topic 70.12.31 Device.key SUN.540-7198-02.1106BE90B0
            1307613461575.85a6ff2e.508 event1967 Date 1307613461571 Device.type 2540 GridCode 70.69.16 Severity 1 Device.key SUN.540-7198-02.1106BE90B0

            # grep description /var/opt/SUNWsefms/store/Events/event*
            /var/opt/SUNWsefms/store/Events/event1954: "description":"Finished monitoring run for device. Error generating report.",
            /var/opt/SUNWsefms/store/Events/event1967: "description":"Finished monitoring run for device. Error generating report.",
            /var/opt/SUNWsefms/store/Events/event252: "description":"Lost out-of-band communication with 2540 nbukmtr1-stk\n:OSGi.com.sun.storage.cam.agent(com.sun.netstorage.fm.storade.agent.Messages):monitor.CommunicationLost.oob.desc:S17:2540 nbukmtr1-stk:S0::",

            I've had a go at trying to fool CAM that event252 no longer exists...
            1) Stopped Web Console service
            2) Renamed /var/opt/SUNWsefms/store/Events/event252 and removing it's entry from /var/opt/SUNWsefms/store/Events/event.keys
            3) Started Web Console service
            4) Logged into CAM

            It still shows the Degraded state and the alarm, so maybe I need to restart something else. I'll carry on and see if this approach works. Any suggestions greatly appeciated though!

            Thanks
            • 3. Re: STK2540 in Degraded state and LUN 31 Corrupt label, wrong magic number
              854744
              Update:

              Found the alarm under /var/opt/SUNWsefms/store/Alarms (doh!) and have managed to get rid of it (well, rename it and update the alarm.keys file). Unfortunately, the health is now "Lost communication". Checking the events, there is a job which is failing to communicate with the controller even though the CAM O-O-B "Array Communication Test" under Trooubleshooting passes OK?!?

              Output from "Troubleshooting -> Array Communication Test"...
              # mail
              From Common.Array.Manager@oracle.com Thu Jun 9 11:29:17 2011
              From: Common.Array.Manager@oracle.com
              Date: Thu, 9 Jun 2011 11:29:17 +0100 (BST)
              To: root@nbukmtr1.uk.eu.airbus.corp
              Message-ID: <11048698.0.1307615357418.JavaMail.root@nbukmtr1>
              Subject: Diagnostic Test Results: Array Communication Test on nbukmtr1-stk
              Content-Length: 357

              Test : Array Communication Test
              Host : nbukmtr1
              Target : nbukmtr1-stk
              Status : Passed
              Options : EMAIL = root@localhost
              Result : Test Completed

              **** Test Output Below ****

              Attempting to contact the array using the following address(es):
              44.82.228.120
              0.0.0.0

              Controller A is accessible via:
              44.82.228.120 (oob)

              Controller B is not accessible


              However, I get the following event when running "General Health Monitoring -> Run Agent"...
              Job Overview (6) Property      Value      
              Host:      nbukmtr1      
              Cancelable:      True      
              Start Time:      09-Jun-2011 11:50:01      
              Last Update Time:      09-Jun-2011 11:50:03      
              Status:      Completed Successfully      
              Target:      nbukmtr1      

              Task completed successfully.

              Monitoring device nbukmtr1-stk
              Gathering log messages for nbukmtr1-stk
              Unable to communicate with nbukmtr1-stk <<<<<<PROBLEM?
              Creating health messages for nbukmtr1-stk
              Unable to retrieve the report for alarm auditing. <<<<<<PROBLEM?
              Saving nbukmtr1-stk data...
              Finished saving nbukmtr1-stk data.


              From the files under /var/opt/SUNWsefms/store, this appears to be due to...
              # cat /var/opt/SUNWsefms/store/Reports/error1
              rc._Type=0
              rc.commType=oob
              rc.error=java.lang.ArrayIndexOutOfBoundsException: 0
              rc.iteration=1729
              rc._Trace=Unable to create report due to java.lang.ArrayIndexOutOfBoundsException: 0

              Now I'm stumped. Found a command to clear the "lost comms" sort of thing from the CAM release notes...
              # cd /opt/SUNWsefms/bin
              # ./csmservice -i -a <array-name> -Z UNLOCK -w

              ...but this would...
              "The correct controller firmware will be loaded, the array will reboot, and CAM should report array status"

              Is there any other way around this?

              Thanks
              • 4. Re: STK2540 in Degraded state and LUN 31 Corrupt label, wrong magic number
                Nik
                Hi.
                Please show: ls -la /var/opt/SUNWsefms/store/Reports

                Try move all files from this dir to backup_dir. Wait 5 min.
                Check status of array.

                In case problem not resolved.
                unregister array
                svcadm disable fmservice
                cd /var/opt/SUNWsefms/store
                tar cf <backup_dir>/store.tar .
                Clear all subdirertories
                svcadm enable fmservice
                register array

                Wait 5 min and check status.


                In case problem not resolved again.

                Uninstall CAM:
                */var/opt/CommonArrayManager/<your version of CAM>/bin/uninstall*
                cd /var/opt
                rm -rf CommonArrayManager       SUNWse6130ui        SUNWsefms

                Install CAM again, register array.


                Regards.
                • 5. Re: STK2540 in Degraded state and LUN 31 Corrupt label, wrong magic number
                  854744
                  Hi Nik,

                  I've tried the first suggestion but I still end up with Degraded status.

                  The Events show...
                  # grep descr /var/opt/SUNWsefms/store/Events/event[0-9]*
                  event2: "description":"Lost out-of-band communication with 2540 nbukmtr1-stk\n:OSGi.com.sun.storage.cam.agent(com.sun.netstorage.fm.storade.agent.Messages):monitor.CommunicationLost.oob.desc:S17:2540 nbukmtr1-stk:S0::",
                  event4: "description":"Diagnostic Test Array Communication Test ran on nbukmtr1-stk with Passed result.",
                  event5: "description":"Lost out-of-band communication with 2540 nbukmtr1-stk\n:OSGi.com.sun.storage.cam.agent(com.sun.netstorage.fm.storade.agent.Messages):monitor.CommunicationLost.oob.desc:S17:2540 nbukmtr1-stk:S0::",
                  event7: "description":"Finished monitoring run for device. Error generating report.",

                  The Reports show...
                  # cat /var/opt/SUNWsefms/store/Reports/error[0-9]*
                  rc._Type=0
                  rc.commType=oob
                  rc.error=java.lang.ArrayIndexOutOfBoundsException: 0
                  rc.iteration=4
                  rc._Trace=Unable to create report due to java.lang.ArrayIndexOutOfBoundsException: 0

                  The Alarms show...
                  # cat /var/opt/SUNWsefms/store/Alarms/alarm1
                  <object class='com.sun.netstorage.fm.storade.agent.service.alarm.AlarmBean'>
                  <var id='AcknowledgedBy'>PJB</var>
                  <var id='GridId'>2540.CommunicationLostEvent.oob</var>
                  <var id='EventId'>1307711252733.85a6ff2e.1</var>
                  <var id='DeviceType'>2540</var>
                  <var id='RecommendedAction'>@event.CommunicationLostEvent.oob.action</var>
                  <var id='DeviceIP'>nbukmtr1-stk</var>
                  <var id='ProbableCause'>@event.CommunicationLostEvent.oob.cause</var>
                  <var id='Severity'>3</var>
                  <var id='DeviceName'>nbukmtr1-stk</var>
                  <var id='Description'>Lost out-of-band communication with 2540 nbukmtr1-stk
                  :OSGi.com.sun.storage.cam.agent(com.sun.netstorage.fm.storade.agent.Messages):monitor.CommunicationLost.oob.desc:S17:2540 nbukmtr1-stk:S0::</var>
                  <var id='DateCreated'>1307711252781</var>
                  <var id='CorrelatedEvents'><item>1307711552831.85a6ff2e.4</item></var>
                  <var id='GridCode'>70.12.31</var>
                  <var id='State'>1</var>
                  <var id='AutoClear'>true</var>
                  <var id='DeviceKey'>SUN.540-7198-02.1106BE90B0</var>
                  <var id='Info'>Lost out-of-band communication with 2540 nbukmtr1-stk
                  :OSGi.com.sun.storage.cam.agent(com.sun.netstorage.fm.storade.agent.Messages):monitor.CommunicationLost.oob.desc:S17:2540 nbukmtr1-stk:S0::</var>
                  <var id='Id'>alarm1</var>
                  <var id='ComponentId'>oob</var>
                  <var id='ComponentName'>OutOfBand</var>
                  </object>


                  I've run the Controller Communication Test again and it still passes. So I'll try re-installing CAM when I can get some free time on the server (next reply maybe in a week or two's time though!).

                  Thanks for all your help
                  Pete

                  P.S. As requested...
                  # ls -la /var/opt/SUNWsefms/store/Reports
                  total 8
                  drwxr-xr-x 2 root root 512 Jun 10 14:17 .
                  drwxr-xr-x 15 root root 512 Jun 10 14:08 ..
                  -rw-r--r-- 1 root root 0 Jun 10 14:02 .chassis.keys
                  -rw-r--r-- 1 root root 0 Jun 10 14:03 cache.keys
                  -rw-r--r-- 1 root root 0 Jun 10 14:02 chassis.keys
                  -rw-r--r-- 1 root root 102 Jun 10 14:17 error.keys
                  -rw-r--r-- 1 root root 180 Jun 10 14:17 error1
                  -rw-r--r-- 1 root root 0 Jun 10 14:02 report.keys
                  -rw-r--r-- 1 root root 0 Jun 10 14:02 sasdomain.keys
                  • 6. Re: STK2540 in Degraded state and LUN 31 Corrupt label, wrong magic number
                    Nik
                    Hi.

                    Try unregister/register array again.
                    Please show ls -la /var/opt/SUNWsefms/store/Reports after this.


                    Regards.
                    • 7. Re: STK2540 in Degraded state and LUN 31 Corrupt label, wrong magic number
                      854744
                      Hi Nik,

                      Here's the info...

                      BEFORE...
                      root@nbukmtr1 0 /root # ls -la /var/opt/SUNWsefms/store/Reports
                      total 8
                      drwxr-xr-x 2 root root 512 Jun 14 09:27 .
                      drwxr-xr-x 15 root root 512 Jun 10 14:08 ..
                      -rw-r--r-- 1 root root 0 Jun 10 14:02 .chassis.keys
                      -rw-r--r-- 1 root root 0 Jun 10 14:03 cache.keys
                      -rw-r--r-- 1 root root 0 Jun 10 14:02 chassis.keys
                      -rw-r--r-- 1 root root 102 Jun 14 09:27 error.keys
                      -rw-r--r-- 1 root root 183 Jun 14 09:27 error1
                      -rw-r--r-- 1 root root 0 Jun 10 14:02 report.keys
                      -rw-r--r-- 1 root root 0 Jun 10 14:02 sasdomain.keys

                      root@nbukmtr1 2 /root # cat /var/opt/SUNWsefms/store/Reports/error1
                      rc._Type=0
                      rc.commType=oob
                      rc.error=java.lang.ArrayIndexOutOfBoundsException: 0
                      rc.iteration=1097
                      rc._Trace=Unable to create report due to java.lang.ArrayIndexOutOfBoundsException: 0

                      From CAM gui, Health is "Degraded".


                      Removed (@09:32am), then registered (@09:37am) nbukmtr1 via IP address and left for >15 minutes.


                      AFTER...
                      root@nbukmtr1 0 /root # ls -la /var/opt/SUNWsefms/store/Reports
                      total 8
                      drwxr-xr-x 2 root root 512 Jun 14 10:02 .
                      drwxr-xr-x 15 root root 512 Jun 10 14:08 ..
                      -rw-r--r-- 1 root root 0 Jun 10 14:02 .chassis.keys
                      -rw-r--r-- 1 root root 0 Jun 10 14:03 cache.keys
                      -rw-r--r-- 1 root root 0 Jun 10 14:02 chassis.keys
                      -rw-r--r-- 1 root root 102 Jun 14 10:02 error.keys
                      -rw-r--r-- 1 root root 183 Jun 14 10:02 error1
                      -rw-r--r-- 1 root root 0 Jun 14 09:32 inventory.keys
                      -rw-r--r-- 1 root root 0 Jun 10 14:02 report.keys
                      -rw-r--r-- 1 root root 0 Jun 10 14:02 sasdomain.keys

                      root@nbukmtr1 0 /root # cat /var/opt/SUNWsefms/store/Reports/error1
                      rc._Type=0
                      rc.commType=oob
                      rc.error=java.lang.ArrayIndexOutOfBoundsException: 0
                      rc.iteration=1105
                      rc._Trace=Unable to create report due to java.lang.ArrayIndexOutOfBoundsException: 0

                      From CAM gui, Health is "Lost Communication" and there are no Alarms listed.

                      The last time I un-registered/re-registered the array, I think I got the "Lost Communication" state initial and when I tried to manually "Run Agent" from "General Health Monitoring Setup" page. Then it went to Degraded.

                      The oob "Array Communication Test" still works fine.

                      Thanks for taking another look.
                      • 8. Re: STK2540 in Degraded state and LUN 31 Corrupt label, wrong magic number
                        Nik
                        Hi.
                        You can install one more CAM on another system ( or, for example, in Zone on same server. Use whole not sparse zone).

                        In case You have problem on fresh CAM to, need analyze array and network.

                        In case problem was resolved, You can deinstall CAM on main server clean /var from old logs and CAM's config. Install fresh CAM.


                        CAM available for man platfrom ( windows,Linux, Solaris) on MOS.


                        Regards.
                        • 9. Re: STK2540 in Degraded state and LUN 31 Corrupt label, wrong magic number
                          854744
                          Hi Nik,

                          I've tried a new CAM install on a new (remote) server, but exactly the same thing happens so I'm getting the network team to check things out with the STK2540's management connection. If they don't find anything wrong, is installing controller firmware the next thing to try? I've managed to get hold of a slightly newer one (145965-03), so I could go from 07.35.55.10 to 07.35.55.11. Although I can't see it in the manual, I'm assuming that this would need all data presented to be unmounted and would not lose array config.

                          After that, would performing a "Reset Configuration" be worth a try? I'd lose my data, but I'm not sure where to go from here (other than allowing my sscs monitoring script ignore "OK", "Lost Communication" or "Degraded" controller states).

                          Thanks again
                          • 10. Re: STK2540 in Degraded state and LUN 31 Corrupt label, wrong magic number
                            Nik
                            Hi.
                            Try get supportdata via CLI.

                            /opt/SUNWsefms/bin/supportData

                            example: supportData -d Array-15 -p /tmp -o outputfile

                            unzip result file.

                            Check arrayprofile file for unprintable characters.


                            CAM 6.7 more sense for it.


                            Regards.
                            • 11. Re: STK2540 in Degraded state and LUN 31 Corrupt label, wrong magic number
                              854744
                              Hi Nik,

                              supportData command works OK. storageArrayProfile.txt shows no unusual chars, but the summary indicates that there is a 2nd controller present whose status is "removed".
                              ...
                              SUMMARY------------------------------
                              Number of controllers:           2
                              Controller redundancy mode:      DUPLEX
                              Needs Attention flag:            true
                              Is Fixing flag:                  false
                              ...
                              Controllers------------------------------
                              Number of controllers: 2

                              Tray.85.Controller.A
                              Status: Optimal
                              ...
                              Tray.85.Controller.B
                              Status: Removed
                              ...

                              I've no idea what the array config was before it was delivered to us, so maybe it had controller B and someone took it out before shipping. I've tried using "Service Advisor" option in CAM to try to set a single controller config but I get the error "The device has been unregistered from this application. Please close this service advisor window.".

                              Checking stateCaptureData.dmp shows the connections being lost but not obviously why (there's alot of detail but I must admit I don't understand it!).

                              In the meantime, I've changed my monitoring script to ignore array health and only check controller A and disk health to get around this issue.

                              Now that I've arranged some downtime, I'll try firmware upgrade and maybe reset configuration if I get time. Any other suggestions more than welcome.

                              Thanks
                              • 12. Re: STK2540 in Degraded state and LUN 31 Corrupt label, wrong magic number
                                861764
                                Hi,

                                we are also facing same issue "loss communication" is the CAM 6.7.0.12.
                                after following so many work-arrounds, reached at the stage that this is the bug in the CAM v6.7.0.12 and will be fixed in to the next release.

                                As of now work-arround is to Manage array with in-band management as after following in-band, problem in our environment has been resolved.

                                Please download in-band management docs.

                                http://download.oracle.com/docs/cd/E19377-01/821-1362-10/821-1362-10.pdf


                                Thanks
                                Rajeev
                                • 13. Re: STK2540 in Degraded state and LUN 31 Corrupt label, wrong magic number
                                  854744
                                  Update:
                                  This array has finally come under our support agreement so I recently logged a support request. Sun's recommendation was to...
                                  1) apply the simplex firmware (/opt/SUNWsefms/bin/lsscs modify -a <array> -t system -p <.../fw/images/nge/N1932-735843-902.dlp> -c system firmware),
                                  2) reset the controller,
                                  3) change the redundancy to simplex (/opt/SUNWsefms/bin/service -a <array> -c set -q redundancy -t simplex)

                                  I haven't tried it out yet (and there's still the slight possibility we may get the controller B which used to be in this array), but will update if I ever get to have a go.

                                  Thanks again for all your help!
                                  Pete