This discussion is archived
13 Replies Latest reply: Jul 4, 2011 6:20 AM by 854744 RSS

STK2540 in Degraded state and LUN 31 Corrupt label, wrong magic number

854744 Newbie
Currently Being Moderated
Hi,

PROBLEM 1 -
Our Storage team recently configured an STK2540 for our Sun M4000 (where I've installed CAM v6.7.0.12) and I'm happily using the storage. My problem is that the 2540's health is reported via CAM as 'Degraded' and so my scripts to check the array's status via sscs are showing it has a problem.

I suspect that an alarm is to blame, as the 2540 was originally registered and configured via a cross-over network cable on a private network (as no serial connector found). We then changed it's IP via CAM and plugged it into our main LAN, then un-registered and re-registered it with it's new IP. Alarms were generated as below, but do not appear to have been "auto cleared"...

# /opt/SUNWstkcam/bin/sscs list alarm
Alarm Id : alarm2
Severity : Critical
Type : 2540.CommunicationLostEvent
Topic : OutOfBand
Event Code : 70.12.31
Date : 2011-06-03 11:36:38
Device : nbukmtr1-stk[SUN.540-7198-02.1106BE90B0]
Description : Lost out-of-band communication with 2540 nbukmtr1-stk
State : Open
Acknowledged By : PJB
Auto Clear : Y
Aggregated Count : 0

Found one alert entry in health database.

I've tried and failed with...
un-rgistering and re-registering the 2540,
re-opening the alarm and acknowledging it again,
rebooting M4000 & 2540

Is there any way to clear this alarm and/or get the health status back to normal?


PROBLEM 2 -
Our Storage team have configured an "Access" volume mapping to the "Default Storage Domain" and our M4000 host (LUN 31). From Solaris 10 on the M4000 this appears as a 16MB LUN which has not yet been labelled in format. /var/adm/messages complains as follows (2 paths to same LUN via single 2540 controller)...

Jun 8 13:44:42 xxxxxxxxx scsi: [ID 107833 kern.warning] WARNING: /pci@3,700000/SUNW,emlxs@0/fp@0,0/ssd@w202400a0b8760eb6,1f (ssd0):
Jun 8 13:44:42 xxxxxxxxx Corrupt label; wrong magic number
Jun 8 13:44:42 xxxxxxxxx scsi: [ID 107833 kern.warning] WARNING: /pci@3,700000/SUNW,emlxs@0/fp@0,0/ssd@w202400a0b8760eb6,1f (ssd0):
Jun 8 13:44:42 xxxxxxxxx Corrupt label; wrong magic number
Jun 8 13:44:42 xxxxxxxxx scsi: [ID 107833 kern.warning] WARNING: /pci@1,700000/SUNW,emlxs@0/fp@0,0/ssd@w203400a0b8760eb6,1f (ssd1):
Jun 8 13:44:42 xxxxxxxxx Corrupt label; wrong magic number
Jun 8 13:44:42 xxxxxxxxx scsi: [ID 107833 kern.warning] WARNING: /pci@1,700000/SUNW,emlxs@0/fp@0,0/ssd@w203400a0b8760eb6,1f (ssd1):
Jun 8 13:44:42 xxxxxxxxx Corrupt label; wrong magic number

From what I can gather from the docs, this is the UMT LUN used to communicate with the 2540, so I have left well alone. Is this mapping really needed or can I label the disk? I'm just trying to get rid of the scsi errors going to /var/adm/messages as it's picked up by our monitoring software.


If you need any further info, please let me know.

Many thanks!!!
Pete
  • 1. Re: STK2540 in Degraded state and LUN 31 Corrupt label, wrong magic number
    Nik Expert
    Currently Being Moderated
    Hi.
    Alarm "Description : Lost out-of-band communication with 2540 nbukmtr1-stk" mean that CAM can't contact with array via network.

    You can register array, so it's look like:
    1. Temporary problem.
    2. Compatibility problem.

    Check:
    1. Can you ping controllers of 2540 ? ( Your array have dual or singler controllers in array ? )
    2. Port on network switch must configure for auto-negotation speed.
    3. Check Events in CAM. Problem persitent or not.
    4. Try use another switch for connect 2540. (dummy as possible).



    About Lun 31. Access Lun used in case in-band managment of array. This LUN read-only so you can't label it.

    You can unmap this LUN via CAM or ignore this error messages.


    Another way for resolve problem with managment access to 2540 - use in-band managment.
    In this case you not need have network connections to array but must map LUN 31.

    Read CAM docs about configure in-band managment.


    Regards.
  • 2. Re: STK2540 in Degraded state and LUN 31 Corrupt label, wrong magic number
    854744 Newbie
    Currently Being Moderated
    Hi Nik,

    Thanks for the info. I'll have to plan for the LUN31 removal as the server is now in use, so I'll concentrate on the out-of-band management which is where I want to end up.

    We have a single 2540 controller and the M4000 can ping it's mgmt IP fine (same network so no firewall). Snooping shows traffic while I run sscs commands and CAM session and there are no further out-of-band alarms so I'm hoping that it is now working fine and was just logged while I changed the mgmt IP and swapped cables. I just need to find a way to clear the alarm and get rid of the Degraded status.

    I've found what I think is the alarm on the CAM server...

    # ls -la /var/opt/SUNWsefms/store/Events
    total 18
    drwxr-xr-x 2 root root 512 Jun 9 10:52 .
    drwxr-xr-x 15 root root 512 Jun 3 11:46 ..
    -rw-r--r-- 1 root root 0 Jun 7 16:47 .event.keys
    -rw-r--r-- 1 root root 294 Jun 9 10:52 event.keys
    -rw-r--r-- 1 root root 1191 Jun 9 09:52 event1954
    -rw-r--r-- 1 root root 1191 Jun 9 10:52 event1966
    -rw-r--r-- 1 root root 1543 Jun 3 11:36 event252

    # cat /var/opt/SUNWsefms/store/Events/event.keys
    1307097398223.85a6ff2e.227 event252 Date 1307097398222 Device.type 2540 GridCode 70.12.31 Severity 3 Topic 70.12.31 Device.key SUN.540-7198-02.1106BE90B0
    1307613461575.85a6ff2e.508 event1967 Date 1307613461571 Device.type 2540 GridCode 70.69.16 Severity 1 Device.key SUN.540-7198-02.1106BE90B0

    # grep description /var/opt/SUNWsefms/store/Events/event*
    /var/opt/SUNWsefms/store/Events/event1954: "description":"Finished monitoring run for device. Error generating report.",
    /var/opt/SUNWsefms/store/Events/event1967: "description":"Finished monitoring run for device. Error generating report.",
    /var/opt/SUNWsefms/store/Events/event252: "description":"Lost out-of-band communication with 2540 nbukmtr1-stk\n:OSGi.com.sun.storage.cam.agent(com.sun.netstorage.fm.storade.agent.Messages):monitor.CommunicationLost.oob.desc:S17:2540 nbukmtr1-stk:S0::",

    I've had a go at trying to fool CAM that event252 no longer exists...
    1) Stopped Web Console service
    2) Renamed /var/opt/SUNWsefms/store/Events/event252 and removing it's entry from /var/opt/SUNWsefms/store/Events/event.keys
    3) Started Web Console service
    4) Logged into CAM

    It still shows the Degraded state and the alarm, so maybe I need to restart something else. I'll carry on and see if this approach works. Any suggestions greatly appeciated though!

    Thanks
  • 3. Re: STK2540 in Degraded state and LUN 31 Corrupt label, wrong magic number
    854744 Newbie
    Currently Being Moderated
    Update:

    Found the alarm under /var/opt/SUNWsefms/store/Alarms (doh!) and have managed to get rid of it (well, rename it and update the alarm.keys file). Unfortunately, the health is now "Lost communication". Checking the events, there is a job which is failing to communicate with the controller even though the CAM O-O-B "Array Communication Test" under Trooubleshooting passes OK?!?

    Output from "Troubleshooting -> Array Communication Test"...
    # mail
    From Common.Array.Manager@oracle.com Thu Jun 9 11:29:17 2011
    From: Common.Array.Manager@oracle.com
    Date: Thu, 9 Jun 2011 11:29:17 +0100 (BST)
    To: root@nbukmtr1.uk.eu.airbus.corp
    Message-ID: <11048698.0.1307615357418.JavaMail.root@nbukmtr1>
    Subject: Diagnostic Test Results: Array Communication Test on nbukmtr1-stk
    Content-Length: 357

    Test : Array Communication Test
    Host : nbukmtr1
    Target : nbukmtr1-stk
    Status : Passed
    Options : EMAIL = root@localhost
    Result : Test Completed

    **** Test Output Below ****

    Attempting to contact the array using the following address(es):
    44.82.228.120
    0.0.0.0

    Controller A is accessible via:
    44.82.228.120 (oob)

    Controller B is not accessible


    However, I get the following event when running "General Health Monitoring -> Run Agent"...
    Job Overview (6) Property      Value      
    Host:      nbukmtr1      
    Cancelable:      True      
    Start Time:      09-Jun-2011 11:50:01      
    Last Update Time:      09-Jun-2011 11:50:03      
    Status:      Completed Successfully      
    Target:      nbukmtr1      

    Task completed successfully.

    Monitoring device nbukmtr1-stk
    Gathering log messages for nbukmtr1-stk
    Unable to communicate with nbukmtr1-stk <<<<<<PROBLEM?
    Creating health messages for nbukmtr1-stk
    Unable to retrieve the report for alarm auditing. <<<<<<PROBLEM?
    Saving nbukmtr1-stk data...
    Finished saving nbukmtr1-stk data.


    From the files under /var/opt/SUNWsefms/store, this appears to be due to...
    # cat /var/opt/SUNWsefms/store/Reports/error1
    rc._Type=0
    rc.commType=oob
    rc.error=java.lang.ArrayIndexOutOfBoundsException: 0
    rc.iteration=1729
    rc._Trace=Unable to create report due to java.lang.ArrayIndexOutOfBoundsException: 0

    Now I'm stumped. Found a command to clear the "lost comms" sort of thing from the CAM release notes...
    # cd /opt/SUNWsefms/bin
    # ./csmservice -i -a <array-name> -Z UNLOCK -w

    ...but this would...
    "The correct controller firmware will be loaded, the array will reboot, and CAM should report array status"

    Is there any other way around this?

    Thanks
  • 4. Re: STK2540 in Degraded state and LUN 31 Corrupt label, wrong magic number
    Nik Expert
    Currently Being Moderated
    Hi.
    Please show: ls -la /var/opt/SUNWsefms/store/Reports

    Try move all files from this dir to backup_dir. Wait 5 min.
    Check status of array.

    In case problem not resolved.
    unregister array
    svcadm disable fmservice
    cd /var/opt/SUNWsefms/store
    tar cf <backup_dir>/store.tar .
    Clear all subdirertories
    svcadm enable fmservice
    register array

    Wait 5 min and check status.


    In case problem not resolved again.

    Uninstall CAM:
    */var/opt/CommonArrayManager/<your version of CAM>/bin/uninstall*
    cd /var/opt
    rm -rf CommonArrayManager       SUNWse6130ui        SUNWsefms

    Install CAM again, register array.


    Regards.
  • 5. Re: STK2540 in Degraded state and LUN 31 Corrupt label, wrong magic number
    854744 Newbie
    Currently Being Moderated
    Hi Nik,

    I've tried the first suggestion but I still end up with Degraded status.

    The Events show...
    # grep descr /var/opt/SUNWsefms/store/Events/event[0-9]*
    event2: "description":"Lost out-of-band communication with 2540 nbukmtr1-stk\n:OSGi.com.sun.storage.cam.agent(com.sun.netstorage.fm.storade.agent.Messages):monitor.CommunicationLost.oob.desc:S17:2540 nbukmtr1-stk:S0::",
    event4: "description":"Diagnostic Test Array Communication Test ran on nbukmtr1-stk with Passed result.",
    event5: "description":"Lost out-of-band communication with 2540 nbukmtr1-stk\n:OSGi.com.sun.storage.cam.agent(com.sun.netstorage.fm.storade.agent.Messages):monitor.CommunicationLost.oob.desc:S17:2540 nbukmtr1-stk:S0::",
    event7: "description":"Finished monitoring run for device. Error generating report.",

    The Reports show...
    # cat /var/opt/SUNWsefms/store/Reports/error[0-9]*
    rc._Type=0
    rc.commType=oob
    rc.error=java.lang.ArrayIndexOutOfBoundsException: 0
    rc.iteration=4
    rc._Trace=Unable to create report due to java.lang.ArrayIndexOutOfBoundsException: 0

    The Alarms show...
    # cat /var/opt/SUNWsefms/store/Alarms/alarm1
    <object class='com.sun.netstorage.fm.storade.agent.service.alarm.AlarmBean'>
    <var id='AcknowledgedBy'>PJB</var>
    <var id='GridId'>2540.CommunicationLostEvent.oob</var>
    <var id='EventId'>1307711252733.85a6ff2e.1</var>
    <var id='DeviceType'>2540</var>
    <var id='RecommendedAction'>@event.CommunicationLostEvent.oob.action</var>
    <var id='DeviceIP'>nbukmtr1-stk</var>
    <var id='ProbableCause'>@event.CommunicationLostEvent.oob.cause</var>
    <var id='Severity'>3</var>
    <var id='DeviceName'>nbukmtr1-stk</var>
    <var id='Description'>Lost out-of-band communication with 2540 nbukmtr1-stk
    :OSGi.com.sun.storage.cam.agent(com.sun.netstorage.fm.storade.agent.Messages):monitor.CommunicationLost.oob.desc:S17:2540 nbukmtr1-stk:S0::</var>
    <var id='DateCreated'>1307711252781</var>
    <var id='CorrelatedEvents'><item>1307711552831.85a6ff2e.4</item></var>
    <var id='GridCode'>70.12.31</var>
    <var id='State'>1</var>
    <var id='AutoClear'>true</var>
    <var id='DeviceKey'>SUN.540-7198-02.1106BE90B0</var>
    <var id='Info'>Lost out-of-band communication with 2540 nbukmtr1-stk
    :OSGi.com.sun.storage.cam.agent(com.sun.netstorage.fm.storade.agent.Messages):monitor.CommunicationLost.oob.desc:S17:2540 nbukmtr1-stk:S0::</var>
    <var id='Id'>alarm1</var>
    <var id='ComponentId'>oob</var>
    <var id='ComponentName'>OutOfBand</var>
    </object>


    I've run the Controller Communication Test again and it still passes. So I'll try re-installing CAM when I can get some free time on the server (next reply maybe in a week or two's time though!).

    Thanks for all your help
    Pete

    P.S. As requested...
    # ls -la /var/opt/SUNWsefms/store/Reports
    total 8
    drwxr-xr-x 2 root root 512 Jun 10 14:17 .
    drwxr-xr-x 15 root root 512 Jun 10 14:08 ..
    -rw-r--r-- 1 root root 0 Jun 10 14:02 .chassis.keys
    -rw-r--r-- 1 root root 0 Jun 10 14:03 cache.keys
    -rw-r--r-- 1 root root 0 Jun 10 14:02 chassis.keys
    -rw-r--r-- 1 root root 102 Jun 10 14:17 error.keys
    -rw-r--r-- 1 root root 180 Jun 10 14:17 error1
    -rw-r--r-- 1 root root 0 Jun 10 14:02 report.keys
    -rw-r--r-- 1 root root 0 Jun 10 14:02 sasdomain.keys
  • 6. Re: STK2540 in Degraded state and LUN 31 Corrupt label, wrong magic number
    Nik Expert
    Currently Being Moderated
    Hi.

    Try unregister/register array again.
    Please show ls -la /var/opt/SUNWsefms/store/Reports after this.


    Regards.
  • 7. Re: STK2540 in Degraded state and LUN 31 Corrupt label, wrong magic number
    854744 Newbie
    Currently Being Moderated
    Hi Nik,

    Here's the info...

    BEFORE...
    root@nbukmtr1 0 /root # ls -la /var/opt/SUNWsefms/store/Reports
    total 8
    drwxr-xr-x 2 root root 512 Jun 14 09:27 .
    drwxr-xr-x 15 root root 512 Jun 10 14:08 ..
    -rw-r--r-- 1 root root 0 Jun 10 14:02 .chassis.keys
    -rw-r--r-- 1 root root 0 Jun 10 14:03 cache.keys
    -rw-r--r-- 1 root root 0 Jun 10 14:02 chassis.keys
    -rw-r--r-- 1 root root 102 Jun 14 09:27 error.keys
    -rw-r--r-- 1 root root 183 Jun 14 09:27 error1
    -rw-r--r-- 1 root root 0 Jun 10 14:02 report.keys
    -rw-r--r-- 1 root root 0 Jun 10 14:02 sasdomain.keys

    root@nbukmtr1 2 /root # cat /var/opt/SUNWsefms/store/Reports/error1
    rc._Type=0
    rc.commType=oob
    rc.error=java.lang.ArrayIndexOutOfBoundsException: 0
    rc.iteration=1097
    rc._Trace=Unable to create report due to java.lang.ArrayIndexOutOfBoundsException: 0

    From CAM gui, Health is "Degraded".


    Removed (@09:32am), then registered (@09:37am) nbukmtr1 via IP address and left for >15 minutes.


    AFTER...
    root@nbukmtr1 0 /root # ls -la /var/opt/SUNWsefms/store/Reports
    total 8
    drwxr-xr-x 2 root root 512 Jun 14 10:02 .
    drwxr-xr-x 15 root root 512 Jun 10 14:08 ..
    -rw-r--r-- 1 root root 0 Jun 10 14:02 .chassis.keys
    -rw-r--r-- 1 root root 0 Jun 10 14:03 cache.keys
    -rw-r--r-- 1 root root 0 Jun 10 14:02 chassis.keys
    -rw-r--r-- 1 root root 102 Jun 14 10:02 error.keys
    -rw-r--r-- 1 root root 183 Jun 14 10:02 error1
    -rw-r--r-- 1 root root 0 Jun 14 09:32 inventory.keys
    -rw-r--r-- 1 root root 0 Jun 10 14:02 report.keys
    -rw-r--r-- 1 root root 0 Jun 10 14:02 sasdomain.keys

    root@nbukmtr1 0 /root # cat /var/opt/SUNWsefms/store/Reports/error1
    rc._Type=0
    rc.commType=oob
    rc.error=java.lang.ArrayIndexOutOfBoundsException: 0
    rc.iteration=1105
    rc._Trace=Unable to create report due to java.lang.ArrayIndexOutOfBoundsException: 0

    From CAM gui, Health is "Lost Communication" and there are no Alarms listed.

    The last time I un-registered/re-registered the array, I think I got the "Lost Communication" state initial and when I tried to manually "Run Agent" from "General Health Monitoring Setup" page. Then it went to Degraded.

    The oob "Array Communication Test" still works fine.

    Thanks for taking another look.
  • 8. Re: STK2540 in Degraded state and LUN 31 Corrupt label, wrong magic number
    Nik Expert
    Currently Being Moderated
    Hi.
    You can install one more CAM on another system ( or, for example, in Zone on same server. Use whole not sparse zone).

    In case You have problem on fresh CAM to, need analyze array and network.

    In case problem was resolved, You can deinstall CAM on main server clean /var from old logs and CAM's config. Install fresh CAM.


    CAM available for man platfrom ( windows,Linux, Solaris) on MOS.


    Regards.
  • 9. Re: STK2540 in Degraded state and LUN 31 Corrupt label, wrong magic number
    854744 Newbie
    Currently Being Moderated
    Hi Nik,

    I've tried a new CAM install on a new (remote) server, but exactly the same thing happens so I'm getting the network team to check things out with the STK2540's management connection. If they don't find anything wrong, is installing controller firmware the next thing to try? I've managed to get hold of a slightly newer one (145965-03), so I could go from 07.35.55.10 to 07.35.55.11. Although I can't see it in the manual, I'm assuming that this would need all data presented to be unmounted and would not lose array config.

    After that, would performing a "Reset Configuration" be worth a try? I'd lose my data, but I'm not sure where to go from here (other than allowing my sscs monitoring script ignore "OK", "Lost Communication" or "Degraded" controller states).

    Thanks again
  • 10. Re: STK2540 in Degraded state and LUN 31 Corrupt label, wrong magic number
    Nik Expert
    Currently Being Moderated
    Hi.
    Try get supportdata via CLI.

    /opt/SUNWsefms/bin/supportData

    example: supportData -d Array-15 -p /tmp -o outputfile

    unzip result file.

    Check arrayprofile file for unprintable characters.


    CAM 6.7 more sense for it.


    Regards.
  • 11. Re: STK2540 in Degraded state and LUN 31 Corrupt label, wrong magic number
    854744 Newbie
    Currently Being Moderated
    Hi Nik,

    supportData command works OK. storageArrayProfile.txt shows no unusual chars, but the summary indicates that there is a 2nd controller present whose status is "removed".
    ...
    SUMMARY------------------------------
    Number of controllers:           2
    Controller redundancy mode:      DUPLEX
    Needs Attention flag:            true
    Is Fixing flag:                  false
    ...
    Controllers------------------------------
    Number of controllers: 2

    Tray.85.Controller.A
    Status: Optimal
    ...
    Tray.85.Controller.B
    Status: Removed
    ...

    I've no idea what the array config was before it was delivered to us, so maybe it had controller B and someone took it out before shipping. I've tried using "Service Advisor" option in CAM to try to set a single controller config but I get the error "The device has been unregistered from this application. Please close this service advisor window.".

    Checking stateCaptureData.dmp shows the connections being lost but not obviously why (there's alot of detail but I must admit I don't understand it!).

    In the meantime, I've changed my monitoring script to ignore array health and only check controller A and disk health to get around this issue.

    Now that I've arranged some downtime, I'll try firmware upgrade and maybe reset configuration if I get time. Any other suggestions more than welcome.

    Thanks
  • 12. Re: STK2540 in Degraded state and LUN 31 Corrupt label, wrong magic number
    861764 Newbie
    Currently Being Moderated
    Hi,

    we are also facing same issue "loss communication" is the CAM 6.7.0.12.
    after following so many work-arrounds, reached at the stage that this is the bug in the CAM v6.7.0.12 and will be fixed in to the next release.

    As of now work-arround is to Manage array with in-band management as after following in-band, problem in our environment has been resolved.

    Please download in-band management docs.

    http://download.oracle.com/docs/cd/E19377-01/821-1362-10/821-1362-10.pdf


    Thanks
    Rajeev
  • 13. Re: STK2540 in Degraded state and LUN 31 Corrupt label, wrong magic number
    854744 Newbie
    Currently Being Moderated
    Update:
    This array has finally come under our support agreement so I recently logged a support request. Sun's recommendation was to...
    1) apply the simplex firmware (/opt/SUNWsefms/bin/lsscs modify -a <array> -t system -p <.../fw/images/nge/N1932-735843-902.dlp> -c system firmware),
    2) reset the controller,
    3) change the redundancy to simplex (/opt/SUNWsefms/bin/service -a <array> -c set -q redundancy -t simplex)

    I haven't tried it out yet (and there's still the slight possibility we may get the controller B which used to be in this array), but will update if I ever get to have a go.

    Thanks again for all your help!
    Pete

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points