Forum Stats

  • 3,874,806 Users
  • 2,266,775 Discussions
  • 7,911,967 Comments

Discussions

Sun Sparc Enterprise T5120 Raid Degraded disk replacement -corrupt label; wrong magic number

4135526
4135526 Member Posts: 8
edited Sep 15, 2020 6:34AM in SPARC Servers

Hello.

we had a failed disk and replaced it with an onsite spare

after replacing the disk it still showed the raid is degraded :

***    136.6G  N/A     DEGRADED OFF    RAID1

0.4.0   136.6G GOOD

N/A 136.6G          FAILED

from the /var/adm/messages  on the day we replaced the disk:

Jun 23 08:17:23 ******* SC Alert: [ID 457296 daemon.notice] IPMI | minor: ID =  11c : 06/23/2020 : 04:10:05 : Entity Presence : /HDD1/PRSNT : Device Absent

Jun 23 08:17:25 ******* SC Alert: [ID 248681 daemon.notice] IPMI | minor: ID =  11d : 06/23/2020 : 04:10:17 : Entity Presence : /HDD1/PRSNT : Device Present

Jun 23 08:17:37 ******* scsi: [ID 193665 kern.info] sd0 at mpt0: target 5 lun 0

Jun 23 08:17:37 ******* genunix: [ID 936769 kern.info] sd0 is /[email protected]/[email protected]/[email protected]/[email protected]/[email protected],0

Jun 23 08:17:37 ******* scsi: [ID 107833 kern.warning] WARNING: /[email protected]/[email protected]/[email protected]/[email protected]/[email protected],0 (sd0):

Jun 23 08:17:37 *******     Corrupt label; wrong magic number

Jun 23 08:17:37 ******* genunix: [ID 408114 kern.info] /[email protected]/[email protected]/[email protected]/[email protected]/[email protected],0 (sd0) online

Jun 23 08:17:37 ******* scsi: [ID 107833 kern.warning] WARNING: /[email protected]/[email protected]/[email protected]/[email protected]/[email protected],0 (sd0):

Jun 23 08:17:37 *******      Corrupt label; wrong magic number

Jun 23 08:17:52 ******* SC Alert: [ID 721165 daemon.notice] Audit | minor: root : Open Session : object = /session/type : value = shell : success

Jun 23 08:17:54 ******* SC Alert: [ID 624538 daemon.error] Chassis | major: Hot insertion of HDD1

anybody have an idea how to proceed ?

thank you in advance.

Best Answer

  • ClaudiuO-Oracle
    ClaudiuO-Oracle Member Posts: 50 Employee
    edited Aug 25, 2020 7:04AM Answer ✓

    Hello 4135526,

    From format command it looks you have configured the disk as a standalone drive (not part of the RAID configuration):

    AVAILABLE DISK SELECTIONS:

    0. c1t0d0 <LSILOGIC-LogicalVolume-3000 cyl 65533 alt 2 hd 16 sec 273>

    /[email protected]/[email protected]/[email protected]/[email protected]/[email protected],0

    1. c1t2d0 <LSILOGIC-LogicalVolume-3000 cyl 65533 alt 2 hd 16 sec 273>

    /[email protected]/[email protected]/[email protected]/[email protected]/[email protected],0

    2. c1t5d0 <SUN146G cyl 14087 alt 2 hd 24 sec 848>          <<<<<<<<<<<<<<<<<<<

    /[email protected]/[email protected]/[email protected]/[email protected]/[email protected],0

    In format you should only see c1t0d0 and c1t2d0 which are the 2 RAID 1 volumes each consisting of 2 disks

    same with iostat...

    Unfortunately I cannot take content out of MOS and pasted here to help you with the documant.

    What needs to be done next here in my opinion is to unconfigure the drive c1t5d0, remove it physically (you can identify it before by making it blink with format/analyze/read which will make the activity LED blink faster) and reinsert it and let the RAID controller do its job by taking over the disk and placing it back in the configuration (once the disk is inserted the resync process is carried out automatically).

    You've has previously a question about making the light on the HDD blink, that is achievable only is you use MegaRAID Storage Manager which will make both disks parts of the volume blink if it helps...

    More details about it can be found here: https://docs.oracle.com/cd/E19203-01/820-4933-15/820-4933-15.pdf

    Typically the targets on a raidctl command should be like:

    [email protected] # raidctl

    Controller: 1

      Volume:c1t0d0

        Disk: 0.0.0    <<< HDD0

        Disk: 0.1.0    <<< HDD1

        Disk: 0.2.0    <<< HDD2

        Disk: 0.3.0    <<< HDD3

    but in your case it looks they are numbered from 1 to 5, we have 0.0.0 and 0.2.0 missing, so we we have a closer look at the following outputs?

    raidctl -l c1t0d0

    raidctl -l c1t2d0

    raidctl -l -g 0.1.0 1

    raidctl -l -g 0.3.0 1

    raidctl -l -g 0.4.0 1

    raidctl -l -g 0.5.0 1

    Best regards,

    Claudiu

«1

Answers

  • Nik
    Nik Blocked Member Posts: 2,879 Bronze Crown
    edited Aug 18, 2020 7:32AM

    Hi.

    1.  Read service manual.

    https://docs.oracle.com/cd/E19839-01/820-2181-15/820-2181-15.pdf

    2.  Show current state of raid; output of commands format ;  cfgadm -al

    3.  Try reseat new disk again according service manual.  You should  unconfigure disk before remove.

        Wait at least 30s before   install disk after removing.

    At this moment system detect new disk as standalone new disk.

    Message "Corrupt label; wrong magic number"  means that new disk not have any label (partitions).

    Regards,

       Nik

  • 4135526
    4135526 Member Posts: 8
    edited Aug 18, 2020 8:59AM

    this is the output of cfgadm -al

    c1::dsk/c1t5d0                 disk        connected    configured                    unknown

    c2                                fc-private    connected    configured                    unknown

    c2::203700a0b85bcde8   disk       connected    configured                    unknown

    c3                                fc-private   connected    configured                    unknown

    c3::203800a0b85bcde8 unavailable  connected    configured                  failed

    usb0/1                         unknown      empty        unconfigured ok

    usb0/2                         unknown      empty        unconfigured ok

    usb0/3                         unknown      empty        unconfigured ok

    usb1/1                         unknown      empty        unconfigured ok

    usb1/2                         unknown      empty        unconfigured ok

    usb2/1                         unknown      empty        unconfigured ok

    usb2/2                         usb-storage  connected    configured   ok

    usb2/3                         unknown      empty        unconfigured ok

    usb2/4                         usb-hub      connected    configured   ok

    usb2/4.1                       unknown      empty        unconfigured ok

    usb2/4.2                       unknown      empty        unconfigured ok

    usb2/4.3                       unknown      empty        unconfigured ok

    usb2/4.4                       unknown      empty        unconfigured ok

    usb2/5                         unknown      empty        unconfigured ok

  • Nik
    Nik Blocked Member Posts: 2,879 Bronze Crown
    edited Aug 18, 2020 9:17AM

    Hi.

    You show  output only 1 command from 3.

    I  understand that it is big secret, but my crystal ball is broken and not say what happens with your system....

    Regards,

       Nik

  • 4135526
    4135526 Member Posts: 8
    edited Aug 18, 2020 10:25AM

    maybe you meant

    cfgadm -al

    c3::203800a0b85bcde8 unavailable  connected    configured                  failed

    cfgadm -c unconfigure  c3::203800a0b85bcde8

    at this point re-seat the disk or replace it

    # cfgadm -c configure c3::203800a0b85bcde8

    and after run cfgadm -al to check that everything is fine

    is that correct or im missing something?

  • Nik
    Nik Blocked Member Posts: 2,879 Bronze Crown
    edited Aug 18, 2020 4:18PM

    Hi.

    You again hide output of commands .

      Sowe try resolve not real problem .

    c3                                fc-private   connected    configured                    unknown

    c3::203800a0b85bcde8 unavailable  connected    configured                  failed

    c3 - fc-private .  It's mean that c3 is FC controller.

    So c3::203800a0b85bcde8 - some LUN from FC-array that was some time configured but not available at this moment.

    It's another case. It may be problem or not... Only You know what other connected to  this server....

    cfgadm -alv  -  Show  all controllers with device-path.

    cfgadm -al -o show_FCP_dev   - Show all LUNs from FC -target

    format - show what disks avalable at this moment

    raidctl -l <disk> - for show state of raid.

    LUNs from FC array can have more than one paths. So faulted one path do not cause lost access to disk.

    WWN of device on c2 and c3  have small difference. So it's 2 path for one array. One path is broken, but it's not internal disks.

    c2::203700a0b85bcde8   disk       connected    configured                    unknown

    c3::203800a0b85bcde8 unavailable  connected    configured                  failed

    use luxadm disaplay /dev/rdsk/<disk>s2 -  for check state of device paths.

    For internal disk:

      Check current state of raid.  In case disk not detected -  RemoveHDD1;   Wait 30 sec ; install it again.

    At first time you swap disk at 2 sec. This can cause current problem.

    Regards,

       Nik.

  • 4135526
    4135526 Member Posts: 8
    edited Aug 20, 2020 6:01AM

    first NIK.

    thank you for the help.

    from :

    raidctl

    Controller: 1

    Volume:c1t0d0

    Volume:c1t2d0

    Disk: 0.1.0

    Disk: 0.3.0

    Disk: 0.4.0

    Disk: 0.5.0

    raidctl -l c1t0d0

    Volume Size    Stripe  Status   Cache  RAID

    Sub Size Level

    Disk

    ----------------------------------------------------------------

    c1t0d0 136.6G  N/A     DEGRADED OFF    RAID1

    0.4.0   136.6G GOOD

    N/A 136.6G          FAILED

    root#raidctl -l c1t2d0

    Volume Size    Stripe  Status   Cache  RAID

    Sub Size Level

    Disk

    ----------------------------------------------------------------

    c1t2d0 136.6G  N/A     OPTIMAL  OFF RAID1

    0.3.0   136.6G GOOD

    0.1.0   136.6G GOOD

    cfgadm -al

    Ap_Id Type         Receptacle Occupant     Condition

    c1 scsi-bus     connected configured   unknown

    c1::dsk/c1t0d0 disk connected    configured   unknown

    c1::dsk/c1t2d0 disk connected    configured   unknown

    c1::dsk/c1t5d0 disk connected    configured   unknown

    c2 fc-private   connected    configured unknown

    c2::203700a0b85bcde8 disk connected    configured   unknown

    c3 fc-private   connected    configured unknown

    c3::203800a0b85bcde8 unavailable  connected    configured   failed

    usb0/1 unknown empty        unconfigured ok

    usb0/2 unknown empty        unconfigured ok

    usb0/3 unknown empty        unconfigured ok

    usb1/1 unknown empty        unconfigured ok

    usb1/2 unknown empty        unconfigured ok

    usb2/1 unknown empty        unconfigured ok

    usb2/2 usb-storage  connected    configured   ok

    usb2/3 unknown empty        unconfigured ok

    usb2/4 usb-hub      connected configured   ok

    usb2/4.1 unknown empty        unconfigured ok

    usb2/4.2 unknown empty        unconfigured ok

    usb2/4.3 unknown empty        unconfigured ok

    usb2/4.4 unknown empty        unconfigured ok

    usb2/5 unknown empty        unconfigured ok

    anyway what i ask is ...we replaced the Disk  but still c1t0d0  that is Raid 1 as you can see

    in degraded status.

    i dont know which disk was replaced is there a command to blink 0.5.0 that i think in N/A

    after i know to location to just remove it or do the process from the manual you provided.

  • Nik
    Nik Blocked Member Posts: 2,879 Bronze Crown
    edited Aug 20, 2020 10:08AM

    Hi.

    According messages of system  was replaced disk HDD1,  As I understand - it's  disk 0.1.0 but this disk already OK.

    You should replace disk 0.5.0.

    Check what disks realy present at system.

    Regards,

      Nik.

  • 4135526
    4135526 Member Posts: 8
    edited Aug 21, 2020 8:00AM

    Hello Nik.

    all 4 disks are present in the system all of them have a green light  (although Cfgadm -al shows only 3 disks)

    is there a way to blink disk 0.5.0 in order to find and replace it?

  • Nik
    Nik Blocked Member Posts: 2,879 Bronze Crown
    edited Aug 21, 2020 10:25AM

    Hi.

    According already provided doc, command:

    cfgadm -c unconfigure c0::dsk/c1t5d1

    should  unconfigure disk 0.5.0 from system and light blue LED on disk.

    According SPARC Enterprise T5x20 Device Paths (Doc ID 1352363.1),

    HDD5 have path /[email protected]/[email protected]/[email protected]/[email protected]/[email protected],0

    Regards,

    Nik

  • ClaudiuO-Oracle
    ClaudiuO-Oracle Member Posts: 50 Employee
    edited Aug 24, 2020 10:59AM

    Hello 4135526,

    My opinion about the issue:

    - you should refer to How to replace an internal disk in a volume under LSI RAID controller (Doc ID 1395234.1) if you have access to it

    - from the outputs provided it looks you have replaced the disk in slot 1 (HDD1) which is part of a healthy RAID1 configuration

    - the OS should only see the 2 volumes c1t0d0 and c1t2d0, as the disks are under HW RAID and should not be configured using cfgadm commands, however in your case we can also see c1t5d0

    - raidctl reports target 0.5.0 which does not exist physically as this server has only 4 disks (am I wrong? do you have the 8 disk backplane on this T5120), this is the way it treats disk failures bringing the failed disk out of the config and an nonexistent target

    - let's take a look at the following outputs:

    # format

    # iostat -En

    # uptime

    Best regards,

    Claudiu