5 Replies Latest reply: Feb 6, 2014 3:59 AM by user8500769 RSS

    clusterware not starting up on one node error:PROT-602: Failed to retrieve data from the cluster registry

    user8500769

      Hi,

       

      I have a 11gR2 grid infra on rhel5.

       

      I am not able to startup crs on one node but on another node its working fine.

      {code}

      [root@rac1 bin]# ./ocrcheck

      PROT-602: Failed to retrieve data from the cluster registry

      PROC-26: Error while accessing the physical storage ASM error [SLOS: cat=8, opn=kgfolclcpi1, dep=204, loc=kgfokge

      AMDU-00204: Disk N0003 is in currently mounted diskgroup CRS

      AMDU-00201: Disk N0003: 'ORCL:VOTE1'

      ] [8]

      [root@rac1 bin]# ./crsctl check crs

      CRS-4638: Oracle High Availability Services is online

      CRS-4535: Cannot communicate with Cluster Ready Services

      CRS-4529: Cluster Synchronization Services is online

      CRS-4533: Event Manager is online

      [root@rac1 bin]# ./crs_stat -t

      CRS-0184: Cannot communicate with the CRS daemon.

       

      [root@rac1 bin]#

      {code}

       

      Getting an error in ocssd.log

      {code}

      2014-01-29 21:10:13.288: [CSSD][2997357456]clssgmEvtInformation: reqtype (11) cmProc (0x9d02110) client ((nil))
      2014-01-29 21:10:13.288: [CSSD][2997357456]clssgmEvtInformation: reqtype (11) req (0x9c1aa70)
      2014-01-29 21:10:13.288: [CSSD][2997357456]clssnmQueueNotification: type (11) 0x9c1aa70
      2014-01-29 21:10:13.999: [GPnP][3046368960]clsgpnpm_newWiredMsg: [at clsgpnpm.c:633] Msg-reply has soap fault 10 (Operation returned Retry (error CLSGPNP_CALL_AGAIN)) [uri "http://www.grid-pnp.org/2005/12/gpnp-errors#"]
      2014-01-29 21:10:13.999: [GPnP][3046368960]clsgpnpm_newWiredMsg: [at clsgpnpm.c:663] Result: (31) CLSGPNP_CALL_AGAIN. Received gpnp meg-reply 0x9cebdb0 with soap fault, reason 'Operation returned Retry (error CLSGPNP_CALL_AGAIN)'(10) 10
      2014-01-29 21:10:13.999: [GPnP][3046368960]clsgpnpm_receiveMsg: [at clsgpnpm.c:2954] Result: (31) CLSGPNP_CALL_AGAIN. error in gpnp-msg from 'ipc' buf=0x9ce9af8
      2014-01-29 21:10:13.999: [GPnP][3046368960]clsgpnpm_exchange: [at clsgpnpm.c:1198] Result: (31) CLSGPNP_CALL_AGAIN. Error soap response from url "ipc://GPNPD_rac1",msg=0x9cebdb0 rdom=0x9d58c40 try=1/500
      @                                                                         
      /error                                                    165021,20880%

      {code}

       

      crsd.log file

      {code}

      2014-02-03 11:22:40.021: [GPnP][3046098640]clsgpnpwu_walletfopen: [at clsgpnpwu.c:494] Opened SSO wallet: '/raczone/grid_home/gpnp/rac1/wallets/peer/cwallet.sso'
      2014-02-03 11:22:40.021: [GPnP][3046098640]clsgpnp_getCK: [at clsgpnp0.c:1965] Result: (0) CLSGPNP_OK. Get gpnp wallet - provider 1 of 2 (LSKP-FSW(1))
      2014-02-03 11:22:40.021: [GPnP][3046098640]clsgpnp_getCK: [at clsgpnp0.c:1982] Got gpnp security keys (wallet).>
      2014-02-03 11:22:40.021: [GPnP][3046098640]clsgpnp_Init: [at clsgpnp0.c:837] GPnP client pid=5545, tl=3, f=0

      2014-02-03 11:22:40.058: [GIPCXCPT][3046098640] gipcShutdownF: skipping shutdown, count 2, from [ clsinet.c : 1732], ret gipcretSuccess (0)

      2014-02-03 11:22:40.060: [GIPCXCPT][3046098640] gipcShutdownF: skipping shutdown, count 1, from [ clsgpnp0.c : 1021], ret gipcretSuccess (0)

      2014-02-03 11:22:40.069: [  OCRASM][3046098640]proprasmo: Error in open/create file in dg [CRS]

      [  OCRASM][3046098640]SLOS : SLOS: cat=7, opn=kgfoAl06, dep=15077, loc=kgfokge

      ORA-15077: could not locate ASM instance serving a required diskgroup

       

      2014-02-03 11:22:40.071: [  OCRASM][3046098640]proprasmo: kgfoCheckMount returned [7]

      2014-02-03 11:22:40.071: [  OCRASM][3046098640]proprasmo: The ASM instance is down

      2014-02-03 11:22:40.071: [  OCRRAW][3046098640]proprioo: Failed to open [+CRS]. Returned proprasmo() with [26]. Marking location as UNAVAILABLE.

      2014-02-03 11:22:40.071: [  OCRRAW][3046098640]proprioo: No OCR/OLR devices are usable

      2014-02-03 11:22:40.071: [  OCRASM][3046098640]proprasmcl: asmhandle is NULL

      2014-02-03 11:22:40.071: [  OCRRAW][3046098640]proprinit: Could not open raw device

      2014-02-03 11:22:40.071: [  OCRASM][3046098640]proprasmcl: asmhandle is NULL

      2014-02-03 11:22:40.072: [  OCRAPI][3046098640]a_init:16!: Backend init unsuccessful : [26]

      2014-02-03 11:22:40.072: [  CRSOCR][3046098640] OCR context init failure.  Error: PROC-26: Error while accessing the physical storage ASM error [SLOS: cat=7, opn=kgfoAl06, dep=15077, loc=kgfokge

      ORA-15077: could not locate ASM instance serving a required diskgroup

      ] [7]

      2014-02-03 11:22:40.072: [    CRSD][3046098640][PANIC] CRSD exiting: Could not init OCR, code: 26

      2014-02-03 11:22:40.072: [    CRSD][3046098640] Done.

       

                                                                    4597

      {code}

       

       

      Any suggestion,will be highly appreciated

        • 1. Re: clusterware not starting up on one node error:PROT-602: Failed to retrieve data from the cluster registry
          Vandana B -Oracle

          Hi,

           

          Verify whether the underlying OCR device

           

          - Whether it is presented to the server and accessible by the node which is not coming currently

          - If it is available and presented then check the permissions on the underlying OCR device on the node

           

          - Also looking at the ocrcheck error it looks like the underlying device has been presented to different diskgroup. Please refer to the below mentioned document for the same

           

          Ref: ORA-15003 DISKGROUP "DATA" ALREADY MOUNTED IN ANOTHER LOCK NAME SPACE (Doc ID 1531734.1)

           

           

           

          Steps from the above mentioned document

           

          Check whether heartbeat is detected for the failed diskgroup.

          1. Use AMDU tool before mounting the diskgroup in question.

          $ amdu -diskstring '/dev/asm/data*' -dump DATA  -nomap -nodir   | grep HEARTBEAT
          AMDU-00204: Disk N0001 is in currently mounted diskgroup DATA
          AMDU-00201: Disk N0001: '/dev/asm/data1'
          ** HEARTBEAT DETECTED **


          2. Use kfed .   Heartbeat block is located at the last 4k block on aun=1.

           

          If AU size for the affected diskgroup is 1M, heartbeat can be checked in the following way.

          $ kfed read <device_path> aus=1048576 aun=1 blkn=255 | grep  kfbh.check  

          And also note the following command with various size of AU.

          $ kfed read <device_path> aus=4194304 aun=1 blkn=1023 | grep  kfbh.check  
          $ kfed read <device_path> aus=8388608 aun=1 blkn=2047 | grep  kfbh.check  
          $ kfed read <device_path> aus=16777216 aun=1 blkn=4095 | grep  kfbh.check
          $ kfed read <device_path> aus=33554432  aun=1 blkn=8191 | grep  kfbh.check
          $ kfed read <device_path> aus=67108864  aun=1 blkn=16383 | grep  kfbh.check

            <<< If the command above show a different checksum value over time ( say 10 sec interval ), it indicates that the heatbeat block is being updated from other node beyond the server.

           

          3.  Use  OS command dd in the following way.

          $ dd if=<device_path> of=/tmp/1.dd bs=4096 skip=511

          10 secs later, take another dd output with a different file name

          $ dd if=<device_path> of=/tmp/2.dd bs=4096 skip=511 

          <<< 1M AU, skip=511 / 4M AU, skip=2047 / 8M AU, skip=4095 / 16M AU, skip=8191 / 32M AU, skip=16383 / 64M AU, skip=32767

          And then check if they shows different content.

             $ diff /tmp/1.dd /tmp/2.dd
             Binary files /tmp/1.dd and /tmp/2.dd differ            

          <<< If these 2 files show different content, it also indicates that the heatbeat block is being updated from other node beyond the server.

           

          4. If heartbeat is detected while the affected diskgroup is not mounted, it indicates that the diskgroup in question is currently mounted by other node. But it does not tell which node is currently mounting it. Find the node with help of Storage Admin team which is currently mounting the diskgorup in question and dismount and try to mount it on the 2nd node.

           

          Regards,

          Vandana - Oracle

          • 2. Re: clusterware not starting up on one node error:PROT-602: Failed to retrieve data from the cluster registry
            user8500769

            HI,

             

            This error doesn't seems to be realted to HEARTBEAT.However, I have executed the below command and as per the output it doesn;t seems that any of the diskgorup has been mounted.

             

            RAC1 were the error is coming up:

             

            [grid@rac1 bin]$ ./asmcmd lsdsk -G DATA

            Connected to an idle instance.

             

            [grid@rac1 bin]$ ./amdu -diskstring="ORCL:*" -dump DATA  -nomap -nodir   | grep HEARTBEAT

            AMDU-00204: Disk N0001 is in currently mounted diskgroup DATA

            AMDU-00201: Disk N0001: 'ORCL:DATA1'

            ** HEARTBEAT DETECTED **

            [grid@rac1 bin]$ ./amdu -diskstring="ORCL:*" -dump VOTE  -nomap -nodir   | grep HEARTBEAT

            AMDU-00210: No disks found in diskgroup VOTE

            [grid@rac1 bin]$ ./amdu -diskstring="ORCL:*" -dump FRA  -nomap -nodir   | grep HEARTBEAT

            AMDU-00204: Disk N0002 is in currently mounted diskgroup FRA

            AMDU-00201: Disk N0002: 'ORCL:FRA1'

            ** HEARTBEAT DETECTED **

            [grid@rac1 bin]$ ./amdu -diskstring="ORCL:*" -dump CRS  -nomap -nodir   | grep HEARTBEAT

            AMDU-00204: Disk N0003 is in currently mounted diskgroup CRS

            AMDU-00201: Disk N0003: 'ORCL:VOTE1'

            ** HEARTBEAT DETECTED **

            [grid@rac1 bin]$

             

             

             

            RAC2 (Here, clusterware is running properly)

             

            ORACLE_SID = [grid] ? +ASM2

            The Oracle base for ORACLE_HOME=/raczone/grid_home is /raczone/11.2.0

            [grid@rac2 bin]$ ./asmcmd lsdsk -G DATA

            Path

            ORCL:DATA1

             

             

            [grid@rac2 bin]$ ./amdu -diskstring="ORCL:*" -dump DATA  -nomap -nodir   | grep HEARTBEAT

            AMDU-00204: Disk N0001 is in currently mounted diskgroup DATA

            AMDU-00201: Disk N0001: 'ORCL:DATA1'

            ** HEARTBEAT DETECTED **

            [grid@rac2 bin]$

             

             

             

            [grid@rac2 bin]$ ./amdu -diskstring="ORCL:*" -dump VOTE  -nomap -nodir   | grep HEARTBEAT

            AMDU-00210: No disks found in diskgroup VOTE

            [grid@rac2 bin]$ ./amdu -diskstring="ORCL:*" -dump CRS  -nomap -nodir   | grep HEARTBEAT

            AMDU-00204: Disk N0003 is in currently mounted diskgroup CRS

            AMDU-00201: Disk N0003: 'ORCL:VOTE1'

            ** HEARTBEAT DETECTED **

            [grid@rac2 bin]$ ./amdu -diskstring="ORCL:*" -dump FRA  -nomap -nodir   | grep HEARTBEAT

            AMDU-00204: Disk N0002 is in currently mounted diskgroup FRA

            AMDU-00201: Disk N0002: 'ORCL:FRA1'

            ** HEARTBEAT DETECTED **

            • 4. Re: clusterware not starting up on one node error:PROT-602: Failed to retrieve data from the cluster registry
              Vandana B -Oracle

              Hi,

               

              That is great news, could you share with us the how were you able to resolve the issue?

               

              Regards,

              Vandana - Oracle