This discussion is archived
9 Replies Latest reply: Mar 4, 2013 1:34 AM by ClaudioD&T RSS

Node 2 in RAC 11gr2 went down and hasn't been able to restart. Help pls.

ClaudioD&T Newbie
Currently Being Moderated
Hello all,
I'm working on this RAC configuration 11gR2 updated to the last patchset. During the night everything on node 2 went down and hasn't been able to come back, not even after a reboot. I've checked through the logfiles and I think the problem might be traceable to this:
vi ocssd.log
2013-01-15 02:42:47.629: [    CSSD][1104660800]clssnmvDHBValidateNcopy: node 1, dbacc1, has a disk HB, but no network HB, DHB has rcfg 247316625, wrtcnt, 18953467, LATS 165822984, lastSeqNo 18949942, uniqueness 1357546658, timestamp 1358214167/666569324
… after some time...
2013-01-15 02:43:01.929: [    CSSD][1115699520]###################################
2013-01-15 02:43:01.929: [    CSSD][1115699520]clssscExit: CSSD aborting from thread clssnmRcfgMgrThread
2013-01-15 02:43:01.929: [    CSSD][1115699520]###################################
2013-01-15 02:43:01.929: [    CSSD][1115699520](:CSSSC00012:)clssscExit: A fatal error occurred and the CSS daemon is terminating abnormally.

crsctl check cluster -all
**************************************************************
dbacc1:
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
**************************************************************
dbacc2:
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4530: Communications failure contacting Cluster Synchronization Services daemon
CRS-4534: Cannot communicate with Event Manager
**************************************************************


I also tried to ping the vip from one machine to the other and vice versa and they all answered fine, also with huge size ping (i've found that this could have spotted a possibile bug).

Any help would be really really appreciated. Thanks.

Edited by: Klawd on 16-gen-2013 9.29
  • 1. Re: Node 2 in RAC 11gr2 went down and hasn't been able to restart. Help pls.
    JohnWatson Guru
    Currently Being Moderated
    What is the output of
    crsctl stat res -t -init
    after you try to start the clusterware?
  • 2. Re: Node 2 in RAC 11gr2 went down and hasn't been able to restart. Help pls.
    ClaudioD&T Newbie
    Currently Being Moderated
    crsctl stat res -t -init on node 1
    --------------------------------------------------------------------------------
    NAME           TARGET  STATE        SERVER                 STATE_DETAILS
    --------------------------------------------------------------------------------
    Cluster Resources
    --------------------------------------------------------------------------------
    ora.asm
          1        ONLINE  ONLINE       dbacc1                 Started
    ora.cluster_interconnect.haip
          1        ONLINE  ONLINE       dbacc1
    ora.crf
          1        ONLINE  ONLINE       dbacc1
    ora.crsd
          1        ONLINE  ONLINE       dbacc1
    ora.cssd
          1        ONLINE  ONLINE       dbacc1
    ora.cssdmonitor
          1        ONLINE  ONLINE       dbacc1
    ora.ctssd
          1        ONLINE  ONLINE       dbacc1                 OBSERVER
    ora.diskmon
          1        OFFLINE OFFLINE
    ora.drivers.acfs
          1        ONLINE  ONLINE       dbacc1
    ora.evmd
          1        ONLINE  ONLINE       dbacc1
    ora.gipcd
          1        ONLINE  ONLINE       dbacc1
    ora.gpnpd
          1        ONLINE  ONLINE       dbacc1
    ora.mdnsd
          1        ONLINE  ONLINE       dbacc1
    crsctl stat res -t -init on node 2
    --------------------------------------------------------------------------------
    NAME           TARGET  STATE        SERVER                 STATE_DETAILS
    --------------------------------------------------------------------------------
    Cluster Resources
    --------------------------------------------------------------------------------
    ora.asm
          1        ONLINE  OFFLINE
    ora.cluster_interconnect.haip
          1        ONLINE  OFFLINE
    ora.crf
          1        ONLINE  ONLINE       dbacc2
    ora.crsd
          1        ONLINE  OFFLINE
    ora.cssd
          1        ONLINE  OFFLINE
    ora.cssdmonitor
          1        ONLINE  ONLINE       dbacc2
    ora.ctssd
          1        ONLINE  OFFLINE
    ora.diskmon
          1        OFFLINE OFFLINE
    ora.drivers.acfs
          1        ONLINE  ONLINE       dbacc2
    ora.evmd
          1        ONLINE  OFFLINE
    ora.gipcd
          1        ONLINE  ONLINE       dbacc2
    ora.gpnpd
          1        ONLINE  ONLINE       dbacc2
    ora.mdnsd
          1        ONLINE  ONLINE       dbacc2
    Edited by: Klawd on 16-gen-2013 11.49

    Edited by: Klawd on 16-gen-2013 12.18
  • 3. Re: Node 2 in RAC 11gr2 went down and hasn't been able to restart. Help pls.
    JohnWatson Guru
    Currently Being Moderated
    It is really difficult to read your output, because you didn't use
     tags.
    
    (and by the way, it would be polite to say "thank you for trying to assist", you might get better responses if you did)                                                                                                                                                                                                                                                                                                                                                                                                                    
  • 4. Re: Node 2 in RAC 11gr2 went down and hasn't been able to restart. Help pls.
    ClaudioD&T Newbie
    Currently Being Moderated
    I'm very sorry, I tried to find a way to use a fixed length character for the output but I didn't how to change it, also on the right panel I didn't see a code tag, would you care to tell me how to use it pls?
    Also the thanks was already in the first post, I absolutely didn't mean to be rude. Thanks for the help John.

    Nevermind, found the correct use of the code tags in the FAQ and fixed the previous post.

    Edited by: Klawd on 16-gen-2013 12.21

    Edited by: Klawd on 16-gen-2013 12.21
  • 5. Re: Node 2 in RAC 11gr2 went down and hasn't been able to restart. Help pls.
    JohnWatson Guru
    Currently Being Moderated
    Problem with the interconnect on node 2?
  • 7. Re: Node 2 in RAC 11gr2 went down and hasn't been able to restart. Help pls.
    Gennady Sigalaev Journeyer
    Currently Being Moderated
    Hi Klawd,

    I don't like the following row:
    ora.asm
          1        ONLINE  OFFLINE
    First of all try to identify the problem with asm on second node (could be a problem with disks, rights or something). After that follow by "Troubleshoot Grid Infrastructure Startup Issues [ID 1050908.1]" to resolve another problems.

    Best regards,
    Gena.
  • 8. Re: Node 2 in RAC 11gr2 went down and hasn't been able to restart. Help pls.
    ClaudioD&T Newbie
    Currently Being Moderated
    Ok, thank you, I'll do that tomorrow in the morning. (Italy here ;) )




    EDIT: we still can't find what's wrong with this node. Everything seems to point to the interconnect.

    Edited by: Klawd on 17-gen-2013 17.30
  • 9. Re: Node 2 in RAC 11gr2 went down and hasn't been able to restart. Help pls.
    ClaudioD&T Newbie
    Currently Being Moderated
    Sorry for the late update.
    Just to close the thread and leave some usefull data.
    This turned out to be new bug. We made a SR about it.
    The workaround that fixed the problem temporarly was to shut everything down, start the second node first and then start the first node.

    Bye.

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points