9 Replies Latest reply: Mar 4, 2013 3:34 AM by ClaudioD&T RSS

    Node 2 in RAC 11gr2 went down and hasn't been able to restart. Help pls.

    ClaudioD&T
      Hello all,
      I'm working on this RAC configuration 11gR2 updated to the last patchset. During the night everything on node 2 went down and hasn't been able to come back, not even after a reboot. I've checked through the logfiles and I think the problem might be traceable to this:
      vi ocssd.log
      2013-01-15 02:42:47.629: [    CSSD][1104660800]clssnmvDHBValidateNcopy: node 1, dbacc1, has a disk HB, but no network HB, DHB has rcfg 247316625, wrtcnt, 18953467, LATS 165822984, lastSeqNo 18949942, uniqueness 1357546658, timestamp 1358214167/666569324
      … after some time...
      2013-01-15 02:43:01.929: [    CSSD][1115699520]###################################
      2013-01-15 02:43:01.929: [    CSSD][1115699520]clssscExit: CSSD aborting from thread clssnmRcfgMgrThread
      2013-01-15 02:43:01.929: [    CSSD][1115699520]###################################
      2013-01-15 02:43:01.929: [    CSSD][1115699520](:CSSSC00012:)clssscExit: A fatal error occurred and the CSS daemon is terminating abnormally.

      crsctl check cluster -all
      **************************************************************
      dbacc1:
      CRS-4537: Cluster Ready Services is online
      CRS-4529: Cluster Synchronization Services is online
      CRS-4533: Event Manager is online
      **************************************************************
      dbacc2:
      CRS-4535: Cannot communicate with Cluster Ready Services
      CRS-4530: Communications failure contacting Cluster Synchronization Services daemon
      CRS-4534: Cannot communicate with Event Manager
      **************************************************************


      I also tried to ping the vip from one machine to the other and vice versa and they all answered fine, also with huge size ping (i've found that this could have spotted a possibile bug).

      Any help would be really really appreciated. Thanks.

      Edited by: Klawd on 16-gen-2013 9.29
        • 1. Re: Node 2 in RAC 11gr2 went down and hasn't been able to restart. Help pls.
          JohnWatson
          What is the output of
          crsctl stat res -t -init
          after you try to start the clusterware?
          • 2. Re: Node 2 in RAC 11gr2 went down and hasn't been able to restart. Help pls.
            ClaudioD&T
            crsctl stat res -t -init on node 1
            --------------------------------------------------------------------------------
            NAME           TARGET  STATE        SERVER                 STATE_DETAILS
            --------------------------------------------------------------------------------
            Cluster Resources
            --------------------------------------------------------------------------------
            ora.asm
                  1        ONLINE  ONLINE       dbacc1                 Started
            ora.cluster_interconnect.haip
                  1        ONLINE  ONLINE       dbacc1
            ora.crf
                  1        ONLINE  ONLINE       dbacc1
            ora.crsd
                  1        ONLINE  ONLINE       dbacc1
            ora.cssd
                  1        ONLINE  ONLINE       dbacc1
            ora.cssdmonitor
                  1        ONLINE  ONLINE       dbacc1
            ora.ctssd
                  1        ONLINE  ONLINE       dbacc1                 OBSERVER
            ora.diskmon
                  1        OFFLINE OFFLINE
            ora.drivers.acfs
                  1        ONLINE  ONLINE       dbacc1
            ora.evmd
                  1        ONLINE  ONLINE       dbacc1
            ora.gipcd
                  1        ONLINE  ONLINE       dbacc1
            ora.gpnpd
                  1        ONLINE  ONLINE       dbacc1
            ora.mdnsd
                  1        ONLINE  ONLINE       dbacc1
            crsctl stat res -t -init on node 2
            --------------------------------------------------------------------------------
            NAME           TARGET  STATE        SERVER                 STATE_DETAILS
            --------------------------------------------------------------------------------
            Cluster Resources
            --------------------------------------------------------------------------------
            ora.asm
                  1        ONLINE  OFFLINE
            ora.cluster_interconnect.haip
                  1        ONLINE  OFFLINE
            ora.crf
                  1        ONLINE  ONLINE       dbacc2
            ora.crsd
                  1        ONLINE  OFFLINE
            ora.cssd
                  1        ONLINE  OFFLINE
            ora.cssdmonitor
                  1        ONLINE  ONLINE       dbacc2
            ora.ctssd
                  1        ONLINE  OFFLINE
            ora.diskmon
                  1        OFFLINE OFFLINE
            ora.drivers.acfs
                  1        ONLINE  ONLINE       dbacc2
            ora.evmd
                  1        ONLINE  OFFLINE
            ora.gipcd
                  1        ONLINE  ONLINE       dbacc2
            ora.gpnpd
                  1        ONLINE  ONLINE       dbacc2
            ora.mdnsd
                  1        ONLINE  ONLINE       dbacc2
            Edited by: Klawd on 16-gen-2013 11.49

            Edited by: Klawd on 16-gen-2013 12.18
            • 3. Re: Node 2 in RAC 11gr2 went down and hasn't been able to restart. Help pls.
              JohnWatson
              It is really difficult to read your output, because you didn't use
               tags.
              
              (and by the way, it would be polite to say "thank you for trying to assist", you might get better responses if you did)                                                                                                                                                                                                                                                                                                                                                                                                                    
              • 4. Re: Node 2 in RAC 11gr2 went down and hasn't been able to restart. Help pls.
                ClaudioD&T
                I'm very sorry, I tried to find a way to use a fixed length character for the output but I didn't how to change it, also on the right panel I didn't see a code tag, would you care to tell me how to use it pls?
                Also the thanks was already in the first post, I absolutely didn't mean to be rude. Thanks for the help John.

                Nevermind, found the correct use of the code tags in the FAQ and fixed the previous post.

                Edited by: Klawd on 16-gen-2013 12.21

                Edited by: Klawd on 16-gen-2013 12.21
                • 5. Re: Node 2 in RAC 11gr2 went down and hasn't been able to restart. Help pls.
                  JohnWatson
                  Problem with the interconnect on node 2?
                  • 7. Re: Node 2 in RAC 11gr2 went down and hasn't been able to restart. Help pls.
                    Gennady Sigalaev
                    Hi Klawd,

                    I don't like the following row:
                    ora.asm
                          1        ONLINE  OFFLINE
                    First of all try to identify the problem with asm on second node (could be a problem with disks, rights or something). After that follow by "Troubleshoot Grid Infrastructure Startup Issues [ID 1050908.1]" to resolve another problems.

                    Best regards,
                    Gena.
                    • 8. Re: Node 2 in RAC 11gr2 went down and hasn't been able to restart. Help pls.
                      ClaudioD&T
                      Ok, thank you, I'll do that tomorrow in the morning. (Italy here ;) )




                      EDIT: we still can't find what's wrong with this node. Everything seems to point to the interconnect.

                      Edited by: Klawd on 17-gen-2013 17.30
                      • 9. Re: Node 2 in RAC 11gr2 went down and hasn't been able to restart. Help pls.
                        ClaudioD&T
                        Sorry for the late update.
                        Just to close the thread and leave some usefull data.
                        This turned out to be new bug. We made a SR about it.
                        The workaround that fixed the problem temporarly was to shut everything down, start the second node first and then start the first node.

                        Bye.