10 Replies Latest reply: Aug 23, 2010 10:50 AM by cayenne RSS

    Fresh cluster inst., servers reboot, cannot restart clusters or ASM

    cayenne
      Hello all,

      I just installed 11Gr2 cluster over 5 nodes. I used ASM in the installer, to hold the voting disk, etc for it.
      I installed the RDBMS binairies successfully across all nodes. NO INSTANCES YET.

      A few days went by....

      I was getting ready to do post installing patches...and found things looking strange. I found the (working on node1), the clustering systems was not running.

      I looked, and the servers (all 5 of them) for some reason had rebooted since install.

      I tried starting the cluster:

      crsctl start cluster -all.

      Took while to return, and then errored with a timeout msg.

      I checked to see if it was up:
      ./crsctl check crs
      CRS-4638: Oracle High Availability Services is online
      CRS-4535: Cannot communicate with Cluster Ready Services
      CRS-4530: Communications failure contacting Cluster Synchronization Services daemon
      CRS-4533: Event Manager is online

      It then dawned on me...maybe ASM wasn't up either?

      Nope...not running.
      I tried to start it locally on node 1

      I set the SID, and tried using sqlplus

      I got:
      ORA-01078: failure in processing system parameters
      LRM-00109: could not open parameter file '/u01/app/oracle/product/11.2.0/dbhome_1/dbs/init+ASM1.ora'


      I looked...nothing in that directory at all but a simple init.ora file.

      I tried shutting down the cluster with
      crsctl stop cluster -all

      I got a ton of messages for each node like:
      CRS-4548: Unable to connect to CRSD
      CRS-2678: 'ora.crsd' on 'node1' has experienced an unrecoverable failure
      CRS-0267: Human intervention required to resume its availability.

      I'm trying to get through to Oracle support, but they're running slow.
      Any ideas here?

      I used OUI to create ASM for the cluster...why would it not put an init file there to point to the spfile in ASM?

      I'm guessing this is the reason the nodes couldn't talk or sync. Trouble is...how do I start ASM without an init file? I seem to recall there might be a way to create a file to point to the ASM for the spfile, but I'm new to this too...and not sure where to point or the syntax to use.

      Will have starting cluster up with no ASM have done any damage...if so, how to fix it?

      As you can tell, learning about clusters/RAC and ASM....and I'm not finding good reference materials on troubleshooting. Heck, the install docs are bad enough....

      Thank you in advance for any advice or links...

      cayenne

      ps. this is on RHEL5

      Edited by: cayenne on Aug 10, 2010 12:33 PM
        • 1. Re: Fresh cluster inst., servers reboot, cannot restart clusters or ASM
          Sebastian Solbach -Dba Community-Oracle
          Hi,

          try shutting down the cluster/crs with force (-f)
          (crsctl stop crs -f)

          This should definitely bring down the stack on the node. (Unfortunately crsctl stop crs -f has to be done on each node... does not work clusterwide as crsctl stop cluster).

          Then try to start up one node with crsctl start crs and crsctl start cluster, and look what is happening...

          If this node does not come up, then the logs would be interesting... (why ASM cannot be mounted etc.).
          If ASM has problems, the alert.log of the ASM instance would be interesting....

          I recommend using the lates 11.2.0.1.2 Grid PSU... this should solve the problem that your cluster went down...

          Sebastian
          • 2. Re: Fresh cluster inst., servers reboot, cannot restart clusters or ASM
            cayenne
            ssolbach wrote:
            Hi,

            try shutting down the cluster/crs with force (-f)
            (crsctl stop crs -f)

            This should definitely bring down the stack on the node. (Unfortunately crsctl stop crs -f has to be done on each node... does not work clusterwide as crsctl stop cluster).

            Then try to start up one node with crsctl start crs and crsctl start cluster, and look what is happening...

            If this node does not come up, then the logs would be interesting... (why ASM cannot be mounted etc.).
            If ASM has problems, the alert.log of the ASM instance would be interesting....

            I recommend using the lates 11.2.0.1.2 Grid PSU... this should solve the problem that your cluster went down...

            Sebastian
            Thank you for the reply. I am just not sure what is going on here.

            I tried the crsctl stop crs -f....and that didn't work:

            crsctl stop crs -f
            CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'node1'
            CRS-2673: Attempting to stop 'ora.crsd' on 'node1'
            CRS-4548: Unable to connect to CRSD
            CRS-2675: Stop of 'ora.crsd' on 'node1' failed
            CRS-2679: Attempting to clean 'ora.crsd' on 'node1'
            CRS-4548: Unable to connect to CRSD
            CRS-2678: 'ora.crsd' on 'node1' has experienced an unrecoverable failure
            CRS-0267: Human intervention required to resume its availability.
            CRS-2795: Shutdown of Oracle High Availability Services-managed resources on 'node1' has failed
            CRS-4687: Shutdown command has completed with error(s).
            CRS-4000: Command Stop failed, or completed with errors.

            Checking the status, sure enough....some things are still running?
            crsctl check crs
            CRS-4638: Oracle High Availability Services is online
            CRS-4535: Cannot communicate with Cluster Ready Services
            CRS-4530: Communications failure contacting Cluster Synchronization Services daemon
            CRS-4533: Event Manager is online


            Ok, now...checking ASM

            oracleasm status
            Checking if ASM is loaded: no
            Checking if /dev/oracleasm is mounted: no

            I try to start it:
            oracleasm init
            Loading module "oracleasm": failed
            Unable to load module "oracleasm"

            I'm newbie to ASM and cluster...while I've touched them before, this is my first install, and I'm just puzzled, and Oracle support isn't helping very much so far.

            I'd installed the ASM libraries...it found the disks on my SAN (which had been used on another database, so I left the lables...then dropped and re-added a couple of the disks so they would show up on the cluster OUI as candidate disks)

            I set this up with the cluster OUI....it found the ASM candidate disk I was going to use for voting disk, OCR...and it seemed to like the set up. Clustering set up and was installed across all 5 nodes. In /etc/oratab on all nodes, I see ASM1, ASM2...+ASM5

            I installed 11Grs database binaries...and the clustering services distributed them across all 5 nodes. Again...all appeared to be working.

            Somewhere between then and now...all 5 servers got rebooted...and now...it is hung, and won't totally start or shut down.

            I'm guessing at this time...can't hurt anymore to try to reboot one or more of the servers.

            Any help or suggestions greatly appreciated.

            Thank you,

            cayenne
            • 3. Re: Fresh cluster inst., servers reboot, cannot restart clusters or ASM
              Sebastian Solbach -Dba Community-Oracle
              Hi,

              Try doing crsctl stop crs -f 2 times. Sometimes this is needed.

              crsctl check crs should report you that nothing is running anymore.

              If it is... then only way I can think of is disable the automatic startup of crs

              crsctl disable crs

              and restart the node... This will definitely bring everything down (and does not start it up after restarting).
              You can then start the crs stack with crsctl start crs.

              Furthermore don't get confused with oracleasm.
              Oracleasm is for asmlib. Which may be used in preparating the storage for ASM but has nothing to do later with if ASM is running or not.

              To check the status of ASM first see if clusterstack was started successfully (you need to wait a little till it is started):

              crsctl check crs.

              Everything should be online... if it is, you can do a crsctl stat res -t which will show you all ressources including if ASM is running.

              If there has been a problem starting up the stack (crsctl check crs) then we have to find out why.

              Check $CRS_HOME/log/<hostname>/alert*.log for error messages.

              If something indicates a problem with ASM do a:

              adrci
              show alert

              and choose the ASM alert.log.

              Search for error messages.

              GL.

              Sebastian
              • 4. Re: Fresh cluster inst., servers reboot, cannot restart clusters or ASM
                cayenne
                ssolbach wrote:
                Hi,

                Try doing crsctl stop crs -f 2 times. Sometimes this is needed.

                crsctl check crs should report you that nothing is running anymore.

                If it is... then only way I can think of is disable the automatic startup of crs

                crsctl disable crs

                and restart the node... This will definitely bring everything down (and does not start it up after restarting).
                You can then start the crs stack with crsctl start crs.

                Furthermore don't get confused with oracleasm.
                Oracleasm is for asmlib. Which may be used in preparating the storage for ASM but has nothing to do later with if ASM is running or not.

                To check the status of ASM first see if clusterstack was started successfully (you need to wait a little till it is started):

                crsctl check crs.

                Everything should be online... if it is, you can do a crsctl stat res -t which will show you all ressources including if ASM is running.

                If there has been a problem starting up the stack (crsctl check crs) then we have to find out why.

                Check $CRS_HOME/log/<hostname>/alert*.log for error messages.

                If something indicates a problem with ASM do a:

                adrci
                show alert

                and choose the ASM alert.log.

                Search for error messages.

                GL.

                Sebastian
                Thank you, I'd not known of adrci before!! Once I set the ORACLE_HOME, it came right up!

                Ok, it does look like ASM problem...and clustering can't find the voting disc.

                Errors in file /u01/app/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_dia0_25186.trc:
                ORA-27508: IPC error sending a message
                ORA-27300: OS system dependent operation:sendmsg failed with status: 22
                ORA-27301: OS failure message: Invalid argument
                ORA-27302: failure occurred at: sskgxpsnd1
                Errors in file /u01/app/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_dia0_25186.trc:
                ORA-27508: IPC error sending a message
                Errors in file /u01/app/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_o000_22504.trc (incident=4801):
                ORA-00603: ORACLE server session terminated by fatal error
                ORA-27504: IPC error creating OSD context
                ORA-27300: OS system dependent operation:if_not_found failed with status: 0
                ORA-27301: OS failure message: Error 0
                ORA-27302: failure occurred at: skgxpvaddr9
                ORA-27303: additional information: requested interface 192.168.100.1 not found. Check output from ifconfig command
                Incident details in: /u01/app/oracle/diag/asm/+asm/+ASM1/incident/incdir_4801/+ASM1_o000_22504_i4801.trc
                Errors in file /u01/app/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_gmon_25212.trc:
                ORA-29746: Cluster Synchronization Service is being shut down.
                ORA-29702: error occurred in Cluster Group Service operation
                GMON (ospid: 25212): terminating the instance due to error 29746
                opidrv aborting process O000 ospid (22504) as a result of ORA-603

                -------------------

                Ok, I've been going through the alert log, and the .trc files it indicates. I started by checking the private address 192.168.100.1...it seems to be up with ifconfig..and pingable.

                I'm looking through errors...the one : ORA-27300 returning a value 22 got a hit on Oracle Support...but wasn't the same errors as I got.

                Whew...still plowing through all the files and logs...not seeing anything out there so far that matches my problem...dunno what could have caused this to just all go BANG. I mean...no one but me using the machines, no databases on them yet...all that was installed was clustering with its voting disk and (ocr?) on a single ASM disk group....and RDBMS binaries installed across all 5 nodes.

                Please let me know if you see anything that stands out here...I'm still searching myself...
                :)

                Again, thanks for the adrcpi hint!!

                cayenne
                • 5. Re: Fresh cluster inst., servers reboot, cannot restart clusters or ASM
                  Sebastian Solbach -Dba Community-Oracle
                  Hi,

                  if your server rebooted, it really could be a hint, that something with your private interconnect was not o.k.
                  And the messages you get, still indicate that something is not well with the interconnect.

                  Can you do and paste the following for each interconnect interface/bond:

                  ifconfig ethX on each host

                  Furthermore try pinging the private interconnect from one host to the other (I assume you used 192.168.100.1 - 5)

                  One side node, if clusterware can't find the voting disks, than this is before ASM is starting.
                  In the bootstrapping process first the voting disks are accessed, and shortly later ASM is started.

                  What does
                  crsctl get css votedisk

                  tells you.

                  Sebastian
                  • 6. Re: Fresh cluster inst., servers reboot, cannot restart clusters or ASM
                    cayenne
                    ssolbach wrote:
                    Hi,

                    if your server rebooted, it really could be a hint, that something with your private interconnect was not o.k.
                    And the messages you get, still indicate that something is not well with the interconnect.

                    Can you do and paste the following for each interconnect interface/bond:

                    ifconfig ethX on each host

                    Furthermore try pinging the private interconnect from one host to the other (I assume you used 192.168.100.1 - 5)

                    One side node, if clusterware can't find the voting disks, than this is before ASM is starting.
                    In the bootstrapping process first the voting disks are accessed, and shortly later ASM is started.

                    What does
                    crsctl get css votedisk

                    tells you.

                    Sebastian
                    Thank you so much for all the help so far.
                    A question...I'm confused..

                    Doesn't ASM have to start first? Since the voting disk and OCR are ON ASM....doesn't it have to be up, before CRS can reach the voting disk, etc?

                    Thank you,

                    cayenne
                    • 7. Re: Fresh cluster inst., servers reboot, cannot restart clusters or ASM
                      Sebastian Solbach -Dba Community-Oracle
                      Hi Cayenne,

                      well no. That ASM can start, ASM needs the CSSD/CRS agent. That is so by design (and was so in the past). Even with a single instance ASM you needed to have a single instance CRS/CSSD.

                      Now with 11gR2 the problem is: Oracle needs ASM to access the voting disks + OCR to get the cluster stack up. But at the same time Oracle needs the clusterstack to be able to start ASM. So as a workaround the cluster has 2 things.
                      a.) A local OCR (called OLR) and
                      b.) the cluster knows exactly where to find the voting disks on the ASM disks (in the ASM header).
                      So clusterware really does not need ASM to be up to be able to bootstrap.

                      However shortly after clusterware accessed the Voting disks (directly without ASM), ASM can be started and then clusterware changes the access to OCR and Voting disks via. ASM.

                      Hope that clarifies it.

                      Sebastian
                      • 8. Re: Fresh cluster inst., servers reboot, cannot restart clusters or ASM
                        cayenne
                        ssolbach wrote:
                        Hi Cayenne,

                        well no. That ASM can start, ASM needs the CSSD/CRS agent. That is so by design (and was so in the past). Even with a single instance ASM you needed to have a single instance CRS/CSSD.

                        Now with 11gR2 the problem is: Oracle needs ASM to access the voting disks + OCR to get the cluster stack up. But at the same time Oracle needs the clusterstack to be able to start ASM. So as a workaround the cluster has 2 things.
                        a.) A local OCR (called OLR) and
                        b.) the cluster knows exactly where to find the voting disks on the ASM disks (in the ASM header).
                        So clusterware really does not need ASM to be up to be able to bootstrap.

                        However shortly after clusterware accessed the Voting disks (directly without ASM), ASM can be started and then clusterware changes the access to OCR and Voting disks via. ASM.

                        Hope that clarifies it.

                        Sebastian
                        Thank you, that does help somewhat.
                        Hmm...interesting after reading this...and went looking around. For some reason now...the ASM disks are NOT appearing under the mount point /dev/oracleasm ??

                        The SA's have assured me the SAN is still up and running....

                        Dang, this is getting frustrating.

                        Oracle support hasn't answered me in like 3+ days...even with severity 2.

                        I did find this note:
                        Oracleasm Configure gets Error - Unable To Load Module Oracleasm [ID 466428.1]

                        Wondering if that might have done it. I just found that my SA's did do some upgrades to RHEL5. From this note, which isn't completely related to my situation, it did appear that changing the kernel would throw ASM out...wondering if that might be the case here?

                        I checked...the versions still seem compatible...just a minor upgrade...they still seem to match?

                        uname -r
                        2.6.18-194.8.1.el5

                        rpm -qa|grep asm
                        oracleasm-2.6.18-194.el5-2.0.5-1.el5
                        oracleasm-support-2.1.3-1.el5
                        oracleasmlib-2.0.4-1.el5

                        I'm getting lost on what to try next...any suggestions?

                        Thank you,

                        cayenne
                        • 9. Re: Fresh cluster inst., servers reboot, cannot restart clusters or ASM
                          692600
                           I checked...the versions still seem compatible...just a minor upgrade...they still seem to match?
                          
                          uname -r
                          2.6.18-194.8.1.el5
                          
                          rpm -qa|grep asm
                          oracleasm-2.6.18-194.el5-2.0.5-1.el5
                          oracleasm-support-2.1.3-1.el5
                          oracleasmlib-2.0.4-1.el5
                          
                          I'm getting lost on what to try next...any suggestions?
                           
                          Well they do not match and that looks like where you problem is.

                          The latest driver available for linux is for kernel 2.6.18-194.3.1.el5 and your kernel is 2.6.18-194.8.1.el5

                          You can try updating the driver when it's available by 'oracleasm update-driver'. Till they release it, your temporary solution would be to throw out asmlib and use block devices for ASM. Steps would be roughly as below:

                          1) stop clusterware normal/force if nothing works kill d.bin processes - on all nodes
                          crsctl stop crs -f
                          if doesn't work do a ps -ef | grep d.bin and kill -15

                          2) cd /etc/init.d - all nodes
                          ./oracleasm stop
                          ./oracleasm disable

                          3) configure udev rules with your asm disks
                          vi /etc/udev/rules.d/99-oracle-asmdevices.rules
                          KERNEL=="sdh[5-10]*", OWNER="grid", GROUP="asmadmin", MODE="0660"

                          if you have not been security savvy, owner would be oracle, group would be dba

                          4) replicate this file to all the nodes
                          # scp 99-oracle-asmdevices.rules

                          5) restart the udev services - all nodes
                          # udevcontrol reload_rules
                          # start_udev

                          6) start your cluster
                          crsctl start crs

                          You should get the stuff up & running. Good luck!

                          Cheers.
                          • 10. Re: Fresh cluster inst., servers reboot, cannot restart clusters or ASM
                            cayenne
                            Thank you to all for the answers.
                            I had the sys admins revert to the previous version of the OS kernel...reboot the boxes, and voila!

                            Everything came back up running again. Just that slight difference in version really messed things up.