6 Replies Latest reply: Oct 25, 2013 7:11 AM by AjithPathiyil RSS

    Node eviction |Diagwait

    user13549752

      One of the frequent problem we see in RAC environment is the node eviction. To get more info in the logs we set the diagwait.

       

      #crsctl set css diagwait 13 -force


      As per  oracle note "Using Diagwait as a diagnostic to get more information for diagnosing Oracle Clusterware Node evictions (Doc ID 559365.1)".


      Starting with 11.2.0.1, Customers do not need to set diagwait as the architecture has been changed.


      Does it mean that from 11.2.0.1 when there is a node eviction happens, logs have all the info and we don't need more info. or there is any relevant parameter for diagwait in 11gR2?

        • 1. Re: Node eviction |Diagwait
          Aman....

          user13549752 wrote:

           

          One of the frequent problem we see in RAC environment is the node eviction. To get more info in the logs we set the diagwait.

           

          #crsctl set css diagwait 13 -force


          As per  oracle note "Using Diagwait as a diagnostic to get more information for diagnosing Oracle Clusterware Node evictions (Doc ID 559365.1)".


          Starting with 11.2.0.1, Customers do not need to set diagwait as the architecture has been changed.


          Does it mean that from 11.2.0.1 when there is a node eviction happens, logs have all the info and we don't need more info. or there is any relevant parameter for diagwait in 11gR2?

          The same task is carried by CSSDMONITOR and CSSDAGENT I believe now.

           

          Aman....

          • 2. Re: Node eviction |Diagwait
            user13549752

            Thanks Aman for the updates..Experts any other views on this?

            • 3. Re: Node eviction |Diagwait
              AjithPathiyil

              Hi,

               

              Is your RAC database on Virtulized environment? If Yes, You can check the below fencing parameters.

               

              This is otherwise called "Shoot the other node in the head" STONITH technology, Since the RAC database is created in VMWARE, the resources are limited and so the fencing timeout parameters should be set according to our system resource limitations in VMWARE virtual environments.

               

              So, the best fencing timout setting would be below, So that, the timeout parameters do wait for the longer hearbeat and still beleive the node(s) is part of cluster, Else, if timeout is smaller, it will shoot the node. This fencing timeout setting works for me very well, Hope this helps you too, If yes, please mark it as answered.

               

              Increase CRS Fencing Timeout (Shared Filesystem)

              ===================================================

               

               

              These steps are not necessary for a test or production environment. However they might make your

              VMware test cluster just a little more stable and they will provide a good learning opportunity about

              Grid Infrastructure.

               

               

               

               


              1. Grid Infrastructure must be running on only one node to change these settings. Shutdown the clusterware on ajithpathiyil2 as user root.


               

               

              [oracle@ajithpathiyil1 ˜]$ ssh ajithpathiyil2

              Last login: Wed Mar 30 14:50:49 2011

              Set environment by typing 'oenv' - default is instance RAC1.

              ajithpathiyil2:/home/oracle[RAC1]$ su -

              Password:

              [root@ajithpathiyil2 bin]# crsctl stop crs

              CRS-2791: Starting shutdown of Oracle High Availability

              Services-managed resources

              on 'ajithpathiyil2'

              CRS-2673: Attempting to stop 'ora.crsd' on 'ajithpathiyil2'

              CRS-2790: Starting shutdown of Cluster Ready Services-managed

              resources on 'ajithpathiyil2'

              ...

              ...

              ...

              CRS-2793: Shutdown of Oracle High Availability Services-managed

              resources on 'ajithpathiyil2' has completed

              CRS-4133: Oracle High Availability Services has been stopped.

               

               

               

               

               

               

               

              2. Return to node ajithpathiyil1. As the root user, increase the misscount so that CRS waits 1.5 minutes before it reboots. (VMware can drag a little on some laptops!)

               

              [root@ajithpathiyil1 ˜]# crsctl get css misscount

              30

              [root@ajithpathiyil1 ˜]# crsctl set css misscount 90

              Configuration parameter misscount is now set to 90.

                  

              3. Increase the disktimeout so that CRS waits 10 minutes for I/O to complete before rebooting.

                 

              [root@ajithpathiyil1 ˜]# crsctl get css disktimeout

              200

              [root@ajithpathiyil1 ˜]# crsctl set css disktimeout 600

              Configuration parameter disktimeout is now set to 600.

                  

              4. Restart CRS on the other node.

               

              [root@ajithpathiyil1 bin]# ssh ajithpathiyil2

              [root@ajithpathiyil2 bin]# crsctl start crs

              • 4. Re: Node eviction |Diagwait
                AjithPathiyil

                Hi,


                You can also try the below option to capture the reason for the node-eviction, this is a simpler method. Assuming yours is 11g R2


                Please include the below callout script to immediately getting notified about the node eviction and its reason(in your case LUNs not reachable by CRS will be shown in reason for node eviction). Please mark this reply as answered if it solves your issue.

                 

                From a shell prompt (logged in as oracle) on each server, navigate to /u01/grid/oracle/product/11.2.0/grid_1/racg/usrco. Create file there called callout1.sh using vi (or your favorite editor). The contents of the file should be this:

                ----------------------------------------------------------------------------------------------------

                #!/bin/ksh

                umask 022

                FAN_LOGFILE=/tmp/‘hostname‘_uptime.log

                echo $* "reported="‘date‘ >> $FAN_LOGFILE &

                ----------------------------------------------------------------------------------------------------

                Note: The use of backticks around the hostname and date commands.

                 

                [oracle@<node_name> ˜]$ chmod 755  /u01/grid/oracle/product/11.2.0/grid_1/racg/usrcocallout1.sh

                 

                [oracle@<node_name> ˜]$ tail -f  /u01/grid/oracle/product/11.2.0/grid_1/log/‘hostname -s‘/crsd/crsd.log

                 

                 

                You can get more information on below MOS notes too

                MOS Article ID # 1050908.1, How to Troubleshoot Grid Infrastructure Startup Issues

                MOS Article ID # 1050693.1, Troubleshooting 11.2 Clusterware Node Evictions (Reboots)

                • 5. Re: Node eviction |Diagwait
                  user13549752

                  crsctl set css misscount 90 #this command is only for vmware or we can use in physical server ?

                  • 6. Re: Node eviction |Diagwait
                    AjithPathiyil

                    Hi,

                     

                    Its not a VMware specific command, You can use it generally in physical or virtulized servers, but ideally the network latency in virtualized environments with default fencing parameter value leads to node eviction, If you think, your network latency between the physical servers are also high, then you might try reducing the fencing parameter change,

                     

                    Note:- Plz do not try this in production.

                     

                    And I would suggest you to put the callout script I've pasted above to find the exacct reason for node eviction and then proceed with the fencing timeout parameter change if required