5 Replies Latest reply: Oct 15, 2013 6:13 AM by Kavanagh RSS

    Checking the LUNs when clusterware doesn't come up

    Kavanagh

      Grid Version : 11.2.0.3.6

      Platform  : Oracle Enterprise Linux 6.2

       

      In our 2-Node RAC, Node2 got evicted. Once Node2 booted up CRS didn't start up. I couldn't find anything significant in Grid alert.log , ocssd.log or crsd.log.

      In node2, I was able to do fdisk -l on all LUNs in the OCR_VOTE diskgroup. After few hours of headache and escalations we discovered that LUNs were not actually accesible to the clusterware in Node2 although fdisk -l was correctly showing the partition.

       

      When the cluster was down, I wanted to check if voting disk was actually accessible to the CRS ( GI ), but I couldn't (as shown below).

       

      # ./crsctl start crs

      CRS-4640: Oracle High Availability Services is already active

      CRS-4000: Command Start failed, or completed with errors.

       

       

      # ./crsctl query css votedisk

      Unable to communicate with the Cluster Synchronization Services daemon.

       

      How can I check if the voting disk is accessible to the CRS in a node when the CRS is down ?

        • 1. Re: Checking the LUNs when clusterware doesn't come up
          gottikere

          #./crsctl stat res -t -init

           

          can you please execute the below command and give the output.

           

          Thanks,

          http://gssdba.wordpress.com

          • 2. Re: Checking the LUNs when clusterware doesn't come up
            Kavanagh

            I don't have the output of crsctl stat res -t -init now as the issue has been fixed now. But all the essential components were OFFLINE .

            • 3. Re: Checking the LUNs when clusterware doesn't come up
              Billy~Verreynne

              There are 2 layers needed for starting CRS.

               

              Storage. IMO, multipath is mandatory for managing cluster storage at physical level. To check whether the storage is available, use the multipath -l command to a device listing. I usually use multipath -l | grep <keyword> to list the LUNs, where keyword identifies the LUN entry. E.g.

               

              [root@xx-rac01 ~]# multipath -l | grep VRAID | sort
              VNX-LUN0 (360060160abf02e00f8712272de99e111) dm-8 DGC,VRAID
              VNX-LUN1 (360060160abf02e009050a27bde99e111) dm-3 DGC,VRAID
              VNX-LUN2 (360060160abf02e009250a27bde99e111) dm-9 DGC,VRAID
              VNX-LUN3 (360060160abf02e009450a27bde99e111) dm-4 DGC,VRAID
              VNX-LUN4 (360060160abf02e009650a27bde99e111) dm-0 DGC,VRAID
              VNX-LUN5 (360060160abf02e009850a27bde99e111) dm-5 DGC,VRAID
              VNX-LUN6 (360060160abf02e009a50a27bde99e111) dm-1 DGC,VRAID
              VNX-LUN7 (360060160abf02e009c50a27bde99e111) dm-6 DGC,VRAID
              VNX-LUN8 (360060160abf02e009e50a27bde99e111) dm-2 DGC,VRAID
              VNX-LUN9 (360060160abf02e00a050a27bde99e111) dm-7 DGC,VRAID

               

              If the LUN count is wrong and one or more LUNs are missing, I would check /var/log/messages for starters. One can also run a multipath flush and rediscovery (and up the verbosity level if errors are thrown).

               

              If all the LUNs are there, check device permissions and make sure that the Oracle s/w stack has access.

               

              The other layer that needs to be working is the Interconnect. There are 2 basic things to check. Does the local Interconnect interface exist? This can be checked using ifconfig. And does this Interconnect interface communicate with other cluster node's Interconnect interfaces? This can be checked using ping - or if Infiniband is used, via the ibhost and other ib commands.

               

              So if CRS does not start - these 2 checks (storage and Interconnect) would be my first port of call, as in my experience the vast majority of times one of these 2 layers failed.

              • 4. Re: Checking the LUNs when clusterware doesn't come up
                AjithPathiyil

                Hi Kavanagh,

                 

                Please include the below callout script to immediately getting notified about the node eviction and its reason(in your case LUNs not reachable by CRS will be shown in reason for node eviction). Please mark this reply as answered if it solves your issue.

                 

                From a shell prompt (logged in as oracle) on each server, navigate to /u01/grid/oracle/product/11.2.0/grid_1/racg/usrco. Create file there called callout1.sh using vi (or your favorite editor). The contents of the file should be this:

                ----------------------------------------------------------------------------------------------------

                #!/bin/ksh

                umask 022

                FAN_LOGFILE=/tmp/‘hostname‘_uptime.log

                echo $* "reported="‘date‘ >> $FAN_LOGFILE &

                ----------------------------------------------------------------------------------------------------

                Note: The use of backticks around the hostname and date commands.

                 

                [oracle@<node_name> ˜]$ chmod 755  /u01/grid/oracle/product/11.2.0/grid_1/racg/usrcocallout1.sh

                 

                [oracle@<node_name> ˜]$ tail -f  /u01/grid/oracle/product/11.2.0/grid_1/log/‘hostname -s‘/crsd/crsd.log

                 

                 

                You can get more information on below MOS notes too

                MOS Article ID # 1050908.1, How to Troubleshoot Grid Infrastructure Startup Issues

                MOS Article ID # 1050693.1, Troubleshooting 11.2 Clusterware Node Evictions (Reboots)

                • 5. Re: Checking the LUNs when clusterware doesn't come up
                  Kavanagh

                  Will check this. Thank You Ajith, Billy.