5 Replies Latest reply: May 28, 2014 12:43 AM by Billy~Verreynne RSS

    What happens when connection between any Node or all Nodes and Storage is Broken in Oracle RAC?

    Hasan Al Mamun

      Hi I am administering an Oracle RAC 11gr2 in Oracle Linux 6.2. Everything working fine. I have a concern if connectivity between Nodes and ASM Storage is broken for any reason, what do I have to do to restore my database. My RMAN Backup and archived logfiles are also in ASM FRA storage. Thanks and Regards Hasan Al Mamun

        • 1. Re: What happens when connection between any Node or all Nodes and Storage is Broken in Oracle RAC?
          teits

          if just one instance server lost connectivity to ASM storage. the instance might crash. when you re-establish the connection instance recovery will be performed as part of startup process.

           

          if all instances server lost connectivity to ASM storage. re-establish connection to ASM storage. RAC instances may have crashed. startup process will carry out RAC instances recovery.

          REDO LOG is essential for instance recovery. you should not have to restore database if you have not lose any physical disk or data in the ASM disk/LUN.

           

          Tobi

          • 2. Re: What happens when connection between any Node or all Nodes and Storage is Broken in Oracle RAC?
            Hasan Al Mamun

            Hi Tobi

             

            Thanks for the reply, How can I re establish connection? do I have to issue oracleasm scandisks or something else, what might the commands to re-establish.

             

             

            Thanks

             

            Hasan Al Mamun

            • 3. Re: What happens when connection between any Node or all Nodes and Storage is Broken in Oracle RAC?
              Tom321

              Hi,

               

              that depends on the case. If you get your disks back online in the storage and your os shows them to be available a restart of the crs stack should be all that is needed (crsctl start crs). If most parts of crs stack are still running you can also try to start only some components of the stack "crsctl stat res -t" shows you what is up and running.

              On linux disks sometimes come back with different device name, you should use asmlib to label the disks and bypass that problem.

              Beside this, you have data and your backup both in asm. While that is fine i would recommend having an additional "last resort" backup strategy e.g. backup for the recovery area to tape or an filesystem (can be non clustered). Otherwise you have the risk to lose your entire databases. .

               

              Regards

              Thomas

              • 4. Re: What happens when connection between any Node or all Nodes and Storage is Broken in Oracle RAC?
                Billy~Verreynne

                Hasan Al Mamun wrote:

                 

                Hi I am administering an Oracle RAC 11gr2 in Oracle Linux 6.2. Everything working fine. I have a concern if connectivity between Nodes and ASM Storage is broken for any reason, what do I have to do to restore my database. My RMAN Backup and archived logfiles are also in ASM FRA storage. Thanks and Regards Hasan Al Mamun

                 

                That depends on the type and redundancy of your I/O fabric layer. The entire RAC may go down. Only some nodes may go down. All nodes may still be available.

                 

                Example. Fabric layer is 4Gbs fibre, with dual port HBAs per node, and each port wired to one of two redundant SAN switches. If a switch fails, a single port on the nodes will fail. Node software like Multipath will fail I/O over to the working port . The RAC instances will remain up and running.

                 

                Example. Same architecture, but  2 SANs. ASM is used for mirroring across SANs (failgroup 1's disks on SAN1 and failgroup 2's disks on SAN2, for normal redundancy ASM diskgroup). SAN1 fails. ASM will mark failgroup1 as being "down" (missing disks). After the repair time period expired, ASM will proceed to drop all disks in failgroup1. The RAC instances will remain up and running.

                 

                Example. Single switch/SAN used and it goes down. RAC instances can no longer perform db I/O. Instances terminate abnormally. CRS still has a heartbeat via the Interconnect, but none via disk (OCR and voting). CRS terminates abnormally.

                 

                So if your architecture is designed for high availability and redundancy, only a serious/catastrophic failure should knock your RAC instances down.

                • 5. Re: What happens when connection between any Node or all Nodes and Storage is Broken in Oracle RAC?
                  Billy~Verreynne

                  Hasan Al Mamun wrote:

                   

                  How can I re establish connection? do I have to issue oracleasm scandisks or something else, what might the commands to re-establish.

                   

                   

                  I assume you mean a node has lost storage connectivity?

                   

                  The easiest to get things back to normal is a reboot. A server reset also resets the HBA/HCA ports, has the kernel do a fresh scan of the I/O fabric layer, etc.

                   

                  If the node is still up and partially connected, it is a tad more complex.

                   

                  Example. 2 SANs/storage arrays are used (with ASM mirroring diskgroups over storage arrays). 1 storage array goes down. ASM marks the disks from the down array as missing.

                   

                  The storage array is restarted and its storage is available again. The next step is to ensure that the nodes see the disks of this storage array again. The method depends on the type of I/O fabric layer and storage protocol used. In some cases the nodes will automatically detect the disks being available again. In other cases you may need to run a iscsi/isr/srp/FCoE/etc command, flush and reset multipath (will remove unused/missing devices and rediscover mpath disks), etc.

                   

                  When the disks are available again, and ASM has not yet begun to drop the missing disks, you can simply online the missing disks. ASM will detect the disk being available again - and will begun to rebalance that failgroup.

                   

                  If ASM started dropping the missing disks, by the time you have the missing storage array repaired, you no longer can simply online the missing disks. Instead you need to remove the ASM disk labels from these disks (making them available as candidate disks), and add them back to the failgroup (with the unknown/missing/dropping disks) and rebalance the diskgroup.