We are running an Oracle 2-Node RAC (188.8.131.52) on Redhat Enterprise Linux, vers. 5 64 bit. Due to a core switch outage, node 2
lost the connection and went down. The database services were automatically moved to node 1. Now we need to bring back node 2 into the cluster.
What would be the proper order of commands to achieve this?
Thanks for your ideas!
If the server just crashed you shouldn't have to do anything other than starting the server again, and of course make sure that GI starts up (and unless you've turned off autostart it should start as part of the boot process).
If autostart is turned off, then to start it do this (as root):
$ORA_GI_HOME/bin/crsctl start crs
The node is still part of the cluster and it will recover itself and join the cluster again.
What do you mean by 'lost the interconnection'? Did node2 loose the connection to node1 when the switch died and then rebooted itself? That is to be expected, but then when the server starts up again it should just join the cluster again. If it hasn't then something is fishy.
You don't have to add any resources back to the Cluster Registry after a node reboot.
Could you run a 'crsctl stat res -t -init' & a 'crsctl stat res -t' from node2?
I'll post an excerpt from the grid alert log:
[cssd(20848)]CRS-1612:Network communication with node rac1srv0081 (2) missing for 50% of timeout interval. Removal of this node from cluster in 14.410 seconds
[cssd(20848)]CRS-1611:Network communication with node rac1srv0081 (2) missing for 75% of timeout interval. Removal of this node from cluster in 7.390 seconds
[cssd(20848)]CRS-1610:Network communication with node rac1srv0081 (2) missing for 90% of timeout interval. Removal of this node from cluster in 2.380 seconds
[cssd(20848)]CRS-1632:Node rac1srv0081 is being removed from the cluster in cluster incarnation 217939718
[cssd(20848)]CRS-1601:CSSD Reconfiguration complete. Active nodes are rac1srv0080 .
[ctssd(21102)]CRS-2407:The new Cluster Time Synchronization Service reference node is host rac1srv0080.
[crsd(22323)]CRS-5504:Node down event reported for node 'rac1srv0081'.
Maybe this can throw some light onto the subject.
Yes, node2 got evicted since the network was down ("Node rac1srv0081 is being removed from the cluster in cluster incarnation 217939718"), and this is expected to avoid split-brain.
Is the network back up now? Is node2 (rac1srv0081) up again?
Can you you run a '$ORA_GI_HOME/bin/crsctl stat res -t -init' & a '$ORA_GI_HOME/bin/crsctl stat res -t' from node2 & from node1?