I have a 2 node 11gR2 RAC cluster with both SCAN and node VIPs configured. Users connect to the SCAN address. The SCAN listener then redirects connections to the node listener. Does it redirect to the node listener via the node's address or the node's VIP address?
Here's why I'm asking. I'm trying to troubleshoot a recurring problem where node2's VIPs (scan and node-vip) keep failing over to node1 every few weeks. CRS logs on node1 say that it thinks (incorrectly) that node2 is inaccesible so it failed over both VIPs (scan and node-vip). Within a minute or so however, node1 can see node2 again without anyone intervening.
Node2 however remains up the entire time and is running normally. Neither the node listener, database instance, or ASM instance show any indication of a problem in their logs.
On node2's database instance however, all sessions have dropped even though the instance remained up. If the SCAN listeners redirect directly to node2, I would not expect this to happen and I can pin the problem down to some network component (switch, NIC, etc). If SCAN redirects to the node-VIP then I would expect this and need to troubleshoot why node1 is incorrectly thinking node2 went down.
the redirect packet of the SCAN listener includes the VIP NAME. The client has to be able to resolve the Name to the valid VIP address.
After the connection is established the client has a direct connection VIP address to the node.
Just tried it. If I stop the VIP a select of a client will hang. And if it fails over, it will get an error. So in 11.2 the connection is really bound to the VIP.
However, it is not node 1 who simply takes over the VIP. The VIP is failed over, if the network check failed.
Look in the orarootagent logfiles ($GI_HOME/log/<nodename>/agent/orarootagent to see what happened.
Neither agent log indicates what the root cause is. There are simply references to ora.net1.network changing state as CRS moves the VIPs from one node to the other.
CRS has soooo many logs it's hard to figure out where to look for the root cause of something like this.
I'm running 126.96.36.199 and I think I remember reading something somewhere that there was a bug in that version of CRS and that it was a little too quick on the trigger to declare the other node dead and start moving VIPs. Wish I could remember where I read that and if it was true. If so, and if there is no work-around, I may have to upgrade CRS.
Users connect to the SCAN address. The SCAN listener then redirects connections to the node listener. Does it redirect to the node listener via the node's address or the node's VIP address?
That depends on
a) the service the user requests to be connected to
b) the registered hostnames of that service in the Listener
The SCAN listener is a remote listener, with which local listeners, register. The SCAN listener hands off/redirects connections based on what has been registered with it.
Also, a redirect is done using hostname and not dotted IP. Which means in exceptional circumstances, the client can use a hosts file with all static and virtual hostnames, referring to the same single (static or virtual) IP address.
Irrespective of what the SCAN Listener provides as hostname for the redirect, will in such a case be resolved, by the client, to the same IP address. Useful when troubleshooting, or when dealing with NAT firewalls.
"lsnrctl service <scan_IP>" shows the IPs registered with the SCAN listener are the dotted IPs of the VIPs. So if CRS fails the VIP over to the other node, it stands to reason that all connections would be dropped. That symptom is now explained. I still need to figure out why the IP is failing over when there doesn't appear to be a legitimate reason.
<grid-home>/log/<node>/cssd/ocssd.log and <grid-home>/log/<node>/gipcd/ gipcd.log should have some more information about your issue. Agree, that there are a lot of information but try filtering on time of the issue with grep or awk.