We have a cluster comprising 4 proxy nodes and there are around *100* readonly extend members consuming data from the cluster. But since last 3-4 weeks extend members created on two particular boxes are getting disconnected intermittently. Because of this all near caches gets invalidated. we rely heavily upon near caches to meet some very aggressive SLAs.
After initial investigations, only common thing we could figure out between the two machines was : They both are behind a firewall. The firewall has been opened so that these boxes can reach proxy hosts.
15 Jan 2013 00:35:14,515 Logger@9264171 126.96.36.199 DEBUG Coherence - 2013-01-15 00:35:14.514/1455022.351 Oracle Coherence GE 188.8.131.52 <D6> (thread=Proxy:extendTcpProxyService:TcpAcceptor, member=5): Closed: TcpConnection(Id=0x0000013C235209E00BB0A0A605B6F8EC80A35BD751C8FF8AC9ED472D18FAB99D, Open=false, Member(Id=0, Timestamp=2013-01-03 22:07:28.346, Address=X:X:X:X:0, MachineId=0, Location=site:,machine:XXXX,process:20866, Role=JavaLangThread), LocalAddress=X:X:X:X:4100, RemoteAddress=X:X:X:X:XXX) due to:
com.tangosol.net.messaging.ConnectionException: TcpConnection(Id=0x0000013C235209E00BB0A0A605B6F8EC80A35BD751C8FF8AC9ED472D18FAB99D, Open=true, Member(Id=0, Timestamp=2013-01-03 22:07:28.346, Address=184.108.40.206:0, MachineId=0, Location=site:,machine:xxxx,process:20866, Role=JavaLangThread), LocalAddress=X:X:X:X:XXX, RemoteAddress=X:X:X:X:XXX)
Caused by: java.io.IOException: Connection timed out
at sun.nio.ch.FileDispatcher.read0(Native Method)
We suspect that the issue lies with the network connection. We have started capturing all traffic in/out of client boxes.
I wanted to know if there is something more straightforward we can do to investigate the issue. Also, what are the things to look for in the tcp dump to ensure all is well/not well with the network.
Your comments/suggestions are much appreciated.
A long shot, but I remember something similar some time again - and it turned out our firewall was killing 'idle' connections. To fix it, either disable this firewall 'feature' or turn on client/server heartbeats in Coherence:
see the heartbeat-interval and heartbeat-timeout settings for the proxy's outgoing-message-handler settings:
It seems that adding the hearbeat has helped. We are not seeing any more timeouts.
However, we already had OS level keepalive enabled before we added the heartbeat. Have we seen any situation where linux keepalive doesn't help Extend connections from being staled off?
Thanks BigAndy for your suggestion though !!
Keep alive may be enabled, but with what settings? I think the default for tcp_keepalive_time is hours, nor minutes. So if your firewall is killing things within any reasonable time frame the default keep alive settings won't help.
You could tweak your OS keep alive settings to kick in before your firewall kicks you out. Not sure how techy you are, so this might help:
tcp keepalive time on the boxes where cluster is deployed is 7200, 75, 9 i.e. the default one.
However on the machine where *Extend member runs is 1500, 75, 9 i.e. client machine should send a keep alive probe after 25 mins of inactivity. Timeout settings at firewall level is 1 hour.
If we look at the configuration, extend member shouldn't have been disconnected as after 25 mins of inactivity, *extend client machine should have sent a TCP keepalive probe which shouldn't have allowed the firewall to kill the connection.
Seems quite probable.
Just an update, after adding coherence hearbeat we haven't faced any member disconnections yet.
Is there some coherence documentation which explains keepalive/hearbeat requirements in coherence apps?