This discussion is archived
7 Replies Latest reply: Feb 27, 2013 1:21 AM by 936311 RSS

*Extend member disconnections

936311 Newbie
Currently Being Moderated
Hi,

We have a cluster comprising 4 proxy nodes and there are around *100* readonly extend members consuming data from the cluster. But since last 3-4 weeks extend members created on two particular boxes are getting disconnected intermittently. Because of this all near caches gets invalidated. we rely heavily upon near caches to meet some very aggressive SLAs.

After initial investigations, only common thing we could figure out between the two machines was : They both are behind a firewall. The firewall has been opened so that these boxes can reach proxy hosts.

15 Jan 2013 00:35:14,515 Logger@9264171 3.7.1.4 DEBUG Coherence - 2013-01-15 00:35:14.514/1455022.351 Oracle Coherence GE 3.7.1.4 <D6> (thread=Proxy:extendTcpProxyService:TcpAcceptor, member=5): Closed: TcpConnection(Id=0x0000013C235209E00BB0A0A605B6F8EC80A35BD751C8FF8AC9ED472D18FAB99D, Open=false, Member(Id=0, Timestamp=2013-01-03 22:07:28.346, Address=X:X:X:X:0, MachineId=0, Location=site:,machine:XXXX,process:20866, Role=JavaLangThread), LocalAddress=X:X:X:X:4100, RemoteAddress=X:X:X:X:XXX) due to:
com.tangosol.net.messaging.ConnectionException: TcpConnection(Id=0x0000013C235209E00BB0A0A605B6F8EC80A35BD751C8FF8AC9ED472D18FAB99D, Open=true, Member(Id=0, Timestamp=2013-01-03 22:07:28.346, Address=28.224.160.77:0, MachineId=0, Location=site:,machine:xxxx,process:20866, Role=JavaLangThread), LocalAddress=X:X:X:X:XXX, RemoteAddress=X:X:X:X:XXX)
     at com.tangosol.coherence.component.util.daemon.queueProcessor.service.peer.acceptor.TcpAcceptor$TcpProcessor.onRead(TcpAcceptor.CDB:203)
     at com.tangosol.coherence.component.util.daemon.queueProcessor.service.peer.acceptor.TcpAcceptor$TcpProcessor.onSelect(TcpAcceptor.CDB:44)
     at com.tangosol.coherence.component.util.daemon.queueProcessor.service.peer.acceptor.TcpAcceptor$TcpProcessor.onNotify(TcpAcceptor.CDB:13)
     at com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:42)
     at java.lang.Thread.run(Thread.java:619)
Caused by: java.io.IOException: Connection timed out
     at sun.nio.ch.FileDispatcher.read0(Native Method)
     at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
     at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233)
     at sun.nio.ch.IOUtil.read(IOUtil.java:206)
     at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
     at com.tangosol.coherence.component.util.daemon.queueProcessor.service.peer.acceptor.TcpAcceptor$TcpProcessor.onRead(TcpAcceptor.CDB:53)
     at com.tangosol.coherence.component.util.daemon.queueProcessor.service.peer.acceptor.TcpAcceptor$TcpProcessor.onSelect(TcpAcceptor.CDB:44)
     at com.tangosol.coherence.component.util.daemon.queueProcessor.service.peer.acceptor.TcpAcceptor$TcpProcessor.onNotify(TcpAcceptor.CDB:13)
     at com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:42)
     at java.lang.Thread.run(Thread.java:619)

We suspect that the issue lies with the network connection. We have started capturing all traffic in/out of client boxes.

I wanted to know if there is something more straightforward we can do to investigate the issue. Also, what are the things to look for in the tcp dump to ensure all is well/not well with the network.

Your comments/suggestions are much appreciated.

Thanks,
Manish
  • 1. Re: *Extend member disconnections
    user123799 Newbie
    Currently Being Moderated
    Hi Manish,

    A long shot, but I remember something similar some time again - and it turned out our firewall was killing 'idle' connections. To fix it, either disable this firewall 'feature' or turn on client/server heartbeats in Coherence:

    see the heartbeat-interval and heartbeat-timeout settings for the proxy's outgoing-message-handler settings:

    http://docs.oracle.com/cd/E24290_01/coh.371/e22837/appendix_cacheconfig.htm#BABHAFCB
  • 2. Re: *Extend member disconnections
    936311 Newbie
    Currently Being Moderated
    Thanks Andy.

    I have followed this up with networks and they are also in agreement that firewall could be the problem.

    I'll post the solution.

    Thanks,
    Manish
  • 3. Re: *Extend member disconnections
    936311 Newbie
    Currently Being Moderated
    Hi,

    It seems that adding the hearbeat has helped. We are not seeing any more timeouts.

    However, we already had OS level keepalive enabled before we added the heartbeat. Have we seen any situation where linux keepalive doesn't help Extend connections from being staled off?

    Thanks BigAndy for your suggestion though !!

    Manish
  • 4. Re: *Extend member disconnections
    user123799 Newbie
    Currently Being Moderated
    Hi Manish,

    Keep alive may be enabled, but with what settings? I think the default for tcp_keepalive_time is hours, nor minutes. So if your firewall is killing things within any reasonable time frame the default keep alive settings won't help.

    You could tweak your OS keep alive settings to kick in before your firewall kicks you out. Not sure how techy you are, so this might help:

    http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html

    Andy.
  • 5. Re: *Extend member disconnections
    936311 Newbie
    Currently Being Moderated
    Hi Andy,

    tcp keepalive time on the boxes where cluster is deployed is 7200, 75, 9 i.e. the default one.

    However on the machine where *Extend member runs is 1500, 75, 9 i.e. client machine should  send  a keep alive probe after 25 mins of inactivity. Timeout settings at firewall level is 1 hour.

    If we look at the configuration, extend member shouldn't have been disconnected as after 25 mins of inactivity, *extend client machine should have sent a TCP keepalive probe which shouldn't have allowed the firewall to kill the connection.

    Thanks,
    Manish
  • 6. Re: *Extend member disconnections
    user123799 Newbie
    Currently Being Moderated
    Hi Manish,

    A quick google shows that a lot of people report some/many firewalls ignoring TCP keepalives...

    Andy
  • 7. Re: *Extend member disconnections
    936311 Newbie
    Currently Being Moderated
    Hi Andy,

    Seems quite probable.

    Just an update, after adding coherence hearbeat we haven't faced any member disconnections yet.

    Is there some coherence documentation which explains keepalive/hearbeat requirements in coherence apps?

    Manish

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points