7 Replies Latest reply: Feb 27, 2013 3:21 AM by manish k RSS

    *Extend member disconnections

    manish k
      Hi,

      We have a cluster comprising 4 proxy nodes and there are around *100* readonly extend members consuming data from the cluster. But since last 3-4 weeks extend members created on two particular boxes are getting disconnected intermittently. Because of this all near caches gets invalidated. we rely heavily upon near caches to meet some very aggressive SLAs.

      After initial investigations, only common thing we could figure out between the two machines was : They both are behind a firewall. The firewall has been opened so that these boxes can reach proxy hosts.

      15 Jan 2013 00:35:14,515 Logger@9264171 3.7.1.4 DEBUG Coherence - 2013-01-15 00:35:14.514/1455022.351 Oracle Coherence GE 3.7.1.4 <D6> (thread=Proxy:extendTcpProxyService:TcpAcceptor, member=5): Closed: TcpConnection(Id=0x0000013C235209E00BB0A0A605B6F8EC80A35BD751C8FF8AC9ED472D18FAB99D, Open=false, Member(Id=0, Timestamp=2013-01-03 22:07:28.346, Address=X:X:X:X:0, MachineId=0, Location=site:,machine:XXXX,process:20866, Role=JavaLangThread), LocalAddress=X:X:X:X:4100, RemoteAddress=X:X:X:X:XXX) due to:
      com.tangosol.net.messaging.ConnectionException: TcpConnection(Id=0x0000013C235209E00BB0A0A605B6F8EC80A35BD751C8FF8AC9ED472D18FAB99D, Open=true, Member(Id=0, Timestamp=2013-01-03 22:07:28.346, Address=28.224.160.77:0, MachineId=0, Location=site:,machine:xxxx,process:20866, Role=JavaLangThread), LocalAddress=X:X:X:X:XXX, RemoteAddress=X:X:X:X:XXX)
           at com.tangosol.coherence.component.util.daemon.queueProcessor.service.peer.acceptor.TcpAcceptor$TcpProcessor.onRead(TcpAcceptor.CDB:203)
           at com.tangosol.coherence.component.util.daemon.queueProcessor.service.peer.acceptor.TcpAcceptor$TcpProcessor.onSelect(TcpAcceptor.CDB:44)
           at com.tangosol.coherence.component.util.daemon.queueProcessor.service.peer.acceptor.TcpAcceptor$TcpProcessor.onNotify(TcpAcceptor.CDB:13)
           at com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:42)
           at java.lang.Thread.run(Thread.java:619)
      Caused by: java.io.IOException: Connection timed out
           at sun.nio.ch.FileDispatcher.read0(Native Method)
           at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
           at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233)
           at sun.nio.ch.IOUtil.read(IOUtil.java:206)
           at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
           at com.tangosol.coherence.component.util.daemon.queueProcessor.service.peer.acceptor.TcpAcceptor$TcpProcessor.onRead(TcpAcceptor.CDB:53)
           at com.tangosol.coherence.component.util.daemon.queueProcessor.service.peer.acceptor.TcpAcceptor$TcpProcessor.onSelect(TcpAcceptor.CDB:44)
           at com.tangosol.coherence.component.util.daemon.queueProcessor.service.peer.acceptor.TcpAcceptor$TcpProcessor.onNotify(TcpAcceptor.CDB:13)
           at com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:42)
           at java.lang.Thread.run(Thread.java:619)

      We suspect that the issue lies with the network connection. We have started capturing all traffic in/out of client boxes.

      I wanted to know if there is something more straightforward we can do to investigate the issue. Also, what are the things to look for in the tcp dump to ensure all is well/not well with the network.

      Your comments/suggestions are much appreciated.

      Thanks,
      Manish
        • 1. Re: *Extend member disconnections
          user123799
          Hi Manish,

          A long shot, but I remember something similar some time again - and it turned out our firewall was killing 'idle' connections. To fix it, either disable this firewall 'feature' or turn on client/server heartbeats in Coherence:

          see the heartbeat-interval and heartbeat-timeout settings for the proxy's outgoing-message-handler settings:

          http://docs.oracle.com/cd/E24290_01/coh.371/e22837/appendix_cacheconfig.htm#BABHAFCB
          • 2. Re: *Extend member disconnections
            manish k
            Thanks Andy.

            I have followed this up with networks and they are also in agreement that firewall could be the problem.

            I'll post the solution.

            Thanks,
            Manish
            • 3. Re: *Extend member disconnections
              manish k
              Hi,

              It seems that adding the hearbeat has helped. We are not seeing any more timeouts.

              However, we already had OS level keepalive enabled before we added the heartbeat. Have we seen any situation where linux keepalive doesn't help Extend connections from being staled off?

              Thanks BigAndy for your suggestion though !!

              Manish
              • 4. Re: *Extend member disconnections
                user123799
                Hi Manish,

                Keep alive may be enabled, but with what settings? I think the default for tcp_keepalive_time is hours, nor minutes. So if your firewall is killing things within any reasonable time frame the default keep alive settings won't help.

                You could tweak your OS keep alive settings to kick in before your firewall kicks you out. Not sure how techy you are, so this might help:

                http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html

                Andy.
                • 5. Re: *Extend member disconnections
                  manish k
                  Hi Andy,

                  tcp keepalive time on the boxes where cluster is deployed is 7200, 75, 9 i.e. the default one.

                  However on the machine where *Extend member runs is 1500, 75, 9 i.e. client machine should  send  a keep alive probe after 25 mins of inactivity. Timeout settings at firewall level is 1 hour.

                  If we look at the configuration, extend member shouldn't have been disconnected as after 25 mins of inactivity, *extend client machine should have sent a TCP keepalive probe which shouldn't have allowed the firewall to kill the connection.

                  Thanks,
                  Manish
                  • 6. Re: *Extend member disconnections
                    user123799
                    Hi Manish,

                    A quick google shows that a lot of people report some/many firewalls ignoring TCP keepalives...

                    Andy
                    • 7. Re: *Extend member disconnections
                      manish k
                      Hi Andy,

                      Seems quite probable.

                      Just an update, after adding coherence hearbeat we haven't faced any member disconnections yet.

                      Is there some coherence documentation which explains keepalive/hearbeat requirements in coherence apps?

                      Manish