0 Replies Latest reply on Sep 2, 2013 1:08 PM by BretCalvey

    "Did not receive a response to a ping within 3000 millis"




      We are seeing the following error message occurring very frequently in Extend clients...


      com.tangosol.net.messaging.ConnectionException: TcpConnection(Id=0x00000140DD7B35F5AC170A7883DB248E7A6277696018AF78C850F2B21B6A819F, Open=true, Member(Id=0, Timestamp=2013-09-02 09:02:54.179, Address=, MachineId=0, Location=site:,machine:va-msm01agt,process:30004,member:msm-1, Role=VcintMarketStatusManagerProcess), LocalAddress=, RemoteAddress= did not receive a response to a ping within 3000 millis


      This happens very frequently in our test environment and occasionally in production.


      What normally tends to happen after a client gets this error message is that it disconnects from the proxy node. We have logic that will automatically reconnect, but we are wondering why it is happening in the first place (our operations staff get alerted every time this happens).


      The only information I could find about something similar was here...




      Our platform team has assured us that there is nothing wrong with the network and nothing mis-configured in the VMs that are running our test cluster (in production, we are using physical servers and the problem does not tend to occur that much)


      In our test environment, the cluster isn't even under that much load.


      An error in the proxy node logs usually happens at the same time...


      2013-09-02 09:21:05,816 [Logger@9247854] DEBUG Coherence 2013-09-02 09:21:05.417/489919.723 Oracle Coherence GE <D5> (thread=Cluster, member=1): Service guardian is 14808ms late, indicating that this JVM may be running slowly or experienced a long GC


      3 seconds seems like a long time not to respond to a ping and so I thought there was something wrong with the network or with the VM setup, but I have been assured that this is not the case.


      Has anyone else experienced anything similar?


      Should we just increase the time outs? (I'd rather not because it seems something nasty is happening somewhere, but I have no clue about how to find it)


      Thanks in advance