This discussion is archived
1 Reply Latest reply: Jun 17, 2013 1:13 AM by Jonathan.Knight RSS

communication delay with a storage disable node

dadashy Newbie
Currently Being Moderated

Recently we are seeing a series of messages similar to :

Oracle Coherence GE 3.7.1.4 <Warning> (thread=PacketPublisher, member=52):

Experienced a 6523 ms communication delay (probable remote GC) with Member …. 57 packets rescheduled, PauseRate=0.0, Threshold=2080

 

the communication delay is keep going up and down but sometime reaches 30 (and perhaps higher) seconds.

 

Almost all warning messages are pointing to a specific storage disabled node which acts like a gateway to our clients (it is not a proxy node though). The process is running some (heavy) CQCs and has a memory footprint in the region of 2GB and a high number of threads (>1000)

 

Looking at the coherence tuning documentations, there is a comment about fine tuning the network switches, i.e. setting the amount buffer space available to each Ethernet port. However our network guys are not convinced that this has to do with the infrastructure but rather with this particular process and I am kind of agree with them.

 

What is this message exactly saying? what sort of communication is building up?  Considering that this is a storage disabled node, is it only the CQCs data transfer?

How can we identify the root cause and remedy the situation?

Does moving the process out of the cluster and behind a proxy help?

 

Any suggestion that might shed some light on this issue would be much appreciated

  • 1. Re: communication delay with a storage disable node
    Jonathan.Knight Expert
    Currently Being Moderated

    Hi,

     

    The message is basically saying that the member logging the message is having trouble communicating with another member. In this case it has not managed to communicate with the specified member for 6.5 seconds. The typical cause of this is a big GC pause on the other member or sometimes a network issue.

     

    Now given what you have said about your application I would think that this was indeed caused by big GC pause. Do you have GC logging for the gateway node you describe? Is it doig big GC pauses? At the end of the day a 6.5 second GC may not be too bad if your application can put up with it. It will mean that the cluster is not performing at its best though and maybe you should do some GC tuning.

     

    JK

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points