We have a Coherence cluster in prod with 109 nodes [comprising of storgae, engine, proxy and other non-storage nodes]
We have a 1GB link speed. It is a switched network and all boxes are in the same subnet with no routers.
We have noticed the PBS [PublisherSucessRate] drop below 98%, after around 40 nodes start up.
JMX-node is listed as the weakest channel for them. We have a single/centralized JMX server.
We have tired a few options:
We have tuned the JMX server's GC.
We have executed the datagram tests to validate the throughput. We do get a throughput of 114MB/sec
We have also tried disbaling the registration of Mbeans.
The above have helped slightly to a degree, by delaying the degradation of the PBS Rate. But after around 25 minutes, it again begins to drop.
SL makes RTView OCM, a monitoring tool for Coherence so we have a lot of experience with this. Its all too easy to stress the JMX Mbean server and see the publisher success rate go below 99% with larger clusters (in terms of the # of mbeans). This especially can happen when you are querying mbean data faster than the JMX node can return the data. Its also typical during cluster startup when every node is registering mbeans with the JMX node as they join the cluster.
I assume you have a dedicated JMX node and that you have management=all only on that node? I assume that the PublisherSuccessRate is less than 99% only on the JMX node and that you have a tool calculating this rather than looking at the JMX mbean value (which is calculated as an average from the node start time)?
How many mbeans do you have in the cluster? Are you polling for every MBean? How often? We find it takes 1 msec/mbean to retreive data (best case scenario). Worst case scenario is much higher.