Hi1 person found this helpful
The message means that the service guardian inside Coherence has detected that one of the threads that it monitors has been processing the same task for too long without sending a heart beat to the guardian. A common cause is a big GC pause in the node concerned. There are many other reasons why a thread takes a long time to run, e.g. CPU usage is too high so the process is not getting much CPU time, thread contention for a resource, thread deadlocks etc... The most important is obviously a deadlock as this can cause the whole node and occasionally the whole cluster to come to a stop. The default is for Coherence to try to interrupt the thread concerned and ultimately to stop the service - which can of course kill the node. You can change the default behavior so that Coherence will just log a thread dump when there is an issue, this will not fix your problem but would stop Coherence killing the service.
To track down the cause you need to look at things like GC logs to see if that is the problem and also thread dumps to see why the thread is taking too long. The name of the thread to look at is in the error message - in this case thread=Cluster
Thanks for your response and the information you provided is quite helpful - another thing I'd like to ask is how can we change the default behavior to log thread dumps when there is an issue?
There are two ways to change the service guardian to log a thread dump.
First is use the -Dtangosol.coherence.guard.timeout=0 property - although this works it is deprecated.
The second and non-deprecated way is to use the <service-failure-policy> setting in the cache schemes in you cache configuration file documented here: http://docs.oracle.com/cd/E24290_01/coh.371/e22837/appendix_cacheconfig.htm#BABDCBED
I have the same problem. Searching the OTN I found this thread : cluster is falling apart because of long GCs, what is the proper parameter?
I've configured the packet delivery timeout and timeout for service guardian and going to put them on the server. (Not quite sure whether not specifying the timeouts is the problem.)
But there are still some parts unclear to me:
This is happening to me just once a week and exactly after weekends in the early working hours. (Does is really have to do with the idle time of the server in the weekend? Why just once a week and exactly the first working day? If the GC is really running slowly why it is happening at this time?)
And the last question: Is service guardian monitors only threads belonging to Coherence or it also monitors other threads for possible deadlocks?