4 Replies Latest reply: Sep 4, 2013 2:06 PM by Linked AC RSS

    Oracle Coherence GE service guardian


      We running Weblogic Server version which is intergrated with oracle coherence - and on the 4 member cluster which is hosting reports and every time we run into issues where one of the members would go down - on the logs we pick up the below message which
      I honestly don't know what it means - could you kindly advise as to what might cause the below as this is occurring multiple times now in our production environment and the only way to recover is by restarting the affected member.

      Oracle Coherence GE <D5> (thread=Cluster, member=3): Service guardian is 16691ms late, indicating that this JVM may be running slowly or experienced a long GC+

      Please also advise if you require more additional information.

      Thank you in this regard.
        • 1. Re: Oracle Coherence GE service guardian

          The message means that the service guardian inside Coherence has detected that one of the threads that it monitors has been processing the same task for too long without sending a heart beat to the guardian. A common cause is a big GC pause in the node concerned. There are many other reasons why a thread takes a long time to run, e.g. CPU usage is too high so the process is not getting much CPU time, thread contention for a resource, thread deadlocks etc... The most important is obviously a deadlock as this can cause the whole node and occasionally the whole cluster to come to a stop. The default is for Coherence to try to interrupt the thread concerned and ultimately to stop the service - which can of course kill the node. You can change the default behavior so that Coherence will just log a thread dump when there is an issue, this will not fix your problem but would stop Coherence killing the service.

          To track down the cause you need to look at things like GC logs to see if that is the problem and also thread dumps to see why the thread is taking too long. The name of the thread to look at is in the error message - in this case thread=Cluster

          • 2. Re: Oracle Coherence GE service guardian
            Hi JK,

            Thanks for your response and the information you provided is quite helpful - another thing I'd like to ask is how can we change the default behavior to log thread dumps when there is an issue?

            • 3. Re: Oracle Coherence GE service guardian

              There are two ways to change the service guardian to log a thread dump.

              First is use the -Dtangosol.coherence.guard.timeout=0 property - although this works it is deprecated.

              The second and non-deprecated way is to use the <service-failure-policy> setting in the cache schemes in you cache configuration file documented here: http://docs.oracle.com/cd/E24290_01/coh.371/e22837/appendix_cacheconfig.htm#BABDCBED

              • 4. Re: Oracle Coherence GE service guardian
                Linked AC

                Hi Jonathan,


                I have the same problem. Searching the OTN I found this thread : cluster is falling apart because of long GCs, what is the proper parameter? 

                I've configured the packet delivery timeout and timeout for service guardian and going to put them on the server. (Not quite sure whether not specifying the timeouts is the problem.)

                But there are still some parts unclear to me:

                This is happening to me just once a week and exactly after weekends in the early working hours. (Does is really have to do with the idle time of the server in the weekend? Why just once a week and exactly the first working day? If the GC is really running slowly why it is happening at this time?)

                And the last question: Is service guardian monitors only threads belonging to Coherence or it also monitors other threads for possible deadlocks?