9 Replies Latest reply: Apr 27, 2011 2:02 AM by robvarga RSS

    cluster is falling apart because of long GCs, what is the proper parameter?

    749938
      Once a week we have a really bad GC (70 seconds) and it kills the cluster - there are some 65000ms timeout defaults around, right?

      my question is - what should I adjust so that 70sec GC are not a problem?
      I have already adjusted cluster-config>packet-publisher>timeout-milliseconds and cluster-config>service-guardian>timeout-milliseconds but it did not help, what am I doing wrong?

      <coherence>
      <cluster-config>
           <packet-publisher>
                <timeout-milliseconds>900000</timeout-milliseconds>
           </packet-publisher>
           <service-guardian>
                <timeout-milliseconds>900000</timeout-milliseconds>
           </service-guardian>
      ....
      and the logs:
      : 9291339K->9113607K(12199552K), 77.2953050 secs] 9387245K->9113607K(12544576K), [CMS Perm : 2091432K->167774K(2097152K)], 77.2955050 secs] [Times: user=75.27 sys=2.02, real=77.30 secs]

      2010-07-13 08:57:37.857/272401.824 Oracle Coherence GE 3.5.2/463 <D5> (thread=Cluster, member=1): TcpRing: disconnected from member 2 due to a disconnect request
      2010-07-13 08:57:37.858/272401.825 Oracle Coherence GE 3.5.2/463 <D5> (thread=Cluster, member=1): TcpRing: disconnected from member 3 due to a disconnect request
      2010-07-13 08:57:37.858/272401.825 Oracle Coherence GE 3.5.2/463 <D5> (thread=Cluster, member=1): Service guardian is 77109ms late, indicating that this JVM may be running slowly or experienced a long GC
      ....
      2010-07-13 08:56:51.253/272173.816 Oracle Coherence GE 3.5.2/463 <Warning> (thread=PacketPublisher, member=3): Timeout while delivering a packet; requesting the departure confirmation for Member(Id=1, Timestamp=2010-07-10 05:17:37.4, Address=10.130.102.22:7088, MachineId=3667, Location=process:15703)
      by MemberSet(Size=1, BitSetCount=2
      Member(Id=2, Timestamp=2010-07-10 05:20:33.702, Address=10.130.102.17:7088, MachineId=60945, Location=process:8552, Role=ScannerImpl)
      )
        • 1. Re: cluster is falling apart because of long GCs, what is the proper parameter?
          725968
          I would like to bet you a beer that you are using JDK1.5. If so, upgrade to JDK1.6 and your problems will magically go away. A.
          • 2. Re: cluster is falling apart because of long GCs, what is the proper parameter?
            749938
            Using: /usr/java/latest//bin/java -server -showversion -XX:+JavaMonitorsInStackTrace -XX:+PrintCompilation -verbose:gc -XX:+PrintTenuringDistribution -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+UseParNewGC -Xms64m -Xmx1024m -Dsun.lang.ClassLoader.allowArraySyntax=true -Dtangosol.coherence.localport=7088 -Dtangosol.coherence.distributed.backupcount=100 -Dtangosol.coherence.guard.timeout=900000 -d64 -Xms12g -Xmx12g -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:+UseMembar -Djmx.html.port=7027 -Dmars.address=:9557 -Dscanner.coherence.main=true -Dtangosol.coherence.localhost=10.130.102.22 -XX:+HeapDumpOnOutOfMemoryError -XX:PermSize=256m -XX:MaxPermSize=2048m -Dtangosol.coherence.guard.timeout=600000 impl.Scanner t3://:7001

            java version "1.6.0_20"
            Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
            Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01, mixed mode)

            unfortunately we are totally 1.6
            anyway, what is wrong with 1.5? the length of GC is not something i am trying to address right now, it is the Coherence reaction to this freeze which I desire to control better
            • 3. Re: cluster is falling apart because of long GCs, what is the proper parameter?
              Mfalco-Oracle
              Hi Andrey,

              You are setting the correct parameters, though you should set the values to at least twice your maximum expected GC. I would also warn against setting the values so high in Coherence versions prior to 3.6, as it will also mean that some types of node death will take up to that long to detect. Coherence 3.6 which has just been released includes significant improvements in this area safely allowing for quite high packet timeouts, in fact 5 minutes is the default.

              A few other things worth noting is that you are running a very unusual configuration, with a backup count of 100. It is exceedingly rare to run with a backup count of more then 1. Since backups are stored off machine by default the fault tolerance benefits of increasing the backup count are very slim, and setting it to such a high value really shouldn't be necessary. If the data is read-mostly you might be better off just using a replicated cache topology instead. Your current values are likely to result in very poor write performance.

              The other thing worth mentioning is that if you are allowing such high GC pause times on cache servers (storage enabled nodes), then the overall cache performance may suffer greatly. For instance if the cache server then own partition X is in a 1m GC, and requests are evenly distributed across all partitions, then you should find that all clients will pretty quickly find themselves blocked waiting for a response from from that member. A good first step towards minimizing your GC pauses would be to allocate those 12GB of RAM to multiple JVMs rather then to a single JVM. Using 1.6 JVMs going to up to 4GB per JVM is fairly safe, and I'd bet (Andrew's beer) that having 3 such JVMs will outperform a single 12GB JVM, both in terms of GC times and in general.

              One final note, I see that you specified Xms/mx and other JVM settings multiple times and with different values. I'm going to assume that the last value (12GB) is what the JVM choose, as that would explain the long GCs.

              thanks,

              Mark

              Oracle Coherence
              • 4. Re: cluster is falling apart because of long GCs, what is the proper parameter?
                749938
                but if I am adjusting the correct settings, the values being set are 600 000 and 900 000 (with 5 zeros), this is 10 times more then the Service guardian is 77109ms late - 80 000, with FOUR zeros :) still, this does not help? why?

                great hint, I will update to 3.6 right away!

                right now I only use coherence for remote task execution, this is probably not the typical use case - so, in fact there is exactly zero data in the cache, i.e. nothing is in fact backuped etc

                Edited by: Andrey Belomutskiy on Jul 15, 2010 10:04 AM
                • 5. Re: cluster is falling apart because of long GCs, what is the proper parameter?
                  Mfalco-Oracle
                  Hi Andrey,

                  Sorry, my bad, I totally missed that firth zero ;) Now that you've corrected me, I can see what appears to be the real issue. Your configuration for the packet-publisher's timeout-milliseconds is missing an intermediate element. The timeout-milliseconds element should be within a packet-delivery element. i.e.:
                  <coherence>
                      <cluster-config>
                          <packet-publisher>
                              <packet-delivery>
                                  <timeout-milliseconds>900000</timeout-milliseconds>
                              </packet-delivery>
                          </packet-publisher>
                  
                          <service-guardian>
                              <timeout-milliseconds>900000</timeout-milliseconds>
                          </service-guardian>
                  
                  ...
                      </cluster-config>
                  </coherence>
                  The configuration you posted would continue to use the default settings for the packet timeout and explain what you are seeing. While the guardian had been configured properly, it will still warn if it encounters long (multi-second) delays, though it will not take action unless the configured timeout is crossed.

                  thanks,

                  Mark
                  Oracle Coherence
                  • 6. Re: cluster is falling apart because of long GCs, what is the proper parameter?
                    749938
                    Mark, thanks a lot! This explains a lot :)

                    May I suggest you adjust configuration processing to add warning/error message in case invalid configuration is used? Anyway, thanks for your help once again!
                    • 7. Re: cluster is falling apart because of long GCs, what is the proper parameter?
                      Mfalco-Oracle
                      Hi Andrey,

                      I completely agree, and we do intend to address this in the future. For now best practice is to use an editor which will validate your configuration against the DTD, but I absolutely agree Coherence needs to help our here as well.

                      thanks,

                      Mark
                      Oracle Coherence
                      • 8. Re: cluster is falling apart because of long GCs, what is the proper parameter?
                        user13323132
                        It's an interesting issue indeed.

                        Although i cannot offer an explanation, i can report the same behaviour for 3.6.1.0, 3.6.1.1 and 3.6.1.2 I have performed a similar test on two servers cluster with a variable but small number of storage nodes (5-10) deployed across 2 servers.

                        If i have to guess, partition distribution for a small asymmetrical cluster gets stuck by different nodes requesting partitions from each other at the same time.

                        So it looks like you need to restore the same (symmetric?) cluster to get the MACHINE-SAFE HA status back.

                        Cheers,
                        Alexey
                        • 9. Re: cluster is falling apart because of long GCs, what is the proper parameter?
                          robvarga
                          user13323132 wrote:
                          It's an interesting issue indeed.

                          Although i cannot offer an explanation, i can report the same behaviour for 3.6.1.0, 3.6.1.1 and 3.6.1.2 I have performed a similar test on two servers cluster with a variable but small number of storage nodes (5-10) deployed across 2 servers.

                          If i have to guess, partition distribution for a small asymmetrical cluster gets stuck by different nodes requesting partitions from each other at the same time.

                          So it looks like you need to restore the same (symmetric?) cluster to get the MACHINE-SAFE HA status back.

                          Cheers,
                          Alexey
                          Hi Alexey,

                          You should not really use a cluster with two physical boxes only.

                          One reason is that assymetrical (different number of nodes on the different boxes) clusters cannot be made machine safe, as you don't have enough nodes on one machine to hold all the backups of partitions for the larger amount of masters copies of partitions on the other machine.

                          The other is that Coherence error detection recovery protocols need at least 3 (or even better 4) node to work best. Imagine the scenario when the two boxes are not able to communicate with each other. Each box cannot know that the other is there, only communication is affected.

                          If you had a third box, then Coherence would have a chance to detect that it is a communication failure between the two boxes and not death of the other box.

                          Also, if you have only 2 boxes, then after losing one box, the cluster cannot get back to highly available state anymore.


                          Best regards,

                          Robert