0 Replies Latest reply on Apr 22, 2010 5:25 AM by 843829

    How to diagnose and eliminate frequent Full GCs using G1GC

    843829
      For years, we've been waiting to be able to use the G1 garbage collector, and in 1.6u20, it looks like the stability issues we've been running into have been addressed. We've been pounding it pretty hard for days and haven't had any crashes.

      Unfortunately, we're hitting STW Full GCs pretty quickly with one very important customer's load. Under moderate load it happens within an hour, and with light load, it takes about five hours. (With this same load we get promotion failures with CMS, so we're very keen on getting G1 to work). I've tried the G1 options that I could find documented online, and I downloaded the source and ran across promising things like -XX:G1SteadyStateUsed, -XX:G1SteadyStateUsedDelta, and -XX:G1MinReservePercent. Unfortunately, these do not make the problem go away.

      Here are the key points of our application (the same as the question I just posted about CMS)

      - We need to maintain about 4GB of a changing set of long-lived data. (It's the cache for a Berkeley DB b-tree database, so the bulk of memory is byte[][] and byte[]. None of these are that large, maybe 5KB at the largest).

      - There is a lot of churn in these live objects, and it's fairly common for old objects to reference new objects. (This is a cache of a database that is much too large to fit in memory, so our application will constantly be pull in in different nodes into the b-tree).

      - We don't have any really large long lived objects.

      - Pauses of 100-200ms are okay. The occasional pause of ~1 second is tolerable, but a full GC takes about a minute, and it kills us.

      - We can allocate a fairly large heap if we need to even though we don't need much of it for the cache. We've been testing with 20GB, but going up to 40GB is doable.

      We're running 1.6u20 with the following JVM options:

      -XX:+AlwaysPreTouch -XX:G1MinReservePercent=40 -XX:+G1ParallelRSetScanningEnabled -XX:+G1ParallelRSetUpdatingEnabled -XX:InitialHeapSize=21474836480 -XX:MaxHeapSize=21474836480 -XX:+PrintAdaptiveSizePolicy -XX:-ReduceInitialCardMarks -XX:+UnlockDiagnosticVMOptions -XX:+UnlockExperimentalVMOptions -XX:+UseG1GC

      And these for diagnostic purposes

      -XX:+PrintCommandLineFlags -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintHeapAtGC -XX:+PrintTenuringDistribution

      When we put the application under load, we can watch the amount of used heap gradually creep up until we hit a Full GC, and the heap drops back down to ~40% utilized. This process repeats itself every 20 minutes after that.

      We have a fixed size heap of 20GB (20480M), so we just 'tail -f | grep' for lines that look like the following to watch the heap grow:

      [ 8360M->8149M(20480M)]

      What we see is that the number on the left (the starting heap used before the GC) gradually creeps up to ~12288M (this is 60% of 20480M, which is determined by having -XX:G1MinReservePercent=40). It hovers around this value for a long time, the GCs increase in frequency, and the number in the middle (the heap used after the GC) continues to grow. So G1GC is definitely trying to respect the setting for -XX:G1MinReservePercent=40, but for whatever reason the heap continues to grow and grow, and eventually we hit a Full GC. Here's the snippet of the log when we hit the first Full GC (I've got the whole log if that helps).

      1591.143:
      Desired survivor size 331350016 bytes, new threshold 15 (max 15)
      - age 1: 4830848 bytes, 4830848 total
      [GC concurrent-mark-start]
      [Times: user=1.89 sys=0.01, real=0.15 secs]
      Heap after GC invocations=511 (full 2):
      garbage-first heap total 20971520K, used 16024991K [0xfffffd7af3800000, 0xfffffd7ff3800000, 0xfffffd7ff3800000)
      region size 8192K, 1 young (8192K), 1 survivors (8192K)
      compacting perm gen total 40960K, used 38518K [0xfffffd7ff3800000, 0xfffffd7ff6000000, 0xfffffd7ff8c00000)
      the space 40960K, 94% used [0xfffffd7ff3800000, 0xfffffd7ff5d9db98, 0xfffffd7ff5d9dc00, 0xfffffd7ff6000000)
      No shared spaces configured.
      }
      1598.456: [GC concurrent-mark-end, 7.3131780 sec]
      1598.457: [GC remark, 0.0111664 secs]
      [Times: user=0.01 sys=0.00, real=0.01 secs]
      1598.468: [GC concurrent-count-start]
      {Heap before GC invocations=511 (full 2):
      garbage-first heap   total 20971520K, used 20963999K [0xfffffd7af3800000, 0xfffffd7ff3800000, 0xfffffd7ff3800000)
        region size 8192K, 604 young (4947968K), 1 survivors (8192K)
      compacting perm gen  total 40960K, used 38518K [0xfffffd7ff3800000, 0xfffffd7ff6000000, 0xfffffd7ff8c00000)
         the space 40960K,  94% used [0xfffffd7ff3800000, 0xfffffd7ff5d9db98, 0xfffffd7ff5d9dc00, 0xfffffd7ff6000000)
      No shared spaces configured.
      1600.963: [Full GC 20472M->6580M(20480M)
      Desired survivor size 327155712 bytes, new threshold 15 (max 15)
      - age   1:    4830848 bytes,    4830848 total
      , 54.7080965 secs]
      [Times: user=64.96 sys=0.02, real=54.71 secs]
      Heap after GC invocations=512 (full 3):
      garbage-first heap total 20971520K, used 6738882K [0xfffffd7af3800000, 0xfffffd7ff3800000, 0xfffffd7ff3800000)
      region size 8192K, 0 young (0K), 0 survivors (0K)
      compacting perm gen total 40960K, used 38518K [0xfffffd7ff3800000, 0xfffffd7ff6000000, 0xfffffd7ff8c00000)
      the space 40960K, 94% used [0xfffffd7ff3800000, 0xfffffd7ff5d9db98, 0xfffffd7ff5d9dc00, 0xfffffd7ff6000000)
      No shared spaces configured.
      }

      A few questions

      1. The obvious one first. Does anyone know what could be causing this, and how to tune around it?

      2. What else can I try to track down what is causing this?

      3. Most of the knobs that I found in the G1 source code are marked as "develop", and the JVM complains when I try to use them. Is there an equivalent -XX:+UnlockDiagnosticVMOptions that unlocks these options?

      I realize this is an experimental feature in 1.6, so I hate to bug you guys too much with it. But G1GC could solve a real problem for us. Any ideas about how to address this or suggestions about how to debug the problem on my own, are greatly appreciated.

      Thanks.