0 Replies Latest reply on Apr 22, 2010 5:22 AM by 843829

    CMS tuning guidance for delaying a "promotion failure"

      We've gotten pretty good at tuning CMS for our application, but with one customer's dataset we're hitting a promotion failure (presumably due to fragmentation) within a few hours once the load starts, and we need advice about how to delay this as long as possible. I've been trying to read what I can about similar issues, but so far I haven't had much success improving this problem.

      Here are the key points of our application

      - We need to maintain about 4GB of a changing set of long-lived data. (It's the cache for a Berkeley DB b-tree database, so the bulk of memory is byte[][] and byte[]. None of these are that large, maybe 5K at the largest).

      - There is a lot of churn in these live objects, and it's fairly common for old objects to reference new objects. (This is a cache of a database that is much too large to fit in memory, so our application will constantly be pull in in different nodes into the b-tree database).

      - We don't have any really large long lived objects.

      - Pauses of 100-200ms are okay. The occasional pause of ~1 second is tolerable, but a full GC takes about a minute, and it kills us.

      - We can allocate a fairly large heap if we need to even though we don't need much of it for the cache. We've been testing with 20GB, but going up to 40GB is doable.

      We're running 1.6u20 with the following JVM options:

      -d64 -server -Xmx20g -Xms20g -XX:MaxNewSize=1g -XX:NewSize=1g -XX:+UseConcMarkSweepGC -XX:+CMSConcurrentMTEnabled -XX:+CMSParallelRemarkEnabled -XX:+CMSParallelSurvivorRemarkEnabled -XX:CMSMaxAbortablePrecleanTime=10000 -XX:CMSInitiatingOccupancyFraction=30 -XX:+UseParNewGC -XX:+UseMembar -XX:+UseBiasedLocking -XX:+UseLargePages -XX:PermSize=64M -XX:+HeapDumpOnOutOfMemoryError -XX:+AlwaysPreTouch

      along with these for diagnostic purposes:

      -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintCommandLineFlags -XX:+PrintTenuringDistribution -XX:PrintFLSStatistics=1

      When the application is under load, we can watch the "Max Chunk Size" value provided by -XX:PrintFLSStatistics=1 and see that it starts off at 2GB and slowly marches down to 3K, and then there is a "promotion failure" followed by a full GC. Does anyone have advice on which JVM options to tune to try and delay this promotion failure? I've tried ones that sound promising, like -XX:FLSCoalescePolicy=4, without much luck.

      By the way, we've tried G1GC in 1.6 update 20, but we're hitting Full GC's there as well. I will post a separate message for that.