This discussion is archived
7 Replies Latest reply: Mar 18, 2009 7:36 AM by 843829 RSS

64 Bit JVM hanging after long periods of idle time.

843829 Newbie
Currently Being Moderated
OS: Windows 2003 enterprise R2 64 bit edition
Hardware: 8 proccessor (2 quad core Intel Xeon Processors 5400 ) with 16 Gigs of ram
JVM: 1.6.0_04

Recently we went into production, and found that our servers were hanging. No errors are ever thrown but, while we have logging, there is logging of STDERR.

Our Application contains large collections of data (approx 3.2 gigs in memory) and updates these daily. The update is wholesale, the new collections are loaded and replace the old ones.
The machines were not running out of memory, simply hanging. The service which runs them kills them and restarts them automatically.

The interesting thing here is that this was never detected during our testing. All of or testing was done loaded fielding 600+ requests a second. And when in production they only failed after long periods of idle. So in the morning when people started coming in, and making requests, thats when they hung. They would never hang during a busy period, only at the start after being idle for a long time. Every day they are idle for 12+ hours at night. I currently only have 5 instances of this failure, but some patterns seem to be emerging: It happened after 2-3 days. None of the server made it to day 4, ever. The collections are updated once a day, and they never failed to load.

This hanging behavior was reproduced on identical hardware.

We found by removing our JVM GC options (listed below) completely, we can advoid this problem (we got to 6 days of uptime).

here are our JVM settings that exhibit the problem:
-XX:MaxTenuringThreshold=0
-XX:SurvivorRatio=20000
-XX:+UseCMSCompactAtFullCollection
-XX:CMSFullGCsBeforeCompaction=0
-XX:+ParallelRefProcEnabled
-XX:+CMSPermGenSweepingEnabled
-XX:+UseParNewGC
-XX:+UseConcMarkSweepGC
-XX:ParallelGCThreads=4
-XX:NewSize=128M
-XX:MaxNewSize=128M
-XX:+CMSIncrementalMode
-XX:+CMSIncrementalPacing
-XX:CMSIncrementalDutyCycleMin=0
-XX:CMSIncrementalDutyCycle=10
-XX:CMSMarkStackSize=8M
-XX:CMSMarkStackSizeMax=32M
-XX:+UseLargePages
-XX:+DisableExplicitGC

I am not a master of GC tuning and found these somewhere else, from someone with a simular memory requirement.

I have limited hardware (2 boxes) and with a turn around time of 5 days before i'm satisfied that it won't hang.

I am trying to advoid a Full GC. because with 12gigs of our data in the heap it takes approx 30-40 seconds of gc and they server will be paused.

currently I have 4 days uptime on one with no JVM GC options
and 2 days on the other currently running the flags (the last 3 are for
debugging and wouldn't be in production):
-XX:+UseParNewGC
-XX:+UseConcMarkSweepGC
-XX:ParallelGCThreads=4
-XX:+UseCMSCompactAtFullCollection
-XX:CMSFullGCsBeforeCompaction=0
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps


Anybody have any ideas?
  • 1. Re: 64 Bit JVM hanging after long periods of idle time.
    843829 Newbie
    Currently Being Moderated
    Hardware: 8 proccessor (2 quad core Intel Xeon Processors 5400 ) with 16 Gigs of ram
    JVM: 1.6.0_04

    Recently we went into production, and found that our servers were hanging. No errors are ever thrown but, while we have logging, there is logging of STDERR.
    Aside:-
    -XX:+CMSPermGenSweepingEnabled
    Use: -XX:+CMSClassUnloadingEnabled instead.
    currently I have 4 days uptime on one with no JVM GC options
    and 2 days on the other currently running the flags (the last 3 are for
    debugging and wouldn't be in production):
    -XX:+UseParNewGC
    -XX:+UseConcMarkSweepGC
    -XX:ParallelGCThreads=4
    -XX:+UseCMSCompactAtFullCollection
    -XX:CMSFullGCsBeforeCompaction=0
    -XX:+PrintGCApplicationStoppedTime
    -XX:+PrintGCDetails
    -XX:+PrintGCTimeStamps
    If and when a hang ensues again, could you check using appropriate tools on Windows
    whether there is any activity in the target process (on Solaris you would use prstat -L -p <pid>
    and perhaps pstack <pid> for such investigation -- i.e. whether this is a livelock or
    a deadlock, and in either case obtain a stack retrace of the threads in the process.

    Could it be that this is not actually hang, but rather a very slow process perhaps as a result
    of paging? Do you see evidence of paging at the time at which the process appears slow
    or hung? (On Solaris you would use vmstat to obtain such data.) Of course, this "paging
    theory" would not be consistent with your observation that without any flags such a hang
    is not observed, since I would imagine that if something is causing the heap to get paged out
    during idle perods, it should not matter which GC you are using, the slow responses would
    be identical independent of the GC options used.

    Could you post the GC logs -XX:+PrintGCTimeStanps -XX:+PrintGCDetails prior to and
    during the observed hang to see if there are any clues to be had from the logs?

    thanks!
  • 2. Re: 64 Bit JVM hanging after long periods of idle time.
    843829 Newbie
    Currently Being Moderated
    Yeah, I am attempting to capture them now, I should have them by monday.
  • 3. Re: 64 Bit JVM hanging after long periods of idle time.
    843829 Newbie
    Currently Being Moderated
    As far as paging it might be the case, but the server has 12gig allocated to it, on a 16g system. The heap stays between 4 and 8 gig. I thought the use of +useLargePages would prevent the OS from paging anything.                                                                                                                                                                                                                                                                                                                                                                                                                                           
  • 4. Re: 64 Bit JVM hanging after long periods of idle time.
    456795 Newbie
    Currently Being Moderated
    Make sure you've enable large pages in the OS.

    Be careful, you can't reserve pages in Microsoft Windows. From http://developer.amd.com/TechnicalArticles/Articles/Pages/322006145.aspx "Note that on Windows, you can't reserve pages the way you can with Linux. So, for this reason, the best policy is to start your JVM as soon as you can after rebooting"
  • 5. Re: 64 Bit JVM hanging after long periods of idle time.
    843829 Newbie
    Currently Being Moderated
    We've experienced similar problems as well. Here is the last part of the garbage logs. We ended up killing the VM and restarting.
    Heap after GC invocations=1190 (full 120):
     par new generation   total 1225728K, used 136192K [0x00002aaaaeaa0000, 0x00002aab01ca0000, 0x00002aab01ca0000)
      eden space 1089536K,   0% used [0x00002aaaaeaa0000, 0x00002aaaaeaa0000, 0x00002aaaf12a0000)
      from space 136192K, 100% used [0x00002aaaf12a0000, 0x00002aaaf97a0000, 0x00002aaaf97a0000)
      to   space 136192K,   0% used [0x00002aaaf97a0000, 0x00002aaaf97a0000, 0x00002aab01ca0000)
     concurrent mark-sweep generation total 4804608K, used 4496788K [0x00002aab01ca0000, 0x00002aac270a0000, 0x00002aac270a0
    000)
     concurrent-mark-sweep perm gen total 108544K, used 76582K [0x00002aac270a0000, 0x00002aac2daa0000, 0x00002aac2daa0000)
    }
    2008-08-21T04:30:35.138-0700: 158575.323: [GC [1 CMS-initial-mark: 4496788K(4804608K)] 4643774K(6030336K), 0.1385320 sec
    s] [Times: user=0.14 sys=0.00, real=0.13 secs]
    2008-08-21T04:30:35.277-0700: 158575.462: [CMS-concurrent-mark-start]
    2008-08-21T04:30:43.081-0700: 158583.266: [CMS-concurrent-mark: 7.804/7.804 secs] [Times: user=8.32 sys=0.11, real=7.81
    secs]
    2008-08-21T04:30:43.081-0700: 158583.266: [CMS-concurrent-preclean-start]
    2008-08-21T04:30:43.098-0700: 158583.284: [CMS-concurrent-preclean: 0.017/0.017 secs] [Times: user=0.02 sys=0.00, real=0
    .02 secs]
    2008-08-21T04:30:43.098-0700: 158583.284: [CMS-concurrent-abortable-preclean-start]
     CMS: abort preclean due to time 2008-08-21T04:30:48.174-0700: 158588.359: [CMS-concurrent-abortable-preclean: 0.458/5.0
    76 secs] [Times: user=3.65 sys=0.00, real=5.07 secs]
    2008-08-21T04:30:48.177-0700: 158588.362: [GC[YG occupancy: 364276 K (1225728 K)]158588.362: [Rescan (parallel) , 0.2898
    800 secs]158588.652: [weak refs processing, 0.0284260 secs] [1 CMS-remark: 4496788K(4804608K)] 4861065K(6030336K), 0.318
    5690 secs] [Times: user=0.60 sys=0.00, real=0.32 secs]
    2008-08-21T04:30:48.496-0700: 158588.681: [CMS-concurrent-sweep-start]
    2008-08-21T04:30:52.691-0700: 158592.876: [CMS-concurrent-sweep: 4.195/4.195 secs] [Times: user=4.21 sys=0.01, real=4.20
     secs]
    2008-08-21T04:30:52.691-0700: 158592.876: [CMS-concurrent-reset-start]
    2008-08-21T04:30:52.723-0700: 158592.909: [CMS-concurrent-reset: 0.033/0.033 secs] [Times: user=0.03 sys=0.01, real=0.03
     secs]
    2008-08-21T04:30:54.728-0700: 158594.913: [GC [1 CMS-initial-mark: 4490131K(4804608K)] 4865560K(6030336K), 0.3873970 sec
    s] [Times: user=0.39 sys=0.00, real=0.39 secs]
    2008-08-21T04:30:55.116-0700: 158595.301: [CMS-concurrent-mark-start]
    2008-08-21T04:31:00.898-0700: 158601.083: [CMS-concurrent-mark: 5.782/5.782 secs] [Times: user=6.26 sys=0.09, real=5.78
    secs]
    2008-08-21T04:31:00.898-0700: 158601.083: [CMS-concurrent-preclean-start]
    2008-08-21T04:31:00.916-0700: 158601.101: [CMS-concurrent-preclean: 0.018/0.018 secs] [Times: user=0.02 sys=0.00, real=0
    .02 secs]
    2008-08-21T04:31:00.916-0700: 158601.102: [CMS-concurrent-abortable-preclean-start]
     CMS: abort preclean due to time 2008-08-21T04:31:05.927-0700: 158606.112: [CMS-concurrent-abortable-preclean: 0.459/5.0
    11 secs] [Times: user=4.16 sys=0.08, real=5.01 secs]
    2008-08-21T04:31:05.929-0700: 158606.115: [GC[YG occupancy: 796822 K (1225728 K)]158606.115: [Rescan (parallel)
  • 6. Re: 64 Bit JVM hanging after long periods of idle time.
    843829 Newbie
    Currently Being Moderated
    It won't solve the problem, but i think that with 8 CPUs is better to set -XX:ParallelGCThreads=8
  • 7. Re: 64 Bit JVM hanging after long periods of idle time.
    843829 Newbie
    Currently Being Moderated
    We are seeing the same problem.
    java version "1.5.0_17"
    Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_17-b04)
    Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_17-b04, mixed mode)

    /usr/lib/jvm/java/bin/java -Xms10g -Xmx10g -Xmn1g -XX:MaxPermSize=256m -XX:+UseConcMarkSweepGC -XX:+ParallelRefProcEnabled -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:-TraceClassUnloading -Xloggc:gc.log
    In the middle of what seems to be the WeakRef processing on a CMS rescan, the GC log just stops:

    72306.785: [GC[YG occupancy: 30187 K (1047616 K)]72306.785: [Rescan (parallel) , 0.0203340 secs]72306.805: [weak refs processing

    pstack shows:

    Thread 833 (Thread 1076111680 (LWP 26730)):#0  0x0000003b6bab8ab7 in sched_yield () from /lib64/libc.so.6
    #1  0x00002b8d1e719bd9 in os::yield ()
    #2  0x00002b8d1e7b17a8 in ParallelTaskTerminator::offer_termination ()
    #3  0x00002b8d1e480bb0 in CMSRefProcTask::do_work_steal ()
    #4  0x00002b8d1e48090a in CMSRefProcTask::work ()
    #5  0x00002b8d1e81ea7d in GangWorker::run ()
    #6  0x00002b8d1e71b5e5 in _start ()
    #7  0x0000003b6c206307 in start_thread () from /lib64/libpthread.so.0
    #8  0x0000003b6bad1ded in clone () from /lib64/libc.so.6
    Thread 832 (Thread 1077164352 (LWP 26731)):
    #0  0x0000003b6c20a6a7 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
    #1  0x00002b8d1e719b16 in os::sleep ()
    #2  0x00002b8d1e7b17e7 in ParallelTaskTerminator::offer_termination ()
    #3  0x00002b8d1e480bb0 in CMSRefProcTask::do_work_steal ()
    #4  0x00002b8d1e48090a in CMSRefProcTask::work ()
    #5  0x00002b8d1e81ea7d in GangWorker::run ()
    #6  0x00002b8d1e71b5e5 in _start ()
    #7  0x0000003b6c206307 in start_thread () from /lib64/libpthread.so.0
    #8  0x0000003b6bad1ded in clone () from /lib64/libc.so.6
    Thread 831 (Thread 1078217024 (LWP 26732)):
    #0  0x0000003b6c20a6a7 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
    #1  0x00002b8d1e719b16 in os::sleep ()
    #2  0x00002b8d1e7b17e7 in ParallelTaskTerminator::offer_termination ()
    #3  0x00002b8d1e480bb0 in CMSRefProcTask::do_work_steal ()
    #4  0x00002b8d1e48090a in CMSRefProcTask::work ()
    #5  0x00002b8d1e81ea7d in GangWorker::run ()
    #6  0x00002b8d1e71b5e5 in _start ()
    #7  0x0000003b6c206307 in start_thread () from /lib64/libpthread.so.0
    #8  0x0000003b6bad1ded in clone () from /lib64/libc.so.6
    [...]
    #8 0x0000003b6bad1ded in clone () from /lib64/libc.so.6
    Thread 826 (Thread 1083480384 (LWP 26737)):
    #0 0x00002b8d1e485bd7 in CMSCollector::par_take_from_overflow_list ()
    #1 0x00002b8d1e480b8b in CMSRefProcTask::do_work_steal ()
    #2 0x00002b8d1e48090a in CMSRefProcTask::work ()
    #3 0x00002b8d1e81ea7d in GangWorker::run ()
    #4 0x00002b8d1e71b5e5 in _start ()
    #5 0x0000003b6c206307 in start_thread () from /lib64/libpthread.so.0
    #6 0x0000003b6bad1ded in clone () from /lib64/libc.so.6


    Looks like a problem with ParallelRefProcEnabled