This discussion is archived
3 Replies Latest reply: Aug 12, 2009 5:21 AM by 807557 RSS

Memory issues with Java RTS

807557 Newbie
Currently Being Moderated
Hello,

I am currently facing some serious memory issues using the Java RTS. Basically, the memory consumption is much higher than I expect (i.e. much, much higher than the memory consumption under Java SE 6). Of course that I have tried to balance and tune Java RTS so that the RTGC gets enough time to finish its work, as I will show later in this post. Still, the memory problem persists.

My application is a high-throughput application, ported from Java SE 6 to Java RTS 2.2 (see thread [http://forums.sun.com/thread.jspa?threadID=5399293|http://forums.sun.com/thread.jspa?threadID=5399293] ).
I am using a Sun T5220 Machine (1 CPU, 8 cores x 8 HW Threads = 64 virtual cores), running Solaris SunOS 5.10.

With Java SE 6, this application is capable of more than 10.000 TPS , at an average latency of 5 ms, in 32 bit mode. The start parameters are :
-XX:+UseParNewGC -XX:+UseConcMarkSweepGC -Xmx1G -Xms1G
There are 26 worker threads (for request processing). The application is using 1 Gb of heap. At 9.000 TPS the process uses 25% from total CPU. There is a major collection about every 70 sec, with minor collections at about every 100 ms. The 1 Gb is quite enough for this throughput.

Now moving to the same application, running on Java RTS. The 26 worker threads are real-time threads with a very high priority (MaxPrio-1), using a processor set (1) containing 28 CPU. DTrace has confirmed that there is no contention.
The issue that I am facing is, that the memory becomes indirectly a bottleneck for the throughput. It seems that much more memory is needed than in the case of Java SE.

In the 32 bit case, here are my start parameters:
-Xms3600m -Xmx3600m -XX:RTSJBindRTTToProcessorSet=1
-XX:RTGCNormalWorkers=30 -XX:NormalMinFreeBytes=3600m -XX:RTGCWaitDuration=0 -XX:RTGCCriticalReservedBytes=800m
-XX:RTGCBoostedWorkers=30 -XX:RTGCCriticalBoundary=59 -XX:RTGCBoostedPriority=59
-XX:+PreResolveConstantPools
If I keep a constant throughput of 2100 TPS (which is about a fifth of the throughput in JSE case), the application constantly needs about 1.8 Gb of memory, which is almost double than the memory available in the JSE case at 9000 TPS. The RTGC log looks like this:
1847142K->1788418K(3686400K, non fragmented: current 1897981K / min 1450247K / worst 1399243K, biggest 166K, blocked threads: max 0 / still blocked 0 requesting 0K, dark matter: 405223K in 2471149 blocks smaller than 2048 bytes) <completed without boosting> {CPU load: 14.97 recycling / 14.95 since last GC}[GC, 0.0000012 secs]
1793853K->1847900K(3686400K, non fragmented: current 1838500K / min 1499453K / worst 1399243K, biggest 138K, blocked threads: max 0 / still blocked 0 requesting 0K, dark matter: 427946K in 2550507 blocks smaller than 2048 bytes) <completed without boosting> {CPU load: 14.89 recycling / 14.87 since last GC}[GC, 0.0000013 secs]
1851645K->1762616K(3686400K, non fragmented: current 1923783K / min 1458358K / worst 1399243K, biggest 133K, blocked threads: max 0 / still blocked 0 requesting 0K, dark matter: 395862K in 2449501 blocks smaller than 2048 bytes) <completed without boosting> {CPU load: 14.88 recycling / 14.87 since last GC}[GC, 0.0000010 secs]
For 2400 TPS --> 2.1 Gb.
For 2700 TPS --> 2.4 Gb. RTGC log:
2152903K->2480645K(3686400K, non fragmented: current 1205754K / min 1055435K / worst 943027K, biggest 71K, blocked threads: max 0 / still blocked 0 requesting 0K, dark matter: 460827K in 2666089 blocks smaller than 2048 bytes) <completed without boosting> {CPU load: 13.27 recycling / 13.26 since last GC}[GC, 0.0000012 secs]
2487073K->2485239K(3686400K, non fragmented: current 1201160K / min 704177K / worst 704177K, biggest 71K, blocked threads: max 1 / still blocked 0 requesting 0K, dark matter: 444528K in 2610442 blocks smaller than 2048 bytes) <crossed RTGCCriticalReservedBytes threshold (819200K)> {CPU load: 13.37 recycling / 13.36 since last GC}[GC, 0.0000013 secs]
2490803K->2552113K(3686400K, non fragmented: current 1134286K / min 807463K / worst 704177K, biggest 229K, blocked threads: max 1 / still blocked 0 requesting 0K, dark matter: 499789K in 2839566 blocks smaller than 2048 bytes) <crossed RTGCCriticalReservedBytes threshold (819200K)> {CPU load: 13.28 recycling / 13.26 since last GC}[GC, 0.0000014 secs]
Now the RTGC goes in the boosted mode (since the critical threshold has been reached) and the memory fills up, which then leads to outliers up to over 1 second (whereas the average latency is 6 ms).

First, I thought that the RTGC does not have enough CPU in order to clean up the stuff. Therefore I have specified the parameters so that I make sure the RTGC runs with 30 threads continuously, making sure it has enough CPU (36 out of 64 CPU available).
Also, I have implemented the metronome policy for the RTGC, as explained by E. Bruno and G. Bollella in "Real-Time Java Programming". Basically it is a highest-priority real-time thread that triggers the FullyConcurrentGarbageCollector every 100 ms, boosting it to the highest prio for 30 ms. Still, the same behavior.

Could you please help me understand what happens with the memory? Is the high memory consumption normal? I understand that the RTGC is a non-generational collector, therefore I cannot expect the efficiency of the CMS collector of JSE (since I have many short living small objects). Still, the difference seems huge to me and the high memory consumption is a show-stopper for me.
Another issue I encounter is, that as soon as the RTGCCriticalReservedBytes threshold is reached, I start getting outliers in my asynchronously processed requests (latency goes up to over 1 second, whereas usually the requests are always processed within 20 ms).

I am thankful for every hint I could get.

Thank you,


Sergiu Burian
  • 1. Re: Memory issues with Java RTS
    807557 Newbie
    Currently Being Moderated
    ... continued the first post:

    Here are my 64 bit experience :
    The intention was to give much more heap space, so I can go higher with the throughput. So, here are the settings:
    -d64 -cp $CLASSPATH -Xms8G -Xmx8G -XX:RTSJBindRTTToProcessorSet=1
    -XX:RTGCNormalWorkers=30 -XX:NormalMinFreeBytes=8G -XX:RTGCWaitDuration=0
    -XX:RTGCBoostedWorkers=30 -XX:RTGCCriticalReservedBytes=1G -XX:RTGCBoostedPriority=59 -XX:RTGCCriticalBoundary=59 
    -XX:+PreResolveConstantPools
    What has actually happened is ... I have started the application with 1500 TPS and I was surprised to see, that the memory consumption is now at 3.3 Gb (whereas with Java RTS 32 bit, at this throughput the memory consumption was 1.2 Gb, and on the JSE 6 much lower)!
    Here's the RTGC log:
    3496849K->3568919K(8388608K, non fragmented: current 4819688K / min 3883674K / worst 3883674K, biggest 804K, blocked threads: max 0 / still blocked 0 requesting 0K, dark matter: 709795K in 3902228 blocks smaller than 2048 bytes) <completed without boosting> {CPU load: 15.85 recycling / 15.84 since last GC}[GC, 0.0000009 secs]
    3575632K->3389543K(8388608K, non fragmented: current 4999064K / min 4015362K / worst 3883674K, biggest 636K, blocked threads: max 0 / still blocked 0 requesting 0K, dark matter: 727209K in 3775545 blocks smaller than 2048 bytes) <completed without boosting> {CPU load: 15.81 recycling / 15.80 since last GC}[GC, 0.0000010 secs]
    For 1800 TPS --> 4.1 Gb (about 0.8 Gb for 300 more TPS).
    For 2100 TPS --> 4.9 Gb (about 0.8 Gb more for 300 more TPS).
    For 2400 TPS (again 300 more TPS), the memory consumption increases first to 6.1 Gb, and then, without modifying the throughput, the memory consumption increases until it reaches the critical reserved bytes threshold. The RTGC log:
    6361574K->6213294K(8388608K, non fragmented: current 2175313K / min 1232639K / worst 1232639K, biggest 220K, blocked threads: max 0 / still blocked 0 requesting 0K, dark matter: 1156254K in 4440866 blocks smaller than 2048 bytes) <completed without boosting> {CPU load: 12.52 recycling / 12.51 since last GC}[GC, 0.0000009 secs]
    6221396K->6579545K(8388608K, non fragmented: current 1809063K / min 1306510K / worst 1232639K, biggest 237K, blocked threads: max 0 / still blocked 0 requesting 0K, dark matter: 1165354K in 4422250 blocks smaller than 2048 bytes) <completed without boosting> {CPU load: 12.56 recycling / 12.56 since last GC}[GC, 0.0000010 secs]
    6591026K->7131582K(8388608K, non fragmented: current 1257024K / min 902677K / worst 902677K, biggest 221K, blocked threads: max 1 / still blocked 0 requesting 0K, dark matter: 1279674K in 4711580 blocks smaller than 2048 bytes) <crossed RTGCCriticalReservedBytes threshold (1048576K)> {CPU load: 12.59 recycling / 12.58 since last GC}[GC, 0.0000010 secs]
    The 64 bit version take much much more memory ...
  • 2. Re: Memory issues with Java RTS
    807557 Newbie
    Currently Being Moderated
    Hi,

    There are several issues in that post. Here is some feedback on some of them. Feel free to come back with additional questions.

    I'll send a second reply with some tips on the options used and my recommended design.

    First, note that JavaRTS uses the client compiler and is based on Java5. Hence, to get an idea of the reachable throughput, I usually recommend trying with "Java5 -client"

    The second point is about memory consumption. I've not done the math to double check how you measure the memory consumption per second from these logs. JavaRTS does consume more memory. In fact, whenever using more memory could speed-up the JVM or improve the determinism, we usually use more memory. This is not very expensive and now that we support 64 bits, we have a huge margin :-)

    The first cause of the additional consumption is minimum object size and alignment.
    - RTGC info per object (we need two additional pointers). This increase the minimum object size. On 32 bits, the header size is 16 bytes. On 64 bits, 32 bytes.
    - internal fragmentation. We round-up object to the RTGC block size. By default, this is 32 bytes. JDK6 uses a smaller rounding (I think it is 8 bytes, which is the size of their header). Usually, this does not lead to a big increase in memory consumption.

    However, the second cause is inherent to RTGC technology. A lot of RTGCs use what is called snapshot-at-the-beginning (SATB). In short, this means that only
    what is useless when the GC starts is recycled. What become useless while the GC runs will be recycled on the next GC cycle.

    As an example, let's suppose there is no long lived data and all your objects could be garbage collected very quickly.With SATB, we will recycle on the NEXT GC cycle what is allocated during the current one. This means that at any time we have:
    - what was 'dead' at GC start, which the current GC is going to recycle at the end (while the application uses the remaining of the memory)
    - what the application allocates while this GC cycle runs (which will be recycled at the next GC)
    The memory you need is thus ( 2 * memory_consumed_per_gc_cycle)

    The memory consumed per GC cycle depends mainly on:
    - how fast the application allocates
    - how much CPU time is not used by the application and thus available for the RTGC
    - the GC 'cost'... which might be hard to evaluate but which should be pretty constant (I won't go into these details here)

    With other GCs, the GC starts when memory is full and would recycle everything. If everything is short lived, there is not really any minimum. If you have less memory, the GC will just run more often. Thus, the difference between the minimum requirement for JDK6 and for RTGC can be huge... particularly if the GC
    does not get a lot of CPU cycles because the system is heavily loaded (remembering the applications uses the client compiler, slower than the server compiler).

    Bertrand.
  • 3. Re: Memory issues with Java RTS
    807557 Newbie
    Currently Being Moderated
    Now, a few tips.

    First, a general tip for machines with a lot of virtual cores (lets call them CPUs to keep it simple). We have found out that, above a given number of RTGC threads, adding more threads is not necessarily a good idea. This creates a lot of contention on the memory bus, slowing down both the RTGC and the application threads. My recommendation is to start with 25% of the CPUs for the GC, running continuously (starting a new GC cycle as soon as the previous one completes). You may need more than 25% if your allocation rate it too big for the GC to keep up with only that number of worker threads. Try to find the number of CPUs that is just sufficient to keep up with your application when running full time (with some safety margin) .

    In fact, in your example, I would not run your threads at a the maximum priority. My recommendation is to consider your system as soft real-time. This might be more powerful on JavaRTS than what soft real-time means on other system. In addition, you'll see that you can still use the hard real-time capabilities of JavaRTS to improve the quality of service.

    Thus:
    - I would not use RTGCCriticalReservedBytes (at least in a first step)
    - I would let your threads run at the default realtime priority (which has soft realtime semantic by default on JavaRTS)

    As stated above, in the "normal mode", I would use approximatively 25% of the CPUs doing GC work full time (at the default GC priority):
    -XX:RTGCNormalWorkers=16 -XX:NormalMinFreeBytes=2G (or whatever the heap size is)

    Your threads would be able to preempt the RTGC (but in your case, they should not since you have a lot of CPUs).

    As long as the 16 GC workers can keep up with your allocations, this works fine and is more deterministic than waking up
    the GC from time to time. In fact, it is better to reduce the number of RTGC workers than to decrease NormalMinFreeBytes
    (but ensure that the number is high enough to keep up with temporary overloads).

    Thus, you can decrease RTGCNormalWorkers if the GC logs show that the minimum amount of non fragmented memory is high enough.

    However, if you detect that memory nearly fills up, you may want to increase the number of GC workers or ensure that they are not
    preempted by you applications (if you temporarily have more application real-time threads) to increase the robustness.

    My recommendation is to use the boosted mode for that. With the following parameters, the RTGC will be boosted
    (over the default application thread priority) and use potentially 32 CPUs if memory goes below 512M
    -XX:RTGCBoostedWorkers=32 -XX:BoostedMinFreeBytes=512M

    Once boosted, the 32 RTGC workers are no longer preempted by your application threads and might have a better chance to complete on time.

    However, a stated above, having a lot of RTGC workers is not necessarily a good idea (more contention on the memory bus).
    Thus, the boosted mode may not be sufficient to ensure the RTGC can keep up with the application threads that continue running.

    In that case, what this mean is that the system is not feasible with that load. On other JVMs, there will be no way to work around a load that
    is not feasible.

    With JavaRTS, you can go a bit further.It might be better to have:
    - a few critical threads (at MaxPrio-1) monitoring the overload and taking corrective actions (hard real-time monitoring)
    - memory reserved for them
    -XX:RTGCCriticalReservedBytes=126M
    - block the other threads (because of RTGCCriticalReservedBytes) and use the optimal number of GC threads (let's assume it is 24)
    -XX:RTGCBoostedWorkers=24

    Of course, this means the non critical threads will 'pause' while the GC completes the GC cycle with these 24 CPUs. This could be a
    huge pause.

    If you want, you can still process part of your load (for instance the oldest requests) in the critical range (e.g. at a priority higher than
    the max RTGC priority). This can reduce the worst case per transaction response time. This is another way to mix the hard real-time
    capabilities of the JVM with its soft real-time capabilities.

    The key idea is to execute in the critical range only threads for which the memory load is clearly identified/controlled. To be SURE
    the RTGC can keep up with them. This is a requirement if you really want hard real-time.

    While most JVMs will require the whole application to be hard real-time, JavaRTS allows to mix hard and soft real-time. This makes
    is easier (and less expensive in terms of hardware) to ensure the hard real-time part works properly, protected from the non hard RT part.

    IMHO, this is better than trying to use JavaRTS in a Metronome-like manner... particularly with a high number of virtual cores. Explicitly
    controlling the RTGC priorities should be usefull only on uniprocessors, where you cannot run the RTGC concurrently with your application
    threads.

    Regards,

    Bertrand.