2 Replies Latest reply on Oct 14, 2009 9:21 PM by 843829

    Strange GC behaviour


      We are experiencing some strange GC behaviour, and I'm hoping somebody might be able to provide some insight as to what the issue may be.

      Whenever a full compacting GC occurs -- due either to an explicit GC or to a promotion failure (we are using the CMS collector) -- our minor collection times suddenly get a lot worse, and continue to degrade, until a regular CMS collection is triggered (often several hours later), at which point the minor collection times return to normal. This is very consistent behaviour. It is illustrated in the two log extracts below: the first showing a full GC triggering the behaviour, the second showing a CMS collection apparently curing it.
      [GC [ParNew: 105897K->838K(118016K), 0.0035310 secs] 372762K->267706K(773376K), 0.0036190 secs]
      [GC [ParNew: 105798K->690K(118016K), 0.0032320 secs] 372666K->267565K(773376K), 0.0033260 secs]
      [GC [ParNew: 105650K->951K(118016K), 0.0035450 secs] 372525K->267843K(773376K), 0.0036300 secs]
      [GC [ParNew: 105909K->905K(118016K), 0.0035790 secs] 372800K->267798K(773376K), 0.0036630 secs]
      [GC [ParNew: 105865K->900K(118016K), 0.0035990 secs] 372758K->267794K(773376K), 0.0036830 secs]
      [GC [ParNew: 105860K->745K(118016K), 0.0059650 secs] 372754K->267640K(773376K), 0.0060460 secs]
      [Full GC (System) 423443.821: [CMS: 266895K->203518K(655360K), 1.3077080 secs] 269575K->203518K(773376K), [CMS Perm : 26573K->26496K(44312K)], 1.3078410 secs]
      [GC [ParNew: 104960K->1668K(118016K), 0.0115650 secs] 308478K->205186K(773376K), 0.0116540 secs]
      [GC [ParNew: 106628K->1993K(118016K), 0.0219700 secs] 310146K->205512K(773376K), 0.0220590 secs]
      [GC [ParNew: 106953K->1993K(118016K), 0.0273290 secs] 310472K->205511K(773376K), 0.0274200 secs]
      [GC [ParNew: 106953K->3223K(118016K), 0.0365280 secs] 310471K->206742K(773376K), 0.0366240 secs]
      [GC [ParNew: 108183K->2365K(118016K), 0.0423430 secs] 311702K->206379K(773376K), 0.0424440 secs]
      [GC [ParNew: 106458K->2582K(118016K), 0.0426810 secs] 310472K->207061K(773376K), 0.0427830 secs]
      [GC [ParNew: 107542K->2753K(118016K), 0.0383520 secs] 312021K->207454K(773376K), 0.0384460 secs]
      [GC [ParNew: 107713K->2735K(118016K), 0.0384070 secs] 312414K->207723K(773376K), 0.0385080 secs]
      [GC [ParNew: 107695K->2440K(118016K), 0.0514300 secs] 312683K->207839K(773376K), 0.0515290 secs]
      [GC [ParNew: 107400K->2200K(118016K), 0.0594310 secs] 312799K->207880K(773376K), 0.0595350 secs]
      [GC [ParNew: 107158K->2636K(118016K), 0.0753660 secs] 312838K->208671K(773376K), 0.0754680 secs]
      [GC [ParNew: 107596K->2517K(118016K), 0.0840430 secs] 313631K->208815K(773376K), 0.0841420 secs]
      [GC [ParNew: 107465K->2973K(118016K), 0.0758820 secs] 313763K->209580K(773376K), 0.0759750 secs]
      [GC [ParNew: 107618K->3333K(118016K), 0.0753020 secs] 314225K->210245K(773376K), 0.0753960 secs]
      [GC [ParNew: 108293K->2733K(118016K), 0.0763260 secs] 315205K->210011K(773376K), 0.0764210 secs]
      [GC [ParNew: 107693K->3205K(118016K), 0.0740810 secs] 314971K->210757K(773376K), 0.0741850 secs]
      [GC [ParNew: 108165K->2966K(118016K), 0.0829440 secs] 315717K->210829K(773376K), 0.0830480 secs]
      [GC [ParNew: 107926K->3473K(118016K), 0.0905890 secs] 315789K->211713K(773376K), 0.0906910 secs]
      [GC [ParNew: 108343K->4061K(118016K), 0.1072280 secs] 316583K->212747K(773376K), 0.1073410 secs]
      [GC [ParNew: 109021K->3149K(118016K), 0.1200460 secs] 317707K->212367K(773376K), 0.1201530 secs]
      [GC [ParNew: 107348K->2832K(118016K), 0.1581360 secs] 662728K->558598K(773376K), 0.1582300 secs]
      [GC [ParNew: 106676K->2988K(118016K), 0.1526330 secs] 662441K->559077K(773376K), 0.1527270 secs]
      [GC [ParNew: 107948K->2804K(118016K), 0.1370790 secs] 664037K->559407K(773376K), 0.1371660 secs]
      [GC [ParNew: 107764K->2878K(118016K), 0.1341090 secs] 664367K->559820K(773376K), 0.1342100 secs]
      [GC [ParNew: 107192K->2535K(118016K), 0.1409910 secs] 664133K->559938K(773376K), 0.1410890 secs]
      [GC [1 CMS-initial-mark: 557403K(655360K)] 561352K(773376K), 0.0039230 secs]
      [CMS-concurrent-mark: 0.451/0.451 secs]
      [CMS-concurrent-preclean: 0.004/0.004 secs]
      CMS: abort preclean due to time 293974.322: [CMS-concurrent-abortable-preclean: 0.587/5.064 secs]
      [GC[YG occupancy: 12388 K (118016 K)]293974.323: [Rescan (parallel) , 0.0052360 secs]293974.328: [weak refs processing, 0.0003530 secs] [1 CMS-remark: 557403K(655360K)] 569792K(773376K), 0. 0056680 secs]
      [CMS-concurrent-sweep: 1.380/1.383 secs]
      [CMS-concurrent-reset: 0.008/0.008 secs]
      [GC [ParNew: 107085K->1741K(118016K), 0.0034360 secs] 307437K->202097K(773376K), 0.0035210 secs]
      [GC [ParNew: 106701K->606K(118016K), 0.0031490 secs] 307057K->200967K(773376K), 0.0032300 secs]
      [GC [ParNew: 105566K->644K(118016K), 0.0030150 secs] 305927K->201007K(773376K), 0.0031000 secs]
      [GC [ParNew: 105604K->600K(118016K), 0.0030200 secs] 305967K->200967K(773376K), 0.0031050 secs]
      [GC [ParNew: 105560K->639K(118016K), 0.0029650 secs] 305927K->201011K(773376K), 0.0030490 secs]
      Our application is a distributed system consisting of a number of different types of node. This problem only affects certain classes of node, and only since deploying a recent version of our code. Something we changed in the code, therefore, seems to be causing this behaviour.

      We are able to avoid this problem by avoiding full compacting GCs: by enabling ExplicitGCInvokesConcurrent, and by lowering the CMSInitiatingOccupancyFraction to reduce the incidence of promotion failures. This is obviously not an ideal solution, and we would like to better understand what might be going on here.

      So, why might a full compacting GC cause this kind of pathological minor collection behaviour?

      We are running JRE 1.6.0_13 with the following settings:

        • 1. Re: Strange GC behaviour
          As you can see, the scavenge times appear to be closely correlated
          with the amount of data that is promoted into the old generation.
          This appears to increase dramatically (in relative terms)
          following an explicit compacting GC, and falls equally dramatically
          later following a CMS collection cycle. Based on this, my
          colleague conjectures that what you are seeing here is
          a case of premature promotion at an explicit full gc (of
          an object A that might otherwise have died in the young gen),
          followed by promotion due to nepotism in subsequent
          scavenges (the now dead but prematurely promoted A
          artificially holds references to a chain of objects -- possibly
          a linked list -- whose dead members continue to be
          promoted on subsequent scavenges). When a CMS collection
          occurs, these "zombies" (cohorts of A) are reclaimed and the nepotism
          ceases, allowing promotion volume and scavenges to revert to
          their earlier lower levels.

          When you use ExplicitGCInvokesConcurrent, the premature
          promotion that you would see with compacting full collection
          is avoided and you do not see the degradation in performance
          that you saw as a result of nepotism.

          If you have support from Sun ( see http://www.sun.com/software/javaforbusiness/index.jsp )
          you might have access to certain tools that might allow
          you to confirm if this is what is happening and
          further to identify the objects that are contributing to the
          nepotistic behaviour, and take corrective steps in your
          application to avoid nepotism despite such premature promotion.
          You might even be able to request an enhancement to the
          compacting collector that tries to avoid the onset of nepotism
          by avoiding premature promotion.

          Another workaround (besides the ExplicitGCInvokesConcurrent
          that you are currently using) would be -XX:+DisableExplicitGC
          which would (in the case of CMS) avoid this behaviour, provided
          the above conjecture regarding the root-cause is correct.

          Note that premature promotion and nepotism are general
          issues, even without explicit gc's; tuning of survivor spaces
          helps mitigate premature promotion and thus nepotism;
          compacting full gc's as currently implemented in HotSpot
          cause premature promotion and in some cases nepotism,
          and that latter behaviour is independent of survivor space

          best regards!
          • 2. Re: Strange GC behaviour
            I wrote:
            As you can see, the scavenge times appear to be closely correlated
            with the amount of data that is promoted into the old generation.More correctly, it's correlated with some linear combination of the
            amount in the survivor spaces and the amount that is promoted (i.e. net of
            amount copied). Although the correlation with the latter is
            much more obvious because of the much higher cost (in CMS)
            of promoting into the old generation versus copying into the
            survivor spaces.

            (See related plots here.)