5 Replies Latest reply on Aug 27, 2014 6:10 PM by John Gregg

    Interpreting the hot methods table

    John Gregg

      Hi all,


      TL;DR: What causes some methods to show up more frequently than others in the method hot spot table?


      I'm using Mission Control with 64-bit JDK 1.7.0_40 for Linux.  Sorry this is a little long.


      I have an unusual performance problem where a certain batch job runs very fast (<10 minutes) after an app server restart but almost always 80-90 minutes on subsequent nights.  The job copies data from Teradata to Oracle by running a single JDBC query against Teradata, reading ~50k records, and writing them to Oracle in batches of ~2000 in parallel using the Coherence work manager, and repeating until about 15m rows have been processed.  This used to run consistently fast under 32-bit JRockit R28.x (Java 5.)


      I've made many flight recordings of the batch job.  What jumps out at me are the gaps between the socket read events to Teradata.  The event log shows this sequence for the main batch job thread:


      socket read event from Teradata (15-20ms consistently)

      method profiling events reading the result set (total time varies from night to night)

      socket read event from Teradata (15-20ms consistently)

      method profiling events reading the result set (total time varies from night to night)

      socket read event from Teradata (15-20ms consistently)

      method profiling events reading the result set (total time varies from night to night)

      java monitor wait events waiting for the work manager to complete (<500ms consistently)



      There are typically groups of 3 socket read events between calls to the work manager.  If the batch job runs quickly, there are very small gaps between the socket reads with very few method profiling events (100ms.)  If it runs in the 30-minute range (rarely,) those gaps are about 1 second.  If it runs in the 90-minute range (commonly,) those gaps are about 3 seconds.  In other words, it takes longer to read the results out of the resultset buffer than it does to insert them in to Oracle!  The only thing drawing my attention from the flight recording is that there are a bunch of method profiling events in there.  It appears the methods reading data from the result set, resultSet.getString(int) for example, take longer on some nights than others.  How would that be?


      I already switched from name-based result set calls like resultSet.getString(name) to position-based calls like resultSet.getString(int) based on what earlier recordings (and my analysis of the decompiled Teradata driver) showed.  The hot methods look somewhat different now, but the performance is the same.


      The server has 4 modern CPUs, usage is around 20-30%, GC throughput is well over 99%.





        • 1. Re: Interpreting the hot methods table
          Klara Ward, Java Mission Control Dev-Oracle

          Sorry for the late response, it's vacation time in Sweden

          Some general info about the hot methods table: The hot methods table is based on periodical samples of the call stack. The items in the table are groups of call stack samples with the same top frame method. The more call stack samples that end in a certain method, the more time it probably consumes.


          Could there be any native calls in there somewhere, because Flight Recorder don't really track those.

          Did you get any further in your investigations? You could potentially share the recordings with us if you don't get any further.

          • 2. Re: Interpreting the hot methods table
            John Gregg

            Thanks Klara.  I hope you had a good vacation!


            I had assumed the contents of hot methods table were there either because they took a long time or were called many times and I think you've pretty much confirmed that.


            I'm not sure about native calls.  I really have no idea where to expect native calls to be used.  Since we're reading the results of a DB query, I think what happens is this:


            1. Query execution

            2. Teradata reads some number of rows from the DB into some kind of buffer

            3. We read the values out of the result set, which in turn reads them from the buffer

            4. When the buffer is exhausted, repeat steps 2 and 3 until all rows have been read


            I think this is pretty common JDBC driver behavior.


            As for progress on my problem, my current theory is that the code cache is getting flushed and some of my batch job's hot methods aren't hot enough to get themselves recompiled.  I'm going to enable PrintCompilation and watch for "made zombie" in the output.  I do know that the memory pool mbean is reporting a peak code cache usage of around 46MB (max is the default of 48MB) and a current usage around 28MB, so I know at least some flushing has occurred.  We never saw this behavior with JRockit.



            • 3. Re: Interpreting the hot methods table
              John Gregg

              This turned out to be caused by the code cache filling up.  Once we enabled PrintCompilation and had Wily track the size of the code cache, it was obvious.  We quadrupled our code cache size and now actually use about 80-90MB.  25k methods are compiled.  This particular step of our batch job went from 90 minutes to 7.  Overall CPU usage was cut in half.  Not bad.

              • 4. Re: Interpreting the hot methods table
                Klara Ward, Java Mission Control Dev-Oracle

                Good that you found the problem

                It would be interesting to know if you could detect it with JFR as well, just as feedback to us

                I can't remember right now if we have data about the code cache or not, or if you need to enable more of it when you start the recording.

                Would you like to share the recording or have a look yourself?

                • 5. Re: Interpreting the hot methods table
                  John Gregg

                  I don't see code cache in the recording.  I see heap before and after and perm gen before and after, but nothing related to code cache.  What would REALLY be useful would be to see the state of the compiler (disabled or not disabled,) in addition to the cache size.  That the compiler gets disabled is a pretty serious issue and should be made more obvious.  Only once have I ever seen a message in the log saying the compiler was disabled.  Unfortunately I don't even see an mbean you can look at.  There is java.lang:type=Compilation, but it doesn't seem to have any property that indicates the state of the compiler.


                  I can see code cache via the JMX console.