4 Replies Latest reply: Dec 11, 2008 9:10 AM by 807557 RSS

    Java Real time system poor inline performance

    807557
      I am looking at the performance of 1.5.0_16 and Java RTS 2.1

      I'm seeing some very big difference I suspect is due to inlining, but here is one concrete example:
      import java.util.Random;
      public class DoubleToFloatBenchmark {
          private static final int INNER_LOOP = 10000;
          private static final int OUTER_LOOP = 1000;
      
          public static void main(String[] args) throws InterruptedException {
              Random random  = new Random(0);
              double[] values = new double[INNER_LOOP];
              long[] results = new long[INNER_LOOP];
              for (int i = 0; i < values.length; i++) {
                  values[i] = random.nextDouble();
              }
      
              test(values, results);
              test(values, results);
              test(values, results);
              test(values, results);
              test(values, results);
          }
      
          private static void test(double[] values, long[] results) throws InterruptedException {
              long time = Long.MAX_VALUE;
              for (int i = 0; i < OUTER_LOOP; i++) {
                  long start = System.nanoTime();
                  for (int j = 0; j < INNER_LOOP; j++) {
                      results[i] = Double.doubleToLongBits(values);
      }
      long end = System.nanoTime();
      time = Math.min(time, end - start);
      }
      System.out.format("time= %-,3.3fns\n", 1.0 * time / INNER_LOOP);
      Runtime.getRuntime().gc();
      Thread.sleep(10);
      }
      }
      Here is the output:
      bash-3.00$ java -cp . DoubleToFloatBenchmark
      time= 7.345ns
      time= 5.196ns
      time= 0.108ns
      time= 0.108ns
      time= 0.108ns
      bash-3.00$ /opt/SUNWrtjv/bin/java -cp . DoubleToFloatBenchmark
      time= 41.243ns
      time= 41.297ns
      time= 41.295ns
      time= 41.293ns
      time= 41.292ns
      Any ideas on how to speed the RTJ version up. Its 400 times slower.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
        • 1. Re: Java Real time system poor inline performance
          807557
          Mak,

          What you are seeing are the effects of the hotspot server compiler versus the client compiler. Java RTS only supports the client compiler (even if you use -server you get -client). The server compiler can perform very aggressive optimizations, compared to the client compiler, because if it makes a wrong assumption it stops-the-world, deoptimizes things, recompiles them the right way (perhaps immediately, or perhaps leaving it for later dynamic compilation) and continues on its way. The client compiler is much less sophisticated and does not do these aggressive optimizations. For Java RTS the server compiler's mode of operation would completely kill predictability, so deopt can not be allowed and so the aggressive optimizations are also not allowed.

          Here are the results I get for client, server and then JRTS:
          # /mirrors/j2se-mirrors/5.0u17/solaris-i586/bin/java -client DoubleToFloatBenchmark
          time= 66.503ns
          time= 65.146ns
          time= 65.146ns
          time= 65.146ns
          time= 65.146ns
          
          # /mirrors/j2se-mirrors/5.0u17/solaris-i586/bin/java -server DoubleToFloatBenchmark
          time= 10.850ns
          time= 7.711ns
          time= 0.140ns
          time= 0.139ns
          time= 0.139ns
          
           # rtj DoubleToFloatBenchmark
          time= 74.191ns
          time= 74.190ns
          time= 73.739ns
          time= 73.739ns
          time= 73.739ns
          This is the sort of results I'd expect to see. JRTS is approx 13% slower than J2SE client.

          Looking at your example, this is a classic problem with micro-benchmarking - see Cliff Click's "famous" JavaOne 2002 talk on "How not to write a microbenchmark":

          http://www.azulsystems.com/events/javaone_2002/microbenchmarks.pdf

          There are numerous similar articles following up on that showing how easy it is for the server compiler to throw away the precious code you are so desperately trying to measure the performance of. It's a real eye-opener. See Brian Goetz's article: http://www.ibm.com/developerworks/java/library/j-jtp02225.html

          In this code in your example:
          for (int j = 0; j < INNER_LOOP; j++) {
             results[i] = Double.doubleToLongBits(values);
          }
          the inner loop can be removed completely because the computation in the loop is independent of the loop variable j. (I'm not sure if that was intentional?)
          
          So let's manually delete that inner loop and see what we get (and stop dividing by INNER_LOOP). Here's the results again for client, server and jrts:
          # /mirrors/j2se-mirrors/5.0u17/solaris-i586/bin/java -client DoubleToFloatBenchmark
          time= 449.000ns
          time= 450.000ns
          time= 317.000ns
          time= 308.000ns
          time= 316.000ns

          # /mirrors/j2se-mirrors/5.0u17/solaris-i586/bin/java -server DoubleToFloatBenchmark
          time= 455.000ns
          time= 455.000ns
          time= 452.000ns
          time= 455.000ns
          time= 452.000ns

          # rtj DoubleToFloatBenchmark time= 506.000ns
          time= 506.000ns
          time= 503.000ns
          time= 340.000ns
          time= 340.000ns
          Oh my gosh! JRTS and client become faster than server! ;-) But what are we now measuring ... ?
          
          I hope this clarifies things.
          
          David Holmes
          
          Edited by: davidholmes on Dec 11, 2008 10:08 AM Added link to Brian Goetz's article.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
          • 2. Re: Java Real time system poor inline performance
            807557
            1) Thanks you very much for you useful info, timely response and links
            2) Microbenchmarks aren't very good if they have a bug :-)
            --- DoubleToFloatBenchmark.java     (revision 64)
            +++ DoubleToFloatBenchmark.java     Thu Dec 11 07:33:34 GMT 2008
            @@ -26,7 +26,7 @@
                     for (int i = 0; i < OUTER_LOOP; i++) {
                         long start = System.nanoTime();
                         for (int j = 0; j < INNER_LOOP; j++) {
            -                results[i] = Double.doubleToLongBits(values);
            + results[j] = Double.doubleToLongBits(values[j]);
            }
            long end = System.nanoTime();
            time = Math.min(time, end - start);
            It was intention to loop through each element so it could not be optimized out, but an *i* looks a lot like a *j* 
            3) The version without the inner loop is really measuring the System.nanoTime() call which is better in JRTS 
            4) I think the corrected version still looks like a vaild microbenchmark, but would value your input as to why it is not.
            When run with the bug removed
            bash-3.00$ /usr/jdk/jdk1.5.0_16/bin/java -client -cp . DoubleToFloatBenchmark
            time= 31.497ns
            time= 32.483ns
            time= 32.559ns
            time= 32.724ns
            time= 32.503ns
            time= 32.994ns
            time= 32.498ns
            bash-3.00$ /usr/jdk/jdk1.5.0_16/bin/java -server -cp . DoubleToFloatBenchmark
            time= 7.168ns
            time= 5.481ns
            time= 2.439ns
            time= 2.438ns
            time= 2.440ns
            time= 2.440ns
            time= 2.436ns
            bash-3.00$ /opt/SUNWrtjv/bin/java -cp . DoubleToFloatBenchmark
            time= 45.903ns
            time= 45.856ns
            time= 43.575ns
            time= 43.567ns
            time= 43.572ns
            time= 43.569ns
            time= 43.573ns
            Its still a about 18 times difference. The client VM seems to not want to inline the native call. Is there any way to force this inlining?                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
            • 3. Re: Java Real time system poor inline performance
              807557
              The microbenchmark may be "valid" in that it will measure how the VM executes the code you've written, but the issue is more how you use the information from the microbenchmark to infer real application behaviour. If you app spends a large portion of its time doing this conversion then this may be an issue for your apps performance; if not then it probably won't. That's something only you can determine. Ultimately the issue is whether or not you meet your goals.

              There are significant differences between the abilities of the client and server compiler. Whether inlining is the issue in this case I can't say. There are some options for printing out what the compiler generates but I'm not sure to what extent they are applicable to JDK 5.

              There's little scope for influencing the the compilation policies as well - one of our compiler folk would have elaborate on what the available options are.

              David Holmes
              • 4. Re: Java Real time system poor inline performance
                807557
                Thanks David,

                What I'm looking at is an evaluation JRTS vs. jdk1.6 for a predicable low latency messaging app. The app does not exist, but I am building up benchmarks as I design and implement. I don't know if JNI inlining (if that is the problem) will be a significant issue yet, but it may be. I guess my question relates to what the roadmap is for improving the raw performance of JRTS vs. J2SE?

                As an aside, my opinion for what its worth, its surprising the JRTS team invested the resources to implement, support and maintain a Linux version. I would have thought investing in compiler improvements would be a bigger win. As a customer, its a much easier buy decision if the J2SE and JRTS performance is comparable, the'I'd I just take a S10 migration hit which would not be a big deal given we are taking Java. Better cross-selling opportunity too :-)