14 Replies Latest reply: Sep 16, 2009 4:04 PM by 843810 RSS

    JIT vs ia32 Assembly routine

    843810
      Hi,

      I compared the performance of a tiny java program comprising nested loops and some meaningless calculations with basically the same routine written directly in ia32 assembly language and NASM. I didn't bother optimizing either the java program or the asm program. To my complete astonishment, I got the following execution times for the two programs:

      Machinecode:
      time ./NestedLoop
      real     1m3.405s

      Java:
      time java NestedLoop
      real     0m37.229s

      I am well aware that JIT can often outperform C++ native code as it optimizes for the actual processor on the system, but a few lines of machine code?? C'mon. The nested loops in asm didn't even leave the registers to access memory, which could have explained the observed speedup. My two programs should be essentially identical, I just made java print out the end result in order to force it to wait for the loop-calculations. The chosen java variables reflect the names of the cpu registers.

      I guess my question is, can someone please explain this execution time discrepancy? Did JIT perhaps take any 'shortcuts' to avoid actual looping? How would it optimize such a simple algorithm to make it even faster than assembly?

      Thanks in advance.

      Here the sources:
      JAVA:
      public class NestedLoop {
           public static void main(String[] args) {
                int eax = 100;
                int ebx=0;
                int edx;
                for (int i=eax; i>=0; i--) {
                     for (int ecx=100000000; ecx>=0; ecx--) {
                          edx = ecx/2;
                          ebx=ebx+edx;          
                     }
                     ebx++;
                }
                System.out.println(ebx);     // just put in to make the JVM wait for result     
                System.out.println("Done!");
           }     
      }
      NASM:
      global main
      
      main:
           mov eax, 100     
           mov ebx, 0
           
      outerLoop:
           mov ecx, 100000000    ; The counter is in ecx
           
      innerloop:
           mov edx, ecx
           shr edx,1
              add ebx,edx
           loop innerloop
      
           inc ebx
           dec eax
           jnz outerLoop     
      
           mov eax,4
           mov ebx,1
           mov ecx,done
           mov edx,len
           int 80h
      
           mov eax,1
           mov ebx,0
           int 80h
      
      section .text
      
      section .data
      done: db "Done!",0xa
      len: equ $-done
        • 1. Re: JIT vs ia32 Assembly routine
          843810
          I don't know what optimizations the JIT compiler may have made, but on many processors, code alignment can have a large impact on performance. A few NOPs in the right places to word/paragraph align your loop entries could help; there is probably an align directive in NASM for this.

          Also, you've got an off-by-one error. Both "loop" and "jnz" do not enter the loop when the counter has reached zero, but your java code does. Change your java loop conditions to "> 0" instead of ">= 0" to fix this.
          • 2. Re: JIT vs ia32 Assembly routine
            843810
            paul.miner wrote:
            Also, you've got an off-by-one error. Both "loop" and "jnz" do not enter the loop when the counter has reached zero, but your java code does. Change your java loop conditions to "> 0" instead of ">= 0" to fix this.
            Yeah, I caught that shortly after I had already posted the question. These two programs above are slightly modified versions and condensed for posting compared to what I actually used. I decided to ignore the ">=0" as it doesn't really change the performance. If anything, it should have given the asm code a slight edge. Thanks though.

            I will definitely look into code alignment to see, if this could explain the huge discrepancy. Thanks.
            • 3. Re: JIT vs ia32 Assembly routine
              843810
              DanielX wrote:
              These two programs above are slightly modified versions and condensed for posting compared to what I actually used.
              Careful, this is the sort of thing that will get people fired up. At the least, this should have been noted in your original post.
              • 4. Re: JIT vs ia32 Assembly routine
                843810
                paul.miner wrote:
                DanielX wrote:
                These two programs above are slightly modified versions and condensed for posting compared to what I actually used.
                Careful, this is the sort of thing that will get people fired up. At the least, this should have been noted in your original post.
                In my 'original' versions I tested more cases and various (unnecessarily complex) scenarios. So I decided to condense them to these two short versions and then quickly ran the time comparisons on those reduced 'forum' versions, where one was slightly flawed as I apparently missed to change the '>=' to '>' in my attempt to reduce the programs to the bare minimum. Anyway, I now corrected the off-by-one bugs and re-ran both programs on a different linux machine (3 times each). ASM: 20.331s, 20.152s, 20.213s vs JAVA: 11.938s, 11.823s, 12.108s. Pretty significant difference, I think.
                • 5. Re: JIT vs ia32 Assembly routine
                  843810
                  paul.miner wrote:
                  code alignment can have a large impact on performance. A few NOPs in the right places to word/paragraph align your loop entries could help; there is probably an align directive in NASM for this.
                  I tried the align directive in NASM, for 2 byte, 4 byte and even 8 byte alignment (just in case). No difference in performance of the machinecode version. Any other suggestions anyone, why the JIT compiler can totally outperform this short assembler program by ~40%? Is there any way perhaps to see, what kind of native code the compiler supposedly produces? I am frankly a bit puzzled how the Java compiler can speed up these nested loops that dramatically.

                  Thanks.
                  • 6. Re: JIT vs ia32 Assembly routine
                    843810
                    There's a lot of detailed stuff that goes into optimization when it comes to superscalar processors and keeping the pipelines full. The compiler is probably pretty good at this. You could try writing an equivalent C program and timing it.

                    This may not apply now, but on older x86 processors, the LOOP instruction was actually significantly slower than DEC/JNZ. Also, the LEA instruction is often used to do register arithmetic.

                    One plain optimization that seems possible is "ebx++", which a smart compiler might be able to move out of the loop and simplify to "ebx += 101".
                    • 7. Re: JIT vs ia32 Assembly routine
                      843810
                      paul.miner wrote:
                      You could try writing an equivalent C program and timing it.
                      Mhm, I was thinking about doing this, too. It should give me a better overall picture of what exactly might be going on here.
                      paul.miner wrote:
                      One plain optimization that seems possible is "ebx++", which a smart compiler might be able to move out of the loop and simplify to "ebx += 101".
                      Very true! When I reduced the java program to just a simple loop with only "ebx++" in it, I'd always get an instantaneous result back (i.e. 0 computation time), regardless of whether I looped through it 1000, 1 million or 2 billion times. Hence the little meaningless calculation and the output of "ebx" at the end, otherwise the compiler seems to just skip the loops more or less.

                      Thanks for your input. I'll try a few more things and wrap it all up then. The lesson learned and good news I guess is that the clearly outstanding efficiency of the JIT compiler saves me a lot of headache by keeping it all within Java; the bad news is that there probably isn't any potential for further optimization of my code by writing the computationally heavy routine directly in assembly.
                      • 8. Re: JIT vs ia32 Assembly routine
                        jschellSomeoneStoleMyAlias
                        DanielX wrote:
                        ... by writing the computationally heavy routine directly in assembly.
                        Timings for benchmarks give you a good indication how long benchmarks take to run.
                        Since the vast majority of code isn't benchmarks they only have limited use in terms of relating to actual code.

                        Whether one can get better performance with assembly would depend on ones level of experience with assembly, experience with the target platform and general skill level in optimizing code (and assembly) as well.
                        • 9. Re: JIT vs ia32 Assembly routine
                          843810
                          jschell wrote:
                          DanielX wrote:
                          ... by writing the computationally heavy routine directly in assembly.
                          Whether one can get better performance with assembly would depend on ones level of experience with assembly, experience with the target platform and general skill level in optimizing code (and assembly) as well.
                          True. I used to program assembly extensively many years ago, so I wouldn't mind getting into it some more once again, if necessary. If the performance increase of my little 'experiment' had been say >2-3 fold, I would have certainly pursued this further. But it seems that I'd have to think hard(er) to just catch up with the performance of the JIT compiler, let alone improve my algorithm to significant performance benefits in order to justify porting it to assembly code. Not worth it I think.
                          • 10. Re: JIT vs ia32 Assembly routine
                            843810
                            Something I've thought about in programs like this, is how far compiler optimizations could be taken. The JLS says that simple things like "int n = 5 * 6 + 7" are evaluated by the compiler as a constant expression, resulting in "int n = 37". A very smart/hard-working compiler could take this a lot further.

                            For example, the entire inner loop could be reduced to "edx = 0; ebx += CONSTANT". Taking this further, all the local variables and the outer loop could be eliminated as well, and the first println would be "System.out.println(ANOTHER_CONSTANT)". The compiler would burn a lot of CPU cycles, but it seems quite possible.
                            • 11. Re: JIT vs ia32 Assembly routine
                              843810
                              Just for completion, I tested my little program with gcc too. Here the source:
                              #include <stdio.h>
                              int main (void)
                              {
                                   int eax=100;     
                                   int ebx=0;
                                   int edx;
                                   
                                   int i, ecx;
                              
                                   for (i=eax; i>0; i--)
                                   {
                                        for (ecx=100000000; ecx>0; ecx--)
                                        {
                                             edx = ecx/2;
                                             ebx=ebx+edx;          
                                        }
                                        ebx++;
                                   }
                              
                                   printf ("Done!");
                                   return 0;
                              }
                              The average computation times in conclusion:

                              JIT: 12 sec
                              NASM: 20 sec
                              GCC: 37 sec

                              Certainly not representative as some sort of benchmark, but good enough for me to get an idea.
                              • 12. Re: JIT vs ia32 Assembly routine
                                EJP
                                I would expect a good optimizing compiler or JIT to cut the number of iterations in the inner loop in half. If it was really good it would compute the result and eliminate the loops altogether.
                                • 13. Re: JIT vs ia32 Assembly routine
                                  843810
                                  I have changed the loop innerloop statement in my assembly code into dec ecx and jnz innerloop. This right there increased the performance from 20secs to 7.2 secs. Two things are very apparent. 1) JIT seems to be very very efficient, especially when compared to gcc. 2) My evaluation program is obviously way too short and simple in nature, as even tiny changes result in drastic fluctuations in performance time. Granted the loop instruction is executed 100s of millions of times, but it requires 6 cycles over 4 cycles (1+3) for dec and jnz, yet the computation time shrinks from 20secs to 7.2secs. Go figure.
                                  • 14. Re: JIT vs ia32 Assembly routine
                                    843810
                                    DanielX wrote:
                                    I have changed the loop innerloop statement in my assembly code into dec ecx and jnz innerloop. This right there increased the performance from 20secs to 7.2 secs. Two things are very apparent. 1) JIT seems to be very very efficient, especially when compared to gcc. 2) My evaluation program is obviously way too short and simple in nature, as even tiny changes result in drastic fluctuations in performance time. Granted the loop instruction is executed 100s of millions of times, but it requires 6 cycles over 4 cycles (1+3) for dec and jnz, yet the computation time shrinks from 20secs to 7.2secs. Go figure.
                                    Heh, I thought DEC/JNZ might be faster, but I'm surprised it was that big a difference. My guess is it's the whole RISC/CISC style instructions thing. IIRC, RISC style instructions were faster than the multi-step CISC style instructions because they were less disruptive to the pipeline.