One of the things I learned in The Server Side Java Symposium 2008 was a command-line option to print out the assembly code that JIT is producing. Since I've always been interested in seeing the final assembly code that gets produced f ]]>rom your Java code, I decided to give it a test drive. <p> First the disclaimers: <blockquote style="font-weight:bold; color:darkred"> <ol> <li>I'm not a performance expert. <li>Don't try to take this too far, like optimizing your code against what you see here. </blockquote> <p> The option in quhere. The binary I tested is JDK6 u10 b14.

$ java -fullversion java full version "1.6.0_10-beta-fastdebug-b14"

First, let's try something trivial:

public class Main { public static void main(String[] args) { for(int i=0; i

I run this like "java -XX:+PrintOptoAssembly -server -cp . Main". The -XX:+PrintOptoAssemblyis the magic option, and with this option I get the following, which shows the code of the "foo" method:

000 B1: # N1

You see that the entire bar() function call and the loop was optimized away. So it must have inlined the bar() method, then unrolled the loop.

Now to something more interesting:

private static byte[] foo() { byte[] buf = new byte[256]; for( int i=0; i

This produces the following code:

000 B1: # B15 B2

Just to recap, R8-R15 are additional general-purpose 64bit registers new in the amd64.

The first part (00c-027) is allocating an array, and this is already interesting. As the comment indicates, R15 is apparently used as a pointer to a thread-local storage of the current thread, and R15[120]is the pointer to the head of the heap sub-space dedicated for this thread.

So the byte[] is allocated from this thread-local space by simply reserving 256+32 byte space. If there's not enough space (the limit is set at R15[136]), then it uses the slower allocation code at B15— this code must involve in reserving a new chunk from the eden space and allocate a new object there.

Once the pointer to the new array is set to R8 at 00c, the initialization follows (033-071.) The first 24 bytes of the newly allocated space is used for metadata (the first 8 byte is probably lock or GC-related, followed by a pointer to the class object, then another 8 bytes for the size of the array.) 06czero-clears the array. In theory the zero-clear shouldn't have been necessary, as we are then filling the array to zero again, but JIT failed to take advantage of that.

But note that the zero-clear is done by 8 bytes at a time, so it did recognize that the array size is multiple of 8.

I don't quite understand what those prefetch instructions (at 02b, 03a, and 049) are meant for. Presumably they are to make sure that the next time an object allocation happens, that part of the memory is in cache, but why 256, 320, and 384? Does anyone have a clue?

Now as of 074, R8 is the pointer to 'buf' and R9 is the length of the array. Note that JIT knows that buf.length is always 256 here, so this is movl R9,256 and not movl R9,[R8+16]. Also note that this computation is outside the for loop. So this tells us that there's no need to explicitly assign the array length to a temporary variable in a tight loop, because JIT does the equivalent anyway:

int len = buf.length; for( int i=0; i

Similarly there's no need to reverse the direction of the loop to avoid buf.lengthcomputation.

The way the loop is compiled is very interesting. First there's the 'warm up' part (07c-099) that presumably does the array filling until it reaches the 8-byte boundary, then the 'fast loop' portion (09b-0d3) that zero-fills 8 bytes per loop by using an MMX register, then the final 'cool down' part (0d5-100) that handles the last remaining part that doesn't fit 8 byte boundary. In this case, in theory it could have figured out that the whole thing nicely fits 8-byte boundary, so the warm up and cool down was unnecessary, but it appears that JIT didn't realize this.

I don't know what kind of computation happens behind the scene here, but overall this loop unrolling is rather clever. The original code was byte-by-byte assignment to 0, but in the final code, one loop iteration clears 8 byte at a time.

I also noticed that there's no array boundary check in the fast loop portion, which is nice.

OK, most of you have hopefully heard that in JDK6 they do lock coarsening and lock elision. So let's see that in action.

For that, I compiled the following code and executed in the same fashion:

private static void foo() { Vector v = new Vector(); v.add("abc"); v.add("def"); v.add("ghi"); }

This gives me the following:

000 B1: # B10 B2 long 0cc shrq R10, #9 0d0 movq RDX, java/lang/String:exact * # ptr 0da movq R11, 0x00002a959c9da580 # ptr 0e4 movb [R11 + R10], #0 # byte 0e9 movq RSI, RBP # spill 0ec nop # 3 bytes pad for loops and calls 0ef call,static java.util.Vector::add # Main::foo @ bci:11 L[0]=RBP # AllocatedObj(0x0000000040b30680) 0f4 0f4 B6: # B15 B7

The allocation of a Vector object (00c-058) is almost identical to the array allocation code we've seen before (except the additional field initializations at 048-058.) The array allocation for Vector.elementDatafollows (060-0C0.)

Note that the Vectorconstructors are defined in highly nested fashion like this:

public Vector(int initialCapacity, int capacityIncrement) { super(); if (initialCapacity < 0) throw new IllegalArgumentException("Illegal Capacity: "+ initialCapacity); this.elementData = new Object[initialCapacity]; this.capacityIncrement = capacityIncrement; } public Vector(int initialCapacity) { this(initialCapacity, 0); } public Vector() { this(10); }

... but the whole thing is inlined, so the end result is just as fast as the following code. This is great.

public Vector() { this.elementData = new Object[10]; this.capacityIncrement = 0; }

But wait, after that, you see that there's three call instructions for Vector.add. So there's no lock elision nor lock coarsening, despite the fact that this Vectorobject never escapes the stack.

I thought perhaps that's because Vector.addis too complex to be inlined, so I tried the following code, in the hope of seeing the lock elision:

private static void foo() { Foo foo = new Foo(); foo.inc(); foo.inc(); foo.inc(); } private static final class Foo { int i=0; public synchronized void inc() { i++; } }

This produced the following code:

000 B1: # B6 B2

We are all familar with the memory allocation by now, so we can skip that.

The 'fastlock' pseudo-instruction (AFAIK there's no such operation in amd64, and a single machine code can't possibly occupy 223 bytes!) must be the lock code. Here you see that the lock coarsening has indeed happened (yay!), and three increments happen in a single block (MEMBAR-acquire/release must be another pseudo-instruction, which became no-op in this scenario — see that the length of those instructions are 0).

Note that JVM still fails to eliminate a lock here, despite the fact that this object doesn't escape the stack. I tried various things to see the effect of escape analysis and lock elision kick in, but couldn't find a way to do it. It looks like this feature is not quite in JDK yet, although it's equally possible that I'm doing something stupid.

Also note that presumably because of the memory barrier associated with this, each increments write back to memory. This is unfortunate because in theory three increments could have been combined into one, given the the lock was coarsened.

Indeed if I remove the 'synchronized' keyword, I get the following substantially simpler version:

000 B1: # B4 B2

So not only three inc()s but the initializer also got collapsed into single "movq rax[16],3" call. Wow!

All in all, modern JVMs seem pretty good at generating optimal code. In various situations, the resulting assembly code is far from the straight-forward instruction-by-instruction translation. OTOH, the escape analysis doesn't really seem to do anything useful yet.

This was a long post, but I hope you enjoyed this as much as I did.