Zero and Shark: a Zero-Assembly Port of OpenJDK Blog


    May 9, 2007 was a happy day for the Java group at Red Hat. The release of OpenJDK meant we could stop playing catch-up with the free Java solutions we were maintaining and switch our attention to the real deal.

    There was just one problem. On Linux, OpenJDK only worked on x86 machines, but at Red Hat we needed to support PowerPC, Itanium, and zSeries too. We knew that if we wanted OpenJDK to work on these platforms then we would have to make it happen ourselves.

    Porting OpenJDK to a new platform involves writing and maintaining several thousand lines of assembly language. This causes each platform to become its own codebase, each with its own bugs. Each platform requires a specialized implementation, which makes the platform-specific codebases opaque to the developers working on other platforms. This approach is easier to manage in the proprietary world (where every codebase has its team of assigned developers) but it's less suited to open source software, where developers come and go as their circumstances dictate.

    At Red Hat, we wanted to avoid these problems. We started an experimental port of OpenJDK without assembly language, using free software libraries to bridge the gaps. This experiment evolved to become the zero-assembly port of OpenJDK -- Zero-- and its just-in-time compiler Shark.

    The Problem

    The majority of the work of porting OpenJDK to a new processor is in porting the virtual machine, HotSpot. HotSpot is largely written in C++, but some 10,000 lines of its most critical code are written in assembly language. This has to be recreated for every new processor you wish to support, an intensive task that takes on the order of one work year per platform for a basic port. To make OpenJDK truly portable it was clear that this assembly language core needed to be replaced.

    HotSpot operates by default in what it calls mixed mode. Java code is initially executed using a profiling interpreter, which stores information as it runs so that it can identify the methods in which it is spending the most time. Once identified, these "hot" methods are scheduled for compilation to native code. The compiler, running in a separate thread, has a basic loop in which it takes the hottest method, compiles it to native code, and inserts it into the VM. Execution speed increases over time as more and more hot methods are replaced with compiled code. The important point here is that in this design the compiler is optional: a functional, if slow, port could be written by getting the interpreter alone to work.

    The Interpreter

    I've been writing in terms of "the interpreter," but HotSpot in fact contains two: the template interpreter and theC++ interpreter. In the template interpreter, each bytecode is implemented by a block of native code -- atemplate -- written in assembly language. These templates are generated at interpreter startup, and are chained together at runtime to execute the method. In the C++ interpreter, bytecodes are implemented in C++, using a simple loop and switch construct. Not everything required by the interpreter can be handled in C++, however, so the C++ code is supported by a thin assembly language layer. The way the interpreters slot into the virtual machine is shown in Figure 1.

    HotSpot's interpreters
    Figure 1. HotSpot's interpreters. Code written in C++ is shown in green, and code written in assembly language is shown in red.

    The template interpreter is the default, for the simple reason that it's faster. The difference in speed is not so much because it is written in assembly language but because it is the older of the two; the design of the VM has very much evolved along with the template interpreter, and the C++interpreter has to jump through hoops in order to accommodate interfaces that don't really suit it.

    The C++ interpreter has one compelling advantage over the template interpreter, however: it contains much less assembly language. Porting it to a new platform can therefore be done much more quickly, and in mixed mode the difference in execution speed is largely mitigated as hot methods are replaced with compiled code over time. Having less assembly language to replace made the C++ interpreter the better choice for Zero.

    The C++ interpreter's assembly language layer has two basic functions that cannot be performed from within C++: manipulation of the native stack, and calls to native functions with arbitrary signatures. Implementing the C++ interpreter without assembly language basically boils down to finding a way of handling these two things. There are other things in the assembly layer, things that could have been written in C++, were they stand-alone functions; but the fact that part of the layer needs to be written in assembly language means that all of it has to be.

    Stack Manipulation

    At the Java level, a Java VM must keep track of which method called the current method, and which method called that method, and so on. This is necessary for a variety of reasons, from the simple (figuring out where a method is returning to when it returns) to the complex (figuring out the access control context of a method to see if it is permitted to perform some action by the current security policy). The straightforward way to handle this is to store the information in a call stack.

    At the machine level, HotSpot is itself an application, and it has a stack of its own where the individual C and C++ functions store their own caller information and other data. The format of this stack is CPU- and OS-specific, its layout defined as part of the particular platform's Application Binary Interface (ABI).

    HotSpot was originally written for i386, a platform notoriously starved of registers, so instead of maintaining two separate stack pointers in two separate registers, everything in HotSpot is stored on the ABI stack. This saves a register, but it requires at least part of the interpreter to be written in assembly language, as the ABI stack cannot be manipulated from within C or C++. Even if it could be accessed by C or C++, the layout of the ABI stack is platform-specific, so the code to create and access the frames would still require a separate implementation for each platform.

    There's no fundamental need for the stacks to be merged like this -- it's merely a nice optimization -- so in Zero, separate stacks are used. The Java stack is simply a block of memory managed by some simple, portable C++, and the ABI stack is left to manage itself. This change eliminated a swathe of assembly language but raised an issue with the object locking code, which allocates locks on the Java stack but tests for locks by looking for pointers into the ABI stack. Zero works around this by allocating the Java stack's block of memory on the ABI stack, usingalloca().

    Native Calls

    Most methods in a Java application will be normal methods, written in the Java programming language and executed using interpretation or JIT compilation. In addition, however, Java also allows for native methods, methods whose code is written in C or C++, with the Java Native Interface (JNI) providing the bridge between the two. The C code for a native method might look something like this:

    JNIEXPORT jboolean JNICALL Java_java_lang_Class_isInstance(JNIEnv* env, jobject cls, jobject obj) { if (obj == NULL) return JNI_FALSE; return (*env)->IsInstanceOf(env, obj, (jclass) cls); }

    Now, JNI allows for methods with any signature that the underlying platform supports: they can have any number of arguments, with any combination of types, and they can have any return type. This poses a problem for HotSpot, or indeed any runtime written in C++, because C++ can only call functions whose signature is known at compile time, whereas with JNI signatures are only known at runtime. HotSpot's native calling code has traditionally had to be written in assembly language.

    Zero uses a free software library called libffi to handle native calls. It was originally written to handle JNI calls for GIJ, the GNU Interpreter for Java. In other words, libffi solved the exact problem we were facing, making it ideal for Zero. Assembly language is still required to perform the call, but it's encapsulated in libffi and the code in Zero is entirely C++.


    Replacing the assembly language for stack manipulation and native calls allows the whole of the C++ interpreter's support layer to be rewritten in C++, and this, combined with some build system changes, was enough to allow interpreter-only builds of OpenJDK on any Linux system with GCC. This is great, but the resulting VM is very slow. To get reasonable performance we needed to move beyond interpretation and find a way to include a JIT compiler.

    To implement a JIT without introducing platform-specific code, we turned to another free software library, LLVM (Low Level Virtual Machine). LLVM has a wide range of applicability -- it's an infrastructure for building both compilers and virtual machines -- but the feature that was interesting for this project is that it includes JITs that generate native functions from code expressed in LLVM's intermediate representation (IR).

    Shark is, in essence, very simple. It uses the same interface as HotSpot's platform-specific compilers, so it slots in with very little modification to HotSpot itself. When running in mixed mode, HotSpot's compiler scheduler locates hot methods and invokes Shark to compile them, one at a time. Shark translates the Java bytecode of these methods to LLVM IR, and invokes LLVM's JIT to generate the native code. The native code is then installed in the VM, where it replaces the interpreted version of the method, and control returns to the compiler scheduler.

    The main difficulty for Shark is that object pointers need to be available to HotSpot's garbage collectors (GC). HotSpot's existing compilers have access to the native code they generate, which allows them a certain flexibility here. They can leave pointers in registers across GC runs, for example, because they can supply the GC with information about which registers contain pointers. Shark can't do this; object pointers need to be dumped to memory across GC runs and restored afterwards. HotSpot's compilers can inline pointers in the generated code, too, annotating the code so the GC can locate and modify them. Again, Shark can't do this; object pointers must be loaded from memory or passed around between functions. These extra memory accesses impose a significant overhead.


    Zero and Shark have been written following the philosophy of minimal modification to existing HotSpot code, an approach that has had a number of advantages. Zero's development was extremely fast, from the initial concept in December 2007 to a functional, stable VM that could build itself in March 2008. It aids stability, too. A HotSpot build with Zero comprises 6,500 lines of new code from Zero and 450,000 lines from HotSpot: 450,000 lines of code that has enjoyed ten years of extensive use and rigorous testing. This helped enormously when testing Zero builds with the Java Compatibility Kit (JCK): most of the functionality under test was handled by the original HotSpot code, so most of the issues were already taken care of. Finally, this approach means Zero can take advantage of new HotSpot features or optimizations with minimal or no effort. If someone writes a new garbage collector, for example, then Zero and Shark can use it straight away because Zero and Shark present themselves to HotSpot's garbage collectors in exactly the same way as the existing HotSpot code.

    This approach is not without its disadvantages. The interpreter, for example, does not need to be split into two layers for Zero, and a rewrite could make it considerably faster. Shark, too, is limited; the need to keep object pointers visible to HotSpot's garbage collectors requires a lot of memory accesses that could be avoided with a different interface. Zero and Shark were never about extensive HotSpot modifications, however, and to a certain extent they discourage them. Extensive modifications are, by definition, a lot of work, and if you're going to do a lot of work chasing ultimate performance, then why not go the whole way and do a conventional port? Hand-crafted assembly language will always have the edge. The point of Zero and Shark was to deliver a portable and stable VM with reasonable performance. When Shark is ready for production, that's what they'll be.