Skip navigation
I was looking over some old stuff, and found JDistill, a byte-code reduction program I wrote in 1998. Although it won't work unchanged on today's class files, and its copyright status is murky, I thought that the article I wrote might still have some interest. Here it is, with only minor edits. 

JDistill, a program to shrink Java packages
Éamonn McManus <emcmanus@opengroup.org>
3 April 1998

JDistill is a Java program that reduces the size of the class files that make up a Java package. It does this in several ways: 
  • It deletes information that is used for debugging but not actually needed for correct execution of the Java bytecode. This includes line-number information, for example.
  • It traces through all the executable code in the package that is reachable from outside the package in order to determine what is referenced. Anything that is never referenced is deleted. This can include fields and methods of Java classes, or even entire classes themselves.
  • It replaces the names of any methods and fields that have not been deleted and that are not visible outside the package with very short names. This has the side-effect of making the class file much harder to discompile.
  • It detects the frequently very bulky code generated by the current JDK javac compiler (versions up to 1.2beta2 at least) to initialize arrays, and replaces it with code that takes up much less space in the class file.
  • A command-line option tells JDistill that it is being given the whole Java program, i.e., no new classes will be added. In this case many more methods and fields can be deleted or renamed.
This document explains the rationale for JDistill, shows how to use it to distill (shrink) packages, and gives some results for the standard JDK packages in 1.1.5. The final section goes into detail about the implementation. 

Rationale

Java class files serve two distinct purposes. They containbytecode to be executed when the program using them is run; and they contain information used when other Java classes are compiled. For a Java program that is a product, this additional information for the compiler is not needed and can be deleted. 

Deleting the extra information makes the class files smaller, and also makes them much harder to decipher. Tools exist that can reconstitute almost exactly the Java source code that was used to produce a given class file; but if the names have been dramatically shortened (a form of obfuscation) this source code will be of little use.

When the Java compiler compiles a Java package, it does not know if additional classes will subsequently be added to the package. JDistill operates on the assumption that they will not. That means that anything within the package that is not referenced now will never be referenced and can be deleted.

You might imagine that anything JDistill can delete out of a package could also be deleted from the Java source code by a human user. This is not true for a number of reasons.

  • When you declare a constant in Java, like this: 
        static final int MAX_VALUE = 32767;
    
    the compiler must leave this field in the generated class file, so that other classes compiled later will be able to obtain its value. But no compiled class will ever refer to the field by name, because instead of generating a reference toSomeClass.MAX_VALUE the compiler will just copy the constant value 32767. JDistill will see that there are no references to MAX_VALUE in existing class files (because of this replacement) and will know that nothing outside the package can subsequently refer to MAX_VALUE(because it is not public), so it can delete it.
  • A product might exist in several configurations, selected at compile time. It is possible to eliminate code within methods when it belongs to a configuration that is not enabled, by doing something like: 
        if (Config.DEBUG) {
            // debug code...
        }
    
    But this technique cannot eliminate methods themselves or fields or even classes that are only used when Config.DEBUG is true. JDistill can.
  • Renaming methods and fields can save a small amount of space and can render class files very difficult to decipher. But it would be counterproductive to do the renaming in the Java source files!
  • As mentioned above, the Java compiler generates bulky code to initialize constant arrays - the same code as it generates for non-constant arrays, in fact. If you write: 
        static final int[] powers = {1, 5, 25, 125, 625, 3125};
    
    the compiler will generate code almost as if you had written: 
        static final int[] powers;
        static {
            powers = new int[6];
            powers[0] = 1;
            powers[1] = 5;
            ...
            powers[5] = 3125;
        }
    
    (The code is slightly more efficient than that because it can keep a reference to the powers array on the operand stack.) You can write code that takes up less space to initialize the array, by storing the initial values in a constant String and pulling them out in a loop, but it is unnecessary work to do so. JDistill does it for you automatically.

Using JDistill

2011 Éamonn says: you might want to skip to the section on technical details.

The command line for using JDistill looks like this:

java ORG.opengroup.jdistill.Main [-options] directory [...]

At its simplest, all of the .class files in the named directory are read, transformed to make them smaller, and written back over the originals. It is assumed that the class files form a package and that that package is closed, meaning that no new classes will be added to it after distilling, or that it will be redistilled after new classes are added.

Several directories can be named instead of just one; currently the effect is the same as if JDistill had been run on each directory individually. Files can also be named instead of directories, whereupon they are treated separately without the "closed" assumption.

Invoking JDistill without any arguments or with incorrect arguments produces a summary of the available options. The options are as follows. Each option can be abbreviated to any unambiguous prefix.

-output-replace
Followed by a pattern that says where output files should be written. The pattern should have two parts separated by=, and each part should contain exactly one% character. Each input file should match the pattern on the left of the =, with % being a wildcard that matches any sequence of characters. The name of the corresponding output file will be determined by the pattern on the right of the =, with the % being replaced by the same sequence of characters. Thus for example the option 
    -out /tmp/classes/%=/tmp/shrunkclasses/%
will cause a class file read from/tmp/classes/java/lang/String.class to be written to/tmp/shrunkclasses/java/lang/String.class

The % syntax is inspired by GNU make. The more obvious * character was avoided to prevent problems with Unix shell expansion.

If this option is not provided, distilled class files are written over the original undistilled ones. In that case classes that are found not to be referenced are not deleted; it would be more accurate to say that they are not copied when-output-replace is specified.

-option-file
Followed by the name of a file from which further options are to be read. Typically there will be -keep-* options in this file, kept up to date as needed. The syntax of options in the file is the same as on the command line except that they can be provided on several lines, and lines beginning with a# character are ignored.
-verbose
Print a line for each class file that is read to be distilled, each class that is loaded without being distilled (to provide inheritance information, for instance), and each class file that is written.
-no-warnings
Do not print messages about constructs found in the class files that might cause problems with the distillation. Currently the only warnings are associated with classes loaded dynamically usingClass.forName(String). If the argument is not a constant string, or if it is a string naming a class that cannot be found at distill time, a warning is issued.
-class-path
Use the class path immediately following this option instead of the standard class path when looking for classes that are not included in the set to be distilled but are referenced from it. Thus for instance JDistill will almost always want to loadjava.lang.Object to find out if any methods in that class are overridden by the classes being distilled; if those classes belong to a different system than the one where JDistill is being run then it would not be right to load the samejava.lang.Object that is being used for the JDistill code itself.
-keep-class
Do not delete the class following this option even if it is apparently unreferenced. The Java syntax for a class is used, not the name of the class file. So 
    -keep-class java.io.Thing
might cause the file /unshrunk/java/io/Thing.class to be copied to /shrunk/java/io/Thing.class even ifjava.io.Thing is not a public class and there are no references to it within java.io

This option does not affect deletion or renaming of methods and fields within the named class.

-keep-field
Do not delete or rename the field following this option even if it would apparently be all right to do so. This is especially useful for fields that are accessed only from native code. The field is given in its fully-qualified form,i.e., the full name of its class followed by the name of the field, such as java.awt.Font.pData

If a field is accessed from native code but not using JNI, all preceding fields should also be kept, or javah should be run on the distilled classes rather than the originals.

-keep-method
Do not delete or rename the method following this option even if it would apparently be all right to do so. There is no way to differentiate among overloaded methods with the same name using this option: all methods with the given name will be preserved. Again this is useful for methods that are accessed only from native code. The method is given in its fully-qualified form but without parameters, for instance java.io.File.exists.
-keep-constructor
Do not delete any constructors for the class following this option even if they are apparently unreferenced. The class name is spelt out as for -keep-class.
-no-main
Do not consider methods with the signature 
    public static void main(String[])
as entry points, unless they are in public classes, in which case they would be entry points anyway. This option allows main methods introduced for debugging to be automatically deleted from production code.
-closed-program
Assume that no classes other than those being distilled will ever be loaded. This means that anything that is not referenced within the distilled classes is not referenced at all and can be deleted.
-tree
Treat each directory argument as the root of a directory hierarchy and shrink every file in that hierarchy whose name matches *.class.
-make-directories
Make intermediate directories as needed when creating output class files.
-disassemble
Print a disassembly of the structure of each class file as it is read, including a symbolic representation of the bytecode of each method. The information printed is similar to that printed by the standard javap tool, but more detailed.
-show-deletions
Show fields, methods, and classes that are deleted because they are unreferenced. The information displayed can be very instructive. It may reveal parts of a program that should have been deleted during some earlier modification; it may reveal faulty program logic that means some code is never reached; or it may indicate things that are only referenced from native code and so need to be protected using the -keep options. If the program being distilled was compiled with -O, short methods will be inlined where they are called, so if they are not visible outside their package JDistill will legitimately delete them.
-debug-deletions
Also show constants from the constant pool that are deleted. These will typically include the names of deleted fields and methods, and also constant strings or integers that were only referred to from within deleted methods. If fields and methods are renamed, the old names will normally be deleted too.
-show-renames
Show fields and methods that are renamed to shorter names. You may want to keep this information in case you ever have to debug a stack trace through methods that are now called A orB.
-debug-renames
Show the analysis that is done to determine what methods can be renamed. The existence of method overriding and hiding (Java Language Specification §8.4.6) means that this analysis is complex, as described later. With this option, if a method is not renamed then JDistill explains why, and it also shows sets of methods that were found to be required to have the same name.
-no-renames
Do not rename any methods or fields.
-show-array-shrink
Show all array initializations that were considered for shortening, with the savings obtained for those that were indeed shortened and the reason why not for those that were not.
-debug-array-shrink
Show additional information about array shortening. This information is principally of interest for debugging JDistill itself.
-no-array-shrink
Do not shorten any array initializations.
-show-references
Show the results of the JDistill reference analysis before anything is deleted. Essentially, the list of things that are unreferenced is the same as the list of things that will be deleted, so the principal interest of this option is to see how many references there were to certain things. Currently, only constants in the constant pool are counted in this way; for other entities the information is just whether they are referenced or not.
-trace
Print a detailed trace of the flow of execution as references from bytecode are determined. This option produces a great deal of output and is mainly of use for debugging JDistill.

Measurements

These measurements were made on an HP/UX machine running version 1.1.5 of the JDK. All of the standard classes of the JDK were distilled using all available shrinking. A few -keep-*options had to be provided to prevent incorrect deletion of fields and methods used only by native code. With these options in place, the distilled classes get the same results in the JCK as the original classes. 

The overall reduction is a spectacular 32%, but a great deal of that is attributable to two factors: shortened array initializations, and classes compiled without optimization. Some classes, for instance java.lang.Character and almost all of the sun.io package, contain big arrays which the javac compiler turns into huge bytecode. A later section will mention some problems that the shortened array initializations may cause. The standard packages distributed as part of the JDK ought to be compiled with optimization (-O flag), which as a side-effect leaves out line-number information, but some of them are not, which makessun.io even larger than it would be.

The standard packages measured here are not typical of Java applications, in that they export many individual methods but there is less cooperation between classes in individual packages. The fact that most of the package information is exported greatly limits what JDistill can do, since publicly-visible fields and methods cannot be deleted or renamed.

As another example, the JDistill program itself, compiled with optimization (-O flag), was distilled. Here the saving was 19%, a figure probably more typical of applications. There were no large arrays to be shrunk so the figures with and without that optimization were essentially the same.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
PackageOld sizeNew sizeSaved
java.lang1600309779162239 (38.9%)
java.lang.reflect1150410990514 (4.5%)
java.util64858601094749 (7.3%)
java.util.zip37163316145549 (14.9%)
java.io1117831033858398 (7.5%)
java.math21682176224060 (18.7%)
java.text229856102778127078 (55.3%)
java.text.resources37654432344753097 (14.1%)
java.net60327524677860 (13%)
java.security35759291636596 (18.4%)
java.security.acl32332358875 (27.1%)
java.security.interfaces15611015546 (35%)
java.awt25234220843443908 (17.4%)
java.awt.peer976067393021 (31%)
java.awt.image30732241326600 (21.5%)
java.awt.datatransfer616049791181 (19.2%)
java.awt.event31704264295275 (16.6%)
java.applet39313124807 (20.5%)
java.sql36624321364488 (12.3%)
java.rmi16475126013874 (23.5%)
java.rmi.registry30232489534 (17.7%)
java.rmi.dgc25932010583 (22.5%)
java.rmi.server24504201124392 (17.9%)
java.beans541404379010350 (19.1%)
sun.tools.java1508331457795054 (3.4%)
sun.tools.javac67557662301327 (2%)
sun.tools.debug1154999095324546 (21.3%)
sun.tools.asm43774413762398 (5.5%)
sun.tools.tree31094229318917753 (5.7%)
sun.tools.jar31606265365070 (16%)
sun.tools.util982182731548 (15.8%)
sun.tools.javap21708176654043 (18.6%)
sun.tools.javadoc496131524534368 (69.3%)
sun.tools.ttydebug30014242415773 (19.2%)
sun.tools.native2ascii50774228849 (16.7%)
sun.tools.serialver69285969959 (13.8%)
sun.misc541844004814136 (26.1%)
sun.io535026431408142209450 (41.3%)
sun.net13079105152564 (19.6%)
sun.net.ftp859972011398 (16.3%)
sun.net.nntp688056771203 (17.5%)
sun.net.smtp453135291002 (22.1%)
sun.net.www27227215375690 (20.9%)
sun.net.www.content.text14411062379 (26.3%)
sun.net.www.content.image19261485441 (22.9%)
sun.net.www.protocol.file60015160841 (14%)
sun.net.www.protocol.http16302130413261 (20%)
sun.net.www.protocol.doc38713400471 (12.2%)
sun.net.www.protocol.netdoc14841261223 (15%)
sun.net.www.protocol.verbatim34682897571 (16.5%)
sun.net.www.protocol.ftp928580841201 (12.9%)
sun.net.www.protocol.gopher856970851484 (17.3%)
sun.net.www.protocol.mailto45043852652 (14.5%)
sun.net.www.protocol.appletresource29162506410 (14.1%)
sun.net.www.protocol.systemresource749962911208 (16.1%)
sun.net.www.http15538131322406 (15.5%)
sun.net.www.httpd65895685904 (13.7%)
sun.audio1883096009230 (49%)
sun.awt365192644410075 (27.6%)
sun.awt.motif17346813892934539 (19.9%)
sun.awt.image591734587713296 (22.5%)
sun.awt.tiny14766211750030162 (20.4%)
sun.applet923177504517272 (18.7%)
sun.applet.resources14299129831316 (9.2%)
sun.security.acl13307105662741 (20.6%)
sun.security.util22505180724433 (19.7%)
sun.security.x50944805374447361 (16.4%)
sun.security.pkcs25451220403411 (13.4%)
sun.security.provider725785799314585 (20.1%)
sun.rmi.registry12711107331978 (15.6%)
sun.rmi.server19483166662817 (14.5%)
sun.rmi.transport565854591110674 (18.9%)
sun.rmi.transport.proxy40943328518092 (19.8%)
sun.rmi.transport.tcp45573371818392 (18.4%)
sun.rmi.rmic45814382697545 (16.5%)
sun.beans.editors17552150002552 (14.5%)
sun.beans.infos14771235242 (16.4%)
sun.jdbc.odbc15215712568826469 (17.4%)
sunw.util537266271 (50.5%)
sunw.io231100131 (56.7%)
Total (80 packages)909779461600532937741 (32.3%)
Average1137227700136722 (32.3%)

Technical details

JDistill does three essentially independent tasks when working on a package: 
  1. It traces the bytecode to find out what is referenced and deletes whatever is not;
  2. It shortens the names of fields and methods whose existing names are not visible outside the package;
  3. It shortens array initializations.

Finding references

The algorithm for tracing all of the bytecode in a package that can ever be reached is conceptually simple: start from the public entry points and trace through instructions; every time you see an instruction that references a field, mark that field as referenced; every time you see an instruction that invokes a method, mark that method as referenced and trace it too. 

The public entry points of a package are determined as follows:

  • Every public method in a public class is an entry point because it can be called directly by Java code outside the package.
  • Every protected method in a publicclass is an entry point because it can be called by a subclass of the class outside the package (this is not true if the class isfinal but we don't currently check that).
  • Every public or protected method in a class that is not public but that is subclassed by apublic class in the same package is an entry point because it is visible through the subclass.
  • A method that looks like 
        public static void main(String[] args)
    
    is an entry point because a Java program can start there, even if it is in a class that is not public. Since such a method often contains test code for the containing class that is not supposed to be included in the final product, JDistill has a command-line option -no-main to tell it not to consider this kind of entry point unless it is covered by one of the other rules.
  • Every method, even a private one, in a class that contains native method declarations is an entry point. Native code can typically access any field or method of any class regardless of Java-language visibility. We can't know what fields or methods are accessed in this way, and we don't want to keep all fields and methods just in case they might be accessed by native code, so we assume that classes with native methods are the only ones whose methods or fields are accessed in this way. When this assumption is wrong, it can be corrected as explained in the next paragraph.
  • Every method that was explicitly named by a -keep-method or -keep-constructor command-line option is an entry point. You can use this to preserve methods called only from native code and not included in the heuristic described just above.
A constructor is considered to be a method like any other for the purposes of these rules. 

If the -closed-program option is given, only the last two rules are applied, but then if a class containing native methods is determined to be referenced, all of its methods are considered to be entry points.

While we are looking for public entry points, we also look forvisible fields, that is fields that can be accessed directly or indirectly from outside the package. The rules here are the following:

  • Every public or protected field in apublic class or a class that has a publicsubclass in the package is visible. This is essentially the same rule as for methods.
  • Every field in a class that has native methods is visible, for the same reasons as explained for methods above.
  • Every field in a class that implementsjava.io.Serializable (directly or by inheritance) is visible unless it is static or transient, because changing the field in any way would break compatibility with previously-serialized objects of this class.
  • If a class that implements java.io.Serializablehas a field called serialVersionUID then that field also participates in serialization even if it isstatic (in fact it always isstatic).
  • Every field that was explicitly named by a -keep-field command-line option is visible. Again this can be used to prevent the deletion of fields that are only referenced from native code and which are in classes that do not themselves contain native methods.

Here again, the public rule does not come into play when -closed-program is given, and the other rules except the last are only applied to classes that are found to be referenced.

Tracing bytecode

We examine the instructions of every method that is an entry point as explained above, and every method that is called by such a method, and so on recursively. The following sorts of instructions have to be noted: 
  • A checkcast, instanceof,anewarray, getstatic,putstatic, or invokestatic instruction causes the class it references to be visited statically, as explained below.
  • A new, getfield,putfield, invokevirtual,invokespecial, or invokeinterfaceinstruction causes the class it references to be visited dynamically, as explained below.
  • The various invoke* instructions also cause the methods they invoke to be traced if they have not already been.
  • If an invokestatic instruction turns out to invoke the java.lang.Class.forName(String) method then we have detected a dynamically-loaded class. If the class being loaded is known (the argument to forName is a constantString) then it is visited statically. Furthermore, its no-arg constructor is traced, because it might be invoked through the newInstance()method in java.lang.Class. (An earlier version only traced the no-arg constructor if a call tonewInstance() was seen within the package, but this was wrong because the Class reference could easily be passed outside the package and newInstance() invoked there.) 

    If the argument to Class.forName is not a constant String, a warning is issued, because JDistill's assumptions about reachability might be violated.

    You will need to use the various -keep-*command-line options if the package being distilled contains classes, or fields and methods within classes, that are only referenced through dynamic loading and reflection.

  • The field-access instructions (getstatic etc) also cause the fields they name to be marked as referenced.
  • The ldc, ldc_w, andldc2_w instructions cause the constants they reference to be marked as referenced.

A class is visited staticallywhen we know that it will be initialized (in the sense of the Java Language Specification, §12.4.1). In this case the class is marked as referenced if it was not already, and if it has a static initializer method (<clinit>) that method's bytecode will be traced.

A class is visited dynamically when we know that there must be at least one instance of it for the program to be correct. In this case we have further information about the class:

  • If it implements the java.io.Serializableinterface and contains one or both of the methods 
        void readObject(java.io.ObjectInputStream in)
        void writeObject(java.io.ObjectOutputStream out)
    
    then those methods are potentially referenced and must be traced.
  • If it overrides methods in an ancestor class that are referenced then the overriding methods must also be considered to be referenced. This is further explained in the next section.
We do not explicitly visit a parent class dynamically when one if its children is visited dynamically, but if the child is really instantiated then we should end up visiting at least one of its constructors, and this constructor will start with an invocation of one of the parent's constructors, which will provoke a dynamic visit. 

Handling overriding methods

Once all of the public entry points and all of the methods they invoke have been traced recursively, a further important way to reach methods must be considered. A method that is never itself directly invoked, but that overrides a method in an ancestor class that is, must be traced, provided that the class it is contained in is actually instantiated. 

Finding indirectly-referenced overriding methods is not completely straightforward. Here is how we tackle it. We loop over all the classes in the package looking for non-staticmethods that are currently not referenced. For each such method, we recursively scan the ancestors (superclasses and superinterfaces) looking for methods with the same name and parameters, which are therefore overridden; we thus construct a set containing the first overridden method found on each path through the ancestor graph. If we find a referenced overridden method, then we mark the overriding method as referenced too and trace it. After we have scanned all the classes in the package and if we found any overriding methods, we repeat the whole process, until no further overriding methods are found.

In this way we are sure to propagate the referencedness all the way down a possibly long chain of overrides, and we are sure to pick up new chains of overrides that start at a method that is itself only referenced in an overriding method.

A more straightforward approach would be possible if we linked classes to all their subclasses and then methods to all their overriders, but building that information would not be much simpler than what we do here.

In the diagram, Tracing indirect method references   the method "0" is the only one explicitly referenced in the program. The methods labeled "1" are in subclasses that override it, and the methods labeled "2" override those. The first time we loop through all the classes we will find that the methods "1" are unreferenced but override a referenced method, so we will mark them referenced and trace them. Then either later in that scan through the classes or on the next scan, depending on the order in which we visit the classes, we will do the same thing for the methods labeled "2".

Renaming fields and methods

JDistill renames fields and methods in the package being shrunk so that they are shorter. Shortening the name of a field or method reduces space not just in the class where it is defined, but in every class that references it. If a member (a field or method) is referenced in a class there will be a single constant naming the class containing the member and the name of the member itself. This constant will be shared between all the references from the same class, so renaming will have most effect when there are many small classes rather than a few big ones. 

Renaming fields

Renaming fields is straightforward. Any field that is not visible can be renamed. JDistill renames fields in a given class starting at A and counting up toZ, then a to z, then_, then longer names starting with these characters but also including the digits 0 to 9. This means that all new names are valid Java identifiers. The specification of class files does not actually require this, but it seems best to avoid possible problems with incorrect implementations that do. 

The only other constraint on chosen field names is that they must not conflict with existing field names in the same class, specifically those that are visible. This is easily done using a hash table that is needed anyway in order to follow field references to the corresponding field.

Renaming methods

Renaming methods is much harder because of overriding. The field names in a class are independent of those in its ancestors and descendants; class files specify exactly which class's field is being referenced. The names of methods in a class on the other hand are intimately tied to the names of methods in its ancestors and descendants. If a method has the same name as a method in an ancestor, then both names must still be the same after renaming. Contrariwise, if the name of a method is different from the name of another method in the same class or in any ancestor or descendant, then it must remain different after renaming. 

Overloaded methods

Overloading also complicates method renaming slightly. Where we talked about names being different in the previous paragraph, we should really have talked about signatures. That is, the combination of name and parameter types must be the same or different according as there is an overriding relationship between two methods. 

Methods that must be renamed together

Before we can rename any methods, we must group together all methods that have an overriding relationship so that they will all continue to have the same name after renaming. This is straightforward. With every method in every class, we associate a set of other methods. This is usually a set of methods that must have the same name; but if a method must not be renamed, either because it is an entry point or because it has an overriding relationship with an entry point, then it will belong to a special set called the fixed-name set

To construct the sets of methods we proceed as follows. We visit the classes in ancestor-first order. For each method in each class we visit, we first check if it can be renamed at all. If not, it is added to the fixed-name set. If so, it is added to a newly-created set. Then, for every method that the method we are visiting overrides, we merge together the sets for the overriding and overridden methods. In this way, if any method in the group of methods that must have the same name as the current method turns out not to be renameable, all of the methods in the group will end up in the fixed-name set. If all methods in the group can be renamed then they will all end up in the same set.

Finding new names

We can consider finding new method names as a graph-colouring problem. In the graph, the nodes are methods and an edge joins two nodes if the corresponding methods must not have the same name, because they have the same parameters and either they are in the same class or one of them is in an ancestor class of the other. (Methods for which this is true and that already have the same name already have an overriding relationship and can be coalesced into the same node.) The graph is actually a collection of disjoint graphs, one for every possible parameter signature. These graphs can be considered to be coloured when we start, where the colours are method names and no two nodes connected by an edge can have the same colour. What we want to do is to recolour those nodes where this is allowed so as to use more efficient colours (shorter names). 

Building this graph would be expensive and unnecessarily complicated. Instead we use some heuristics to achieve a result that is theoretically non-optimal but will in fact almost always be as good as optimal (not use longer names than necessary). The heuristics are:

  1. We never recolour with a colour that was already in the graph before we started. This means that we only have to check the colours of adjacent nodes we have already coloured. (Translation: we build a big hashtable of every signature of every method and avoiding renaming a method to something that gives it the same signature as one of these.)
  2. We always visit the methods of a class before those of any of its descendants. This means that we can find the adjacent nodes already coloured by searching upwards in the inheritance dag. There are currently no downward links in it.

The reason this is nearly always as good as real colouring is that the short names we will be using as new colours hardly ever exist already as method names, so our first heuristic rarely makes any difference. The exception is when some but not all of the classes being distilled were generated by programs that make short names (such as JDistill itself). In that case we will typically find that we can't make those names any shorter than they were already, and we can shorten names in the non-generated classes less than we would be able to with an optimal algorithm.

Shortening array initializations

As we mentioned earlier, the standard JDK Java compiler from Sun takes source code like: 
    static final int[] powers = {1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024};
and generates bytecode almost as if you had written: 
    static final int[] powers;
    static {
    powers = new int[10];
    powers[0] = 1;
    powers[1] = 2;
    ...
    powers[10] = 1024;
    }
More exactly, the code it generates in the static initializer<clinit> looks like this (the comments show what is on the stack after each instruction): 
    bipush 11       // 11  (size of array)
    newarray int    // int[]
    dup         // int[], int[]
    iconst_0        // int[], int[], 0
    iconst_1        // int[], int[], 0, 1
    iastore     // int[]  (now array[0] == 1)
    dup
    iconst_1
    iconst_2
    iastore
    ...
    dup         // int[], int[]
    bipush 10       // int[], int[], 10
    sipush 1024     // int[], int[], 10, 1024
    iastore     // int[]  (now array[10] == 1024)
    putstatic powers    // (empty; now powers == the initialized array)
    return
In cases like this where the range of values is relatively small, we can instead initialize the array as if the code had been this: 
    static final int[] powers;
    static {
    powers = new int[11];
    final String powersString =
        "\u0001\u0002\u0004\u0008\u0010\u0020\u0040\u0080\u0100\u0200\u0400";
    for (int i = 0; i < 11; i++)
        powers[i] = powersString.charAt(i);
    }
That is, we construct a String where each character is a value to be put in the array, and extract those values when the array is created. This operation is slower than the original unrolled bytecode, which is why we only do this operation in static initializers; since they are only executed once presumably the small extra time will make no difference. (It would be useful to have an option to perform this operation everywhere big arrays are seen, notably in constructors, despite the small speed penalty; but currently there is no such option.) 

The bytecode we construct to unpack the values from the String into the array looks like this:

    bipush 11       // 11  (size of array)
    newarray int    // int[]
    bipush 10       // int[], 10 (which is the array index i)
  loop:
    dup2        // int[], i, int[], i
    dup         // int[], i, int[], i, i
    ldc "\u0001..." // int[], i, int[], i, i, "\u0001..."
    swap        // int[], i, int[], i, "\u0001...", i
    invokevirtual String.charAt(int)
            // int[], i, int[], i, <character at i>
    iastore     // int[], i  (now array[i] has been set)
    iconst_1        // int[], i, 1
    isub        // int[], i-1  (new value of i for next iteration)
    dup         // int[], i, i
    ifge loop       // int[], i
    pop         // int[]
    putstatic powers    // (empty; now powers == the initialized array)
    return
We do all the operations on the stack rather than using local variables, which would yield more straightforward bytecode, because we don't want to have to allocate a local variable or try to determine if one is free at the point where we are inserting this code. Similarly, we add a constant to the currently-declared maximum stack depth rather than computing what its real value is after the introduction of the new code, because we have no need to know anything about the stack usage of instructions anywhere else in the distiller. 

If some of the array values are negative or greater than 65535 then they will not fit in the char values of aString. In this case, provided that the spanof values is not more than 65535, we can encode in the same way but add a (possibly negative) constant to each value.

String constants are encoded in class files usingUTF8 encoding, which has the property that a character cwill be encoded using:

                  
1 byte,if 0 < c < 128;
2 bytes,if c = 0 or 127 < c < 2048;
3 bytes,otherwise (2047 < c < 65536)
So in fact we try to determine if adding a constant to each value in the encoded string will make for a string that takes up less space in the class file, even if the span of values is such that this is not strictly necessary. There are only two possible values that we may want to add: either the smallest value in the array (so that values near that smallest value are encoded in the efficient 1 - 128 range), or one less than the smallest value in the array (for the same reason, but additionally the smallest value itself is encoded as 1 rather than the less efficient 0). These are the only two possibilities because we can't add a number biggerthan the smallest value in the array (since characters in theString cannot be negative), and the fact that the cost function does not decrease except at 1 means that there is no point in adding a number smaller than the smallest value in the array minus 1. We try both possibilities mentioned to see if either of them makes the encoded UTF8 sufficiently smaller to compensate for the extra instructions needed to add the constant. 

After computing the optimal String encoding for the array, we estimate how much space it will take up in the class file, taking into account the instructions in the loop and the fact that it may be necessary to add a reference tojava.lang.String.charAt(int). If it looks as if the rewritten code will be more expensive than the original, then no change is made. The calculations here are not completely accurate so sometimes we may fail to save space when we could have, but the loss is small, typically a dozen bytes or so, and at worst roughly 50.

Why you might not want to shrink arrays

The Java Language Specification guarantees (http://www.javasoft.com/docs/books/jls/html/3.doc.html#101083