7 Replies Latest reply on Mar 5, 2009 3:52 PM by pietblok

    Runtime Compilation to Increase Performance

    807588
      Hi everyone,

      I've been struggling with this problem for a few days now and would be glad if you could help. I'm working on a neural networks project. I've written a GUI to edit and train neural networks for a certain task. In order to be able to play with the network architecture in a flexible manner I've exploited the OOP concepts generously. But once the network architecture is set I don't need this flexibility anymore and I want to compile a specific class that has the same functionality as that network where all the computation is carried out over local variables (I even made them static) instead of array elements and object references. I've successfully written this code that generates a string where loops are unrolled and all the if, switch statements are resolved and the variables that are not supposed to vary at this point are inserted as constants at the (runtime) compile time.This should speed up right? Especially considering all the array bounds checking and stuff java is doing.

      When I try to compile this string, everything is fine and the mathematical functionality is perfectly imitated. But the problem is, for small networks, the code works twice as fast, as I would expect but for large networks the speed drops to something like 4 times slower! For a typical large network the resulting binary is 80k, so I wouldnt expect it to fill the cache (or should I?) anyway, even if it did fill the cache that shouldnt slow it down by a factor of 8. The slowdown sounds more like the difference between JITted and interpreted code. I suspect that as that single method is 80k long it might not be JITted (sorry if I annoy you with that crappy term). If you think this is the case, is there any way to bypass that? Or what do you think the reason could be. I already tried to decrease the number of method calls before compilation to 1 (with the command line argument -XX:CompileThreshold:1) just to try, but it didnt help.

      I converted the resulting string also to C++ and it runs 4 times faster than the unrolled original method(Even for large networks). (It serves also as a nice test showing that java is half as fast as such an arithmetic and trigonometric intensive task)

      I would be so glad if you could help me attain the x2 performance I obtain with the small networks in the general case. And I would be super duper glad if you could suggest a way I could runtime compile it in C++ and attach it to my program through JNI. Runtime compiling C++ code sounds messy but I keep some hope since the operations are purely mathematical and they don't need any platform specific thing.

      Thanks a lot

      Edited by: enobayram on Mar 5, 2009 5:51 AM
        • 1. Re: Runtime Compilation to Increase Performance
          JoachimSauer
          You are trying to out-smart the HotSpot engine of the JVM.

          Since many, many very, very smart people have hacked on that for quite some time, I think that you'll have a very hard time to out-smart it.

          Also: runtime compiling C++ code doesn't sound a lot more messy than runtime-compiling Java code. The only difference is that you'll have to think of different platforms.

          [This list of JVM options|http://blogs.sun.com/watt/resource/jvm-options-list.html] can give you some hints how to help the JVM optimize your code or at least tell you what gets optimized when and how. (Also check the [HotSpot VM Options|http://java.sun.com/javase/technologies/hotspot/vmoptions.jsp]).
          • 2. Re: Runtime Compilation to Increase Performance
            807588
            Thanks for your answer, I was thinking about the same thing, but I am not exactly trying to out-smart it. During that runtime compilation I know much more than the VM does. I know that the neuron axon function family is constant, so I bypass a switch statement. I know that function flatness will not change after that point so I insert it as a constant. I also avoid using arrays since I know how many elements there should be, and I can use individual local variables instead. This also explains the speed increase in small networks and also in the C++ experiment.

            I've checked those options, and the only relevant one I could find is that XX:CompileThreshold. By the way, why do you think runtime compiling C++ is less messy? With java everything needed is included in the standard library. JavaCompiler class and the ClassLoader class are sufficient. With C++ I guess I would have to excite some C++ compiler through a system call so that it generates a .dll (or .so) then I would have to have it loaded to memory and interface it to my program. It could even get messier if I tried to recompile it when the network changes as then I would have to unload the dll and rebuild it. If you had an easier way in mind I would be so glad to hear.
            • 3. Re: Runtime Compilation to Increase Performance
              JoachimSauer
              enobayram wrote:
              Thanks for your answer, I was thinking about the same thing, but I am not exactly trying to out-smart it. During that runtime compilation I know much more than the VM does. I know that the neuron axon function family is constant, so I bypass a switch statement. I know that function flatness will not change after that point so I insert it as a constant. I also avoid using arrays since I know how many elements there should be, and I can use individual local variables instead.
              I think the JVM actually notices some of those facts.
              This also explains the speed increase in small networks and also in the C++ experiment.

              I've checked those options, and the only relevant one I could find is that XX:CompileThreshold.
              I think I saw a option that let the JVM dump the optimized code it produced. That would probably be useful when trying to find out what the HotSpot compiler actually does to your methods. I can't remember what it was called 'though.
              By the way, why do you think runtime compiling C++ is less messy?
              I didn't say less messy. I said about equally messy.
              • 4. Re: Runtime Compilation to Increase Performance
                pietblok
                single method is 80k long
                Sorry, that I'm replying on a subject that I've not the faintiest idea. But I remember having read somewhere, sometime in one of these forums that there is a maximum of 64KB size for a method.

                If this reply is utter nonsense, please ignore.

                Piet
                • 5. Re: Runtime Compilation to Increase Performance
                  807588
                  Amazing!

                  You were absolutely right, I've just checked if 64kB has a significance by incrementally growing and compiling the network and there is a sudden breakpoint around 64 kB (from twice as fast to 4 times slower). I will try to break down the method to a few pieces to ensure smaller size for each one and see if the same network performs better.

                  Thanks a lot!
                  • 6. Re: Runtime Compilation to Increase Performance
                    807588
                    I've just modified my string generator to generate a method to finish the job when a method exceeds a certain size and the same network that resulted in /4 speed is back to x2! Thanks a lot pietblok!

                    I would normally agree with your arguments Joachim but x8 performance difference begs a clear explanation, thanks anyway for your quick answers.

                    Edited by: enobayram on Mar 5, 2009 7:07 AM
                    • 7. Re: Runtime Compilation to Increase Performance
                      pietblok
                      Wow. And I really hesitated to reply at all, having no knowledge whatsoever on the subject.

                      Piet