Just wrapped up my last performance tuning course for this year and for the second time running, some members of my Parisian group had the opportunity to run the exercises on virtualized hardware. Granted, the underlying hardware wasn't quite top shelf but then this hardware had been un-virtualized in past offerings. And, the point is to learn how to diagnose Java performance bottlenecks and in that context, OS/Hardware isn't so important.
I say that some of the group used virtualized hardware because the rest of the group (wisely) chose to use their own laptops. The mix of hardware that has been used for the exercises contains all of the usual suspects, MacBook Pro, Dell, Sony, HP, Lenovo, Siemans and a bunch of other brands. Operating systems included XP, Windows 7, OSX, and Linux. I've also run the benchmarks on Solaris (just to round out the list). At the end of the exercise the tradition is to share results.
The first benchmark is single threaded and as it turns out it surprisingly stresses the thread scheduler while trying to use the RTC. The results of the of the bench are very predicatble in that XP completes the 1 second unit of work in 1.8-1.9 seconds, OSX and Linux run in about 1.1-1.2 seconds. Out of the box Solaris runs in about 10.100 seconds. Set the clock resolution kernel parameter to 1ms and the unit of work runs in about 1.150ms. Windows 7 runs in just over 1.000-1.010 seconds. Note that results are entirely dependent upon the OS. Reboot the same hardware in a different OS and the benchmark result change accordingly.
The second benchmark tries to measure the multi-threaded throughput of various Map implementations. As you can imagine, the benchmark is sensitive to the number of cores along with the choice of OS. Again, XP scores poorly where as Linux and Mac OSX run neck and neck with Windows 7 leading the way for the synchronized collections. Linux, OSX and Windows 7 run neck in neck when benching the java.util.concurrent or non-blocking (Cliff Click's NonBlockingHashMap) maps.
The fun comes when any of these benches are run on any of these operating systems and the change in results is astounding. The 1.200 second run time balloons to 2.3-2.8 seconds. Throughput on the other benchmark drops from 10s of millions per second to a couple of million for the synchronized collections. The results are not as devistating for the non-blocking implementation.
My guess is that virtualization in the first bench, is adding additional stress to the already overworked stressed thread scheduler. I've not thought through why the second bench is so adversely affected. Bottom line, running these benches in a virtualized environment doesn't paint a very pretty performance picture. After running the benchmarks, the group started trading their virtualization war stories. My response, you can't virtualize yourself into more hardware but you can certainly virtualize yourself out of it. That said, I still like virtualization.