Skip navigation
1 2 Previous Next

sdo

24 posts

By now, you are hopefully well aware that Glassfish 3.1has been released.  Because the performance group has been a little quiet lately, maybe you're thinking there aren't a lot of interesting performance features in this release. In fact, there are two key performance benefits: one which benefits developers, and one which is important for anyone using Glassfish's new clustering and high-availability features.

Let's start with developers. One of our primary goal has always been to make the development experience fast and lightweight. This was of course a key factor driving the modularization of Glassfish V3; in V3, we finally had an architecture that allowed the server to load only those Java EE features that a developer was actually using. And the resultswere quite satisfactory. Given all our previous progress, what -- if anything -- could we actually do in Glassfish V3.1?

Developer Metrics Improve by 29%

With a lot of hard work and a laser-like focus by our development team, we managed to improve our core metric of a "developer scenario" by 29%. This scenario includes starting the appserver, deploying a JSP-based application and accessing its URL, and then a cycle of three changes to the application all followed by tests of the application URL. We aggregate the entire time for that scenario as our primary metric, but the table below shows the improvements in each of these areas:

                        
BUILDSTARTUPDEPLOYREDEPLOY AVERAGE
Glassfish V3.12.963.140.9
Glassfish V3.03.284.351.59

As you can see, our improvement here is across the board of all activities that make up the development cycle. Well, most: we haven't figured out how to automatically find bugs in your program, so your testing will take the same amount of time. But at least it will be a little more pleasant since your redeployment cycle will be that much faster in Glassfish V3.1.

Let me mention here too why we test the entire cycle of development and not just startup -- particularly because I've seen some recent blogs touting only startup performance of other now-modularized Java EE servers. In our past, we've made the mistake of focusing solely on startup performance; in the Sun Java Systems Application Server 9.1, we introduced a quick-start feature which improved startup quite significantly. The problem is that it just deferred necessary work until the first time the server was accessed, and the total time it took to start the server and then load its admin console got worse (in fact, general performance of anything socket-related suffered because the architecture of that old server didn't easily support deferring activities). In the end, pure startup isn't what is important -- what's important is how quickly you can get all of your work done. Otherwise, we'd all do what tomcat-based servers do for startup: return to the command prompt immediately, before the server is up, to make it look like startup is immediate. Of course, if you immediately access the server at that point, you'll get an error because it hasn't finished initializing, but hey, at least it started fast.

HA Performance Improves by 33%

On the other end of the spectrum, Glassfish V3.1 contains some quite impressive improvements in its high-availability architecture. This is somewhat old news: when we did project Sailfin for our SIP server, we re-architected the entire failover stack to make it faster and more scalable. But although Sailfin is based on Glassfish, Glassfish V3 didn't support clustering at all yet. V3.1 is the first time we've been able to bring that architural work forward into the main Glassfish release.

In Glassfish V3.1, we support in-memory replication: one server is a primary server and holds the session object. Whenever the session object is modified, the data is sent to a secondary server elsewhere in the cluster so that if the primary server fails, the secondary can supply the session information. This is actually a fairly common implementation of high availability, though of course it does not address the situation of multiple failures. Still, the speed benefits you get from replicating to another server (vs. replicating to something like HADB) are quite significant. We introduced in-memory replication in SJSAS 9.1, and at the time had a nice performance gain compared to traditional database replication.

In Glassfish V3.1, we've taken that architecture and optimized it significantly; it is now based on the scalable grizzly adapter which uses Java NIO for its underpinnings. We've also optimized around the session serialization and general implementation to get a 33% improvement in performance for in-memory replication of Glassfish V3.1 compared to in-memory replication of SJSAS 9.1.1. And again we've tried to pay attention to all aspects of how it might be used. We support full session scope, where the entire HTTP and stateful SFSBs are replicated on each request. We also support modified-attribute scope, where for HTTP sessions, only those attributes in the session that have been marked as changed are replicated. Clearly the modified-attribute scope will perform better, but it does rely on the application to call setAttribute() to mark the attribute as having been modified (which, while not a standard in the Java EE specification, is the common technique adopted by virtuall all Java EE servers).

In practical terms, the improvement we see holds for both kinds of tests. Like all our performance measurements, we take some basic workloads that mimic a typical web or EE application and see how many users we can scale up to such that the response time for the requests is within some limit (typically 0.8 seconds), often with some CPU limit on the machine (e.g., we don't want to use more than 60% of the CPU because in the event of a failure, the remaining instances must take on more load). For HTTP-only, HTTP and EJB, modified attribute, full session: all see about a 33% increase in the number of users we can support and still meet that 0.8 second reponse time.

General Performance

Of course, we haven't neglected general performance either; we've run our usual battery of tests to ensure that Glassfish V3.1 hasn't regressed in performance in any area. The Performance Tuner is back in Glassfish V3.1 to help optimize your production environment so that you get the very best performance Glassfish has to offer. And of course Glassfish V3 remains the only open-source application server to submit performance results on SPECjAppServer 2004.

For most of the year, I've been working on session replication code for Sailfin. When I came back to work with the Glassfish performance team, I found that we had some pretty aggressive goals around performance, particularly considering that Glassfish V3 had a completely new architecture, was a major rewrite of major sections of code, and implements the new Java EE 6 specification. Glassfish V3 in those terms is essentially a .0 release, and I was convinced we'd see major performance regressions from the excellent performance we achieved with Glassfish V2.

Color me surprised; in the end, we met or exceeded all of our goals for V3 performance. For the most part, our performance tests are based on customer applications, industry benchmarks, and other proprietary code that we can't open source (nor share results of). But I can discuss some of those tests, and in this blog we'll look at our first set of sanity tests. These test the basic servlet operations of the web container; we'll look at three such tests:

  1. HelloServlet -- the "Hello, world" of servlets; it simply prints out 4 lines of HTML in response to each request, incrementing a global counter to keep track of the total number of requests.
  2. HelloServletBeanJsp -- that same servlet, but now for each call it instantiates a simple JavaBean and then forwards the request (and bean) to a JSP page (i.e., the standard MVC model for servlets)
  3. HelloSessions -- the hello servlet that keeps track of a session counter (in a session attribute) instead of a global counter

Our goal here is that V3 would be at least as fast at V2 on these tests and remain ahead of the pack among open source application servers. The application servers are hosted on a Sun X4100, which is a 4 core (2 chip) AMD box running Solaris 10. The load is driven using the Faban HTTP Benchmarking program(fhb), which can drive load to a single URL from multiple clients (each client running in a separate thread -- an important consideration in a load generator). As a first pass, we run 20 users with no think time to see how much total load we can generate in the server:

I've normalized the chart to V2 performance. And what we see is that even on the simplest test -- the HelloServlet -- V3 manages to increase the total server throughput by a few percentage points. And while I was concerned about the effects of a new architecture, the OSGI classloading architecture and reworking of the glassfish classloading structure meant that we could take care of a long-standing issue in the V2 classloader -- so now every time we call Beans.instantiate() (or do anything else related to class loading), we can operate much more quickly. When it comes to session management, V2 and V3 come out the same.

The other columns in the chart represent jBoss 5.1 and Tomcat 6.0.20; our goal was to beat those containers on these tests, and we did. However, you might take that with somewhat of a grain of salt, as I am not an expert in those containers, and there are possibly container tunings that I might have missed for those. In fact, these tests are done with a small amount of tuning:

  • JVM options for all products are set to -server -Xmx2000m -Xms2000m -XX:NewRatio=2 -Xss128k. Using Sun's JDK 6 (6U16) means that ergonomics will kick in and use the parallel GC collector with 4 threads on this machine.
  • The thread pool size for all products is set to 10 (both min and max; I'm not a fan of dynamically resizing threadpools).
  • The server will honor all keep alive requests (fhb specifies this automatically) and allow up to 300000 requests from a single client before closing the socket (maxKeepAliveRequests)
  • The server will use 2 acceptor threads and a backlog of 300000 requests (that tuning is really needed only for the scalability test discussed below)
  • For jBoss, I followed the recommendation to use the Tomcat APR connector. As far as I can tell, Netty is not integrated into jBoss 5, though if you know otherwise, I'd love a link to the details.
  • For tomcat, I used the Http11NIOProtocol connector
  • In the default-web.xml for JSPs, genStrAsCharArray is set to true and development is set to false

Happy with a simple throughput test, I proceeded to some scalability tests. For these tests, we also use fhb -- but in this case we run multiple copies of fhb, each with 2000 users and a 1 second think time. This allows us to vary the number of users and test within a pre-defined response time (which is at most 1 second, or the client will fall behind the desired think time). The number of connections that we can run at each test will vary depending on the work -- the HelloServlet test had an initial throughput of almost 42,000 operations per second, and so we were able to test to 56,000 connected users with 28 copies of fhb (which we distributed among 7 x4100 machines; each core essentially running 2000 users). The test involving forwarding to a JSP does almost twice the work, and we can only run 32,000 users within these timing constraints; for the session tests we can run 40,000 users.

Here are the results: So despite all my initial qualms, V3 has performed admirably; it handled those 56,000 simultaneous clients without breaking a sweat. [Well, if a CPU can sweat, it might have -- it was quite busy. :-)] There are no results from tomcat or jBoss for this test, both failed in the configurations I had with these large number of users. In fact, they failed with even smaller numbers of users; I didn't test below 10000, but neither could handle the load even that high. Again, this is certainly possibly due to my lack of knowledge about how to configure the products. Though I'm not convinced about that -- tomcat failed because it had severe GC problems which are caused by a finalizer in the org.apache.tomcat.util.net.NioBlockingSelector$KeyReference class; the finalizer And jBoss failed because of severe lock contention around some lock in the org.apache.tomcat.util.net.AprEndpoint class. Still, there might be a workaround for both issues.

At any rate, I'm a happy camper today: glassfish V3 is going out the door with excellent performance characteristics, thanks to lots of hard work along the way by the engineering community -- thanks guys!

Avid readers of the glassfish aliases know that we are frequently asked questions about why their server isn't responding, or why it is slow, or how many requests are being worked on. And the first thing we always say is to look at the jstack output.

So you're running on a Sun 5120 with 128 hardware threads and so you have 368 request processing threads and the jstack output is 10K lines. Now what? What I do is use the program attached to this blog entry -- it will parse the jstack output and show how many threads are doing what. At it's simplest, it's something like this:

% jstack pid > jstack.out
% java ParseJStack jstack.out
[Partial output...]
Threads in start Running
        8 threads in java.lang.Throwable.getStackTraceElement(Native
Total Running Threads: 8
Threads in state Blocked by Locks
        41 threads running in com.sun.enterprise.loader.EJBClassLoader.getResourceAsStream(EJBClassLoader.java:801)
Total Blocked by Locks Threads: 41
Threads in state Waiting for notify
      39 threads running in com.sun.enterprise.web.connector.grizzly.LinkedListPipeline.getTask(LinkedListPipeline.java:294)
        18 threads running in System Thread
Total Waiting for notify Threads: 74
Threads in state Waiting for I/O read
        14 threads running in com.acme.MyServlet.doGet(MyServlet.java:603)
Total Waiting for I/O read Threads: 14

The parser has aggregated all the threads and shown how they are in various states. 8 threads are current on the CPU (they happen to be doing a stack trace -- a quite expensive operation which is better to avoid). That's fine -- but we probably want it to be more than that.

41 threads are blocked by a lock. The summary method shown is the first non-JDK method in the stack trace; in this case it happens to be the glassfish EJBClassLoader.getResourceAsStream. Now we need to go actually look at the stack trace and search for that class/method and see what resource the thread is blocked on. In this example, all the threads were blocked waiting to read the same Zip file (really a Jar file), and the stack trace for those threads show that all the calls came from instantiating a new SAX Parser. If you didn't know, the SAX parser used by a particular application can be defined dynamically by listing the resource in the manifest file of the applications jar files, which means that the JDK must search the entire class path for those entries until if finds the one the application wants to use (or until it doesn't find anything and falls back to the system parser). But since reading the jar file requires a synchronization lock, all those threads trying to create a parser end up contending for the same lock, which is greatly hampering our application's throughput. It happens that you can set a system property to define the parser and hence avoid the dynamic lookup all the time (-Djavax.xml.parsers.SAXParserFactory=com.sun.org.apache.xerces.internal.jaxp.SaxParserFactoryImpl will always default to the JDK parser).

But the larger point here is that when you see lots of threads blocked on a resource, that's a problem that is throttling your throughput ,and hence whatever the resource is, you need to make changes to your configuration or application to avoid that contention.

What about the threads in notify? Those threads are waiting to be woken up. Usually they are threads in a pool waiting for notification that a task is ready (e.g. the getTask method is grizzly threads that are waiting for a request). System threads are doing things like RMI distributed GC, or JMX monitoring -- you'll see them in the jstack output as threads that have only JDK classes in their stack.

But another problem creeps up in the threads waiting for I/O read -- these are threads that are doing a blocking I/O call (usually socketRead0 -- not accept, which will show up in the jstack parsing as waiting for I/O accept). This is something else that is hampering your throughput -- the thread is waiting for a backend resource to answer its request. Most often this is a database -- again, you can look at the full stack in the output to see exactly what the call is. Maybe your app is making a the same call to a database; if that keeps showing up, its time to optimize the SQL or tune the database so that call happens faster.

There are some other states in the output I haven't show -- threads doing accept, or sleeping, or GC threads. But 95% of the time, if you aren't getting the throughput of out your server that you expect, its because the threads are blocked on I/O read or on a lock.

There are two other things you can do with this tool: first is that you can list multiple file names on the command line and the tool will aggregate them. So sometimes if you take a bunch of jstacks in a row, you can parse all of them and get a better idea where you are blocked. And second you can use the -name argument to limit the output only to certain threads. For example, in glassfish the grizzly http-request-processing threads are named httpSSLWorkerThread-<PORT#>-<Thread#>. If you run jstack -name WorkerThread-80, you'll get only those grizzly threads handling requests on port 80. [In glassfish v3, that naming is slightly different; it is presently http-listener-<listener#>-(<Thread#>), though it will likely be changed to the name of the thread pool instead of the name of the listener -- but it's pretty simple to look at the jstack output and figure out the naming convention in use.]

Finally, two caveats: the first is that jstack output is likely to change, and the parser will not be robust for all versions of the JDK (nor necessarily all JDK options; some GC options may dump GC thread output the tool is unprepared for). So it will likely need to be tweeked if you encounter a parsing error. Second is that a jstack is only a snapshot in time -- it is like a sampling profiler but with a very, very large interval. That means that sampling artifacts will affect your analysis, just like a profiler may point you to a parent or child class if the sampling rate is off a little bit. The fact that the JDK pauses threads only at certain locations to get their stack also affects that, particularly for running threads -- so a method that continually shows up in the running state doesn't mean that method is particularly expensive, just that it is a good point for the JDK to get the trace. [It may be expensive as well, though -- we know from other analysis that working with stack traces, including throwing exceptions, is very expensive in glassfish due to its very deep stack depth, so it's not surprising in my example above to see lots of threads running in the getStackTraceElement method). So this parsing is much more effective at finding things that are blocking your throughput than at finding runtime bottlenecks where you are using too much CPU.

Like all performance tools, jstack gives you one piece of information -- hopefully this will let you place that one piece of information into context with information from your other tools when it comes time to diagnose throughput bottlenecks.

Yesterday, I wrote that I'm often asked which X is faster (for a variety of X). The answer to that question always depends on your perspective. I answered that question in terms of hardware and concluded (as I always do) that the answer depends very much on your needs, but that a machine which appears slower in a single-threaded test will likely be faster in a multi-threaded world. You can't necessarily extrapolate results from a simple test to a complex system.

What about software? Today, I'll look at that question in terms of NIO. You're probably aware that the connection handler of glassfish is based on Grizzly, an NIO framework. Yet in recent weeks, we've read claims from the Mailinator author that traditional I/O is faster than NIO. And a recent blog from Jonathan Campbell shows traditional I/O-based appserver outperforming glassfish. So what gives?

Let's look more closely at the test Jonathan Campbell ran: even though it simulates multiple clients, the driver runs only a single request at a single time. It doesn't appear so on the surface, this is exactly an NIO issue; it has to do with how you architect servers to handle single-request streams vs. a conversational stream. A little know fact about glassfish is that is still contains a blocking, traditional I/O-based connector which is based on the Coyote connector from Tomcat. It is enabled that in glassfish (adding the -Dcom.sun.enterprise.web.connector.useCoyoteConnector=true option to your jvm-options) -- but read this whole blog before you decide that using that connector is a good thing.

So I enabled this connector, got out my two-CPU Linux machine running Red Hat AS 3.0, and re-ran the benchmark Jonathan ran on glassfish and jBoss (I tried Geronimo, but when it didn't work for me, I abandonned it -- I'm sure I'd just done something stupidly wrong in running it, but I didn't have the time to look into it). I ran each appserver with the same JVM options, but did no other tuning. And now that we're comparing the blocking, traditional I/O connectors, Glassfish comes out well on top (and, by comparison with Jonathan's numbers, it would easily have beat Geronimo as well):
jbench.png

So does this mean that traditional I/O is faster than NIO? For this test, yes, But in general? Not necessarily. So next, I wrote up a little Faban Driver that uses the same war file as the original test, but Faban will run the clients simultaneously instead of sequentially and continually pound on the same sessions. In my Faban test, I ran 100 clients, each of which had a 50 ms think time between repeated calls to the session validation servlet of the test. This gave me these calls per second: 
  • Glassfish with NIO (grizzly): 8192
  • Glassfish with Std IO: 3344
  • jBoss: 6953
Yes, those calls per second are vastly higher than the original benchmark -- the jRealBench driver is able to drive the CPU usage of my appserver machine only to about 15%. Faban can do better, though since the test is dominated by network traffic, the CPU utilization is still only about 70%. And for glassfish's blocking connector, I had to increase the request-processing thread count to 100 (even so, there's probably something wrong with that result, but since the blocking connector is not really what we recommend you use in production, I'm not going to delve into it).

When scalability matters, NIO is faster than traditional blocking I/O. Which, of course, is why we use grizzly as the connector architecture for glassfish (and why you probably should NOT run out and change your configuration to use the coyote connector, unless your appserver usage pattern is very much dominated by single request/response patterns). The complex is different than the simple.

As always, your milage will vary -- but the point is, are there tests where traditional I/O is faster than NIO? Of course -- with NIO, you always have the overhead of a select() system call, so when you measure the individual path, traditional I/O will always be faster. But when you need to scale, then NIO will generally be faster; the overehead of the select() call is outweighed by having fewer thread context switches, or by having long keep-alive times, or other options that architecture opens. Just as we saw with hardware, you can't necessarily extrapolate performance from the single, simple case to the complex system: you must test it to see how it behaves.  
As a performance engineer, I'm often asked which X is faster (for a variety of X). The answer to that question always depends on your perspective.

Today, I'll talk about the answer in terms of hardware and application servers. People quite often measure the performace of their appserver on, say, their laptop and a 6-core, 24-thread Sun Fire T1000 and are surprised that the cheaper laptop can serve single requests much faster than the more expensive server.

There are technical reasons for this that I won't delve into -- there are architecture guides that go into all that. Rather I want to explore the question of which of these machines is actually faster, particularly in a Java EE context. In an appserver, you typically want to process multiple requests at the same time. So looking at the speed of a single request isn't really interesting: what is the speed of multiple requests?

To answer this, I took a simple program that does a long-running nonsense calculation. Running this on my laptop and 24-thread T1000, I see the following times (in seconds) to calculate X items:                                           
# ItemsLaptopT1000
1.661.3
21.41.5
42.81.6
85.42.5
1610.83.7
2416.64.8
As you'd expect, the performance of the laptop degrades linearly, to where it takes 16.6 seconds to perform 24 calculations. The performance of the T1000 isn't a linear scale, but even though it takes twice as as the laptop long to perform a single calculation, it can perform 24 calculations in one-third of the time of the laptop.

In the context of an appserver, think of the calculation as the time required for the business methods of your app. I've walked through this explanation a number of times, and often I'm told that the business method is the critical part of the app, and it must be done in .6 seconds for each user -- and hence the throughput of the T1000 isn't important. And that's fine: if you need to calculate a single method in .6 seconds, then you must use the single-threaded machine. But if you need to calculate two of those at the same time, then you'll need to get two of those machines, and if you need to calculate 24 of them, you'll need to get 24 machines.

So this brings us back to our question: which machine is faster? And it depends on what you need. If you need to only do one calculation at a time, then the laptop is faster. If you need to do 3 or more calculations at the same time, then the T1000 is faster. Which is faster for you will depend on your application, your traffic model, and many other variables. As always, the best thing is to try your application, but if that's not feasible, be very careful about extrapolating whatever data you do have: you cannot simply extrapolate performance data from a simple (single-threaded) model to a complex system.  
Recently, I've been reading an article entitled The Fallacy of Premature Optimization by Randall Hyde. I urge everyone to go read the full article, but I can't help summarizing some of it here -- it meshes so well with some of my conversations with developers over the past few years.

Most people can quote the line "Premature optimization is the root of all evil" (which was popularized by Donald Knuth, but originally comes from Tony Hoare). Unfortunately, I (and apparently My. Hyde) come across too many developers who have taken this to mean that they don't have to care about the performance of their code at all, or at least not until the code is completed. This is just wrong.

To begin, the complete quote is actually 
We should forget about small efficiencies, say about 97% of the time: premature optimization
is the root of all evil.
I agree with the basic premise of what this says, and also with everything it does not say. In particular, this quote is abused in three ways.

First, it is only talking about small efficiencies. If you're designing a multi-tier app that uses the network alot, you want to pay attention to the number of network calls you make and the data involved in them. Network calls are a large inefficiency. And not to pick on network calls -- experienced developers know what things are inefficient, and know to program them carefully from the start.

Second, Hoare is saying (and Hyde and I agree) that you can safely ignore the small inefficiencies 97% of the time. That means that you should pay attention to small inefficiencies 1 out of every 33 lines of code you write.

Third, and only somewhat relatedly, this quote builds into the perception that 80% of the time an application spends will be in 20% of the code, so we don't have to worry about our code's performance until we find out we're in the 80%.

I'll present one example from glassfish to highlight those last two points. One day, we discovered that a particular test case for glassfish was bottlenecked on calls to Vector.size -- in particular, because of loops like this: 
Vector v;
for (int i = 0; i < v.size(); i++)
     process(v.get(i));
This is a suboptimal way to process a vector, and one of the 3% of cases you need to pay attention to. The key reason here is because of the synchronization around vector, which turns out to be quite expensive when this loop is the hot loop in your program. I know, you've been told that uncontended access to a synchronized block is almost free, but that's also not quite true -- crossing a synchronization boundary means that the JVM must flush all instance variables presently held in registers to main memory. The synchronization boundary also prevents the JVM from performing certain optimzations, because it limits how the JVM can re-order the code. So we got a big performance boost by re-writing this as 
ArrayList v;
for (int i = 0, j = v.size(); i < j; i++)
     process(v.get(i));
Perhaps you're thinking that we needed to use a vector because of threading issues, but look at that first loop again: it is not threadsafe. If this code is accessed by multiple threads, then it's buggy in both cases.

What about that 80/20 rule? It's true that we found this case because it was consuming a lot (not 80%, but still a lot) of time in our program. [Which also means that fixing this case is tardy optimization, but there it is.] But the problem is that there wasn't just one loop written like this in the code; there were (and still are...sigh) hundreds. We fixed the few that we the worst offenders, but there are still many, many places in the code where this construct lives on. It's considered "too hard" to go change all the places where this occurs (though NetBeans could refactor it all pretty quickly, but there's a risk that subtle differences in the loop would mean that it would need to be refactored differently).

When we addressed preformance in Glassfish V2 in order to get ourexcellent SPECjAppServer results, we fixed a lot of little things like this, because we spend 80% of our time in about 50% of our code. It's what I call performance death by a thousand cuts: it's great when you can find a simple CPU-intensive set of code to optimize. But it's even better if developers pay some attention to writing good, performant code at the outset and you don't have to track down hundreds of small things to fix.

Hyde's full article has some excellent references for further reading, as well as other important points about why, in fact, paying attention to performance as you're developing is a necessary part of coding.  
I've written several times before about how you have to measure performance to understand how you're doing -- and so here's my favorite performance stat of the day: New York 17, New England 14.  
I spent last week working with a customer in Phoenix (only a few weeks before the Giants go there to beat the Patriots), and one of the things we wanted to test was how their application would work with the new in-memory replication feature of the appserver. They brought along one of their apps, we installed it and used their jmeter test, and quickly verified that the in-memory session replication worked as expected in the face of a server failure.

Feeling confident about the functionality test, we did some performance testing using their jmeter script. We got quite good throughput from their test. But as we watched it run, we noticed jmeter reporting that the throughput kept continually decreasing. Since we were pulling the plug on instances in our 6-node cluster all the time, at first I just chalked it up to that. But then we ran a test without failing instances, and the same thing happened: continually decreasing performance.

Nothing is quite as embarrassing as showing off your product to a customer and having the product behave badly. I was ready to blame a host of things: botched installation, network interference, phases of the moon. Secretly, I was willing to blame the customer app: if there's a bug, it must be in their code, not ours.

Eventually, we simplified the test down to a single instance, no failover, and a single URL to a simple JSP: pretty basic stuff, and yet it still showed degradation over time (in fact, things got worse). Now there were two things left to blame: jmeter, or the phases of the moon. Neither seemed likely, until I took a closer look at what jmeter was doing: it turns on that the jmeter script was using an Aggregate Report. That report, in addition to updating the throughput for each request, also updates various statistics, including the 90% response time. It does this in real-time, which may seem like a good idea: but the problem is that calculating the 90% response time is an O(n) operation: the more requests jmeter made, the longer it took to calculate the 90% time.

I've previously written in other contexts about why tests with 0 think time are subject to misleading results. And it turns out this is another case of that: because there is no think time in the jmeter script, the time to calculate the 90% penalizes the total throughput. As the time to calculate the 90% increases, the time available for jmeter to make requests decreases, and hence the reported throughput decreases over time.

I'm not actually sure if jmeter is smart enough to do this calculation correctly even if there is think time between requests: will it just blindly sleep for the think time, or will it correctly calculate the think time minus its own processing time? For my test, it doesn't matter: the simpler thing is to use a different reporting tool that doesn't have the 90% calculation (which, I'm happy to report, showed glassfish/SJSAS 9.1 performing quite well with in-memory replication across the cluster and no degradation over time).

But what's more important to me is that it reinforces a lesson that I seem to have to relearn a lot: sometimes, your intuition is smarter than your tools. I had a strong intuition from the beginning that the test was flawed, but despite that, we spent a fair amount of time tracking down possible bugs in glassfish or the servlets.

And I also don't mean to limit this to a discussion of this particular bug/design issue with jmeter. When we tested startup for the appserver, a particular engineer was convinced that glassfish was idle for most of its startup time: the UNIX time command reported that the elapsed time to run asadmin start-domain was 30 seconds, but the CPU time used was only 1 or 2 seconds. The conclusion from that was that glassfish sat idle for 28 seconds. But intuitively, we knew that wasn't true (for one thing, the disk was cranking away all that time, and a quick glance at a CPU meter would disprove the theory that the CPU wasn't being used). And of course, it turns out that asadmin was starting processes which started processes, and shell timing code didn't understand all the descendant structure (particularly when intermediate processes exited but the grandchild process -- the appserver -- was still executing). The time command was just not suited to giving the desired answer.

Tools that give you visibility into your applications are invaluable; I'm not suggesting that when a tool gives you a result that you don't expect that you should blindly cling to your hypothesis anyway. But when a tool and your intuition are in conflict, don't be afraid to examine the possibility that the tool isn't measuring what you wanted it to.  

[NOTE: The code in this blog was revised 2/11/08 due to some errors on my part the first time, and some changes as it was ingtegrated into grizzly. And thanks to Erik Svensson for pointing out a few errors, it has been revised again on 2/13/08.]
I'm quite interested these days in parsing performance: much of what a Java appserver does is take bytes from a network stream (usually, but not always, in some 8-bit encoding) and convert them into Java strings (based on 16-bit characters). Because servlet and JSP APIs are written in terms of strings, much of that conversion is unavoidable, but parsing network protocols at the byte level is appropriate in some circumstances.

 

As I prepared to prototype some tests around that, I realized I needed a good framework to test my changes, and of course that framework is grizzly. In fact, the newly-released grizzly 1.7 has a new protocol parser that exactly fit my needs (partly because I joined the grizzly project so that I could modify the parser as I needed; such are the joys of open source!).

 

I'll talk about some of my performance tests with network parsing in later blogs; for now, I wanted to write a quick entry on how to use grizzly 1.7's new protocol parser. In grizzly 1.7, the ProtocolParser interface was reimplemented to make it much easier to deal with the messages that the parser is expected to produce. This means that it is now possible to use standard grizzly filters to handle the data produced by a ProtocolParser, simply like this:

 controller.setProtocolChainInstanceHandler(new DefaultProtocolChainInstanceHandler() { public ProtocolChain poll() { ProtocolChain protocolChain = protocolChains.poll(); if (protocolChain == null) { protocolChain = new DefaultProtocolChain(); ((DefaultProtocolChain) protocolChain).setContinuousExecution(true); protocolChain.addFilter(new MyParserProtocolFilter()); protocolChain.addFilter(new MyProcessorFilter()); } return protocolChain; } } 

The nice thing about this is that additional filters (like a debugging log filter) can be inserted anywhere along the chain; the protocol use is completely integrated into the standard grizzly design. Note that call to setContinuousExecution -- it should be the default for protocol parsers (and will be eventually), but version 1.7 of grizzly will need that call. [Note that the standard LogFilter in grizzly is not appropriate in this case, since it tries to read directly from the socket as well; it's trivial to write your own if you like.]

 

Now it's a matter of implementing the two filters and the parser itself. The ParserProtocolFilter class will handle reading the requests and calling the parser, but in order for it to know which parser to use, you must extend it and override the newProtocolParser method:

 public class MyParserProtocolFilter() { public ProtocolParser newProtocolParser() { return new MyProtocolParser()); } } 

What about the parser itself? That's the meat of the issue. The new protocol parser interface expects a basic flow like this: start processing a buffer, enumerate the message in the buffer, and end processing the buffer. The buffer can contain 0 or more complete messages, and it's up to the protocol parser to make sense of that. Here's the outline of a simple protocol parser that parses a protocol where the first byte is a number of bytes in string, followed by the remaining bytes:

 public class MyProtocolParser implements ProtocolParser { byte[] data; int position; ByteBuffer savedBuffer; int origLimit; public void startBuffer(ByteBuffer bb) { // We begin with a buffer containing data. Save the initial buffer // state information. The best thing here is to get the backing store // so that the bytes can be parsed directly. We also need to save the // original limit so that we can place the buffer in the correct state at the // end of parsing savedBuffer = bb; savedBuffer.flip(); partial = false; origLimit = savedBuffer.limit(); if (savedBuffer.hasArray()) { data = savedBuffer.array(); position = savedBuffer.position() + savedBuffer.arrayOffset(); limit = savedBuffer.limit() + savedBuffer.arrayOffset(); } else ...maybe copy out the data, or use put/get when parsing... } public boolean hasMoreBytesToParse() { // Indicate if there is unparsed data in the buffer return position < limit; } public boolean isExpectingMoreData() { // If there is a partial message remaining in the buffer, return true return partial; } public String getNextMessage() { // We already know this, but other protocols might parse here return savedString; } public boolean hasNextMessage() { // In our case, it's easier to parse here int length = data[position]; if (data.length < position + 1 + length) { savedString = new String(data, position + 1, length); position += length + 1; savedBuffer.limit(length + position + 1); savedBuffer.position(position + 1); partial = false; } else partial = true; return !partial; } public boolean releaseBuffer() { // If there's a partial message return true; else false if (!hasMoreBytesToParse()) savedBuffer.clear(); else { // You could compact the buffer here if you're // concerned that there isn't enough space for // further messages, but compacting comes at a // performance price -- whether to compact or not // depends on your protocol. savedBuffer.position(position); savedBuffer.limit(origLimit); } return partial; } } 

The point of this is that the ParserProtocolFilter will repeatedly call hasNextMessage/getNextMessage to retrieve messages (Strings in this case) to pass to the next filter. When it's done, it will call releaseBuffer, which is responsible for setting the position and limit in the buffer to reflect the data consumed by the (possibly multiple) messages returned.

 

So what about the downstream filters? You probably noticed that when we parsed the data, we also set the limit/position in the ByteBuffer to reflect the message boundaries. That's because not all grizzly filters will understand that the data is protocol based and has been seperated into types. For instance, you could write a LogFilter that just prints out the data received; it doesn't know about the messages (and we wouldn't want it to -- we'd want it to print the raw data anyway, rather than information in the message).

But downstream filters can also understand what a message is and hence they can work like this:

 public class MyProcessorFilter implements ProtocolFilter { public boolean execute(Context ctx) { String s = (String) ctx.getAttribute(ProtocolParser.MESSAGE); if (s == null) { // no message; just use the bytes in the buffer like a // normal filter s = getStringFromBuffer(ctx); } .. do something with s .. } } 

So, apart from writing the protocol parser (which could be quite complex, depending on the actual protocol and how it breaks into messages), using the new grizzly framework for protocol parsing is quite simple: you just set up the parser class, and then have a filter that processes the messages from the parser. And long the way, you can use any other grizzly filter or framework feature you need.

When I reported our recent excellent SPECjAppServer 2004 scores, one glassfish user responded: 
I sure wish you guys were able to come up with a thorough write up
about the SPEC Benchmark architecture, and the techniques you guys
used to get the numbers you get and, more importantly, how those
techniques might apply to every day applications we run in the wild.
While we do have a full performance-tuning chapter in the glassfish/SJSAS docset, I can understand the appeal of a quick cheat-sheet for getting the most out of glassfish in production. Most of this information has appeared in various blogs, particularly by Jeanfrancois, who is so expertly focused on making sure that grizzly and our http path is as fast as possible. Still, I hope that gathering this quick list together will be a good single-source summary.

One thing to note about these guidlines: a lot of glassfish configurations (particularly when you start with a developer profile) are optimized for developers. In development, performance is different: you'll trade off a few seconds here and there to make starting the appserver faster, or deploying something faster. In production, you'll make opposite trade-offs. So if you wonder why some of the things in this list aren't necessarily the default setting, that's probably why. 

Tune your JVM

The first step is to tune the JVM, which is of course different for every deployment. These are the options set via the jvm-option tag in your domain.xml (or the JVM options page in the admin console). As a general rule, I like to use the throughput collector with large heaps and a moderate-sized young generations: that makes young GCs quite fast. That will lead to a periodic full GC, but the impact of that on total throughput is usually quite minimal. If you absolutely cannot tolerate a pause of a few seconds, you can look at the concurrent collector, but be aware that this will impact your total throughput. So a good set of JVM arguments to start with are: 
-server -Xmx3500m -Xms3500m -Xmn1500m -XX:+UseParallelGC -XX:+UseParallelOldGC -XX:+AggressiveOpts
On a CMT machine like the SunFire T5220 server, you'll want to use large pages of 256m, and a heap that is a multiple of that: 
-server -XX:LargePageSizeInBytes=256m -Xmx2560m -Xms2560m -Xmn1024m -XX:+UseParallelGC -XX:+UseParallelOldGC -XX:ParallelGCThreads=16 -XX:+AggressiveOpts
More details of the impact of a CMT machine are available at Sun's Cool Threads website.

Make sure to remove the -client option from your jvm options, to include the -Dcom.sun.enterprise.server.ss.ASQuickStartup=false flag, and -- if you are using CMP 2.1 entity beans -- to include -DAllowMediatedWriteInDefaultFetchGroup=true. 

Tune the default-web.xml

Settings in the default-web.xml file are overridden by an application's web.xml, but I find it easier to set production-ready values in the default-web.xml file so that all applications will get them. In particular, under the JspServlet definition, add these two parameters: 
<init-param>
  <param-name>development</param-name>
  <param-value>false</param-value>
</init-param>
<init-param>
  <param-name>genStrAsCharArray</param-name>
  <param-value>true</param-value>
</init-param>
That will mean you cannot change JSP pages on your production server without redeploying the application, but that's generally what you want anyway.

On note about this: this file is only consulted when an application is deployed. So make sure you change the file and then deploy your application, or you won't see any benefit from this change. 

Tune the HTTP threads

As you know, there are two parameters here: the HTTP acceptor threads, and the request-processing threads. These value have unfortunately had different meanings in a few of our releases, and some confusion about them remains. The acceptor threads are used to both to accept new connections to the server and to schedule existing connections when a new request comes over them. In general, you'll need 1 of these for every 1-4 cores on your machine; no more than that (unlike, say SJSAS 8.1 where this had a completely different meaning). The request threads run HTTP requests. You want "just enough" of those: enough to keep the machine busy, but not so many that they compete for CPU resources -- if they compete for CPU resources, then your throughput will suffer greatly. Too many request processing threads is often a big performance problem I see on many machines.

How many is "just enough"? It depends, of course -- in a case where HTTP requests don't use any external resource and are hence CPU bound, you want only as many HTTP request processing threads as you have CPUs on the machine. But if the HTTP request makes a database call (even indirectly, like by using a JPA entity), the request will block while waiting for the database, and you could profitably run another thread. So this takes some trial and error, but start with the same number of threads as you have CPU and increase them until you no longer see an improvement in throughput. 

Tune your JDBC drivers

Speaking of databases, it's quite important in glassfish to use JDBC drivers that perform statement caching; this allows the appserver to reuse prepared statements and is a huge performance win. The JDBC drivers that come bundled with the Sun Java Systems Application Server provide such caching; Oracle's standard JDBC drivers do as well, as do recent drivers for Postgres and MySQL. Whichever driver you use, make sure to configure the properties to use statement caching when you set up the JDBC connection pool -- e.g., for Oracle's JDBC drivers, include the properties 
ImplicitCachingEnabled=true
MaxStatements=200

Use the HTTP file cache

If you serve a lot of static content, make sure to enable the HTTP file cache.



Have I piqued your interest? As I mentioned, there are hundreds of pages of tuning guidelines in our docset. But here at least you have some important first steps.  

Last week, Sun published a new SPECjAppServer 2004 benchmark score: 8439.36 JOPS@Standard [1]. [I'd have written about it sooner, but it was published late Wednesday, and I had to go home and bake a lot of pies.] This is a "big" number, and frankly, it's the one thing that's been missing in our repertoire of submissions. We'd previously shown leading performance on a single chip, but workloads in general (and SPECjAppServer 2004 in particular) don't scale linearly as you increase the load. This number shows that we can scale our appserver across multiple nodes and machines quite well.

I've been asked quite a lot about what scalability actually means for this workload, so let me talk about Java EE scalability for a little bit. The first question I'm invariably asked is, isn't this just a case of throwing lots more hardware at the problem? Clearly, at a certain level the answer is yes: you can't do more work without more hardware. And I don't want to minimize the importance of the amount of hardware that you throw at the problem. There are presently two published SPECjAppServer scores that are higher than ours: HP/Oracle have results of 9459.19 JOPS@Standard [2] and 10519.43 JOPS@Standard [3]. Yet those results require 11 and 12 (respectively) appserver tier machines; our result uses only 6 appserver tier machines. More telling is that the database machine in our submission is a pretty beefy Sun Fire E6900 with 24 CPUs and 96GB of memory. Pretty beefy, that is, until you look at the HP/Oracle submissions that rely on 40 CPUs and 327GB of memory in two Superdome chasis. So yes, if you have millions (and I mean many millions -- ask your HP rep how much those two Superdomes will cost) of dollars to throw at the hardware, you can expect to get a quite high number on the benchmark.

The database, in fact, is one reason why most Java EE benchmarks (and workloads) will not scale linearly -- you can horizontally scale appserver tiers pretty well, but there is still only a single database that must handle an increasing load.

On the appserver side, horizontal scaling is not quite just a matter of throwing more hardware at the problem. SPECjAppServer 2004 is partitioned quite nicely: no failover between J2EE instances is required, connections to a particular instance are sticky, and the instances don't need to communicate with each other. All of that leads to quite nice linear scaling.

But one part of the benchmark doesn't scale linearly, because it is dependent on the size of the database. SPECjAppServer 2004 uses a bigger database for bigger configurations. For example, our previous submission on a single SunFire T2000 achieved a score of 883.66 JOPS@Standard [4]. The benchmark sizing rules meant that the database used for that configuration was only 10% as large at the database we used in our current submission. [More reason why that database scaling is important.] And in particular, it meant that the database in the small submission held 6000 items in the O_item table while our current submission had 60000 items in that table.

For SPECjAppServer 2004, that's important because the benchmark allows the appserver to cache that particular data in ead-only, container-managed EJB 2.1 entities. [That's a feature that's explicitly outside of the J2EE 1.3/1.4 specification, so your portable J2EE apps won't use it -- your portable Java EE 5 apps that use JPA can use cached database data, though somewhat differently.] Caching 6K items is something a single instance can do, but caching all 60K items will cause GC issues for the appserver. Hence, in some areas, the appserver will have to do more work as the database size increases, even if the total load per appserver instance is the same.

So a "big" score on this benchmark is a factor of two things: there are things within the appserver architecture that influence how well you will scale, even in a well-partitioned app. But the amount of hardware (and cost of that hardware) remains the key driving factor in just how high that score can go. As I've stressed many times, benchmarks like this are a proof-point: our previous numbers establish that we have quite excellent performance, and this number establishes that we can scale quite well. As always, the only relevant test remains your application: download the appserver now and see how well it responds to your requirements.

Finally, as always, some disclosures: SPEC and the benchmark name SPECjAppServer 2004 are registered trademarks of the Standard Performance Evaluation Corporation. Competitive benchmark results stated above reflect results published on www.spec.org as of 11/26/07. For the latest SPECjAppServer 2004 benchmark results, visit http://www.spec.org/. Referenced scores:
[1] Six Sun SPARC Enterprise T5120 (6 chip, 48 cores) appservers and one Sun Fire E6900 (24 chips, 48 cores) database; 8,439.36 JOPS@Standard
[2] Eleven HP BL860c (22 chips, 44 cores) appservers and two HP Superdomes (40 chips, 80 cores) database; 9,459.19 JOPS@Standard
[3] Twelve HP BL860c (24 chips, 48 cores) appservers and two HP Superdomes (40 chips, 80 cores) database; 10,519.43 JOPS@Standard
[4] One Sun Fire T2000 (1 chip, 8 cores) appserver and one Sun Fire T2000 (1 chip, 6 cores) database; 883.66 SPECjAppServer2004 JOPS@Standard

You've probably read by now, today Sun released the product version of download glassfish V2 yourself and run some rigorous tests on it. And see for yourself all the improvments we've made.

 

Of course, we've already started work on glassfish V3, where we'll be targeting even more performance features, including very rapid startup, particularly for web container developers.

Today, Sun officially announced SPECjAppServer 2004 scores on our Sun Java Application Server 9.1, which (as you no doubt know) is the productized version of the open-source Glassfish V2 project. We've previously submitted results for SJSAS 9.0 (aka Glassfish V1), which at the time we were quite proud of: they were the only SPECjAppServer scores based on an open-source application server, and that gave us a quite good price/performance story. Considering where we started, I was happy to conclude that those scores were "good enough."

"Good enough" is no longer good enough. Today, we posted the highest ever score for SPECjAppServer 2004 on a single Sun Fire T2000 application server: 883.66 JOPS@Standard [1]. The Sun Fire T2000 in this case has a 1.4ghz CPU; the application also uses a Sun Fire T2000 running at 1.0ghz for its database tier. This result is 10% higher than WebLogic's score of 801.70 JOPS@Standard [2] on the same appserver machine. In addition, this result is almost 70% higher than our previous score of 521.42 JOPS@Standard on a Sun Fire T2000 [3], although that Sun Fire T2000 was running at only 1.2ghz. So that doesn't mean that we are 70% faster than we were, but we are quite substantially faster and are quite pleased to have the highest ever score on the Sun Fire T2000.

This result is personally gratifying to me in many ways, and I am proud of it (and proud of the work by the appserver engineers that it represents) on many, many levels. But it is just a benchmark, so let me touch on two things that means.

First, vendors and their marketing department love to play leap-frog games with benchmarks. My favorite example of this: some time ago, BEA posted a score of 615.64 JOPS@Standard [4] on the 1.2ghz T2000, only to be outdone a few months later by IBM WebSphere's score of 616.22 JOPS@Standard [5] on the same system. It's good marketing press, but at some point those sort of differences become slightly ridiculous to end users.

So yes, at some point it's conceivable that someone will post a higher score on this machine than we have; it's conceivable that I'll be back touting some improvements on our score (because my protestations about benchmarks aside, I'm not above playing the game either). But don't let any of that keep you from the point: this is a result that fundamentally changes the nature of that game. We used to be content with having a good result in terms of price/performance and watching IBM, Oracle, and BEA leap-frog among themselves in terms of raw performance. Now, we're the raw performance leader. There will be jockeying for position in the future, but we've changed forever the set of contenders. [We're also still quite interested in being price/performance leaders, by the way, which is why we also published a score this week using the free, open-source Postgres database.]

Second, remember that this is just a benchmark. Will you see similar results on your application? It depends. SPECjAppServer 2004 doesn't use EJB 3.0, JPA, WebServices, JSF, or any of a host of Java EE technologies (and frankly, I'm pretty happy with our performance in most of those areas; see, for example this article or this one on our WebServices performance). On the other hand, its performance is significantly affected by improvements we made to read-only EJBs, remote EJB invocation, and co-located JMS consumers and producers. So some of the improvements we've made may be in areas your application doesn't even use. [That's another reason I was happy with our previous scores: they established us as a viable appserver vendor, and I knew that customers who benchmarked their own applications would likely see better relative performance than that displayed by SPECjAppServer.]

Don't get me wrong: we have also made substantial performance improvements across the board: in the servlet connector and container, in JSP processing, in the local EJB container, in connection pooling, in CMP 2.1, and so on. This is really an important performance release for us. But as I always have said: the only realistic benchmark for your environment is your application. So go grab a recent build of glassfish V2, and see for yourself.

Now, as always, some disclosures: SPEC and the benchmark name SPECjAppServer 2004 are registered trademarks of the Standard Performance Evaluation Corporation. Competitive benchmark results stated above reflect results published on www.spec.org as of 07/10/06. The comparison presented is based on application servers run on the Sun Fire T2000 1.2 ghz and 1.4ghz servers. For the latest SPECjAppServer 2004 benchmark results, visit http://www.spec.org/. Referenced scores:
[1] One Sun Fire T2000 (1 chip, 8 cores) appserver and one Sun Fire T2000 (1 chip, 6 cores) database; 883.66 SPECjAppServer2004 JOPS@Standard
[2] One Sun Fire T2000 (1 chip, 8 cores) appserver and one Sun Fire T2000 (1 chip, 6 cores) database; 801.70 SPECjAppServer2004 JOPS@Standard
[3] One Sun Fire T2000 (1 chip, 8 cores) appserver and one Sun Fire T2000 (1 chip, 6 cores) database ; 521.42 SPECjAppServer2004 JOPS@Standard
[4] One Sun Fire T2000 (1 chip, 8 cores) appserver and one Sun Fire V490 (4 chips, 8 cores, 2 cores/chip) database; 615.64 SPECjAppServer2004 JOPS@Standard
[5] One Sun Fire T2000 (1 chip, 8 cores) appserver and one Sun Fire X4200 (2 chips, 4 cores, 2 cores/chip) database; 616.22 SPECjAppServer2004 JOPS@Standard

One of those lesser-known features of Java is that it contains two different bytecodes for switch statements: a generic switch statement, and an (allegedly more optimal) table-driven switch statement. The compiler will automatically generate one or the other of these statements depending on the values in the switch statement: the table-driven statement is used when the switch values are close to being sequential (possibly with a few gaps), where the generic statement is used in all other cases. It's the sort of thing that intrigues performance-oriented developers: is the table-driven statement really more optimal? Is it worth coercing the variable involved in a switch statement so that the compiler can generate a table-driven statement? Is there ever a case in a real-world program where this would even matter? Interesting questions, but since I assumed the answer to the last one was "no", I never really thought about the first few.

Now, however, I'm looking at some profiles of Glassfish V2, and I find that when running a particular application, we're spending a full 1% of our time in this method:
protected java.util.logging.Level convertLevel(int level) {
    int index = level / 100;
    switch (index) {
        case 3: return Level.FINEST;
        case 4: return Level.FINER;
        case 5: return Level.FINE;
        case 7: return Level.CONFIG;
        case 8: return Level.INFO;
        case 9: return Level.WARNING;
        case 10: return Level.SEVERE;
        default: return Level.FINER;
    }
}
Seems like a pretty simple method to be spending so much time in (and let's face it, sampling profilers may overstate their time for a method like this). So I dug in a little further. The level value passed to this method is always exactly divisible by 100: it's not the case that level can be 300, 305, and 310. So there is a one-to-one correspondence between the integers passed to the method and the Level object returned. So I was rather impressed that the original author of this code had known enough arcane Java trivia to know that he could coerce the argument to get the table-driven switch statement.

Alas, if only he'd taken the next step to see if the performance difference was worthwhile. It turns out that it wasn't: removing the division from this method and recasting the switch statement to values of 300, 400, and so on eliminted all the time the profiler attributed to this method and resulted in a .5% improvement in the way the application ran. I also did some quick micro-benchmarking of the method and discovered that if I didn't need to coerce the argument into the switch statement (that is, if I passed in values of 3, 4, 5, etc. to begin with), the perfomance of the method was essentially the same, but adding the division statement to coerce the argument slowed down execution of the method quite significantly.

At .5% of performance, I'm not sure that this is the real-world example of where this would ever matter -- though when you provide a platform for other people's applications, you worry about your operations being as streamlined as possible. But it is another example of why you should test you code before making assumptions about how it will perform, and particularly before writing code to work around a potential performance issue.
Almost every thread pool implementation takes great pains to make sure that it can dynamically resize the number of threads it utilizes: you specify the mininum number of threads you want, the maximum number, and the thread pool in its wisdom will automatically configure itself to have the optimal number of threads for your workload. At least, that's the theory...

But what about in practice? I'd argue that its utility is very limited, and that in many cases, a dynamically-resizing threadpool will actually harm to the performance of your system.

First, a quick review of why we have threadpools. From a perfomance perspective, the most important task of a threadpool is to throttle the number of simulatneous tasks running on your system. I know that you may think that the purpose of a threadpool is to allow you to conveniently run multiple things at once. It does that, but more importantly, it prevents you from running too many things at once. If you need to run 100 CPU-bound tasks on a machine with 4 CPUs, you will get optimal throughput if you run only 4 tasks at a time: each task fully utilizes the CPU while it is running. Since you can't run more that 4 tasks at once, you won't get get any better throughput by having more threads -- in fact, if you add more threads to the saturated system, your throughput will go down: the threads will compete with each other for CPU and other system resources, and the operating system will spend more time than necessary managing the competing threads.

In the real world, of course, tasks are never 100% CPU-bound, so you'll usually want more threads than CPUs to get optimal use of your system. How many more is a function of your workload:  how much time it waits for external resources like a database, and so on. But there will be an optimal number, usually quite less than the number of simultaneous tasks your can handle (particularly if those tasks represent jobs coming in from remote users -- e.g. a web or application server handling thousands of connections). The determining rule is this: is you have more tasks to perform AND you have idle CPU time, then it makes sense to add more threads to the pool. If you have more tasks to perform but no idle CPU time, then it is counter-productive to add threads to the pool. And that's my problem with dynamically resizing threadpools: if they choose to add threads because there are tasks waiting (even though there is no available CPU time), they will hurt your performance rather than help it.

Conceivably, you could use some native code to figure out the idle CPU time on your system and have a threadpool that takes that information into account. That would be better, but even that is insufficient. Say you have an application server accessing a remote database using JPA. Now if the database becomes a bottleneck, you'll have idle CPU time on your application server, and it will have tasks that are waiting. But adding threads to run those tasks will again make things worse: it will increase the work needed to be done by the already-saturated database, and your overall throughput will suffer. In the final analysis, you are the only one that will have all the necessary information to know if it is productive to increase the size of your thread pool.

So you are responsible for setting the maximum size of the threadpool to a reasonable value, so that the system will never attempt to run too many threads at once. Given you've done that, is there a point in having a mininum number of threads? The claim is that there is, because it can save on system resources. But I would argue that the impact of that is really minimal. Each thread has a stack and so consumes a certain amount of memory. But if the thread is idle and the machine doesn't have enough physical memory to handle everything on the system, that idle memory will simply be paged out to virtual memory. Even if the thread exits, the memory it used for its stack still belongs to the JVM process -- the JVM might reuse that memory for something else, but in general, the memory cannot be returned to the operating system for use by other processes. So the memory issue doesn't really have much impact. Depending on the application, it's conceivable that fewer idle threads may have a small impact because when a thread is reused, it might happen to have some important data in the CPU cache (whereas an idle thread selected to run a task won't have any data in the CPU cache), but the effects of that in the real world are pretty much non-existent. So it doesn't hurt to have a minimum number of threads, but you get no real advantage from that either.

One area that can be very subtle in this regard is the ThreadPoolExecutor, which can be configured to have three values: a minimum, a core value, and an absolute maximum. In general, threads are added when tasks are waiting until the system runs the desired core value of threads. Then everything chugs along nicely, even though a certain number of tasks may be waiting in the queue. Now say that the system can't keep up with the tasks queue: the task queue length grows beyond some defined value. In response to this, the executor will start adding threads (up to the absolute maximum). But if the system is CPU-bound, or if the system is causing a bottleneck on an external resource, adding those threads is exactly the wrong thing to do. And because this happens only under circumstances such as an increased load, it might be something that you fail to catch in normal testing: during normal testing, you'll usually run with the core number of threads and may not even notice that you've misconfigured the maximum number of threads to a value the system cannot handle. The converse of this argument is that the thread pool executor can add new threads when a burst of traffic comes, and as long as there are resources available to execute those threads, the executor can handle the additional tasks (and then, once the burst is over, the extra threads can exit and reduce system resource usage). But given the minimal-at-best effect that has on system resources, handling a burst like that doesn't make a lot of sense to me, particularly given the potential for increasing load on the system at exactly the wrong time.

All of that is why I always choose to ignore dynamically sizing threadpools, and just configure all my pools with a static size.