1 2 Previous Next


24 posts

By now, you are hopefully well aware that Glassfish 3.1 has been released.  Because the performance group has been a little quiet lately, maybe you're thinking there aren't a lot of interesting performance features in this release. In fact, there are two key performance benefits: one which benefits developers, and one which is important for anyone using Glassfish's new clustering and high-availability features.

Let's start with developers. One of our primary goal has always been to make the development experience fast and lightweight. This was of course a key factor driving the modularization of Glassfish V3; in V3, we finally had an architecture that allowed the server to load only those Java EE features that a developer was actually using. And the results were quite satisfactory. Given all our previous progress, what -- if anything -- could we actually do in Glassfish V3.1?

Developer Metrics Improve by 29%

With a lot of hard work and a laser-like focus by our development team, we managed to improve our core metric of a "developer scenario" by 29%. This scenario includes starting the appserver, deploying a JSP-based application and accessing its URL, and then a cycle of three changes to the application all followed by tests of the application URL. We aggregate the entire time for that scenario as our primary metric, but the table below shows the improvements in each of these areas:

Glassfish V3.12.963.140.9
Glassfish V3.03.284.351.59

As you can see, our improvement here is across the board of all activities that make up the development cycle. Well, most: we haven't figured out how to automatically find bugs in your program, so your testing will take the same amount of time. But at least it will be a little more pleasant since your redeployment cycle will be that much faster in Glassfish V3.1.

Let me mention here too why we test the entire cycle of development and not just startup -- particularly because I've seen some recent blogs touting only startup performance of other now-modularized Java EE servers. In our past, we've made the mistake of focusing solely on startup performance; in the Sun Java Systems Application Server 9.1, we introduced a quick-start feature which improved startup quite significantly. The problem is that it just deferred necessary work until the first time the server was accessed, and the total time it took to start the server and then load its admin console got worse (in fact, general performance of anything socket-related suffered because the architecture of that old server didn't easily support deferring activities). In the end, pure startup isn't what is important -- what's important is how quickly you can get all of your work done. Otherwise, we'd all do what tomcat-based servers do for startup: return to the command prompt immediately, before the server is up, to make it look like startup is immediate. Of course, if you immediately access the server at that point, you'll get an error because it hasn't finished initializing, but hey, at least it started fast.

HA Performance Improves by 33%

On the other end of the spectrum, Glassfish V3.1 contains some quite impressive improvements in its high-availability architecture. This is somewhat old news: when we did project Sailfin for our SIP server, we re-architected the entire failover stack to make it faster and more scalable. But although Sailfin is based on Glassfish, Glassfish V3 didn't support clustering at all yet. V3.1 is the first time we've been able to bring that architural work forward into the main Glassfish release.

In Glassfish V3.1, we support in-memory replication: one server is a primary server and holds the session object. Whenever the session object is modified, the data is sent to a secondary server elsewhere in the cluster so that if the primary server fails, the secondary can supply the session information. This is actually a fairly common implementation of high availability, though of course it does not address the situation of multiple failures. Still, the speed benefits you get from replicating to another server (vs. replicating to something like HADB) are quite significant. We introduced in-memory replication in SJSAS 9.1, and at the time had a nice performance gain compared to traditional database replication.

In Glassfish V3.1, we've taken that architecture and optimized it significantly; it is now based on the scalable grizzly adapter which uses Java NIO for its underpinnings. We've also optimized around the session serialization and general implementation to get a 33% improvement in performance for in-memory replication of Glassfish V3.1 compared to in-memory replication of SJSAS 9.1.1. And again we've tried to pay attention to all aspects of how it might be used. We support full session scope, where the entire HTTP and stateful SFSBs are replicated on each request. We also support modified-attribute scope, where for HTTP sessions, only those attributes in the session that have been marked as changed are replicated. Clearly the modified-attribute scope will perform better, but it does rely on the application to call setAttribute() to mark the attribute as having been modified (which, while not a standard in the Java EE specification, is the common technique adopted by virtuall all Java EE servers).

In practical terms, the improvement we see holds for both kinds of tests. Like all our performance measurements, we take some basic workloads that mimic a typical web or EE application and see how many users we can scale up to such that the response time for the requests is within some limit (typically 0.8 seconds), often with some CPU limit on the machine (e.g., we don't want to use more than 60% of the CPU because in the event of a failure, the remaining instances must take on more load).  For HTTP-only, HTTP and EJB, modified attribute, full session: all see about a 33% increase in the number of users we can support and still meet that 0.8 second reponse time.

General Performance

Of course, we haven't neglected general performance either; we've run our usual battery of tests to ensure that Glassfish V3.1 hasn't regressed in performance in any area. The Performance Tuner is back in Glassfish V3.1 to help optimize your production environment so that you get the very best performance Glassfish has to offer. And of course Glassfish V3 remains the only open-source application server to submit performance results on SPECjAppServer 2004.

For most of the year, I've been working on session replication code for Sailfin. When I came back to work with the Glassfish performance team, I found that we had some pretty aggressive goals around performance, particularly considering that Glassfish V3 had a completely new architecture, was a major rewrite of major sections of code, and implements the new Java EE 6 specification. Glassfish V3 in those terms is essentially a .0 release, and I was convinced we'd see major performance regressions from the excellent performance we achieved with Glassfish V2.

Color me surprised; in the end, we met or exceeded all of our goals for V3 performance. For the most part, our performance tests are based on customer applications, industry benchmarks, and other proprietary code that we can't open source (nor share results of). But I can discuss some of those tests, and in this blog we'll look at our first set of sanity tests. These test the basic servlet operations of the web container; we'll look at three such tests:

  1. HelloServlet -- the "Hello, world" of servlets; it simply prints out 4 lines of HTML in response to each request, incrementing a global counter to keep track of the total number of requests.
  3. HelloServletBeanJsp -- that same servlet, but now for each call it instantiates a simple JavaBean and then forwards the request (and bean) to a JSP page (i.e., the standard MVC model for servlets)
  5. HelloSessions -- the hello servlet that keeps track of a session counter (in a session attribute) instead of a global counter

Our goal here is that V3 would be at least as fast at V2 on these tests and remain ahead of the pack among open source application servers. The application servers are hosted on a Sun X4100, which is a 4 core (2 chip) AMD box running Solaris 10. The load is driven using the Faban HTTP Benchmarking program (fhb), which can drive load to a single URL from multiple clients (each client running in a separate thread -- an important consideration in a load generator). As a first pass, we run 20 users with no think time to see how much total load we can generate in the server:

I've normalized the chart to V2 performance. And what we see is that even on the simplest test -- the HelloServlet -- V3 manages to increase the total server throughput by a few percentage points. And while I was concerned about the effects of a new architecture, the OSGI classloading architecture and reworking of the glassfish classloading structure meant that we could take care of a long-standing issue in the V2 classloader -- so now every time we call Beans.instantiate() (or do anything else related to class loading), we can operate much more quickly. When it comes to session management, V2 and V3 come out the same.

The other columns in the chart represent jBoss 5.1 and Tomcat 6.0.20; our goal was to beat those containers on these tests, and we did. However, you might take that with somewhat of a grain of salt, as I am not an expert in those containers, and there are possibly container tunings that I might have missed for those. In fact, these tests are done with a small amount of tuning:

  • JVM options for all products are set to -server -Xmx2000m -Xms2000m -XX:NewRatio=2  -Xss128k. Using Sun's JDK 6 (6U16) means that ergonomics will kick in and use the parallel GC collector with 4 threads on this machine.
  • The thread pool size for all products is set to 10 (both min and max; I'm not a fan of dynamically resizing threadpools).
  • The server will honor all keep alive requests (fhb specifies this automatically) and allow up to 300000 requests from a single client before closing the socket (maxKeepAliveRequests)
  • The server will use 2 acceptor threads and a backlog of 300000 requests (that tuning is really needed only for the scalability test discussed below)
  • For jBoss, I followed the recommendation to use the Tomcat APR connector. As far as I can tell, Netty is not integrated into jBoss 5, though if you know otherwise, I'd love a link to the details.
  • For tomcat, I used the Http11NIOProtocol connector
  • In the default-web.xml for JSPs, genStrAsCharArray is set to true and development is set to false

Happy with a simple throughput test, I proceeded to some scalability tests. For these tests, we also use fhb -- but in this case we run multiple copies of fhb, each with 2000 users and a 1 second think time. This allows us to vary the number of users and test within a pre-defined response time (which is at most 1 second, or the client will fall behind the desired think time). The number of connections that we can run at each test will vary depending on the work -- the HelloServlet test had an initial throughput of almost 42,000 operations per second, and so we were able to test to 56,000 connected users with 28 copies of fhb (which we distributed among 7 x4100 machines; each core essentially running 2000 users). The test involving forwarding to a JSP does almost twice the work, and we can only run 32,000 users within these timing constraints; for the session tests we can run 40,000 users.

Here are the results: So despite all my initial qualms, V3 has performed admirably; it handled those 56,000 simultaneous clients without breaking a sweat. [Well, if a CPU can sweat, it might have -- it was quite busy. :-)] There are no results from tomcat or jBoss for this test, both failed in the configurations I had with these large number of users. In fact, they failed with even smaller numbers of users; I didn't test below 10000, but neither could handle the load even that high. Again, this is certainly possibly due to my lack of knowledge about how to configure the products. Though I'm not convinced about that -- tomcat failed because it had severe GC problems which are caused by a finalizer in the org.apache.tomcat.util.net.NioBlockingSelector$KeyReference class; the finalizer  And jBoss failed because of severe lock contention around some lock in the org.apache.tomcat.util.net.AprEndpoint class. Still, there might be a workaround for both issues.

At any rate, I'm a happy camper today: glassfish V3 is going out the door with excellent performance characteristics, thanks to lots of hard work along the way by the engineering community -- thanks guys!


Fun With JStack Blog

Posted by sdo Oct 15, 2009

Avid readers of the glassfish aliases know that we are frequently asked questions about why their server isn't responding, or why it is slow, or how many requests are being worked on. And the first thing we always say is to look at the jstack output.

So you're running on a Sun 5120 with 128 hardware threads and so you have 368 request processing threads and the jstack output is 10K lines. Now what? What I do is use the program attached to this blog entry -- it will parse the jstack output and show how many threads are doing what. At it's simplest, it's something like this:

% jstack pid > jstack.out
% java ParseJStack jstack.out
[Partial output...]
Threads in start Running
        8 threads in java.lang.Throwable.getStackTraceElement(Native
Total Running Threads: 8
Threads in state Blocked by Locks
        41 threads running in com.sun.enterprise.loader.EJBClassLoader.getResourceAsStream(EJBClassLoader.java:801)
Total Blocked by Locks Threads: 41
Threads in state Waiting for notify
      39 threads running in com.sun.enterprise.web.connector.grizzly.LinkedListPipeline.getTask(LinkedListPipeline.java:294)
        18 threads running in System Thread
Total Waiting for notify Threads: 74
Threads in state Waiting for I/O read
        14 threads running in com.acme.MyServlet.doGet(MyServlet.java:603)
Total Waiting for I/O read Threads: 14

The parser has aggregated all the threads and shown how they are in various states. 8 threads are current on the CPU (they happen to be doing a stack trace -- a quite expensive operation which is better to avoid). That's fine -- but we probably want it to be more than that.

41 threads are blocked by a lock. The summary method shown is the first non-JDK method in the stack trace; in this case it happens to be the glassfish EJBClassLoader.getResourceAsStream. Now we need to go actually look at the stack trace and search for that class/method and see what resource the thread is blocked on. In this example, all the threads were blocked waiting to read the same Zip file (really a Jar file), and the stack trace for those threads show that all the calls came from instantiating a new SAX Parser. If you didn't know, the SAX parser used by a particular application can be defined dynamically by listing the resource in the manifest file of the applications jar files, which means that the JDK must search the entire class path for those entries until if finds the one the application wants to use (or until it doesn't find anything and falls back to the system parser). But since reading the jar file requires a synchronization lock, all those threads trying to create a parser end up contending for the same lock, which is greatly hampering our application's throughput. It happens that you can set a system property to define the parser and hence avoid the dynamic lookup all the time (-Djavax.xml.parsers.SAXParserFactory=com.sun.org.apache.xerces.internal.jaxp.SaxParserFactoryImpl will always default to the JDK parser).

But the larger point here is that when you see lots of threads blocked on a resource, that's a problem that is throttling your throughput ,and hence whatever the resource is, you need to make changes to your configuration or application to avoid that contention.

What about the threads in notify? Those threads are waiting to be woken up. Usually they are threads in a pool waiting for notification that a task is ready (e.g. the getTask method is grizzly threads that are waiting for a request). System threads are doing things like RMI distributed GC, or JMX monitoring -- you'll see them in the jstack output as threads that have only JDK classes in their stack.

But another problem creeps up in the threads waiting for I/O read -- these are threads that are doing a blocking I/O call (usually socketRead0 -- not accept, which will show up in the jstack parsing as waiting for I/O accept). This is something else that is hampering your throughput -- the thread is waiting for a backend resource to answer its request. Most often this is a database -- again, you can look at the full stack in the output to see exactly what the call is. Maybe your app is making a the same call to a database; if that keeps showing up, its time to optimize the SQL or tune the database so that call happens faster.

There are some other states in the output I haven't show -- threads doing accept, or sleeping, or GC threads. But 95% of the time, if you aren't getting the throughput of out your server that you expect, its because the threads are blocked on I/O read or on a lock.

There are two other things you can do with this tool: first is that you can list multiple file names on the command line and the tool will aggregate them. So sometimes if you take a bunch of jstacks in a row, you can parse all of them and get a better idea where you are blocked. And second you can use the -name argument to limit the output only to certain threads. For example, in glassfish the grizzly http-request-processing threads are named httpSSLWorkerThread-<PORT#>-<Thread#>. If you run jstack -name WorkerThread-80, you'll get only those grizzly threads handling requests on port 80. [In glassfish v3, that naming is slightly different; it is presently http-listener-<listener#>-(<Thread#>), though it will likely be changed to the name of the thread pool instead of the name of the listener -- but it's pretty simple to look at the jstack output and figure out the naming convention in use.]

Finally, two caveats: the first is that jstack output is likely to change, and the parser will not be robust for all versions of the JDK (nor necessarily all JDK options; some GC options may dump GC thread output the tool is unprepared for). So it will likely need to be tweeked if you encounter a parsing error. Second is that a jstack is only a snapshot in time -- it is like a sampling profiler but with a very, very large interval. That means that sampling artifacts will affect your analysis, just like a profiler may point you to a parent or child class if the sampling rate is off a little bit. The fact that the JDK pauses threads only at certain locations to get their stack also affects that, particularly for running threads -- so a method that continually shows up in the running state doesn't mean that method is particularly expensive, just that it is a good point for the JDK to get the trace. [It may be expensive as well, though -- we know from other analysis that working with stack traces, including throwing exceptions, is very expensive in glassfish due to its very deep stack depth, so it's not surprising in my example above to see lots of threads running in the getStackTraceElement method). So this parsing is much more effective at finding things that are blocking your throughput than at finding runtime bottlenecks where you are using too much CPU.

Like all performance tools, jstack gives you one piece of information -- hopefully this will let you place that one piece of information into context with information from your other tools when it comes time to diagnose throughput bottlenecks.


Performance Stat of the Day Blog

Posted by sdo Feb 3, 2008
I've written several times before about how you have to measure performance to understand how you're doing -- and so here's my favorite performance stat of the day: New York 17, New England 14.

Grizzly Protocol Parsers Blog

Posted by sdo Dec 19, 2007

A Glassfish Tuning Primer Blog

Posted by sdo Dec 2, 2007

Last week, Sun published a new SPECjAppServer 2004 benchmark score: 8439.36 JOPS@Standard [1]. [I'd have written about it sooner, but it was published late Wednesday, and I had to go home and bake a lot of pies.] This is a "big" number, and frankly, it's the one thing that's been missing in our repertoire of submissions. We'd previously shown leading performance on a single chip, but workloads in general (and SPECjAppServer 2004 in particular) don't scale linearly as you increase the load. This number shows that we can scale our appserver across multiple nodes and machines quite well.

I've been asked quite a lot about what scalability actually means for this workload, so let me talk about Java EE scalability for a little bit. The first question I'm invariably asked is, isn't this just a case of throwing lots more hardware at the problem? Clearly, at a certain level the answer is yes: you can't do more work without more hardware. And I don't want to minimize the importance of the amount of hardware that you throw at the problem. There are presently two published SPECjAppServer scores that are higher than ours: HP/Oracle have results of 9459.19 JOPS@Standard [2] and 10519.43 JOPS@Standard [3]. Yet those results require 11 and 12 (respectively) appserver tier machines; our result uses only 6 appserver tier machines. More telling is that the database machine in our submission is a pretty beefy Sun Fire E6900 with 24 CPUs and 96GB of memory. Pretty beefy, that is, until you look at the HP/Oracle submissions that rely on 40 CPUs and 327GB of memory in two Superdome chasis. So yes, if you have millions (and I mean many millions -- ask your HP rep how much those two Superdomes will cost) of dollars to throw at the hardware, you can expect to get a quite high number on the benchmark.

The database, in fact, is one reason why most Java EE benchmarks (and workloads) will not scale linearly -- you can horizontally scale appserver tiers pretty well, but there is still only a single database that must handle an increasing load.

On the appserver side, horizontal scaling is not quite just a matter of throwing more hardware at the problem. SPECjAppServer 2004 is partitioned quite nicely: no failover between J2EE instances is required, connections to a particular instance are sticky, and the instances don't need to communicate with each other. All of that leads to quite nice linear scaling.

But one part of the benchmark doesn't scale linearly, because it is dependent on the size of the database. SPECjAppServer 2004 uses a bigger database for bigger configurations. For example, our previous submission on a single SunFire T2000 achieved a score of 883.66 JOPS@Standard [4]. The benchmark sizing rules meant that the database used for that configuration was only 10% as large at the database we used in our current submission. [More reason why that database scaling is important.] And in particular, it meant that the database in the small submission held 6000 items in the O_item table while our current submission had 60000 items in that table.

For SPECjAppServer 2004, that's important because the benchmark allows the appserver to cache that particular data in ead-only, container-managed EJB 2.1 entities. [That's a feature that's explicitly outside of the J2EE 1.3/1.4 specification, so your portable J2EE apps won't use it -- your portable Java EE 5 apps that use JPA can use cached database data, though somewhat differently.] Caching 6K items is something a single instance can do, but caching all 60K items will cause GC issues for the appserver. Hence, in some areas, the appserver will have to do more work as the database size increases, even if the total load per appserver instance is the same.

So a "big" score on this benchmark is a factor of two things: there are things within the appserver architecture that influence how well you will scale, even in a well-partitioned app. But the amount of hardware (and cost of that hardware) remains the key driving factor in just how high that score can go. As I've stressed many times, benchmarks like this are a proof-point: our previous numbers establish that we have quite excellent performance, and this number establishes that we can scale quite well. As always, the only relevant test remains your application: download the appserver now and see how well it responds to your requirements.

Finally, as always, some disclosures: SPEC and the benchmark name SPECjAppServer 2004 are registered trademarks of the Standard Performance Evaluation Corporation. Competitive benchmark results stated above reflect results published on www.spec.org as of 11/26/07. For the latest SPECjAppServer 2004 benchmark results, visit http://www.spec.org/. Referenced scores:
[1] Six Sun SPARC Enterprise T5120 (6 chip, 48 cores) appservers and one Sun Fire E6900 (24 chips, 48 cores) database; 8,439.36 JOPS@Standard
[2] Eleven HP BL860c (22 chips, 44 cores) appservers and two HP Superdomes (40 chips, 80 cores) database; 9,459.19 JOPS@Standard
[3] Twelve HP BL860c (24 chips, 48 cores) appservers and two HP Superdomes (40 chips, 80 cores) database; 10,519.43 JOPS@Standard
[4] One Sun Fire T2000 (1 chip, 8 cores) appserver and one Sun Fire T2000 (1 chip, 6 cores) database; 883.66 SPECjAppServer2004 JOPS@Standard


Sun Ships Glassfish V2 Blog

Posted by sdo Sep 17, 2007

Switching tracks Blog

Posted by sdo Jul 9, 2007

Filter Blog

By date: