Distribute, Detach, and Parallelize in Tomcat Blog

Version 2


    Applications are no longer monolithic, especially when it comes to distributed platforms like J2EE. Non-functional requirements (NFR) compel the architecture to be distributed in the across three or more tiers, encompassing multiple nodes and connectors between them. Many times, each connector adds to the inherent weakness of the system architecture as a whole; but there are times when decreasing the number of nodes will just converge to a point where it is hard (if not impossible) to implement the system to meet non-functional requirements. Further, there's a question of, "How much of the processing can be done later?" The answer will determine which pieces of the processing need to be done synchronously and which can be asynchronous. After all, why do we need to execute the code in a sequential manner when we have multiple processors available in the hardware to do parallel computing?

    This article is going to discuss the above aspects in the context of a highly scalable J2EE architecture. The article accompanies a Reference Implementation (RI) for the architecture, which can be deployed and executed. Even though the RI is designed for and tested in the open source web container Tomcat, the same concepts can be adopted and systems designed for other J2EE containers.

    Getting Started

    Our aim here is to list out a subset from the whole lot of concerns that arise during a software system design, and then look at how we address them in a particular context. For this, we will first try to understand the system requirements to the minimum detail required.

    The existing system landscape is a typical e-commerce website with a medium to high level of traffic. The traffic pattern will vary with time in a day, but the peak transaction requirement is 300 transactions per second (TPS). Each of these transactions will end up in displaying a web page, as shown in Figure 1. These web pages consist of a single main section and another subsection. The subsection will display advertisement links that when clicked will lead to offers, discounts, etc., available in connection with the content for the main section. Currently, these links appear based on a random selection of keywords and/or phrases. When a user clicks one of the links, a separate request will go to a search engine service, which will then display all search results. The user can then click any of the links in these search results. Each click will earn revenue for our e-commerce site.

    Existing Web Page Sample Representation
    Figure 1. Existing web page sample representation

    The current e-commerce site is designed for legacy technology, and it is hard to do any new development in that environment. Hence, there is a very high probability that the entire site will be redesigned for a new technology environment in the future. But the immediate requirement is to introduce intelligent logic to display ad links, which are projected to improve revenue substantially. Moreover, any new effort spent in that direction should be completely re-usable in the future when the e-commerce site is redesigned. The "as-is" state of component interactions is shown in Figure 2.

    As-Is System Components
    Figure 2. As-is system components

    Migration Strategy

    To answer the question of how to begin, we have the following constraints to take care of:

    • Changes to the existing system should be minimal.
    • Any major development should use current technology.
    • The new design should seamlessly integrate with the existing system.
    • The new system should be compatible with future changes to the system.
    • During migration, we should have a quick fallback strategy in case of a problem.
    • We should leverage open source tools as much as possible.

    Taking into consideration the above requirements, we made the following decisions:

    • Java EE will be the development/deployment environment, since EE is a proven, distributed development platform.
    • HTTP will be the protocol for the new system/component interface. HTTP is a proven, popular, character-oriented open protocol.
    • Tomcat will be the HTTP protocol interpreter. Tomcat is mature, open source, and provides for distributed architecture.
    • We will also leverage HTTP processor threads in Tomcat to meet the TPS requirements.
    • Initially, we decided to use an open source database, but quickly we realized that we need to rethink this choice due to volume requirements. We ultimately opted for Oracle as our database.
    • We will limit the changes to the existing system to a soft flip-flop switch, as shown in Figure 3. Using this switch, we can continue to use the existing random logic for link generation or we can use the intelligent logic based on revenue aspects.

    Migration Strategy
    Figure 3. Migration strategy

    The Going Gets Tough

    In order to introduce intelligent logic for ad-link generation, the system needs to keep track of two kinds of data at the minimum:

    • Impressions: The number of times a particular key word/phrase is displayed in the browser.
    • Clicks: The number of times a particular key word/phrase is clicked by the user.

    So the entire problem boils down to keeping track of two categories of events by the system. The constraints listed above compel us to develop new functionality as a block separate from the existing deployment. In other words, we are given the problem of creating a new system that will generate intelligent ad links. This new system is performance-critical, because it shouldn't be adding too much latency to the main flow of the transaction and also it should be serving at 300 TPS. Performance can be measured in different ways, and accesses per second or megabytes per second (MBPS) are just few amongst them.

    After a proof of concept (POC), the decision was to host the deployment in a replicated farm. The volume of data collected by the system in terms of impressions and clicks is so huge that we have to aggregate around two million events in some batch, and that many events has to be inserted and/or updated to the persistent store too.


    Load balancing is one of the goals of application replication. Replication can be done in the same hardware or across multiple pieces of hardware. Which architecture to choose is a matter of what we are trying to achieve. In general, the increasing scale and scope of the use of operating system and hardware resources drives these emerging requirements. Replicated systems designed to meet these requirements are likely to be structured around the answers to a few critical architectural factors:

    • How much memory footprint the application requires
    • How many threads the application components need
    • How many connections each application has to handle
    • How many system resources like file handles are required
    • How much parallelism is attainable in a single process
    • How many CPUs can be leveraged at the same time from within the same hardware
    • What are the synchronization primitives available in case of multiple processes and nodes
    • How we coordinate shared accesses and leases
    • The allowable number of nodes that a unit of application job can cross, taking into consideration the performance criterion

    Replication is a well-understood and very mature architectural choice, so much so that the addition of nodes is considered to be a no-risk proposition, provided we make design decisions for any negative impacts. We decided to have an application server farm with multiple nodes. In each node, we run a Tomcat web server hosting the HTTP component of our application. Most web applications can scale up in capacity and performance by adding more nodes with deployments, but the scaling factor may not be linear, due to LAN dependency and the fact that the load balancer has to distribute multiple TCP/IP connections to multiple nodes. For a 300 TPS requirement, we need to handle multiple connections at each node. The Tomcat server at each node needs to be configured to handle multiple connections. This is done in theserver.xml file, which is the main configuration file for Tomcat; a portion of it is shown below:

     <Server port="8005" shutdown="SHUTDOWN" debug="0"> <Service name="Catalina"> <Connector port="8080" maxThreads="10" minSpareThreads="5" maxSpareThreads="5" enableLookups="false" redirectPort="8443" acceptCount="100" debug="0" connectionTimeout="20000" disableUploadTimeout="true" /> </Service> </Server>

    acceptCount and connectionTimeout are the main connection-related attributes, and they are explained further below.

    • acceptCount: When all the HTTP processor threads in Tomcat are busy, client connections are queued up by Tomcat until a thread is available. The acceptCount attribute decides how many such connections can be queued up by Tomcat. If connections are still coming to Tomcat above and beyond this limit, clients will receive error messages.
    • connectionTimeout: Once a socket connection has been established between a client and Tomcat, there is a certain number of milliseconds before which the client has to send the request, which is controlled by the connectionTimeoutattribute. After this time period, the connection will be closed.


    Once the application deployed in a single node is able to handle multiple connections, we can now look at ways of processing these multiple connections in parallel. This is achieved by configuring Tomcat to have multiple HTTP processor threads. Every single Tomcat instance is a separate Java process or Java virtual machine (JVM). Each Java virtual machine can support many threads of execution at once. These threads independently execute code that operates on values and objects residing in a shared main memory. Threads may be supported by having many hardware processors, by time-slicing a single hardware processor, or by time-slicing many hardware processors. In a Tomcat process, configuring and tuning the number of threads is done in server.xml, as shown in the code listing above. maxThreads,minSpareThreads, and maxSpareThreads are the three attributes through which we can control the amount of parallelism required inside of a single Tomcat server, and they are explained below:

    • maxThreads: Tomcat uses a thread pool, and each request will be served by any idle thread in the thread pool.maxThreads decides the maximum number of threads that Tomcat can create to service requests.
    • minSpareThreads: When Tomcat is initially started, it may not create maxThreads number of threads configured. Instead, it will createminSpareThreads and later, on an as-needed basis, it will create more threads until the number of threads reaches a maximum of maxThreads.
    • maxSpareThreads: During off-load times, Tomcat doesn't require many of the threads in the pool.maxSpareThreads is the maximum number of idle threads Tomcat will retain in the pool. If this number is exceeded, excess threads are de-referenced to allow garbage collection.

    For a given transactional requirement, an important decision is to decide on the number of threads to be used at each Tomcat server level. When multiple threads are used in Tomcat, and if all the threads are busy, subsequent transactions need to wait on a queue for a free thread. Thus we need to be aware of two aspects for deciding the number of Tomcat threads:

    • Software contention: This refers to the time that a transaction needs to wait for a free thread. This involves the time to wait and also the associated context switching overheads.
    • Physical contention: Refers to the time spent by a transaction waiting to use physical resources (e.g., disk, CPU, etc.).

    Transaction Response Time vs. Number of Threads
    Figure 4. Transaction response time versus number of threads

    As Figure 4 shows, we cannot increase the effective parallelism, and thus the application response time, after a limit. This is mainly due to multiple dependencies at the Java-runtime, operating-system, and hardware levels. "Programming Java Threads in the Real World" discusses the things you need to know to program threads in the real world. Today's hardware options are vast and have varying levels of capacity to support parallel operations; for example, Sun Fire E25K Server speaks of supporting 1.5GHz UltraSPARC IV architecture processors of up to 72 in number, which is really promising. The Sun Fire Capacity on Demand (COD) provides the ability to acquire and activate CPU/memory resources on demand on a permanent or temporary basis. We decided to use Dell PowerEdge machines with four CPUs each as our main power center.

    By limiting the number of socket connections between the client and the Tomcat server and the number of Tomcat threads, we can implement a type of congestion or admission control, which will help us to limit the number of transactions allowed into the system. An incoming transaction that finds maxThreadstransactions in the system is blocked. The result of blocking is that the incoming transaction is either rejected or placed in a queue waiting to enter the server. Tomcat will indicate this state with the following log message:

    Dec 16, 2005 12:59:46 AM org.apache.tomcat.util.threads.ThreadPool logFull SEVERE: All threads (10) are currently busy, waiting. Increase maxThreads (10) or check the servlet status

    To aim to parallelize processing is to do more work in less time. For this, we need to reduce resource contention and associated resource locking. This is especially true in situations where we need to access shared resources from multiple threads, like a database table row or a global read-write variable. In our architecture, multiple Tomcat threads collectImpressionEvents and ClickEvents. Every time we need to regenerate ad links, we need to consider the total impressions and clicks for a particular link. This means that every Tomcat thread in each request-response cycle has to first aggregate the events based on a particular link and, based on the new figures, recalculate the weight factor for each ad link. This means every Tomcat request thread has to synchronize for doing the aggregation. But synchronization will kill one of our intentions behind parallelization. We are now at a point where we need to make some design decisions based on trade-offs.

    By relaxing the frequency at which we need to do aggregation of events, we can reduce the effect of synchronization. This is achieved by collecting events at the Tomcat-processor-thread level for a few requests. Once we reach a threshold number of events collected, or once a certain interval of time has elapsed, we can now do the aggregation. We now need a way to collect events at the thread level, and the best way to do this is to leverage ThreadLocal. The ThreadLocal class provides thread-local variables. We define a Map at the ThreadLocal level and then manage multiple thread-local variables (or resources) using this Map.

    public class ThreadLocalCache{ public volatile static int i = 1; private static ThreadLocal perThreadEventPool = new ThreadLocal() { protected synchronized Object initialValue(){ Map perThreadMap = new HashMap(); List eventList = new ArrayList(); perThreadMap.put(ThreadLocalCache. EVENT_LIST, eventList); PerThreadPushWorker perThreadPushWorker = new PerThreadPushWorker(); perThreadPushWorker.start(); long threadStartTime = System.currentTimeMillis(); perThreadMap.put(ThreadLocalCache. THREAD_LAST_BUSY_TIME, threadStartTime); perThreadMap.put(ThreadLocalCache. WORKER_NAME, perThreadPushWorker); perThreadPushWorker.setParentsThreadLocalId(i); perThreadMap.put(ThreadLocalCache. THREADLOCAL_INSTANCE, new Integer(i++)); return perThreadMap; } }; public static void handleEvent(List events){ Map perThreadMap = (Map) perThreadEventPool.get(); List eventList = (List) perThreadMap.get (ThreadLocalCache.EVENT_LIST); eventList.addAll(events); long lastBusyTime = ((Long) perThreadMap.get( ThreadLocalCache.THREAD_LAST_BUSY_TIME)). longValue(); long timeNow = System.currentTimeMillis(); if((timeNow - lastBusyTime) > FLUSH_INTERVAL){ PerThreadPushWorker perThreadPushWorker = (PerThreadPushWorker) perThreadMap.get( ThreadLocalCache.WORKER_NAME); if(perThreadPushWorker.isAvailable()){ perThreadPushWorker.setEvents( getAndClearEvents()); synchronized(perThreadPushWorker){ perThreadPushWorker.notify(); } perThreadMap.put(ThreadLocalCache. THREAD_LAST_BUSY_TIME, timeNow); } } } }

    The InheritableThreadLocalclass extends ThreadLocal to provide inheritance of values from parent thread to child thread, when a child thread is created. PerThreadPushWorker in the above code is a child thread of the Tomcat HTTP processor thread. Thus, every Tomcat HTTP processor thread will have a separatePerThreadPushWorker associated with it. The functionality of the PerThreadPushWorker is to collect events from a Tomcat HTTP processor thread (orThreadLocal) and make them available at the Tomcat process level. Since we are not usingInheritableThreadLocal, we need some other way to share thread-local variables across multiple threads. As is evident from the code of ThreadLocalCache, each Tomcat HTTP processor thread can obtain a reference of its associatedPerThreadPushWorker and check whether thePerThreadPushWorker has finished its previous work. If it has finished, the HTTP processor thread gathers the next batch of events from its ThreadLocal, and passes them toPerThreadPushWorker and triggersPerThreadPushWorker to work again. The job ofPerThreadPushWorker is to access theProcessRegistry and merge the new batch of events to the list in ProcessRegistry.

    The ProcessRegistry is implemented based on the Singletondesign pattern so that a maximum of one instance of the class exists per JVM. So any activity in ProcessRegistrywill need to be synchronized, since this singleton class is shared between multiple PerThreadPushWorker instances. We don't want the HTTP processor threads of Tomcat to wait until this processing is completed, which is why we needPerThreadPushWorker itself in this architecture so that the Tomcat HTTP processor thread can just trigger thePerThreadPushWorker to work in the background and return at once.

     public class PerThreadPushWorker extends Thread{ private volatile WorkerStatus workerStatus = WorkerStatus.IDLE; private volatile List events; private volatile int threadLocalId; public void run(){ while(true){ doWork(); workerStatus = WorkerStatus.IDLE; try{ synchronized(this){ wait(); } } catch(InterruptedException interruptedException){ interruptedException.printStackTrace(); } workerStatus = WorkerStatus.WORKING; } } public void setEvents(List events){ this.events = events; } private void doWork(){ if(null != events){ // Do some processing of events here. ProcessRegistry.getInstance().push(events); } else{ Log.log("* == * PerThreadPushWorker.doWork : No events promoted to ProcessRegistry"); } events = null; } public boolean isAvailable(){ return (workerStatus == WorkerStatus.IDLE); } }


    The detach technique aims to introduce loose coupling between components. Loose coupling will change real-time transactions tonear real-time. A similar concept has been traditionally accomplished by applications using messaging bridges between applications. Messaging makes applications loosely coupled by communicating asynchronously. Messaging applications transmit data through a message channel, a virtual pipe that connects a sender to a receiver. A messaging scenario with a single producer and multiple consumers is depicted in Figure 5.

    Messaging Representation
    Figure 5. Messaging representation

    Even though we don't plan to use a full-fledged messaging infrastructure, we can design components loosely to share data or events each other, in accordance with the principles of a loosely coupled messaging system. In the place of the virtual pipe, we can utilize a shared data or event sink. When we do that, we need to guard against data corruption, since multiple threads might be accessing the shared data structure concurrently. Java uses thesynchronized keyword to indicate that a single thread at a time can be executing in this or any other synchronized method of the object representing the monitor. Each Java monitor has a single nameless anonymous condition variable on which a thread canwait(), or signal one waiting thread withnotify(), or signal all waiting threads withnotifyAll(). This condition variable corresponds to a lock on the object that must be obtained whenever a thread calls a synchronized method in the object. These sequences of steps are better depicted with the help of the diagram in Figure 6.

    Threads Sharing Resource
    Figure 6. Threads sharing resource

    Reference Architecture

    The above discussion points are realized to implement a reference architecture to solve the problem we discussed in the "Getting Started" section. Distribution is realized by horizontally scaling the hardware, whereas we parallelize and detach the components at the software layer using appropriate design primitives. This leads us to the deployment architecture highlighted in Figure 7, to solve the NFR listed in the "Getting Started" section.

    Figure 7
    Figure 7. Deployment architecture. Click on thumbnail to view full-sized image.

    The reference architecture spans across the HTTP application server farm and JRMP business server blocks, shown with a blue tint in Figure 7. We have stubbed out all database interactions to make the architecture deployment simple. The major components and their structural relationship are shown in Figure 8. This will help the reader to better understand how the classes are arranged at code level; sample code is attached in the References section.

    Figure 8
    Figure 8. RI component relationship. Click on thumbnail to view full-sized image.

    Running the RI

    Deploying and executing the RI is a straightforward process. First download and unzip the .zip file containing the RI source from the "References" section to a convenient location. It will create a folder calledDistributeDetachAndParalleliseInTomcatSrc. This folder will have a build.xml file, and this file depends on an environment variable called CATALINA_HOME on your system, which points to the folder where you have Tomcat unzipped. The build.xml file needs the Ant build tool. To execute the RI, follow the steps below:

    1. Open a command prompt, cd DistributeDetachAndParalleliseInTomcatSrc, and enterant runrmi (Figure 12).
    2. Copy the web.war file from theDistributeDetachAndParalleliseInTomcatSrc\dist folder to the CATALINA_HOME\webapps folder, and start Tomcat (Figure 11).
    3. Open another command prompt, cd DistributeDetachAndParalleliseInTomcatSrc, and enterant runclient (Figure 9).
    4. You can repeat step 3 above to have multiple HTTP clients send requests to Tomcat (Figure 10).
    5. To rebuild the RI, cd DistributeDetachAndParalleliseInTomcatSrc, and enterant.

    Observe the console windows shown from Figure 9 through Figure 12 to visualize the RI dynamics. The deployment is horizontally scalable by including multiple Tomcat processes behind a load-balancing router, as shown in the deployment diagram.

    Figure 9
    Figure 9. Simulated Web Server Client 1. Click on thumbnail to view full-sized image.

    Figure 10
    Figure 10. Simulated Web Server Client 2. Click on thumbnail to view full-sized image.

    Figure 11
    Figure 11. HTTP application server in Tomcat. Click on thumbnail to view full-sized image.

    Figure 12
    Figure 12. Java RMI business server. Click on thumbnail to view full-sized image.


    This article showcases how developers can leverage Java language primitives and harness the runtime and hardware resources. A scalable architecture can be deployed in a single node or in multiple nodes as and when the need arises. But for scaled-out deployments, we need to be careful to co-ordinate and synchronize shared resource access. Not all applications can be designed in a detached fashion, but wherever possible, this design proves to be a strong alternative to synchronous, real-time processing. Understanding the requirements and planning for a scalable architecture is an indispensable step for the success of scalable deployments, and this article shows how we analyzed and designed for the various pain points one by one.



    I would like to express my gratitude to my technical advisor Bob Rudi, with whom I worked on architecting and designing the solution. Special thanks go to Duane Gearhart and Birenjith Sasidharan for their support in optimizing and performance tuning the application.