Metaphorical Introduction 

 

I find myself trying to multithread my life constantly. I've got so many things to do; surely there's a way I can multiplex the chores to get all of them done faster, right?

For example, I'll be brushing my teeth and realize I also need to comb my hair. I'm only using one hand to hold the toothbrush, so I reach for the comb with the other hand. Then I'll start combing my hair, at which point the other hand with the toothbrush stops moving, or (even worse) keeps moving, but in such as way that toothpaste starts running all over the place.

Now I've gone from two actions that would have been simple to perform serially to one interleaved complex action that's resulted in a toothpaste-and-spittle mess and an unruly mop on my head.

The problem here is that, despite the speed of our processors and the simplicity of our actions, some things in life simply cannot be multithreaded.

This is not to say that multithreading itself is unachievable or not worthwhile for some operations. As an example, consider breathing or blinking; if I had to stop any other action to make my eyes blink or to breathe, I'd never get anything done. Consider multiplexing actions while driving; if we couldn't do many things at once in this environment, we'd all be dead. (Of course, many people assume that this multiplexing-while- driving capability is universal and extends to complex conversations on phones while driving at 80 on the highway; I believe these folks will eventually be evolved out of our society, although they may take a few of us with them along the way).

So there are clearly some operations which are better done on separate threads; the trick is figuring out which ones are which if you don't want to be wiping up toothpaste off your clothes all the time. Or, in the case of your applications, if you don't want to be debugging thread deadlocks or performance problems due to thread abuse.

That reminds me; I was going to write about software in this article, not toothpaste and driving. I knew there was a reason I was posting this to a developer site...

Now, to get Technical

I've spent the better part of the last year thinking about multithreading problems, and I could probably spend the rest of my career doing the same (although it probably wouldn't be a long career since my head would pop off before too long). There are so many issues in this ugly space that to write about all of them would take too long. So for the purposes of a focused blog (although it's probably too late for that goal), let's just focus on one area of trouble for Java client developers: multi-threaded graphics.

When developers first discover the joys of multithreaded programming, it's like opening a new present on Christmas; the power of Java thread creation is that it is so easy to create and use new threads, that you can start taking advantage of multithreaded processing in a much easier fashion than you ever could before in previous programming languages. And the knowledge that your application can perform multiple tasks simultaneously, especially on systems with multiple processors or hyper-threaded cores, is huge. You no longer need to block on IO, waiting for the system to open a file. You no longer need to hang in your display code waiting for an image to downloaded over the network. You no longer need to hang the GUI while calculating a complex set of equations. In all of these cases, it is trivial to spawn a new thread to perform the bottlenecking operation, and use the result of that operation later as appropriate.

In fact, Java builds in much of this functionality for you, so you don't even need to do the work of multithreading some of your operations which may be blocking. For example, the old Toolkit Image loading methods automatically load images on a separate thread. The following code:

    Image img = Toolkit.getImage("duke.gif");

would return without having loaded the image. Instead, the method puts the request to load that image on a separate image-loading thread. Later, when you need to use img, you may need to make sure it's loaded first (see my blog on Image APIs for more information on this asynchronous behavior). 

"Well heck," the Happy Developer says. "If two threads make my application that much faster, imagine how fast it'll be with ten!"

For example, suppose a developer finds that one of the more costly operations in their application is some rendering operation, like drawing some complex Shape. Every frame they have to draw numShapes of these shapes, like so:

    // Assume myShapes[] and numShapes are initialized appropriately elsewhere...
    public void paintComponent(Graphics g) {
        for (int i = 0; i < numShapes; ++i) {
            g.draw(myShapes[i]);
        }
    }

"Golly!", the Happy Developer says, "What if I use that cool Thread mechanism to speed this up?! Then it'll go wayfaster." They might write something like the following:

    class ShapeRenderer implements Runnable {
        Graphics g;
        Shape s;
        ShapeRenderer(Graphics g, Shape s) {
            this.graphics = g;
            this.shape = s;
        }
        public void run() {
            graphics.draw(s);
        }
    }
    public void paintComponent(Graphics g) {
        for (int i = 0; i < numShapes; ++i) {
            Thread t = new Thread(new ShapeRenderer(g, myShapes[i]);
        t.start();
        }
    }

They compile the code, run it hopefully ... and discover that it didn't fix their performance problem. In fact, they discover that the app is actually much slower than the original code.

What gives?

There are actually several different factors that can contribute to the performance of this particular approach, ranging from things that add no benefit to those factors that actually make multi-threading this application slower. Let's go through some of these factors, one by one:

1) Thread/object creation overhead
Okay, so we've all heard the mantra that object creation and destruction (and the associated garbage collection process) is actually pretty fast. Memory allocation and destruction on current garbage collection systems is almost free, in fact; it costs only the instructions involved in moving a pointer around (memory is assigned by reserving space on a pre-allocated heap, and is destroyed by another pointer move. Of course, there are more details here about how garbage collectors, but suffice it to say that simple creation/destruction of temporary objects is very, very cheap).

However, that doesn't mean to say that creating temporary objects is actually free. In particular, it doesn't mean that you want to create and initialize objects in your inner loop if you don't have to. In the above example, we are not only asking for temporary memory for the Thread and ShapeRenderer objects (a cheap operation), but we are also asking that those objects get created and initialized, which may not be so cheap, depending on the complexity of the objects involved and whatever initialization process they need to go through. In this case, the creation of a thread will probably involve a fair amount of processing at either/both the Java and native levels in order to create the underlying thread object.

The Happy Developer, realizing this, will of course take steps to minimize the temporary object creation. In this situation, they may realize that since the same thing happens every time through the loop, there is no reason that the applications needs to create the Thread and ShapeRendering objects every time through; they can just incur the overhead of creation one time and then reuse these objects whenever we need them.

I won't bother with an example here, just picture a variation of the above where the Thread and the ShapeRender objects are created only once. Then, inside the paintComponent() method, we need only update the Graphics object of each ShapeRenderer and then tell each Thread to do its thing.

Once again, the Happy Developer (this time with a slightly less huge smile of anticipation on their face) awaits the stunning results ... but discovers that this new variation is stillworse than the original approach.

Things may be a bit better here than before; at least the application is not going through the contortions of creating and initializing the Thread and ShapeRenderer objects every time through the painting loop. But there's still something amiss in the messy threading details.

2) Thread Swapping

One of the hidden details of multithreaded programming is that the operating system has to go through a fair amount of work in order to run a separate thread. This is not much in the whole scheme of things (less than a millisecond, certainly), but it can add up when there are several threads involved.

For example, if you have ten threads all trying to do similar tasks at the same time and at the same priority, then the system will keep swapping the threads in and out trying to get the work done. This may not be as bad as swapping out each thread after just a couple of instructions in some round-robin fashion; we may get a fair chunk of work done in any given thread before we are swapped out. But the amount of work accomplished on that thread must be weighed against the work done to swap threads to know whether it was worthwhile having multiple threads to accomplish the task.

I wrote a test to see what thread-swapping overhead was. I ran in a tight loop, calling wait/notify to swap the threads back and forth. For 100,000 swaps, it took 1.3 seconds on that particular test system. This doesn't sound like much, but if you can imagine each thread trying to perform something simple like drawing a single line, the fact that we could only do 100,000 of these operations in that 1.3 seconds makes the thread swap overhead seem pretty significant. (For comparison purposes, I also timed calling a function 100,000 on the same system, which took only about 10 ms).

The cost of thread swapping overhead comes into play especially on systems where there are less computing resources available than there are threads than want those resources. This is an excellent, if obvious, segue into my next point...

3) Limited Thread Resources

The ideal case for multithreaded systems is having one processor per thread, or at least one processor available whenever a new thread needs processing power. For example, if you have a four CPU system and there are four threads all trying to run at the same time, they can each have a processor to themselves and get along swimmingly. There need be no thread-swapping overhead, as mentioned in point #2 above, because the threads do not have to be swapped out; they have full control over their CPU (at least while the process is running).

There are now hybrid systems where single chips can have multiple resources available for threads, such as the Hyper Threaded CPUs of various chips today. While these systems cannot dedicate an entire CPU to a particular thread, they have enough resources on any CPU to dedicate some of those resources to separate threads. There is still some overhead of thread swapping and contention here, but it is at least better than the single-CPU model.

The problem with the sample application above is that it does not take into account anything about the system when it creates a thread per Shape. Unless the application is running on a system that has as many CPUs (or at least an many thread resources, in the hyper threading case) as there are threads, then there is going to be contention in thread processing and thus overhead for swapping threads in and out.

Our Happy Developer might realize that the available thread processing resources could be a bottleneck. Suppose they know that, in general, their application will run on systems with at least 2 processors or one hyper-threaded processor, and they would still like to take advantage of that capability in multithreading their application. They may then change their app to use a model of exactly two rendering threads instead of the one-per-Shape model above. Now, instead of sending each Shape rendering operation through its own dedicated thread, it will queue up these operations on two separate threads. The code will be more complex for this approach, but still fairly straightforward. For brevity, I'll skip the example, but hopefully it's easy to picture this thread-sharing approach.

Again, the Happy Developer awaits patiently the results they know will stun the world. There is a slight faltering of their smile this time; they have met defeat too many times in the past. Still, they look forward to ... more failure.

Once again, the application fails to improve upon the original single-threaded approach. This time, they have eliminated much of the overhead in dealing with threads. And when running on a system than can process multiple threads simultaneously, they may even have eliminated thread-swapping overhead (or at least reduced it significantly).

Given the power of doing multiple tasks simultaneously, and the ability of the system to handle this simultaneity, what happened?

4) Graphics is Inherently Single-Threaded

Here's the sad reality of today's computing platforms; graphics hardware is inherently single-threaded.

This single-threaded approach is so ingrained in today's platforms that it is an assumption at the hardware, the driver, and even the API level (although some APIs are written to handle multi-threaded programming, they do not do a good job of compensating for the limitations below them and the hardware and driver level)..

While the hardware architectures, processors, operating systems, and languages have evolved to allow and encourage multi-threaded programming, the underlying graphics engines simply cannot do it.

This means that you may be able to easily write an application that performs graphics operations in multiple threads. And the system you are using (e.g., Java) may turn around and issue those graphics calls in separate threads. And the underlying systems that Java depends upon (e.g., X, DirectX, GDI, whatever) may be able to receive those calls from multiple threads. But when it finally gets down to the hardware, it has all been funneled through one pipe and there is no way to get any advantage by trying to use multiple threads at a higher level.

Just like the chain that is only as strong as its weakest link, an application is only as multithreaded as the systems it depends upon; in this case, if the graphics subsystem is single-threaded, then there is nothing you can do at the upper layers to make that system more multi-threaded-friendly.

Let's take a look at an example. I was working on this one just this week, playing around with various options in double-buffering. I wanted to play with the idea of writing to a buffer on one thread and copying that buffer to the onscreen window on a totally different thread. There were various reasons for this (and various possible gotchas), but it was at least worth an experiment.

I wrote my rendering loop to fill the buffer as fast as it could, with simple calls that mimicked a scrolling operation (copy part of the buffer to itself in a different location, fill in the rest with some color). After each operation, it would update a flag that told the system that the buffer contents were new (and should thus be copied to the onscreen window at some time).

I wrote my buffer-copying thread to occasionally wake up (every few milliseconds) and copy the buffer to the screen if the buffer had been changed since the last copy.

There are a few implementation details here (such as synchronizing on the update flag variable), but I have described the essential bits.

What I saw confused me at first. It ran pretty well on my development system. So I had someone else run it on their system. In that new environment, it basically froze the window for several seconds. It looked like we were not even getting our screen- updating loop, as if the system was not even kicking off the timer I had set.

In digging into it further, I found the artifact was more disturbing. We were being woken up correctly based on the timer, and were then attempting to update the screen. But the buffer-rendering loop had so completely filled the graphics pipeline with scroll/fill calls that we basically froze the system while those were being worked on by the underlying driver and hardware.

Here was a situation where:

  • the language supported the threading approach
  • the underlying subsystem (in this case, DirectDraw) supported rendering to/from multiple threads
  • the operating system supported multiple threads
BUT 
  • the graphics rendering system was created as inherently single-threaded, so the fact that one thread could so completely fill the graphics pipeline meant that future threads that the operating system wanted to swap in would simply have to wait for the original thread.

The upshot of this whole diatribe on the graphics subsystem is that there is basically no gain to be had in changing the original sample application in the way that we did; since the underlying system is inherently single-threaded, we have nothing to gain and everything to lose by introducing potential thread overhead when everything will just end up in a single thread in the final rendering step in the hardware.

Conclusions, Thoughts, Powertool Accidents

The point of this article was to raise some issues to be aware of in multithreaded programming. Note that I am specificallynot saying "Don't Do It!". There are lots of advantages to multithreaded approaches, some of them mentioned in the introduction above. For example, it is still a huge win to do time-consuming non-graphics tasks in a separate thread (such as IO or image loading) if you do not want to block the main thread (a big example in my desktop client world is the canonical "don't block the GUI thread" example; all blocking operations should happen elsewhere so that the user still sees a snappy GUI)..

And there are even cases where graphics operations can be effectively multi-threaded. For example, in a system that is using all software rendering (such as rendering destinations that are not hardware-accelerated, such as BufferedImage objects) running on systems that support multiple threads, there could be a huge win in having separate rendering threads. Imagine a multi-chip server that is producing separate images all in parallel, rendering to each with software loops, running each of those loops on separate processors; this is a major win for parallelism.

But what I am saying is: be aware of the issues in the platform and underlying systems when taking a multithreaded approach. And always test your application to see if you actually got the speedup you were anticipating.

Think of multithreaded programming kind of like a chainsaw. It can be an incredibly powerful tool that can dramatically reduce the time needed to perform some tasks. Or it can chop your hand off and cause a huge mess that's impossible to recover from. You need to know how to use it effectively to determine which one it will do for you.