If you are interested in hardware acceleration for Java2D on Windows, check out the latest bits on the mustang site ( http://mustang.dev.java.net). Dmitri Trembovetski has been working tirelessly to implement functionality similar to what Chris Campbell did with our OpenGL rendering pipeline, and it's pretty stunning. There is now (as of build 33) acceleration for everything from the standard image copies to translucent image operations to lines to transforms to complex clips to text (AA and non-AA).
Note: This rendering pipeline is disabled by default for now; there
are various issues we are working through to make this renderer as good
in quality as the default renderer. That quote from Spiderman comes to
mind: "With great power comes great responsibility." Except in our case,
the quote runs more like this: "With great power comes great driver quality
issues"; as we enable more features in Direct3D, we expose more quality and
robustness issues in graphics hardware and drivers that we need to work
around. This driver quality issue is a ripe topic for another article or more; suffice it to say that the hardware and driver manufacturers tend to have a lower bar for "quality" than people tend to expect for Java. To enable the Direct3D pipeline in the current Mustang builds, use the
I thought it might help to dive into one area of acceleration, to explain
why we're doing this, and what benefits you might expect to see.
In both the OpenGL and Direct3D pipelines, we accelerate text by caching
individual characters (glyphs) as tiles in a texture. The first time
you render a glyph (ala
drawString()) to an accelerated
destination (such as the Swing back buffer), we will rasterize that glyph
into the texture and then execute a texture-mapped quad operation
to get that glyph into the right place. The next time you draw that
same glyph, we already have it cached and can simply perform the texturing
It might not be obvious why this is a Good Thing; after all, doesn't it sound like a lot more effort to do a full-on 3D texture-mapped quad operation than to simply draw a few pixels for a character into a buffer? Yes ... and no. in terms of raw instructions executed, that's probably correct; rasterizing a glyph is a pretty simple operation. And we already have a software cache for glyphs, so all we really do on repeat operations is to copy the pixels down from that cache into the destination. Meanwhile, a texture-map operation requires possible setup of the rendering destination in direct3D, possible transformation setup, creation of appropriate vertex and texture-coordinate information for the glyph quad, passing down the call to Direct3D, then the stuff that the Direct3D driver does before handing it off to hardware, which then rasterizes the textured-quad. This definitely sounds like a whole pile of work...
But there are two keys here that make the performance win more understandable: VRAM and parallel processing.
VRAM: Using video memory is all a matter of getting better performance by locality of memory. Basically, things happen faster if they are located more closely together.
Let me try a sports analogy. This is a first for me; anyone that knows me would be shocked that I'd try this. Sports is one of those things that never really "took" with me. I'm apt to start talking about runs and goals and tackles in the same metaphor and the whole analogy would fall flat. But I like to try new things, so here goes:
Imagine a play in baseball (that's the one with hits and runs and outs, right?). Let's say that the batter hits a grounder that the fielders need to get to quickly to try to throw the player out at first. If one of the infielders can manage to get to the ball before it passes out of the infield, then they can wing it over to first base and have a hope of throwing the person out. But if the ball goes into the outfield, then whoever gets the ball has to throw the ball farther, and thus has less chance of throwing out the batter at first. Here we see the dynamic of locality; if the play can be kept completely within the infield, then there is a greater chance of making the out because the ball can travel much quicker to first base.
Whew! Okay, that was a (7th inning?) stretch, but I made it out the other side at least. Let's take this back into more familiar territory of computers.
The screen exists in video memory (that's where the data lives that the monitor inputs read from). The Swing back buffer (as of j2se 1.4) also lives in video memory (I'm talking about Windows here, since this article is about our Direct3D pipeline; other platforms have different screen/buffer/rendering dynamics). This means fast copies from the back buffer to the screen; if they are both in VRAM, then the operation is going to happen faster. This is because the bits don't have to travel as far, but it is really because there is a faster data path from VRAM to VRAM than there is from system memory to VRAM; pixels don't need to go through the CPU or over the PCI/AGP/PCI-Express bus, they just go through the faster/ wider video card bus.
(Note: The observant reader may notice that my baseball analogy breaks down here somewhat. VRAM operations are not faster just because of locality, but also because there is a faster path for local data. If I were to overload the analogy to account for this, it would be as if the infield players were the really good players on the team that could throw a whole lot faster than the outfielders. This is maybe not too far off-base; when I played little league it was certainly the case that the person playing right field (that'd be me) was far slower and less capable than the people closer to the batter).
The dynamic between the back buffer and the screen also applies to operations going to the back buffer itself; anything that can happen from VRAM to that back buffer has the advantages of locality and a faster/wider data path. In the case of texture-mapping operations, it may be that there is more happening to copy each individual pixel into place, but these pixels are being copied from a better location (VRAM) to the back buffer than the previous approach of rasterizing or copying from system memory to the back buffer.
Parallel Processing: Another important factor here that makes all of this possible is that the graphics chip is a completely separate processor. So when we're talking about the work involved in rasterizing a texture-mapped quad, this is all happening on the GPU, not the CPU with the rest of the Java software stack. In addition to being parallel, the GPU is also highly-tuned for doing these sorts of operations, so it can probably do a much better/faster job of them than the CPU could.
I could try to overextend the strained baseball analogy here, where the fielders operate asynchronously to the pitcher, but that would probably result in the next play starting while the current play was still happening. Baseball is confusing enough without throwing multi-threading into the mix.
Between these two factors, using data in VRAM and using the capabilities of the GPU, it is no longer the case that more complicated operations necessarily result in slower performance.
Another side benefit of this approach is that more interesting text approaches, such as anti-aliasing, can be supported with basically no additional performance hit. Typically, in a software rendering solution, text-antialiasing causes a significant performance hit. This is because of the increased amount of stuff happening to rasterize these characters; there is now a read from the destination pixel and a blending operation to get the smooth edges of each glyph. Beyond the extra calculations involved here, that simple read can be quite expensive, especially when the destination is in VRAM. Graphics chips are really good at doing things in VRAM. They are pretty good at doing things from the CPU down into VRAM. But they really stink at doing things from VRAM to the CPU; the read speed of VRAM is really abysmal. So if a software rasterizer must read from VRAM in order to draw an anti-aliased glyph, performance will usually suffer.
But with the texture-mapped quad approach to text rendering, there is basically no extra work going on when the glyphs are translucent. The same operations occur under the hood, but now they are all happening on the GPU and in VRAM, which have all the benefits so eloquently and inappropriately layed out in the baseball analogy above.
So enough about the low-level details. Download the bits, try them out, let
us know what you find. We are continuing to work on it (various performance,
quality, and robustness issues) and will enable Direct3D rendering by default
when we are confident that this renderering pipeline is at least as good
as the default one. In the meantime, you can force it on by using the
Dmitri has just informed me of three bugs that are currently being fixed on our side (not driver issues, actual implementation bugs if you can believe it):
- 6255408: PIT: D3D: Animation freezes when pushing the console to FS mode and restoring it, WinXP
- 6255346: PIT: D3D: VolatileImage is distorted when lowering the color depth at runtime
- 6255836: PIT: ClassCastException thrown when ALT+TABing a FullScreen-page flipping app, Win32
In addition, the pipeline may not get enabled (even when you force it on) in 16-bit color depth; some graphics chips (such as the GeForce MX products from nVidia) have hardware limitations that force us to back off of acceleration in that depth.
If you do see any "issues" on your system, let us know. Be sure to tell us your platform details (especially your OS, graphics chip, resolution, bit depth, and driver version) so that we can chase down and fix the problems.