Skip navigation
ANNOUNCEMENT: community.oracle.com is currently Read only due to planned upgrade until 29-Sep-2020 9:30 AM Pacific Time. Any changes made during Read only mode will be lost and will need to be re-entered when the application is back read/write.

I am finishing the code samples for my book “Scala for the Impatient”. (Yes, for those of you who are impatiently awaiting it—the end is near. Very near.)

In the XML chapter, I started an example with

val doc = XML.load("http://horstmann.com/index.html")
doc \ "body" \ "_" \ "li"

It took several minutes for the file to load. What gives? My network connection wasn't that slow. And neither is the Scala XML parser—it just calls the SAX parser that comes with the JDK.

The problem is DTD resolution. The file starts out with

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
      "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

So, the parser feels compelled to fetchhttp://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd, and rightly so, because it needs to be able to resolve entities such as&auml; in the file.

Except, the W3C hates it when people fetch that file, and rightly so—they shouldn't have to serve it up by the billions. It should be up to the platform to cache commonly used DTDs.

My platform, Ubuntu Linux, happens to have a perfectly good infrastructure for caching DTDs. Schema files too. There is a file/etc/xml/catalog that maps public ID prefixes to other catalog files. For example, the prefix "-//W3C//DTD XHTML 1.0" is mapped to/etc/xml/w3c-dtd-xhtml.xml, which maps"-//W3C//DTD XHTML 1.0 Strict//EN" to/usr/share/xml/xhtml/schema/dtd/1.0/catalog.xml, which maps to the final destination, xhtml1-strict.dtd. I am pretty sure this is the same on other Linux systems too.

So, of course the JDK takes advantage of this infrastructure, right? No—or I wouldn't have had the problem that I described.  Here is what I had to do to make it work.

The JDK takes its SAX implementation from Apache, and Apache has a CatalogResolver class. The JDK has it too, well-hidden atcom.sun.org.apache.xml.internal.resolver.tools.CatalogResolver. Ok, let's use it and delegate to it in the regular SAX handler.

import java.net.*;
import javax.xml.parsers.*;
import org.xml.sax.*;
import org.xml.sax.helpers.*;
import com.sun.org.apache.xml.internal.resolver.tools.*;

public class SAXTest {
   public static void main(String[] args) throws Exception {
      final CatalogResolver catalogResolver = new CatalogResolver();
      DefaultHandler handler = new DefaultHandler() {
            public InputSource resolveEntity (String publicId, String systemId) {
                return catalogResolver.resolveEntity(publicId, systemId);
            }
            public void startElement(String namespaceURI, String lname, String qname,
               Attributes attrs) { // the stuff you'd normally do
               if (lname.equals("a") && attrs != null) {
                  for (int i = 0; i < attrs.getLength(); i++) {
                     String aname = attrs.getLocalName(i);
                     if (aname.equals("href")) System.out.println(attrs.getValue(i));
                  }
               }
            }
         };

      SAXParserFactory factory = SAXParserFactory.newInstance();
      factory.setNamespaceAware(true);
      SAXParser saxParser = factory.newSAXParser();
      String url = args.length == 0 ? "http://horstmann.com/index.html" : args[0];
      saxParser.parse(new URL(url).openStream(), handler);
   }
}

Does it work? No. The compiler complains that there is no packagecom.sun.org.apache.xml.internal.resolver.tools. That's bull:

jar tvf /path/to/jdk1.7.0/jre/lib/rt.jar | grep /CatalogResolver
  6757 Mon Jun 27 00:45:14 PDT 2011 com/sun/org/apache/xml/internal/resolver/tools/CatalogResolver.class

Take this, Java:

javac -cp .:/path/to/jdk1.7.0/jre/lib/rt.jar SAXTest.java

It compiles. It runs. (As an aside, this is pretty weird. I didn't realize that the compiler excludes some classes fromrt.jar.)

Does it work? No. But there is a useful warning: Cannot find CatalogManager.properties. That's the final missing step. Create a file CatalogManager.properties with the entry

catalogs=/etc/xml/catalog

and put it somewhere on the class path. (No,/path/to/jdk/jre/lib/ext doesn't work, which probably isn't a bad thing.) Or start your app with

java -Dxml.catalog.files=/etc/xml/catalog SAXParser

Did it work? No. It turns out that Linux isn't all that perfect in its XML catalog infrastructure. The catalog.xmlfile has itself a DTD, like this:

<!DOCTYPE catalog PUBLIC "-//GlobalTransCorp//DTD XML Catalogs V1.0-Based Extension V1.0//EN"
    "http://globaltranscorp.org/oasis/catalog/xml/tr9401.dtd">

globaltranscorp.org is no longer, so downloading the DTD is futile. But wait—don't we have a perfectly good mechanism for using the public ID and locating the cached copy? The Ubuntu folks put the blame on Apache, and I am inclined to agree with them.

Anyway, the fix is to replace the system ID with"/usr/share/xml/schema/xml-core/tr9401.dtd".

Now it works. But it's ugly. Why can't it work by default? Or at least by default when -Dxml.catalog.files is set?

BTW, I am aware that I can get a CatalogManagerimplementation from Apache, and that it will likely work fine when mixed with the Java XML implementation. I just feel that I shouldn't have to do that.

What about other platforms? On the Mac, I found acatalog file at /opt/local/etc/xml. It only had a few Docbook DTDs, not XHTML. I don't know how you add to it (except, of course, manually). In Ubuntu, it's sudo apt-get install w3c-dtd-xhtml. How about Windows? I hope that some of you can tell me.

In Scala, it's a little messier to use the catalog resolver since the parser installs its own SAX handler.  The following works:

import xml._
import java.net._

object Main extends App {
  System.setProperty("xml.catalog.files", "/etc/xml/catalog")

  val res = new com.sun.org.apache.xml.internal.resolver.tools.CatalogResolver

  val loader = new factory.XMLLoader[Elem] {
    override def adapter = new parsing.NoBindingFactoryAdapter() {
      override def resolveEntity(publicId: String, systemId: String) = {
        res.resolveEntity(publicId, systemId) 
      }
    }
  }

  val doc = loader.load(new URL("http://horstmann.com/index.html"))
  println(doc);
}

Don't ask. This doesn't use the documented API, just what I gleaned from reading the source.

Scala users have an alternative parser,ConstructingParser. Does it resolve entities? Nope. It replaces them with useless comments <!-- unknown entity auml; -->. Don't ask.

Overall, this enough to make grown men cry. In my Google searches, I ran across a good number of apps that maintained their own catalog infrastructure. Caching these DTDs isn't something that every app should have to reinvent. The blame falls squarely on the Java platform here. (In Linux, there are C++ based tools that have no trouble with any of this.) Java should support the catalog infrastructure where it exists, and allow users to manually manage the catalogs and communicate the location with a global setting, not something on the classpath or the command line.

Java has no operator overloading. I always thought that was a shame. For example, BigDecimal would be a lot more popular if you could write a * b instead ofa.multiply(b).

Why doesn't Java have operator overloading? Well,  C++ has it, and people kept saying that it makes code hard to read. Actually, in C++, you can have a library class vectorand write v[i], thanks to operator overloading. In Java, we have unsightly v.get(i) and v.set(i, newValue). Easier to read? I think not.

Of course, someone somewhere out there abused operator overloading in C++, doing something silly like overloading% for computing percentages. The horror. It's actually pretty hard to do much abuse in C++ because you can only overload the standard operators. Personally, I never ran into anything scary

In Scala, on the other hand, you can define any operators you like. If you want to check for five star hotels, you can define a predicate *****. Unicode is fair game too: fred ♥ wilma.

Scala detractors foam at the mouth when they see

(1 /: (2 to 10)) (_ * _)

Actually, that's unfair. Let's write it without operators:

2.to(10).foldLeft(1)((x, y) => x * y)

It still looks like magic if you don't knowfoldLeft. And if you do, the operator version is simpler.

(BTW, this computes 1 * 2 * 3 * 4 * ... * 10.)

So, I firmly believed that operators are a good thing when used with restraint.

I still believe that, but I came to realize that restraint is harder than I thought.

Some fellow had kvetched how the Scala collections library has too many crazy operators. And I remembered some discomfort when I wrote the chapter on collections for “Scala for the Impatient”. I put together a table with all operators for adding or removing elements, and it did seem a bit of a mess. I pointed this out on the Scala mailing list, and Martin Odersky asked what I suggested to fix it.

Well, fixing inconsistencies happens to be my forte, so I went right at it.

For starters, we have the following:

coll :+ elem // Makes a new collection, appending elem after coll
elem +: coll // The same, but elem gets prepended

It's a nifty trick in Scala that an operator ending in a colon is right-associative, and elem +: coll is really the same as invoking the +:method on coll.

You can also insert elements in bulk:

coll ++ coll2
coll2 ++: coll

Did you notice the asymmetry? Why isn't it :++ for the first one? I pointed that out and was told that++ is prettier and shorter. 

What if the collection is a set? Then there is no intrinsic ordering, so it seems silly to distinguish between appending and prepending. You just write

set + elem

It gets a bit confusing when elem happens be a string.

Set("Fred") + "Wilma" // Set("Fred", "Wilma")
Set(42) + "Wilma" // "Set(42)Wilma"
Set(42) ++ Set("Wilma") // Set(42, "Wilma")

In the second case, it doesn't form a Set[Any] with 42 and"Wilma", but it coerces Set(42) to a string and concatenates the two. Ouch.

So far, these operators return a new collection, leaving the original unchanged. That's the functional way, and it's often good. But sometimes you want to mutate a collection. For example,

buffer += elem
buffer ++= coll

Did you notice the inconsistency? To append without mutating, it's buffer :+ elem, so for consistency's sake, it should be

buffer :+= elem

I asked why it wasn't so and was told that += is prettier and shorter. Yes, it is, but actually there is a:+=, because Scala always synthesizes anop= from any operator. And it's subtly different from+=.

What about prepend-and-mutate? The operators must have a colon at the back, so they are +=: and ++=:. Why not =+: and =++:? Gentle reader, if at this point you lost the will to live, I hardly blame you. 

Just one more thing before I come to my conclusion.

For lists, the prepend operator is ::, because, well, it's always been so. For example, 1 :: 2 :: 3 :: Nil makes List(1, 2, 3). And we don't want to change it to +: because, well, it's never been that. And we don't want to replace +: with:: because then we lose the beautiful symmetry with:+, even though we don't care about that symmetry when it comes to :++ or :+=.

No, I couldn't come up with a fix that was consistent, pretty, and compatible with the past. But I learned something from the process.

  1. Don't mess with +. String concatenation has ruined it for the rest of us.
  2. When your operator starts looking like Morse code, give up. You don't have to have an operator for everything.
  3. Use asymmetric operators for asymmetric operations.:: for cons, or | for shell pipes, are bad role models.
  4. You have one chance. Operators are powerful stuff. Once people are used to :: or | or!=, they will refuse to switch.

Is Scala wrong to have operator overloading? No, on the contrary. Operators are incredibly useful. They are just really difficult to get right.

We know this from mathematics, where of course operators abound because they are so useful.  New operators get created all the time, and many of them sink into the obscurity that they richly deserve (such as Newton's fluxion notation). Nevertheless, some awful operators survive. Consider derivatives. Input: A function. Output: Another function.  We have two operators: f' and df/dx. The first is inadequate and the second is cumbersome. As an undergraduate, I had a textbook that bravely soldiered on with Dxf, which made a lot more sense, but it was too far from the mainstream.

In summary, operator overloading is neither a mistake nor a panacea. I want it and I want people to use it wisely. As I learned, that's harder than it appears.