I've been too busy to blog for quite some time, but today something happened that seemed strange enough to break my silence. A student came to me with a Java source file that the grading script rejected. We looked at it and couldn't figure out why. I unearthed the error message: ?
MergeSorter.java:1: error: illegal character: \65279
import java.util.Random;
1 error

Huh? What's \65279? Why the backslash? I didn't even know what notation that is. I looked at the file with Emacs hexl-mode and saw that the first three bytes were hex EF BB BF. In all these years, I had never seen that, but Google set me straight. It's the Unicode byte order mark or BOM. I asked the student what editor he had used to produce this file. Sure enough, it was Notepad. Of course. If I had the power to eradicate one program from the face of the earth, it surely would be Notepad.


Just in case you haven't been down this particular rathole before, here's a refresher on the BOM. At one point in time, Unicode fit into 16 bit, and it seemed attractive to encode it with fixed-width 16-bit quantities. For example, an uppercase A is hexadecimal 0041, so you have one byte of 00 and one byte of 41. Or do you? In a little-endian platform such as Intel, it would be more convenient to have a byte of 41 followed by a byte of 00. Rather than lamely settling on either little-endian or big-endian encoding, Unicode gives a much more interesting choice. Your file can start out with the byte order mark, hexadecimal FEFF. If it shows up as FE FF when reading a byte at a time, the data is big-endian, and if it shows up as FF FE, it's little-endian.


But UTF-16 is so last millennium. Now Unicode has grown to 20 bit. While one could theoretically encode it fixed-length with 3-byte or 4-byte values, just about everyone uses the more economical UTF-8 instead. That's a variable-length encoding. 7-bit ASCII is embedded as 0bbbbbbb, where each b is a bit. Then we have a bunch of two-byte codes of the form 110bbbbb 10bbbbbb, followed by three-byte codes 1110bbbb 10bbbbbb 10bbbbbb, and so on. EF BB BF happens to be the three-byte encoding of the BOM. Work it out for yourself as an exercise! And, by the way, the decimal value is 65279.

But who needs a byte order mark for UTF-8? There are no two ways of ordering the bytes. The first byte is always the one starting with something other than 10, and the others always start with 10. Why would Notepad put a BOM into an UTF-8 document? That's actual work. Usually, Notepad is stupid, not evil. So I checked the Unicode spec here. They say it's perfectly ok to add a BOM in front of a file, and it might actually be useful because it allows a guess that this is a UTF-8 encoded file. If you open the file, knowing that it is UTF-8, you should ignore it.

That's fair. So Java, which, as we all know, loves Unicode, will surely do the right thing, read the BOM and ignore it in a file that's opened with UTF-8 encoding. Umm, no. Check out this and this bug report. The folks at Sun were wringing their hands and wailed how fixing this bug would break a whole bunch of "customer" tools. Which turned out to be the Sun app server.

Well, guess what. Not fixing the bug breaks javacwhich now rejects perfectly valid UTF-8 source files.

Why didn't I notice this earlier? I guess I have finally reached the point where students configure Windows to use UTF-8 and not some archaic Microsoft-specific 8-bit encoding. That's good. Now we just need javac to read those UTF-8 files. If Notepad can, surely javac can too.