1 2 3 Previous Next


153 posts
It's been almost twenty years that Gary Cornell contacted me to tell me
When Sun Microsystems introduced Java in 1995, applets were considered the killer feature for the business success of Java. Don
I've been too busy to blog for quite some time, but today something happened that seemed strange enough to break my silence. A student came to me with a Java source file that the grading script rejected. We looked at it and couldn't figure out why. I unearthed the error message: ?
MergeSorter.java:1: error: illegal character: \65279
import java.util.Random;
1 error

Huh? What's \65279? Why the backslash? I didn't even know what notation     that is. I looked at the file with Emacs hexl-mode and saw that the first     three bytes were hex EF BB BF. In all these years, I had     never seen that, but Google set me straight. It's the Unicode byte order     mark or BOM. I asked the student what editor he had used to produce this     file. Sure enough, it was Notepad. Of course. If I had the power to     eradicate one program from the face of the earth, it surely would be     Notepad.


Just in case you haven't been down this particular rathole before,     here's a refresher on the BOM. At one point in time, Unicode fit into 16     bit, and it seemed attractive to encode it with fixed-width 16-bit     quantities. For example, an uppercase A is hexadecimal 0041, so you have     one byte of 00 and one byte of 41. Or do you? In a little-endian platform     such as Intel, it would be more convenient to have a byte of 41 followed     by a byte of 00. Rather than lamely settling on either little-endian or     big-endian encoding, Unicode gives a much more interesting choice. Your     file can start out with the byte order mark, hexadecimal FEFF. If it shows     up as FE FF when reading a byte at a time, the data is big-endian, and if     it shows up as FF FE, it's little-endian.


But UTF-16 is so last millennium. Now Unicode has grown to 20 bit.     While one could theoretically encode it fixed-length with 3-byte or 4-byte     values, just about everyone uses the more economical UTF-8 instead. That's     a variable-length encoding. 7-bit ASCII is embedded as 0bbbbbbb, where     each b is a bit. Then we have a bunch of two-byte codes of the form     110bbbbb 10bbbbbb, followed by three-byte codes 1110bbbb 10bbbbbb     10bbbbbb, and so on. EF BB BF happens to be the three-byte encoding of the     BOM. Work it out for yourself as an exercise! And, by the way, the decimal     value is 65279.

But who needs a byte order mark for UTF-8? There are no two ways of     ordering the bytes. The first byte is always the one starting with     something other than 10, and the others always start with 10. Why would     Notepad put a BOM into an UTF-8 document? That's actual work. Usually,     Notepad is stupid, not evil. So I checked the Unicode spec here. They say it's     perfectly ok to add a BOM in front of a file, and it might actually be     useful because it allows a guess that this is a UTF-8 encoded file. If you     open the file, knowing that it is UTF-8, you should ignore it.

That's fair. So Java, which, as we all know, loves Unicode, will surely     do the right thing, read the BOM and ignore it in a file that's opened     with UTF-8 encoding. Umm, no. Check out this and this bug report.     The folks at Sun were wringing their hands and wailed how fixing this bug     would break a whole bunch of "customer" tools. Which turned out to be the     Sun app server.

Well, guess what. Not fixing the bug breaks javac which     now rejects perfectly valid UTF-8 source files.

Why didn't I notice this earlier? I guess I have finally reached the     point where students configure Windows to use UTF-8 and not some archaic     Microsoft-specific 8-bit encoding. That's good. Now we just need     javac to read those UTF-8 files. If Notepad can, surely  javac     can too.

Filter Blog

By date: