This discussion is archived
12 Replies Latest reply: Mar 12, 2011 8:57 AM by 843441 RSS

What every developer should know about character encoding

DavidThi808 Newbie
Currently Being Moderated
This was originally posted (with better formatting) at Moderator edit: link removed/what-every-developer-should-know-about-character-encoding.html. I'm posting because lots of people trip over this.

If you write code that touches a text file, you probably need this.

Lets start off with two key items

1.Unicode does not solve this issue for us (yet).
2.Every text file is encoded. There is no such thing as an unencoded file or a "general" encoding.
And lets add a codacil to this – most Americans can get by without having to take this in to account – most of the time. Because the characters for the first 127 bytes in the vast majority of encoding schemes map to the same set of characters (more accurately called glyphs). And because we only use A-Z without any other characters, accents, etc. – we're good to go. But the second you use those same assumptions in an HTML or XML file that has characters outside the first 127 – then the trouble starts.

The computer industry started with diskspace and memory at a premium. Anyone who suggested using 2 bytes for each character instead of one would have been laughed at. In fact we're lucky that the byte worked best as 8 bits or we might have had fewer than 256 bits for each character. There of course were numerous charactersets (or codepages) developed early on. But we ended up with most everyone using a standard set of codepages where the first 127 bytes were identical on all and the second were unique to each set. There were sets for America/Western Europe, Central Europe, Russia, etc.

And then for Asia, because 256 characters were not enough, some of the range 128 – 255 had what was called DBCS (double byte character sets). For each value of a first byte (in these higher ranges), the second byte then identified one of 256 characters. This gave a total of 128 * 256 additional characters. It was a hack, but it kept memory use to a minimum. Chinese, Japanese, and Korean each have their own DBCS codepage.

And for awhile this worked well. Operating systems, applications, etc. mostly were set to use a specified code page. But then the internet came along. A website in America using an XML file from Greece to display data to a user browsing in Russia, where each is entering data based on their country – that broke the paradigm.

Fast forward to today. The two file formats where we can explain this the best, and where everyone trips over it, is HTML and XML. Every HTML and XML file can optionally have the character encoding set in it's header metadata. If it's not set, then most programs assume it is UTF-8, but that is not a standard and not universally followed. If the encoding is not specified and the program reading the file guess wrong – the file will be misread.

Point 1 – Never treat specifying the encoding as optional when writing a file. Always write it to the file. Always. Even if you are willing to swear that the file will never have characters out of the range 1 – 127.

Now lets' look at UTF-8 because as the standard and the way it works, it gets people into a lot of trouble. UTF-8 was popular for two reasons. First it matched the standard codepages for the first 127 characters and so most existing HTML and XML would match it. Second, it was designed to use as few bytes as possible which mattered a lot back when it was designed and many people were still using dial-up modems.

UTF-8 borrowed from the DBCS designs from the Asian codepages. The first 128 bytes are all single byte representations of characters. Then for the next most common set, it uses a block in the second 128 bytes to be a double byte sequence giving us more characters. But wait, there's more. For the less common there's a first byte which leads to a sersies of second bytes. Those then each lead to a third byte and those three bytes define the character. This goes up to 6 byte sequences. Using the MBCS (multi-byte character set) you can write the equivilent of every unicode character. And assuming what you are writing is not a list of seldom used Chinese characters, do it in fewer bytes.

But here is what everyone trips over – they have an HTML or XML file, it works fine, and they open it up in a text editor. They then add a character that in their text editor, using the codepage for their region, insert a character like ß and save the file. Of course it must be correct – their text editor shows it correctly. But feed it to any program that reads according to the encoding and that is now the first character fo a 2 byte sequence. You either get a different character or if the second byte is not a legal value for that first byte – an error.

Point 2 – Always create HTML and XML in a program that writes it out correctly using the encode. If you must create with a text editor, then view the final file in a browser.

Now, what about when the code you are writing will read or write a file? We are not talking binary/data files where you write it out in your own format, but files that are considered text files. Java, .NET, etc all have character encoders. The purpose of these encoders is to translate between a sequence of bytes (the file) and the characters they represent. Lets take what is actually a very difficlut example – your source code, be it C#, Java, etc. These are still by and large "plain old text files" with no encoding hints. So how do programs handle them? Many assume they use the local code page. Many others assume that all characters will be in the range 0 – 127 and will choke on anything else.

Here's a key point about these text files – every program is still using an encoding. It may not be setting it in code, but by definition an encoding is being used.

Point 3 – Always set the encoding when you read and write text files. Not just for HTML & XML, but even for files like source code. It's fine if you set it to use the default codepage, but set the encoding.

Point 4 – Use the most complete encoder possible. You can write your own XML as a text file encoded for UTF-8. But if you write it using an XML encoder, then it will include the encoding in the meta data and you can't get it wrong. (it also adds the endian preamble to the file.)

Ok, you're reading & writing files correctly but what about inside your code. What there? This is where it's easy – unicode. That's what those encoders created in the Java & .NET runtime are designed to do. You read in and get unicode. You write unicode and get an encoded file. That's why the char type is 16 bits and is a unique core type that is for characters. This you probably have right because languages today don't give you much choice in the matter.

Point 5 – (For developers on languages that have been around awhile) – Always use unicode internally. In C++ this is called wide chars (or something similar). Don't get clever to save a couple of bytes, memory is cheap and you have more important things to do.

Wrapping it up
I think there are two key items to keep in mind here. First, make sure you are taking the encoding in to account on text files. Second, this is actually all very easy and straightforward. People rarely screw up how to use an encoding, it's when they ignore the issue that they get in to trouble.

Edited by: Darryl Burke -- link removed
  • 1. Re: What every developer should know about character encoding
    802316 Pro
    Currently Being Moderated
    A small point, I thought the first 7-bit characters were 0 to 127 which makes 128 in total. (you mention this in the later half, but refer the the first 127 characters being 1 - 127 in the first half) I had assumed that the null character is the same in every encoding as well. BTW: If you use C (which uses \0 terminated strings) you might not share this view.

    America's aren't the only ones who use English characters, The English have been know to use them too. Admittedly they have the £ pound sign on the keyboard. Australian's use $ for currency and are generally happy to the 7-bit ASCII characters, along with most English colonies like America.

    I agree with Point 5. memory is not only cheap but getting cheaper all the time. 1 GB of memory add about £100 to the cost of a server from a major named vendor and it is reusable, your time is not. ;) Disk space is so mind bogglingly cheap that the value of your time to press one key can be worth more than 10 MB of disk space. i.e. its not worth pressing one key if it saves less than 10 MB.
  • 2. Re: What every developer should know about character encoding
    DarrylBurke Guru Moderator
    Currently Being Moderated
    Moderator advice: Kindly stop promoting your blog via this forum. This is the second time a link has been removed from a post made by you.

    db
  • 3. Re: What every developer should know about character encoding
    DavidThi808 Newbie
    Currently Being Moderated
    I've had several people thank me for posting the previous one as it answered questions for them.

    Can you please tell me why it is bad to proactively answer questions by providing people an intro to a subject rather than waiting for them to get stuck and then answering?

    thanks - dave
  • 4. Re: What every developer should know about character encoding
    796440 Guru
    Currently Being Moderated
    DavidThi808 wrote:
    Can you please tell me why it is bad to proactively answer questions by providing people an intro to a subject rather than waiting for them to get stuck and then answering?
    It's not "bad" per se, but it's not the purpose of this site. This site is not intended to be a dumping ground for all the myriad java tutorials du jour that individuals feel like holding court on. This site is intended to be a place where people ask specific questions and receive help in response.
  • 5. Re: What every developer should know about character encoding
    DavidThi808 Newbie
    Currently Being Moderated
    Ok. So if someone asks for an intro to something on topic here like when to use what classes for I/O, character encoding, etc., then it's ok to answer with something like this?

    But I need to wait for the question?

    thanks - dave
  • 6. Re: What every developer should know about character encoding
    796440 Guru
    Currently Being Moderated
    DavidThi808 wrote:
    Ok. So if someone asks for an intro to something on topic here like when to use what classes for I/O, character encoding, etc., then it's ok to answer with something like this?

    But I need to wait for the question?
    That would be my take on it, yes, assuming the post is a valid useful answer to what they're asking, but I'm not a moderator or an admin.
  • 7. Re: What every developer should know about character encoding
    DarrylBurke Guru Moderator
    Currently Being Moderated
    DavidThi808 wrote:
    I've had several people thank me for posting the previous one as it answered questions for them.
    Really? All I see iis one response which is mainly critical of the material you posted.
    What every developer should know about bitmaps

    db
  • 8. Re: What every developer should know about character encoding
    DavidThi808 Newbie
    Currently Being Moderated
    Hi Darryl;

    Yes, I was going to respond to the points raised, but could not as you had locked the thread. I would like to because the reply raised good points worth discussing.

    I also posted a message to you at Message to Darryl Burke as your email address is not in your profile, and there was a reply to that too.

    I received 6 emails about my post thanking me. My guess is with your locking it they figured that was the only way to respond.

    thanks - dave
  • 9. Re: What every developer should know about character encoding
    800330 Explorer
    Currently Being Moderated
    I had no intention to praise the contents of your blog, merely commenting on your effort to explain your reasoning (for which I still feel that it was very polite and well mannered).

    I haven't got a strong opinion on whether or not you can start a thread not containing a question or at least ask for a discussion. However, you're giving the answer to a yet unraised issue. How is someone having the question after you posted the answer going to find your answer? My answer: through Google and with that your blog post or tutorial probably already shows up. So, why post and advertise it again here?
  • 10. Re: What every developer should know about character encoding
    DavidThi808 Newbie
    Currently Being Moderated
    That's a fair point. I generally come here and search, then go to Google if that fails. But I asked around at work and everyone else goes straight to Google. So I gues pro-active is not needed.

    thanks - dave
  • 11. Re: What every developer should know about character encoding
    jschellSomeoneStoleMyAlias Expert
    Currently Being Moderated
    DavidThi808 wrote:
    This was originally posted (with better formatting) at Moderator edit: link removed/what-every-developer-should-know-about-character-encoding.html. I'm posting because lots of people trip over this.

    If you write code that touches a text file, you probably need this.

    Lets start off with two key items

    1.Unicode does not solve this issue for us (yet).
    2.Every text file is encoded. There is no such thing as an unencoded file or a "general" encoding.
    And lets add a codacil to this – most Americans can get by without having to take this in to account – most of the time. Because the characters for the first 127 bytes in the vast majority of encoding schemes map to the same set of characters (more accurately called glyphs). And because we only use A-Z without any other characters, accents, etc. – we're good to go. But the second you use those same assumptions in an HTML or XML file that has characters outside the first 127 – then the trouble starts.
    Pretty sure most Americans do not use character sets that only have a range of 0-127. I don't think I have every used a desktop OS that did. I might have used some big iron boxes before that but at that time I wasn't even aware that character sets existed.

    They might only use that range but that is a different issue, especially since that range is exactly the same as the UTF8 character set anyways.

    >
    The computer industry started with diskspace and memory at a premium. Anyone who suggested using 2 bytes for each character instead of one would have been laughed at. In fact we're lucky that the byte worked best as 8 bits or we might have had fewer than 256 bits for each character. There of course were numerous charactersets (or codepages) developed early on. But we ended up with most everyone using a standard set of codepages where the first 127 bytes were identical on all and the second were unique to each set. There were sets for America/Western Europe, Central Europe, Russia, etc.

    And then for Asia, because 256 characters were not enough, some of the range 128 – 255 had what was called DBCS (double byte character sets). For each value of a first byte (in these higher ranges), the second byte then identified one of 256 characters. This gave a total of 128 * 256 additional characters. It was a hack, but it kept memory use to a minimum. Chinese, Japanese, and Korean each have their own DBCS codepage.

    And for awhile this worked well. Operating systems, applications, etc. mostly were set to use a specified code page. But then the internet came along. A website in America using an XML file from Greece to display data to a user browsing in Russia, where each is entering data based on their country – that broke the paradigm.
    The above is only true for small volume sets. If I am targeting a processing rate of 2000 txns/sec with a requirement to hold data active for seven years then a column with a size of 8 bytes is significantly different than one with 16 bytes.
    Fast forward to today. The two file formats where we can explain this the best, and where everyone trips over it, is HTML and XML. Every HTML and XML file can optionally have the character encoding set in it's header metadata. If it's not set, then most programs assume it is UTF-8, but that is not a standard and not universally followed. If the encoding is not specified and the program reading the file guess wrong – the file will be misread.
    The above is out of place. It would be best to address this as part of Point 1.
    Point 1 – Never treat specifying the encoding as optional when writing a file. Always write it to the file. Always. Even if you are willing to swear that the file will never have characters out of the range 1 – 127.

    Now lets' look at UTF-8 because as the standard and the way it works, it gets people into a lot of trouble. UTF-8 was popular for two reasons. First it matched the standard codepages for the first 127 characters and so most existing HTML and XML would match it. Second, it was designed to use as few bytes as possible which mattered a lot back when it was designed and many people were still using dial-up modems.

    UTF-8 borrowed from the DBCS designs from the Asian codepages. The first 128 bytes are all single byte representations of characters. Then for the next most common set, it uses a block in the second 128 bytes to be a double byte sequence giving us more characters. But wait, there's more. For the less common there's a first byte which leads to a sersies of second bytes. Those then each lead to a third byte and those three bytes define the character. This goes up to 6 byte sequences. Using the MBCS (multi-byte character set) you can write the equivilent of every unicode character. And assuming what you are writing is not a list of seldom used Chinese characters, do it in fewer bytes.
    The first part of that paragraph is odd. The first 128 characters of unicode, all unicode, is based on ASCII. The representational format of UTF8 is required to implement unicode, thus it must represent those characters. It uses the idiom supported by variable width encodings to do that.
    But here is what everyone trips over – they have an HTML or XML file, it works fine, and they open it up in a text editor. They then add a character that in their text editor, using the codepage for their region, insert a character like ß and save the file. Of course it must be correct – their text editor shows it correctly. But feed it to any program that reads according to the encoding and that is now the first character fo a 2 byte sequence. You either get a different character or if the second byte is not a legal value for that first byte – an error.
    Not sure what you are saying here. If a file is supposed to be in one encoding and you insert invalid characters into it then it invalid. End of story. It has nothing to do with html/xml.
    Point 2 – Always create HTML and XML in a program that writes it out correctly using the encode. If you must create with a text editor, then view the final file in a browser.
    The browser still needs to support the encoding.
    Now, what about when the code you are writing will read or write a file? We are not talking binary/data files where you write it out in your own format, but files that are considered text files. Java, .NET, etc all have character encoders. The purpose of these encoders is to translate between a sequence of bytes (the file) and the characters they represent. Lets take what is actually a very difficlut example – your source code, be it C#, Java, etc. These are still by and large "plain old text files" with no encoding hints. So how do programs handle them? Many assume they use the local code page. Many others assume that all characters will be in the range 0 – 127 and will choke on anything else.
    I know java files have a default encoding - the specification defines it. And I am certain C# does as well.
    Point 3 – Always set the encoding when you read and write text files. Not just for HTML & XML, but even for files like source code. It's fine if you set it to use the default codepage, but set the encoding.
    It is important to define it. Whether you set it is another matter.
    Point 4 – Use the most complete encoder possible. You can write your own XML as a text file encoded for UTF-8. But if you write it using an XML encoder, then it will include the encoding in the meta data and you can't get it wrong. (it also adds the endian preamble to the file.)

    Ok, you're reading & writing files correctly but what about inside your code. What there? This is where it's easy – unicode. That's what those encoders created in the Java & .NET runtime are designed to do. You read in and get unicode. You write unicode and get an encoded file. That's why the char type is 16 bits and is a unique core type that is for characters. This you probably have right because languages today don't give you much choice in the matter.
    Unicode character escapes are replaced prior to actual code compilation. Thus it is possible to create strings in java with escaped unicode characters which will fail to compile.
    Point 5 – (For developers on languages that have been around awhile) – Always use unicode internally. In C++ this is called wide chars (or something similar). Don't get clever to save a couple of bytes, memory is cheap and you have more important things to do.
    No. A developer should understand the problem domain represented by the requirements and the business and create solutions that appropriate to that. Thus there is absolutely no point for someone that is creating an inventory system for a stand alone store to craft a solution that supports multiple languages.

    And another example is with high volume systems moving/storing bytes is relevant. As such one must carefully consider each text element as to whether it is customer consumable or internally consumable. Saving bytes in such cases will impact the total load of the system. In such systems incremental savings impact operating costs and marketing advantage with speed.
  • 12. Re: What every developer should know about character encoding
    843441 Newbie
    Currently Being Moderated
    I think this is a worthy subject and I'm still sorting it out. I'm not sure your explanation clarifies everything for me.

    One chicken and egg problems is: even if you put the encoding in a file (such as HTML or XML file), you still need to assume or deduce a character encoding to read the file to read the encoding! (Unless you have a special file format where the first N bytes always identify the encoding.)

    I've seen better explanations that distinguish between codepoints (related to your term "glyph", that is, a way to identify a "unique character"), Unicode characters (16 bit codes used to represent codepoints where sometimes a codepoint is represented by a single Unicode character, and sometimes it takes more than one), and character "transformation/transmission formats" (e.g. UFT-8, etc.) which determine the byte sequences that end up in files or streams to represent sequences of codepoints.

    The real horror to me is the failure of Unicode to handle all the codepoints neatly, so you still have to worry about where characters start and end in a Java string.

    I am hoping a future version of Java will include Codepoint sequences, arrays, etc., or even integer read/write methods for easier programming (through not as memory frugal). It would be nice to have something like:

    InputStreamReader.read(CodePoint[] cbuf, int offset, int length)

    or simply

    InputStreamReader.read(int[] cbuf, int offset, int length)

    would be acceptable for some applications since codepoints currently fit into an int with room to spare.

    Edited by: sb4 on Mar 12, 2011 8:57 AM