9 Replies Latest reply: Sep 4, 2010 4:56 PM by 653909 RSS

    UTF-8 vs UTF-16 string encoding

    653909
      I am doing an internationalized web app.

      Lets say that we need to choose one of those encodings versus the other. (For those who don't know (there are such) UTF-8 can encode any character in the Unicode set, and is a variable length encoding, i.e. saves space for the majority of the text being ASCII-like).

      So...

      How can we express the boolean choice as a function of the density of 2byte+ characters in the text ?

      What if the majority of the characters are English text ? What if they are in Japanese ?

      What if 30% of the text is in Japanese and 70% in English ?


      ;

      How can we express the boolean choice as a function of the user locale only ?


      Thank you.
        • 1. Re: UTF-8 vs UTF-16 string encoding
          843853
          You do realise that when representing UNICODE UTF-16 is also variable length.

          P.S. I don't think this is a binary choice since one should also be able to consider the guaranteed single byte character encodings such as iso-8859-x which can be used in some locale to guarantee 1 byte per character.
          • 2. Re: UTF-8 vs UTF-16 string encoding
            800025
            Just for the fun of it:
            import java.nio.charset.Charset;
            import java.util.*;
            
            public class CharSetSelector {
            
                private static final String bundleName = "sun.applet.resources.MsgAppletViewer";
            
                public static void main(String[] args) {
                 test(Locale.JAPAN);
                 test(Locale.US);
                 test(Locale.FRANCE);
                }
            
                public static boolean selectUTF8(Locale locale) {
                 ResourceBundle bundle = ResourceBundle.getBundle(bundleName, locale);
                 StringBuilder sb = new StringBuilder();
                 Enumeration<String> keys = bundle.getKeys();
                 while (keys.hasMoreElements()) {
                     sb.append(bundle.getObject(keys.nextElement()));
                 }
                 String sample = sb.toString();
                 int utf8Length = sample.getBytes(Charset.forName("UTF-8")).length;
                 int utf16Length = sample.getBytes(Charset.forName("UTF-16")).length;
                 System.out.println("UTF-8 \t" + utf8Length);
                 System.out.println("UTF-16 \t" + utf16Length);
                 return utf8Length < utf16Length;
                }
            
                public static void test(Locale locale) {
                 System.out.println("Use UTF-8 " + selectUTF8(locale) + " for "
                      + locale.getDisplayName(Locale.ROOT));
                }
            
                private CharSetSelector() {
            
                }
            
            }
            • 3. Re: UTF-8 vs UTF-16 string encoding
              653909
              We are talking about Unicode here, not ISO-88-whatever.
              For example one may decide to quote 2 lines of Chinese in a 2MB eng text HTML file.

              I know that UTF16 is varlength too - yes.


              Interesting sample of code btw, very nice :)

              Why is that particular bundle name chosen ?
              • 4. Re: UTF-8 vs UTF-16 string encoding
                800025
                I just searched for some arbitrary bundle that contained a reasonable amount of text and indeed is translated into various languages. When you might decide to take a similar approach, you should perhaps create some sample text in the various languages yourself and not rely on a sun package.
                • 5. Re: UTF-8 vs UTF-16 string encoding
                  653909
                  Yeah I know... Not portable...
                  • 6. Re: UTF-8 vs UTF-16 string encoding
                    843853
                    javaUserMuser wrote:
                    We are talking about Unicode here, not ISO-88-whatever.
                    The point I was trying, but obviously failing to make, is that if you are going to go to all that trouble to select an encoding based on Locale then surely one should consider choosing a single byte per character encoding encoding where appropriate to a Locale.

                    Of course if you are just trying to make sure that whatever the Locale you can cover the whole of the UNICODE set then your approach is possibly right but I have reservations. For example, if I have correctly read between the lines I would have thought that the dominant language of the page being supplied would be the discriminant rather than the Locale of the client.
                    • 7. Re: UTF-8 vs UTF-16 string encoding
                      653909
                      Yes, the last statement of yours is right.

                      I'm not sure I understand all of your writings, but UTF-8 covers all the Unicode set, even Japanese.
                      Its just, depending on the discriminant value.. performance penalties can arise.

                      For alphanumeric/latin characters only, no performance penalties arise, thus no need for other encodings.

                      Our friend who gave us the code snipped made a rough estimate/heuristic in his code, which is sufficient.

                      I.e. there is (some) correlation between the locale and the dominant language. No particularly precise measurement is needed (its expensive at least)..
                      • 8. Re: UTF-8 vs UTF-16 string encoding
                        jschellSomeoneStoleMyAlias
                        javaUserMuser wrote:
                        I am doing an internationalized web app.
                        From this and the other responses I believe this is too broad.
                        Lets say that we need to choose one of those encodings versus the other. (For those who don't know (there are such) UTF-8 can encode any character in the Unicode set, and is a variable length encoding, i.e. saves space for the majority of the text being ASCII-like).

                        So...

                        How can we express the boolean choice as a function of the density of 2byte+ characters in the text ?

                        What if the majority of the characters are English text ? What if they are in Japanese ?

                        What if 30% of the text is in Japanese and 70% in English ?
                        What does that mean exactly?

                        If I pop a order site with english content and links to the original japanese site it doesn't have anything to do with the server.

                        What performance problems are you suggesting? Processing and volume is the performance hit not display. Performance during display depends on the client and a web page in that regard on a client is not worth considertin.
                        • 9. Re: UTF-8 vs UTF-16 string encoding
                          653909
                          Nobody talks about display, rather strictly I/O.
                          Nobody talks about hyperlinks. We are talking about UTF encodings, and their respective tradeoffs.


                          Our pal above gave a very elegant solution for my question. So if you are feeling confused about its meaning you can backwards deduce what was meant thereby.