1 2 3 Previous Next 31 Replies Latest reply: Nov 28, 2002 6:55 AM by 843810 RSS

    How to detect what charater encoding a file is in

    843810
      Hi all java guru

      How can l use java api to detect what charater enocoding a file is in ?
      It is highly appreciated if you can send your reply to cwso7@netscape.net as well !

      thx & regards
      fox
        • 1. Re: How to detect what charater encoding a file is in
          843810
          Hi all java guru

          How can l use java api to detect what charater
          enocoding a file is in ?
          It is highly appreciated if you can send your reply to
          cwso7@netscape.net as well !
          It shouldn't really matter... once a String is read into Java from an outside source (such as a file), it's automatically in UTF-8 encoding. The original encoding doesn't matter anymore, because the String is no longer where it originated.
          Mind you, I suspect this produces new questions for you! ;-)
          Why do you need to know the encoding? Does it have an impact on how you want to display the information?

          Hope that helps,
          Martin Hughes
          • 2. Re: How to detect what charater encoding a file is in
            843810
            once a String is read into Java from an outside source (such as a file), it's automatically in UTF-8 encoding.

            Can you clarify the above statement please?

            1) Set the LOCALE to SJIS
            2) Read a Japanese string and store it in string ( Now you are saying that the string stores it in UTF-8)
            3) Do System.out.println(), It will use default encoding which is SJIS and hence UTF-8 will be converted to SJIS and you will see japanese string
            4) But if you do Runtime.exec("echo StringVariablewhichhas the string"), you can see japanese string . Here who is converting UTF-8 to SJIS, if the string is storing in UTF-8
            • 3. Re: How to detect what charater encoding a file is in
              843810
              If the file contains Byte Order Mark, I read in the BOM (should be the first few bytes) to determine the file's encoding, and then read the entire file in the proper encoding using InputStreamWriter(fis, encoding).
              • 4. Re: How to detect what charater encoding a file is in
                843810
                once a String is read into Java from an outside
                source (such as a file), it's automatically in UTF-8
                encoding.
                Strings are in UTF-16 format only. Strings and characters are processed internally as Unicode 16-bit entities. Encodings are used when write to or read from externally. http://java.sun.com/docs/books/tutorial/i18n/text/stream.html details how file I/O works with file encoding.

                You can convert string to array of bytes in specified encoding using str.getBytes(String encoding) method.
                • 5. Re: How to detect what charater encoding a file is in
                  843810
                  Hmmm, didn't know that UTF-16 bit... though it makes sense. It enables Java to handle
                  CJK (Chinese, Japanese and Korean) without having to split them across two bytes within the String. Nice.
                  For those reading this wondering what the rest of us are on about, UTF-16 is pretty much identical to UTF-8, except for the fact that it uses 16 bits to store characters, rather than 8. CJK characters, due to the huge number that exist, need 16 bits to hold. Reading them from UTF-8 can sometimes prove tricky, as they end up being stored as two characters. Long story. ;-)

                  Anyway, Java will always store something in Unicode, be it UTF-8 or UTF-16 as we've discovered!
                  The System.out.println line mentioned earlier is displaying in SHIFT_JIS because it's converting the data's encoding back to one that can be displayed in the DOS box (or whatever is displaying the println results). Whatever it's outputting to is outside of Java. In the case of a DOS box, sure, you're in the middle of running a Java program, but the actual window itself is not a part of it. And you can bet that it's not going to be a Unicode application like Java is (i.e. it's not going to be storing its data in a Unicode-compliant encoding). So when Java hands its output back to the window, either Java or the window is smart enough to be able to switch the encoding back to something the window knows how to display. I'd suspect it's Java doing this, but I could be wrong.

                  Hope that helps!
                  Martin Hughes
                  • 6. Re: How to detect what charater encoding a file is in
                    843810
                    As far as I know, there is no JAVA API that detects character encoding used in a file. However, it is doable but requires custom software.

                    To Martin:
                    Correct me if I am wrong, dont you have to know the correct encoding of the file before reading it into a Java app? Otherwise, Java will think it is in Unicode and the file may not be read in correctly.
                    • 7. Re: How to detect what charater encoding a file is in
                      843810
                      To Martin / Shadow

                      last reply exactly get the heart of my question. At the outset, when you read the file, you must specify the "encoding" method. Otherwise, Java I/O stream can't correctly convert the bytes into String in UTF-16.
                      If l don't specify the encoding when reading a file, Java will treat the file as being encoded in java default encoding. However, if l set the encoding to "UTF-8" but my file,let's say, is in "Big5" encoding. garbage will be read in. Thus, are there any API / open-source java classes used for determine what charater encoding a file is in .


                      • 8. Re: How to detect what charater encoding a file is in
                        843810
                        Exactly, you'd have to take liberties with Java, and hope that it was able to decipher your encoding for you.
                        AFAIK, there's no way of telling what encoding a file is in. Java certainly doesn't provide the APIs for it.
                        About the only way I can think of doing this is by looking at the bit orders of each encoding. Absolutely horrible stuff to code, I know, but it'd work.
                        No doubt Big-5, SHIFT_JIS etc. all store their characters in certain bit patterns, all of which differ (or we wouldn't have differing encodings). What I do know is that most of the encodings differ only in their allowable character ranges, which should make things easier (i.e. Big-5 is only going to allow Chinese characters, SHIFT_JIS Japanese ones etc. - they don't provide a way of describing Latin alphabet characters AFAIK). So it would all come down to reading in the characters byte by byte, and being able to pick which encoding we're in purely off bit order and character range.
                        http://www.iana.org/assignments/character-sets is a good place to start. From there, there are a heap of RFCs that define each encoding. Horrible, I know, but...

                        Hope that helps,
                        Martin Hughes
                        • 9. Re: How to detect what charater encoding a file is in
                          843810
                          Quick correction, Big5 encoding can display normal ASCII characters.
                          What Martin described was a possible solution and that was what i spend a year doing (part time) as my thesis. It is definitely doable but it is not 100% accurate but is pretty reliable.
                          Some people over mozilla has written a brief article/essay on this (detecting character encoding) but I dont have the link handy at the moment.
                          • 10. Re: How to detect what charater encoding a file is in
                            843810
                            shadowO

                            Please kindly provide the link to me ! Thanks a ton!

                            Danny
                            • 11. Re: How to detect what charater encoding a file is in
                              843810
                              been very busy.
                              I misplaced the document. Do a google search on "Mozilla encoding character" you should be able to find it.
                              • 12. Re: How to detect what charater encoding a file is in
                                843810
                                I've spent a little bit of time these past two days playing around with this problem and for the encodings I've looked at so far, I've found it relatively straight forward to distinguish between them. I've managed to detect the following encodings reasonably reliably:

                                UTF-8 (with BOM)
                                UCS-2LE or UCS-4LE or UCS-16LE (by 'or' I mean I haven't yet distinguished between them)
                                UTF-16 or UTF-2
                                UCS-4
                                ASCII
                                UTF-8 (without BOM)
                                EUC-JP
                                SHIFT-JIS
                                ISO-2022-JP
                                ISO-2022-KR

                                All of the above were achieved without heuristics.

                                The last major one I want to support is ISO Latin-1, which I suspect is going to be much trickier, and will need some form of analysis of usage patterns for accented characters.

                                The following sites have been instrumental:

                                http://czyborra.com/utf/
                                http://zsigri.tripod.com/fontboard/cjk/charsets.html
                                http://www.devhood.com/tutorials/tutorial_details.aspx?tutorial_id=469
                                http://lfw.org/text/jp-www.html
                                http://developer.apple.com/techpubs/macos8/TextIntlSvcs/TextEncodingConversionManager/TEC1.5/TEC.9e.html
                                http://www.faqs.org/rfcs/rfc1468.html
                                http://www.faqs.org/rfcs/rfc2237.html
                                • 13. Re: How to detect what charater encoding a file is in
                                  843810
                                  If you have achieved that in two days, you are much smarter than me :) Thanks for the links. I will have a look at them when I have time. This is still a topic I am interested in greatly.
                                  ISO Latin-1 (ISO8859-1) is not too hard. IIRC, it just uses a 256 matrix.
                                  • 14. Re: How to detect what charater encoding a file is in
                                    843810
                                    I consider my initial attempt fairly hacky and inefficient... but I've included a sample method of what I wrote for a first attempt below. There's no doubt more efficient ways of doing this, but the following runs in a millisecond or two, for 2 to 4 Kb of text.

                                    Other points I'm looking into now are where, and just how much of a document does one have to verify before you're satisfied with its encoding. I've found that 2Kb is no where near enough for web pages (due to large amounts of ASCII HTML header info and javascript. I've thought of taking samples throughout the document but one runs the risk of cutting double byte characters in half... perhaps one can make the algorithm more lenient in such a case??

                                    I'm interested to hear your feedback.

                                         /**
                                          * <p>
                                          * Tests whether a given byte array could represent ISO-2022-JP encoded text. The detection relies on two
                                          * factors:
                                          * <ul>
                                          *      <li> The presence of JIS escape sequences as defined by RFC1468 and RFC2237.</li>
                                          *      <li> That the number of double byte characters present is equal to the remainder of bytes that are not used
                                          * for ASCII characters, Roman characters (see RFC1468) or escape sequences divided by two.</li>
                                          * </ul>
                                          * </p>
                                          *
                                          * <pre>
                                          * Esc Seq    Character Set                  ISOREG
                                          *
                                          * ESC ( B    ASCII                             6
                                          * ESC ( J    JIS X 0201-1976 ("Roman" set)    14
                                          * ESC $ @    JIS X 0208-1978                  42
                                          * ESC $ B    JIS X 0208-1983                  87
                                          * ESC $ ( D  JIS X 0212-1990                 159
                                          * </pre>
                                          *
                                          * <p>
                                          * For more information, refer to:<br />
                                          * <a href="http://www.faqs.org/rfcs/rfc1468.html">http://www.faqs.org/rfcs/rfc1468.html</a><br />
                                          * <a href="http://www.faqs.org/rfcs/rfc2237.html">http://www.faqs.org/rfcs/rfc2237.html</a><br />
                                          * </p>
                                          * @param bytes An array of bytes to examine.
                                          * @return boolean Whether or not the byte array could be ISO-2022-JP encoded.
                                          */
                                         static boolean isValidISO2022JP(byte[] bytes)
                                         {
                                              int dbcsCount = 0; //Number of valid double-byte chars encountered
                                              int asciiCount = 0; //Number of ASCII chars encountered
                                              int romanCount = 0; //Number of Roman chars encountered
                                              int escCount = 0; //Number of Esc sequences encountered
                                    
                                              int len = bytes.length;
                                              int pos = 0;
                                              int aa = 0x00;
                                    
                                              boolean isDBCSMode = false;
                                              boolean isRomanMode = false;
                                              boolean lastWasFirstDBC = false;
                                    
                                              while (pos < len)
                                              {
                                                   aa = bytes[pos] & 0xFF;
                                    
                                                   if (aa == 0x1B)
                                                   {
                                                        int bb = bytes[pos + 1] & 0xFF;
                                                        int cc = bytes[pos + 2] & 0xFF;
                                                        int dd = bytes[pos + 3] & 0xFF;
                                    
                                                        if (pos + 3 < len)
                                                        {
                                                             //ESC ( B
                                                             if (bb == 0x28 && cc == 0x42)
                                                             {
                                                                  escCount = escCount + 3;
                                                                  pos = pos + 3;
                                                                  isDBCSMode = false;
                                                                  isRomanMode = false;
                                                                  lastWasFirstDBC = false;
                                                             }
                                                             //ESC ( J
                                                             else if (bb == 0x28 && cc == 0x4A)
                                                             {
                                                                  escCount = escCount + 3;
                                                                  pos = pos + 3;
                                                                  isDBCSMode = false;
                                                                  isRomanMode = true;
                                                                  lastWasFirstDBC = false;
                                                             }
                                    
                                                             //ESC $ @
                                                             else if (bb == 0x24 && cc == 0x40)
                                                             {
                                                                  escCount = escCount + 3;
                                                                  pos = pos + 3;
                                                                  isDBCSMode = true;
                                                                  isRomanMode = false;
                                                                  lastWasFirstDBC = false;
                                                             }
                                    
                                                             //ESC $ B
                                                             else if (bb == 0x24 && cc == 0x42)
                                                             {
                                                                  escCount = escCount + 3;
                                                                  pos = pos + 3;
                                                                  isDBCSMode = true;
                                                                  isRomanMode = false;
                                                                  lastWasFirstDBC = false;
                                                             }
                                    
                                                             //ESC $ ( D
                                                             else if (bb == 0x24 && cc == 0x28 && dd == 0x44)
                                                             {
                                                                  escCount = escCount + 4;
                                                                  pos = pos + 4;
                                                                  isDBCSMode = true;
                                                                  isRomanMode = false;
                                                                  lastWasFirstDBC = false;
                                                             }
                                                             else
                                                             {
                                                                  pos++;
                                                             }
                                                        }
                                                   }
                                                   else
                                                   {
                                                        if (isDBCSMode && aa > 0x20 && aa < 0x80)
                                                        {
                                                             if (lastWasFirstDBC == true)
                                                             {
                                                                  dbcsCount++;
                                                                  lastWasFirstDBC = false;
                                                             }
                                                             else
                                                             {
                                                                  lastWasFirstDBC = true;
                                                             }
                                                        }
                                                        else if (aa < 0x80)
                                                        {
                                                             if (isRomanMode)
                                                                  romanCount++;
                                                             else
                                                                  asciiCount++;
                                                             lastWasFirstDBC = false;
                                                        }
                                                        pos++;
                                                   }
                                              }
                                    
                                              return (escCount > 0 && dbcsCount == (len - asciiCount - romanCount - (escCount)) / 2);
                                         }
                                    1 2 3 Previous Next