This discussion is archived
1 2 3 Previous Next 31 Replies Latest reply: Nov 28, 2002 4:55 AM by 843810 RSS

How to detect what charater encoding a file is in

843810 Newbie
Currently Being Moderated
Hi all java guru

How can l use java api to detect what charater enocoding a file is in ?
It is highly appreciated if you can send your reply to cwso7@netscape.net as well !

thx & regards
fox
  • 1. Re: How to detect what charater encoding a file is in
    843810 Newbie
    Currently Being Moderated
    Hi all java guru

    How can l use java api to detect what charater
    enocoding a file is in ?
    It is highly appreciated if you can send your reply to
    cwso7@netscape.net as well !
    It shouldn't really matter... once a String is read into Java from an outside source (such as a file), it's automatically in UTF-8 encoding. The original encoding doesn't matter anymore, because the String is no longer where it originated.
    Mind you, I suspect this produces new questions for you! ;-)
    Why do you need to know the encoding? Does it have an impact on how you want to display the information?

    Hope that helps,
    Martin Hughes
  • 2. Re: How to detect what charater encoding a file is in
    843810 Newbie
    Currently Being Moderated
    once a String is read into Java from an outside source (such as a file), it's automatically in UTF-8 encoding.

    Can you clarify the above statement please?

    1) Set the LOCALE to SJIS
    2) Read a Japanese string and store it in string ( Now you are saying that the string stores it in UTF-8)
    3) Do System.out.println(), It will use default encoding which is SJIS and hence UTF-8 will be converted to SJIS and you will see japanese string
    4) But if you do Runtime.exec("echo StringVariablewhichhas the string"), you can see japanese string . Here who is converting UTF-8 to SJIS, if the string is storing in UTF-8
  • 3. Re: How to detect what charater encoding a file is in
    843810 Newbie
    Currently Being Moderated
    If the file contains Byte Order Mark, I read in the BOM (should be the first few bytes) to determine the file's encoding, and then read the entire file in the proper encoding using InputStreamWriter(fis, encoding).
  • 4. Re: How to detect what charater encoding a file is in
    843810 Newbie
    Currently Being Moderated
    once a String is read into Java from an outside
    source (such as a file), it's automatically in UTF-8
    encoding.
    Strings are in UTF-16 format only. Strings and characters are processed internally as Unicode 16-bit entities. Encodings are used when write to or read from externally. http://java.sun.com/docs/books/tutorial/i18n/text/stream.html details how file I/O works with file encoding.

    You can convert string to array of bytes in specified encoding using str.getBytes(String encoding) method.
  • 5. Re: How to detect what charater encoding a file is in
    843810 Newbie
    Currently Being Moderated
    Hmmm, didn't know that UTF-16 bit... though it makes sense. It enables Java to handle
    CJK (Chinese, Japanese and Korean) without having to split them across two bytes within the String. Nice.
    For those reading this wondering what the rest of us are on about, UTF-16 is pretty much identical to UTF-8, except for the fact that it uses 16 bits to store characters, rather than 8. CJK characters, due to the huge number that exist, need 16 bits to hold. Reading them from UTF-8 can sometimes prove tricky, as they end up being stored as two characters. Long story. ;-)

    Anyway, Java will always store something in Unicode, be it UTF-8 or UTF-16 as we've discovered!
    The System.out.println line mentioned earlier is displaying in SHIFT_JIS because it's converting the data's encoding back to one that can be displayed in the DOS box (or whatever is displaying the println results). Whatever it's outputting to is outside of Java. In the case of a DOS box, sure, you're in the middle of running a Java program, but the actual window itself is not a part of it. And you can bet that it's not going to be a Unicode application like Java is (i.e. it's not going to be storing its data in a Unicode-compliant encoding). So when Java hands its output back to the window, either Java or the window is smart enough to be able to switch the encoding back to something the window knows how to display. I'd suspect it's Java doing this, but I could be wrong.

    Hope that helps!
    Martin Hughes
  • 6. Re: How to detect what charater encoding a file is in
    843810 Newbie
    Currently Being Moderated
    As far as I know, there is no JAVA API that detects character encoding used in a file. However, it is doable but requires custom software.

    To Martin:
    Correct me if I am wrong, dont you have to know the correct encoding of the file before reading it into a Java app? Otherwise, Java will think it is in Unicode and the file may not be read in correctly.
  • 7. Re: How to detect what charater encoding a file is in
    843810 Newbie
    Currently Being Moderated
    To Martin / Shadow

    last reply exactly get the heart of my question. At the outset, when you read the file, you must specify the "encoding" method. Otherwise, Java I/O stream can't correctly convert the bytes into String in UTF-16.
    If l don't specify the encoding when reading a file, Java will treat the file as being encoded in java default encoding. However, if l set the encoding to "UTF-8" but my file,let's say, is in "Big5" encoding. garbage will be read in. Thus, are there any API / open-source java classes used for determine what charater encoding a file is in .


  • 8. Re: How to detect what charater encoding a file is in
    843810 Newbie
    Currently Being Moderated
    Exactly, you'd have to take liberties with Java, and hope that it was able to decipher your encoding for you.
    AFAIK, there's no way of telling what encoding a file is in. Java certainly doesn't provide the APIs for it.
    About the only way I can think of doing this is by looking at the bit orders of each encoding. Absolutely horrible stuff to code, I know, but it'd work.
    No doubt Big-5, SHIFT_JIS etc. all store their characters in certain bit patterns, all of which differ (or we wouldn't have differing encodings). What I do know is that most of the encodings differ only in their allowable character ranges, which should make things easier (i.e. Big-5 is only going to allow Chinese characters, SHIFT_JIS Japanese ones etc. - they don't provide a way of describing Latin alphabet characters AFAIK). So it would all come down to reading in the characters byte by byte, and being able to pick which encoding we're in purely off bit order and character range.
    http://www.iana.org/assignments/character-sets is a good place to start. From there, there are a heap of RFCs that define each encoding. Horrible, I know, but...

    Hope that helps,
    Martin Hughes
  • 9. Re: How to detect what charater encoding a file is in
    843810 Newbie
    Currently Being Moderated
    Quick correction, Big5 encoding can display normal ASCII characters.
    What Martin described was a possible solution and that was what i spend a year doing (part time) as my thesis. It is definitely doable but it is not 100% accurate but is pretty reliable.
    Some people over mozilla has written a brief article/essay on this (detecting character encoding) but I dont have the link handy at the moment.
  • 10. Re: How to detect what charater encoding a file is in
    843810 Newbie
    Currently Being Moderated
    shadowO

    Please kindly provide the link to me ! Thanks a ton!

    Danny
  • 11. Re: How to detect what charater encoding a file is in
    843810 Newbie
    Currently Being Moderated
    been very busy.
    I misplaced the document. Do a google search on "Mozilla encoding character" you should be able to find it.
  • 12. Re: How to detect what charater encoding a file is in
    843810 Newbie
    Currently Being Moderated
    I've spent a little bit of time these past two days playing around with this problem and for the encodings I've looked at so far, I've found it relatively straight forward to distinguish between them. I've managed to detect the following encodings reasonably reliably:

    UTF-8 (with BOM)
    UCS-2LE or UCS-4LE or UCS-16LE (by 'or' I mean I haven't yet distinguished between them)
    UTF-16 or UTF-2
    UCS-4
    ASCII
    UTF-8 (without BOM)
    EUC-JP
    SHIFT-JIS
    ISO-2022-JP
    ISO-2022-KR

    All of the above were achieved without heuristics.

    The last major one I want to support is ISO Latin-1, which I suspect is going to be much trickier, and will need some form of analysis of usage patterns for accented characters.

    The following sites have been instrumental:

    http://czyborra.com/utf/
    http://zsigri.tripod.com/fontboard/cjk/charsets.html
    http://www.devhood.com/tutorials/tutorial_details.aspx?tutorial_id=469
    http://lfw.org/text/jp-www.html
    http://developer.apple.com/techpubs/macos8/TextIntlSvcs/TextEncodingConversionManager/TEC1.5/TEC.9e.html
    http://www.faqs.org/rfcs/rfc1468.html
    http://www.faqs.org/rfcs/rfc2237.html
  • 13. Re: How to detect what charater encoding a file is in
    843810 Newbie
    Currently Being Moderated
    If you have achieved that in two days, you are much smarter than me :) Thanks for the links. I will have a look at them when I have time. This is still a topic I am interested in greatly.
    ISO Latin-1 (ISO8859-1) is not too hard. IIRC, it just uses a 256 matrix.
  • 14. Re: How to detect what charater encoding a file is in
    843810 Newbie
    Currently Being Moderated
    I consider my initial attempt fairly hacky and inefficient... but I've included a sample method of what I wrote for a first attempt below. There's no doubt more efficient ways of doing this, but the following runs in a millisecond or two, for 2 to 4 Kb of text.

    Other points I'm looking into now are where, and just how much of a document does one have to verify before you're satisfied with its encoding. I've found that 2Kb is no where near enough for web pages (due to large amounts of ASCII HTML header info and javascript. I've thought of taking samples throughout the document but one runs the risk of cutting double byte characters in half... perhaps one can make the algorithm more lenient in such a case??

    I'm interested to hear your feedback.

         /**
          * <p>
          * Tests whether a given byte array could represent ISO-2022-JP encoded text. The detection relies on two
          * factors:
          * <ul>
          *      <li> The presence of JIS escape sequences as defined by RFC1468 and RFC2237.</li>
          *      <li> That the number of double byte characters present is equal to the remainder of bytes that are not used
          * for ASCII characters, Roman characters (see RFC1468) or escape sequences divided by two.</li>
          * </ul>
          * </p>
          *
          * <pre>
          * Esc Seq    Character Set                  ISOREG
          *
          * ESC ( B    ASCII                             6
          * ESC ( J    JIS X 0201-1976 ("Roman" set)    14
          * ESC $ @    JIS X 0208-1978                  42
          * ESC $ B    JIS X 0208-1983                  87
          * ESC $ ( D  JIS X 0212-1990                 159
          * </pre>
          *
          * <p>
          * For more information, refer to:<br />
          * <a href="http://www.faqs.org/rfcs/rfc1468.html">http://www.faqs.org/rfcs/rfc1468.html</a><br />
          * <a href="http://www.faqs.org/rfcs/rfc2237.html">http://www.faqs.org/rfcs/rfc2237.html</a><br />
          * </p>
          * @param bytes An array of bytes to examine.
          * @return boolean Whether or not the byte array could be ISO-2022-JP encoded.
          */
         static boolean isValidISO2022JP(byte[] bytes)
         {
              int dbcsCount = 0; //Number of valid double-byte chars encountered
              int asciiCount = 0; //Number of ASCII chars encountered
              int romanCount = 0; //Number of Roman chars encountered
              int escCount = 0; //Number of Esc sequences encountered
    
              int len = bytes.length;
              int pos = 0;
              int aa = 0x00;
    
              boolean isDBCSMode = false;
              boolean isRomanMode = false;
              boolean lastWasFirstDBC = false;
    
              while (pos < len)
              {
                   aa = bytes[pos] & 0xFF;
    
                   if (aa == 0x1B)
                   {
                        int bb = bytes[pos + 1] & 0xFF;
                        int cc = bytes[pos + 2] & 0xFF;
                        int dd = bytes[pos + 3] & 0xFF;
    
                        if (pos + 3 < len)
                        {
                             //ESC ( B
                             if (bb == 0x28 && cc == 0x42)
                             {
                                  escCount = escCount + 3;
                                  pos = pos + 3;
                                  isDBCSMode = false;
                                  isRomanMode = false;
                                  lastWasFirstDBC = false;
                             }
                             //ESC ( J
                             else if (bb == 0x28 && cc == 0x4A)
                             {
                                  escCount = escCount + 3;
                                  pos = pos + 3;
                                  isDBCSMode = false;
                                  isRomanMode = true;
                                  lastWasFirstDBC = false;
                             }
    
                             //ESC $ @
                             else if (bb == 0x24 && cc == 0x40)
                             {
                                  escCount = escCount + 3;
                                  pos = pos + 3;
                                  isDBCSMode = true;
                                  isRomanMode = false;
                                  lastWasFirstDBC = false;
                             }
    
                             //ESC $ B
                             else if (bb == 0x24 && cc == 0x42)
                             {
                                  escCount = escCount + 3;
                                  pos = pos + 3;
                                  isDBCSMode = true;
                                  isRomanMode = false;
                                  lastWasFirstDBC = false;
                             }
    
                             //ESC $ ( D
                             else if (bb == 0x24 && cc == 0x28 && dd == 0x44)
                             {
                                  escCount = escCount + 4;
                                  pos = pos + 4;
                                  isDBCSMode = true;
                                  isRomanMode = false;
                                  lastWasFirstDBC = false;
                             }
                             else
                             {
                                  pos++;
                             }
                        }
                   }
                   else
                   {
                        if (isDBCSMode && aa > 0x20 && aa < 0x80)
                        {
                             if (lastWasFirstDBC == true)
                             {
                                  dbcsCount++;
                                  lastWasFirstDBC = false;
                             }
                             else
                             {
                                  lastWasFirstDBC = true;
                             }
                        }
                        else if (aa < 0x80)
                        {
                             if (isRomanMode)
                                  romanCount++;
                             else
                                  asciiCount++;
                             lastWasFirstDBC = false;
                        }
                        pos++;
                   }
              }
    
              return (escCount > 0 && dbcsCount == (len - asciiCount - romanCount - (escCount)) / 2);
         }
1 2 3 Previous Next