This discussion is archived
1 2 3 Previous Next 31 Replies Latest reply: Nov 28, 2002 4:55 AM by 843810 Go to original post RSS
  • 15. Re: How to detect what charater encoding a file is in
    843810 Newbie
    Currently Being Moderated

    That's very informative, guys.. good discussion..!!

    but then are we then concluding that

    1) while reading a file, we have to know the encoding of the file?
    2) while reading a file, we have to convert explicitly, the data to the encoding that we want even if it's a convertion to say UTF8?
    3) while inserting data to say the file, do we have to do a similar fashioned process of converting the data to say UTF8 and then put into the file?

    am confused.. but more than eager to know about it.

    Are we recommending that when the data is coming to a Java program, it's better to convert the encoding to say UTF8 and when it goes out also, it's better to convert it into say UTF8 if it's going to a storage like database or a file?

    please clarify.. thankx in advance..

    Manesh
  • 16. Re: How to detect what charater encoding a file is in
    843810 Newbie
    Currently Being Moderated
    Nice :) really similar to what I have done.

    Manesh,
    Firstly, please read this thread. A very nice intro:
    http://forum.java.sun.com/thread.jsp?forum=16&thread=299456

    Now to your questions:
    1) This depends upon what you want to do. If you do know the encoding of the file, then it is not a problem. If you do not, you can always read the file into a byte array, then try to figure out the encoding from it.

    2) If you do not specify the encoding of the file, Java will use the default (UTF-16?). However, as I said in 1) you can read it in as a byte array then construct a string from it once you figure out the encoding.

    3) Again if you do not specify the encoding, the default is used. You can specify what encoding you want however. I would recommend UTF-8.

    Hope this helps.
  • 17. Re: How to detect what charater encoding a file is in
    843810 Newbie
    Currently Being Moderated
    Hi,

    two items to add to this good discussion:

    1) UTF-8 is despite the name not a 8-bit encoding. Characters are encoded with variable length from 1 byte up to 6 bytes.
    For more information see http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8

    2) If no encoding is specified Java uses the platform's default character encoding. For Western Europeans for example this will be ISO-Latin-1.

    Regards

    Jan
  • 18. Re: How to detect what charater encoding a file is in
    843810 Newbie
    Currently Being Moderated
    refer

    http://www.oreilly.com/catalog/javaint/
  • 19. Re: How to detect what charater encoding a file is in
    843810 Newbie
    Currently Being Moderated
    1) UTF-8 is despite the name not a 8-bit encoding.
    Characters are encoded with variable length from 1
    byte up to 6 bytes.
    not to be rude...
    the reason utf-8 is 8-bit encoding because you can get information by reading 8-bit (1 byte) at a time. The first byte will tell you how long the character is. Encoding like utf-16, you will have to read two bytes at a time to decipher the character.
    of course i could still be wrong :) but this is how i understand it.
  • 20. Re: How to detect what charater encoding a file is in
    843810 Newbie
    Currently Being Moderated
    I need some help. I'm trying to determine the encoding of a text file so that I can use the right encoding to read text in. The file has no BOM and is in either ANSI or UTF-8 format. If I erroneously use UTF-8 to read in an ANSI file, the text will appear corrupted with a lot of square boxes.

    Notepad of Win2K can correctly detect the file encoding even when there is no BOM. How can I do the same, preferably without having to read in the entire file first? Thanks.
  • 21. Re: How to detect what charater encoding a file is in
    843810 Newbie
    Currently Being Moderated
    I do not know what ANSI is referring to. Do you mean ASCII?
    If the file is in proper ASCII, reading the file as UTF-8 will not cause any problem. However, I have seen instances where Windows put characters that are not ASCII, but they do look like ASCII, into files.
  • 22. Re: How to detect what charater encoding a file is in
    843810 Newbie
    Currently Being Moderated
    ANSI is 8-bit ASCII extension, or Cp1252 Windows Latin-1, which is similar to ISO-8859-1. Since ANSI and UTF-8 are both 8-bit sequences, Java cannot tell which file encoding a file is in if its content has such sequences.

    I think I can read in the entire file as bytes and then determine if they are valid UTF-8 sequences (an invalid one will generate an exception), but it would be very inefficient to do so. Any other ideas?
  • 23. Re: How to detect what charater encoding a file is in
    843810 Newbie
    Currently Being Moderated
    You dont have to read in the whole file. You can read it in chunk by chunk if you want. However, you will know it is in ANSI once you detect a sequence that is not UTF8 :)
  • 24. Re: How to detect what charater encoding a file is in
    843810 Newbie
    Currently Being Moderated
    So, is there a specific algorithm to help auto-detect the file encoding, between ANSI and UTF-8, before reading it into a JTextArea?
  • 25. Re: How to detect what charater encoding a file is in
    843810 Newbie
    Currently Being Moderated
    You would have to write it yourself. However, I can give you some pointers.

    The valid UTF-8 sequences in binary are:
    0???????
    110????? 10??????
    1110???? 10?????? 10??????
    11110??? 10?????? 10?????? 10??????
    111110?? 10?????? 10?????? 10?????? 10??????
    1111110? 10?????? 10?????? 10?????? 10?????? 10??????

    So if you know that the file is either ANSI or UTF8 and with no other choices, you can read the file and match it against the six patterns above. As soon as it fails, it is in ANSI. If it passes, it SHOULD be in UTF8 although it maybe in ANSI. Also check for invalid ANSI sequences too.
  • 26. Re: How to detect what charater encoding a file is in
    843810 Newbie
    Currently Being Moderated
    I was looking at that pattern, too, but has yet to come up with any way to break it down. Actually, I only need to deal with the first 3 sequences for my files because the Unicode characters I'm interested in has UTF-8 sequence three bytes long at most. I want to read in a dozen byte or so and then somehow match against the 3 UTF-8 sequences. Thanks.
  • 27. Re: How to detect what charater encoding a file is in
    843810 Newbie
    Currently Being Moderated
    some ideas you can find at http://oss.software.ibm.com/cvs/icu/~checkout~/icuhtml/design/charset_detection.html

    Jaric Sng
  • 28. Re: How to detect what charater encoding a file is in
    843810 Newbie
    Currently Being Moderated

    I have a doubt with you, Can u have a DBCS in variable name ??
    Iam not refrring to variable value here.
  • 29. Re: How to detect what charater encoding a file is in
    843810 Newbie
    Currently Being Moderated

    Java port of Mozilla's automatic charset detection algorithm is now available at...

    http://www.i18nfaq.com/chardet.html