That's very informative, guys.. good discussion..!!
but then are we then concluding that
1) while reading a file, we have to know the encoding of the file?
2) while reading a file, we have to convert explicitly, the data to the encoding that we want even if it's a convertion to say UTF8?
3) while inserting data to say the file, do we have to do a similar fashioned process of converting the data to say UTF8 and then put into the file?
am confused.. but more than eager to know about it.
Are we recommending that when the data is coming to a Java program, it's better to convert the encoding to say UTF8 and when it goes out also, it's better to convert it into say UTF8 if it's going to a storage like database or a file?
please clarify.. thankx in advance..
Nice :) really similar to what I have done.
Firstly, please read this thread. A very nice intro:
Now to your questions:
1) This depends upon what you want to do. If you do know the encoding of the file, then it is not a problem. If you do not, you can always read the file into a byte array, then try to figure out the encoding from it.
2) If you do not specify the encoding of the file, Java will use the default (UTF-16?). However, as I said in 1) you can read it in as a byte array then construct a string from it once you figure out the encoding.
3) Again if you do not specify the encoding, the default is used. You can specify what encoding you want however. I would recommend UTF-8.
Hope this helps.
two items to add to this good discussion:
1) UTF-8 is despite the name not a 8-bit encoding. Characters are encoded with variable length from 1 byte up to 6 bytes.
For more information see http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
2) If no encoding is specified Java uses the platform's default character encoding. For Western Europeans for example this will be ISO-Latin-1.
1) UTF-8 is despite the name not a 8-bit encoding.
Characters are encoded with variable length from 1
byte up to 6 bytes.
not to be rude...
the reason utf-8 is 8-bit encoding because you can get information by reading 8-bit (1 byte) at a time. The first byte will tell you how long the character is. Encoding like utf-16, you will have to read two bytes at a time to decipher the character.
of course i could still be wrong :) but this is how i understand it.
I need some help. I'm trying to determine the encoding of a text file so that I can use the right encoding to read text in. The file has no BOM and is in either ANSI or UTF-8 format. If I erroneously use UTF-8 to read in an ANSI file, the text will appear corrupted with a lot of square boxes.
Notepad of Win2K can correctly detect the file encoding even when there is no BOM. How can I do the same, preferably without having to read in the entire file first? Thanks.
I do not know what ANSI is referring to. Do you mean ASCII?
If the file is in proper ASCII, reading the file as UTF-8 will not cause any problem. However, I have seen instances where Windows put characters that are not ASCII, but they do look like ASCII, into files.
ANSI is 8-bit ASCII extension, or Cp1252 Windows Latin-1, which is similar to ISO-8859-1. Since ANSI and UTF-8 are both 8-bit sequences, Java cannot tell which file encoding a file is in if its content has such sequences.
I think I can read in the entire file as bytes and then determine if they are valid UTF-8 sequences (an invalid one will generate an exception), but it would be very inefficient to do so. Any other ideas?
You would have to write it yourself. However, I can give you some pointers.
The valid UTF-8 sequences in binary are:
1110???? 10?????? 10??????
11110??? 10?????? 10?????? 10??????
111110?? 10?????? 10?????? 10?????? 10??????
1111110? 10?????? 10?????? 10?????? 10?????? 10??????
So if you know that the file is either ANSI or UTF8 and with no other choices, you can read the file and match it against the six patterns above. As soon as it fails, it is in ANSI. If it passes, it SHOULD be in UTF8 although it maybe in ANSI. Also check for invalid ANSI sequences too.
I was looking at that pattern, too, but has yet to come up with any way to break it down. Actually, I only need to deal with the first 3 sequences for my files because the Unicode characters I'm interested in has UTF-8 sequence three bytes long at most. I want to read in a dozen byte or so and then somehow match against the 3 UTF-8 sequences. Thanks.