I need to find out the encoding of some arbitrary files (more than a thousand) downloaded from internet (source codes) (So I don't know the real encodings). Is there any java class/method that guesses the encoding of a file? Is there any tutorials to write such a program? (It is still ok, if the program/method/class can find the encoding from a normal file, not a source code.) Any help is appreciated, thanks....
One thing to watch out for: if there don't happen to be any non-ASCII characters in the first 4KB or 8KB (whatever you set the buffer amount to), a UTF-8 file can be mis-identified as being in a single-byte encoding. To be safe, you may want to force it to check the whole file.
Many many thanks for pointing this out. This will save my life from a lot of work, most probably. At the same time, sorry that it took so long, but I had lot of things to do.
Now could you please help me about some points (I think you have used this before):
1-) Is there a policy about using/modifying this code? I think to have this in my project.
2-) Thanks for your warning about:
"One thing to watch out for: if there don't happen to be any non-ASCII characters in the first 4KB or 8KB (whatever you set the buffer amount to), a UTF-8 file can be mis-identified as being in a single-byte encoding. To be safe, you may want to force it to check the whole file. "
What should I change to get this happened? Is it enough if I specified it in the constructor? Like making the buffer length equal to the whole file length? And what about "enforce8bit" boolean. Should I need it?
3-) Can this classes work with raw files (direct HTML source codes)? Or do I need to extract the tect part first?
4-) If you have used this before, what is the % sucsess you got?
1-) It's used in Groovy, and Laforge is Groovy's project manager, so I assume the license is either the same as Groovy's or compatible with it.
2-) Using the file length for the buffer length should work fine. enforce8Bit just means, if the file contains only 7-bit ASCII characters, it should be treated as an 8-bit encoding anyway. The default is true, and I've never seen any reason to change it.
3-) I'm not sure I understand this question. HTML files are just text, so there's no reason why you can't use this tool on them. If you suspect part of the page is in a different encoding, I suppose you could extract that part and test it separately.
4-) I use it in a text editor (programming and viewsourcing mainly), and it works so well that I never notice it. I can't think of any higher praise than that. ^_^
Thanks for explanations...
I will try it as soon as I have the time for it, if it is really as good as you say then I can't measure the goodness you and the developer did to me.
About the third question, I want to say that there are some extra HTML specific codes (like <a, <p) in the raw source code of HTML files. So if the program if using a method that checks the frequancy of the charecters, in the source code there will be more "a" and "p" for example. (I don't know the method it uses for detecting the charset) So that was the problem.
Many many thanks again...
Counting characters is an advanced technique, used to identify the encoding by first identifying the language the text is written in. CharsetToolkit is much more rudimentary than that. It can detect UTF-16 or UTF-8 from a BOM, and it can identify UTF-8 by detecting byte patterns. Failing that, if it sees any bytes with the high-order bit set, it declares the encoding to be the system default, or whatever other 8-bit encoding you've told it to expect. That means if I, on my English-language Windows system, open a Russian document that's encoded in windows-1256, CharsetToolkit will identify it as windows-1252. That has never been a problem for me; if it's a problem for you, you'll need something much more complex, like jchardet. But don't ask me how to use that one; I'm sure it's a brilliant peice of work, but it violates every rule I know concerning API design and coding standards. ^_^
The theory: http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html
First of all, I started to think that you are a real angel that came to rescue me. :-) Thanks a lot fo the infos. Because after understanding the encoding system, I also had to understand in which language the text is written! And if there is also a pre-made thing that does it, or at least info about this, it will help me alot.
Secondly, I understand that CharsetToolkit is more than enough for my problem of finding the encoding of the text (since it matches perfectly in your windows-1256 example.) and I don't need jchardet if I only need the encoding type in a preceise way. Am I right?
Again, I will most probably need jchardet too in my project (although I don't want to deal with fuzzy stuff). Even if I don't need it, many thanks for your help. :-)