This content has been marked as final. Show 13 replies
Oh, it doesn't appear correct on this board. I convert it "�E" to utf8 using this online converter
and it gives 
If I convert it "�E" to unicode using this online converter
it gives 
I googled somebody encountered the same problem (with different rare character) in 2005 in this forum and he didn't get a nice solution.
I hope after 3 years it will be solved smoothly.
I hope I have good luck!
Edited by: aaron9979215 on Jul 3, 2008 6:10 AM
Your best bet is probably to find out what the Unicode Codepoint of that character is (from your description that's not really visible, your example comes across as a crossed-out "o" and a capital "E" for me). Then use the Unicode escape "\uxxxx" to represent it in a String constant in your Java code.
I found it doesn't appear correctly and reeditted my second post.
Please take a second look.
These two converted codes seem not conforming to each other.
Can anyone show me how to get a correct unicode?
Edited by: aaron9979215 on Jul 3, 2008 6:14 AM
Your best bet is probably to avoid any copy-and-pasting and to add a option to your code to dump the unicode values of all characters it reads ...
Thanks for your reply. But I don't quite understand your suggestion. Can you explain something more?
The rare character's unicode is "\ue5e5".
But this character can not be read from file in which it was saved in its original charset.
I am writing a project to crawl a series of webpages and extract some specific information on it. I don't save the webpage on my local disk but just open them online and extract the information that I am interested in. Then close the connection.If you're going to read data from a web page and look for specific Unicode codepoints, you need to translate the stream to Unicode using the encoding specified by the particular server. I would recommend replacing your "webPage2InputStream" by "webPage2Reader", and not using "wpEncoding" (unless you got it from the webserver's response, which doesn't appear to be the case).
InputStream wpInStream =webPage2InputStream(threadHplink); ThreadAnalyzer.Analyze(wpInStream,wpEncoding,threadBuffer);
kdgregory, I understand what you mean.
Reading InputStream according to its original charset will be done in ThreadAnalyzer.Analyze() which is automatically generated by a compiler generator tool named JavaCC. When a reader has read the inputstream with a specific encoding, I think the inputstream should have been translated in Unicode. Do you think so?
The webpages are all encoded in gb2312 which is stated in its meta information.
The rare character's unicode is \uE5E5.
JoachimSaucer, I kind of understand what you mean. I modified the program to first dump the inputStream into a StringBuffer with its original charset. Then use utf-8 to write the StringBuffer to a temp file. Then open a fileInputStream to that temp file to analyze. But problem was not solved. It seems this character only exists in gb2312 and not existed in utf-8. Even changing to utf-16 still doesn't work.
Then why it can be converted to a unicode \uE5E5.
Edited by: aaron9979215 on Jul 3, 2008 9:49 AM
Reading InputStream according to its original charset will be done in ThreadAnalyzer.Analyze() which is automatically generated by a compiler generator tool named JavaCC.How does that have any bearing on encoding? The website will tell you what encoding it is using for the page. You need to get that encoding at the time you read the page. Perhaps that happens in webPage2InputStream(), but it's not apparent from your code.
When a reader has read the inputstream with a specific encoding, I think the inputstream should have been translated in Unicode.Why would you think that? An InputStream deals with bytes, a Reader deals with characters. Unicode refers to characters, which must be encoded to be presented as bytes.
The webpages are all encoded in gb2312 which is stated in its meta information.Do you mean a "<meta>" tag within the page? Have you verified that the pages, as delivered by the webserver, actually use that encoding? And that there isn't a response header that says something different?
Thanks for your reply, kdgergory.
Your suggestion is right. I am not quite familiar with web programming.
Those webpages are encoded in gb2312. I recorded a lot of information on the page. Many of them are successfully recorded to the database. Only those containing the rare character failed. I think that is because the rare character can not be recognized by the program, even after I have added its seem-to-be unicode.
It's very bad. Because those parts are important.
I would like to thank you again for your heartily suggestions.
It has been very late in my place.
I have to go to bed.
See you tomorrow, my friend~
Those webpages are encoded in gb2312. I recorded a lot of information on the page. Many of them are successfully recorded to the database. Only those containing the rare character failed. I think that is because the rare character can not be recognized by the program, even after I have added its seem-to-be unicode.I took a look at the Wikipedia entry for [GB2312|http://en.wikipedia.org/wiki/Gb2312], and one of the things that it mentioned was that the encoding was superseded by GBK and GB18030, which contained more characters. Is it possible that your "rare" character is not actually representable in GB2312?
Then I realized that the "Unicode" character that you mentioned, \uE5E5, is in the "private use" area established by the Unicode spec, so either (1) you're actually giving the EUC-CN encoded value, which isn't Unicode and therefore senseless to search for in Java character data; (2) it isn't a character that has a Unicode value, is being mapped into the private space by the encoding implementation that you're using, and that implementation might have a bug; (3) isn't the character you think it is; or (4) isn't encoded using GB2312. Personally, I think #2 is most likely.
It would be helpful for you to go to www.unicode.org, and familiarize yourself with how Unicode is structured. There's a FAQ that describes some of the issues with conversion between CJK character sets and Unicode, and you'll also find [code charts|http://www.unicode.org/charts/] and information about [locating a character by name|http://www.unicode.org/standard/where/].
I have to go to bed.Tomorrow is a holiday where I live, and I'm planning on staying away from computers :-)
See you tomorrow, my friend~
Plus, I don't think I can give you any more help. You'll need to trace the data, in binary form, through each step of your process to find out where it's being changed.
Hi, kdgregory! Thanks for your detailed reply and suggestion.
I have seen you have done all you could to offer me the best help.
This rare character just appears as a wider blank. Maybe it can be seen as
Oh, BINGO! I can't believe it! I wrote the text above the line minutes ago. And before I wrote those words. I have read kdgergory's suggestion and the related hyperlink. And also before that I read a thread in this forum 3 years ago in which a similar problem was encountered but ended up without satisfying solution.
When I just tried to give up all my efforts and wrote the text above to show an after-all thanks, I suddenly got an idea of changing the gb2312 appeared in the meta tag of the webpage to GBK or GB18030 to have a try. When I try GBK, it still does not work. But after it was changed into GB18030, IT'S DONE!!!
kdgregory, I must show my highest thanks and respect to you!!!
I regret very much that I forgot to mark this thread a question because at that time I was very annoyed and frustrated. So I can't give you a correct tag. But you absolutely deserve it! I hope everything is fine in your holiday!