13 Replies Latest reply: Jul 3, 2008 8:31 PM by 807589 RSS

    An annoying problem with a rare character in gb2312 (Chinese charset)

    807589
      Hi, everyone! I get a very annoying proble with a rare character in gb2312 charset and need your generous help very much.
      I am writing a project to crawl a series of webpages and extract some specific information on it. I don't save the webpage on my local disk but just open them online and extract the information that I am interested in. Then close the connection.
                           InputStream wpInStream =webPage2InputStream(threadHplink);
                                ThreadAnalyzer.Analyze(wpInStream,wpEncoding,threadBuffer);
      I read webpage via webPage2InputStream. Then I will use ThreadAnalyzer.Analyze to extract the information I need with charset wpEncoding (it is gb2312 in this case) and store the information in threadBuffer
      However a rare character (this one "�E") in gb2312 often appears among the information I am interested in. It appears like a blank a little wider than a normal one like " ". When I paste it in java program, it looks like a rectangle (paste this "�E" to eclipse editor, you'll see). I want to match this symbol in my code. But use something like ("...|"�E"|...") (it appears a rectangle in java code) won't do. I don't know how to use regular expression to match this one.
      But very strange if I copy the .html file (of course including this damned symbol) and save it in a .txt file in utf8, then it matches.
      This hints me if I should convert the inputstream to utf8 first before I extract the concerned information. Can anyone show me how to deal with this problem?
      I really need your help~
      It's really annoying because it's just a beginning. I don't know how many rare words existed ahead of me~~~ >_<
        • 1. Re: An annoying problem with a rare character in gb2312 (Chinese charset)
          807589
          Oh, it doesn't appear correct on this board. I convert it "�E" to utf8 using this online converter
          http://www.saodao.com/shiyongxinxi/shiyonggongju/20080131/29.html
          and it gives &#xE5E5
          If I convert it "�E" to unicode using this online converter
          http://www.chinaue.com/tool/uni.htm
          it gives &#58853
          I googled somebody encountered the same problem (with different rare character) in 2005 in this forum and he didn't get a nice solution.
          I hope after 3 years it will be solved smoothly.
          I hope I have good luck!

          Edited by: aaron9979215 on Jul 3, 2008 6:10 AM
          • 2. Re: An annoying problem with a rare character in gb2312 (Chinese charset)
            JoachimSauer
            Your best bet is probably to find out what the Unicode Codepoint of that character is (from your description that's not really visible, your example comes across as a crossed-out "o" and a capital "E" for me). Then use the Unicode escape "\uxxxx" to represent it in a String constant in your Java code.
            • 3. Re: An annoying problem with a rare character in gb2312 (Chinese charset)
              807589
              I found it doesn't appear correctly and reeditted my second post.
              Please take a second look.
              These two converted codes seem not conforming to each other.
              Can anyone show me how to get a correct unicode?

              Edited by: aaron9979215 on Jul 3, 2008 6:14 AM
              • 4. Re: An annoying problem with a rare character in gb2312 (Chinese charset)
                JoachimSauer
                Your best bet is probably to avoid any copy-and-pasting and to add a option to your code to dump the unicode values of all characters it reads ...
                • 5. Re: An annoying problem with a rare character in gb2312 (Chinese charset)
                  807589
                  Sorry JoachimSaucer.
                  Thanks for your reply. But I don't quite understand your suggestion. Can you explain something more?
                  Many thanks!
                  • 6. Re: An annoying problem with a rare character in gb2312 (Chinese charset)
                    807589
                    The rare character's unicode is "\ue5e5".
                    But this character can not be read from file in which it was saved in its original charset.
                    • 7. Re: An annoying problem with a rare character in gb2312 (Chinese charset)
                      807589
                      I am writing a project to crawl a series of webpages and extract some specific information on it. I don't save the webpage on my local disk but just open them online and extract the information that I am interested in. Then close the connection.
                                           InputStream wpInStream =webPage2InputStream(threadHplink);
                                         ThreadAnalyzer.Analyze(wpInStream,wpEncoding,threadBuffer);
                      If you're going to read data from a web page and look for specific Unicode codepoints, you need to translate the stream to Unicode using the encoding specified by the particular server. I would recommend replacing your "webPage2InputStream" by "webPage2Reader", and not using "wpEncoding" (unless you got it from the webserver's response, which doesn't appear to be the case).
                      • 8. Re: An annoying problem with a rare character in gb2312 (Chinese charset)
                        807589
                        kdgregory, I understand what you mean.
                        Reading InputStream according to its original charset will be done in ThreadAnalyzer.Analyze() which is automatically generated by a compiler generator tool named JavaCC. When a reader has read the inputstream with a specific encoding, I think the inputstream should have been translated in Unicode. Do you think so?
                        The webpages are all encoded in gb2312 which is stated in its meta information.
                        The rare character's unicode is \uE5E5.
                        • 9. Re: An annoying problem with a rare character in gb2312 (Chinese charset)
                          807589
                          JoachimSaucer, I kind of understand what you mean. I modified the program to first dump the inputStream into a StringBuffer with its original charset. Then use utf-8 to write the StringBuffer to a temp file. Then open a fileInputStream to that temp file to analyze. But problem was not solved. It seems this character only exists in gb2312 and not existed in utf-8. Even changing to utf-16 still doesn't work.
                          Then why it can be converted to a unicode \uE5E5.

                          Edited by: aaron9979215 on Jul 3, 2008 9:49 AM
                          • 10. Re: An annoying problem with a rare character in gb2312 (Chinese charset)
                            807589
                            Reading InputStream according to its original charset will be done in ThreadAnalyzer.Analyze() which is automatically generated by a compiler generator tool named JavaCC.
                            How does that have any bearing on encoding? The website will tell you what encoding it is using for the page. You need to get that encoding at the time you read the page. Perhaps that happens in webPage2InputStream(), but it's not apparent from your code.
                            When a reader has read the inputstream with a specific encoding, I think the inputstream should have been translated in Unicode.
                            Why would you think that? An InputStream deals with bytes, a Reader deals with characters. Unicode refers to characters, which must be encoded to be presented as bytes.
                            The webpages are all encoded in gb2312 which is stated in its meta information.
                            Do you mean a "<meta>" tag within the page? Have you verified that the pages, as delivered by the webserver, actually use that encoding? And that there isn't a response header that says something different?
                            • 11. Re: An annoying problem with a rare character in gb2312 (Chinese charset)
                              807589
                              Thanks for your reply, kdgergory.
                              Your suggestion is right. I am not quite familiar with web programming.
                              Those webpages are encoded in gb2312. I recorded a lot of information on the page. Many of them are successfully recorded to the database. Only those containing the rare character failed. I think that is because the rare character can not be recognized by the program, even after I have added its seem-to-be unicode.
                              It's very bad. Because those parts are important.
                              I would like to thank you again for your heartily suggestions.
                              It has been very late in my place.
                              I have to go to bed.
                              See you tomorrow, my friend~
                              • 12. Re: An annoying problem with a rare character in gb2312 (Chinese charset)
                                807589
                                Those webpages are encoded in gb2312. I recorded a lot of information on the page. Many of them are successfully recorded to the database. Only those containing the rare character failed. I think that is because the rare character can not be recognized by the program, even after I have added its seem-to-be unicode.
                                I took a look at the Wikipedia entry for [GB2312|http://en.wikipedia.org/wiki/Gb2312], and one of the things that it mentioned was that the encoding was superseded by GBK and GB18030, which contained more characters. Is it possible that your "rare" character is not actually representable in GB2312?

                                Then I realized that the "Unicode" character that you mentioned, \uE5E5, is in the "private use" area established by the Unicode spec, so either (1) you're actually giving the EUC-CN encoded value, which isn't Unicode and therefore senseless to search for in Java character data; (2) it isn't a character that has a Unicode value, is being mapped into the private space by the encoding implementation that you're using, and that implementation might have a bug; (3) isn't the character you think it is; or (4) isn't encoded using GB2312. Personally, I think #2 is most likely.

                                It would be helpful for you to go to www.unicode.org, and familiarize yourself with how Unicode is structured. There's a FAQ that describes some of the issues with conversion between CJK character sets and Unicode, and you'll also find [code charts|http://www.unicode.org/charts/] and information about [locating a character by name|http://www.unicode.org/standard/where/].

                                I have to go to bed.
                                See you tomorrow, my friend~
                                Tomorrow is a holiday where I live, and I'm planning on staying away from computers :-)

                                Plus, I don't think I can give you any more help. You'll need to trace the data, in binary form, through each step of your process to find out where it's being changed.
                                • 13. Re: An annoying problem with a rare character in gb2312 (Chinese charset)
                                  807589
                                  Hi, kdgregory! Thanks for your detailed reply and suggestion.
                                  I have seen you have done all you could to offer me the best help.
                                  This rare character just appears as a wider blank. Maybe it can be seen as

                                  ------------------------------------------------------------------------------------------------------------
                                  Oh, BINGO! I can't believe it! I wrote the text above the line minutes ago. And before I wrote those words. I have read kdgergory's suggestion and the related hyperlink. And also before that I read a thread in this forum 3 years ago in which a similar problem was encountered but ended up without satisfying solution.
                                  http://forum.java.sun.com/thread.jspa?messageID=3819280&tstart=0
                                  When I just tried to give up all my efforts and wrote the text above to show an after-all thanks, I suddenly got an idea of changing the gb2312 appeared in the meta tag of the webpage to GBK or GB18030 to have a try. When I try GBK, it still does not work. But after it was changed into GB18030, IT'S DONE!!!
                                  kdgregory, I must show my highest thanks and respect to you!!!
                                  I regret very much that I forgot to mark this thread a question because at that time I was very annoyed and frustrated. So I can't give you a correct tag. But you absolutely deserve it! I hope everything is fine in your holiday!
                                  Best regards!
                                  Aaron