9 Replies Latest reply: May 30, 2014 12:29 AM by nitul.kukadia RSS

    UTF-8 Encoding

    nitul.kukadia

      Hi All,

         I have input string in japanese language, I am parsing string in UTF-8 encoding.

       

         String name = new String(japaneseString.getBytes(), "UTF-8");

       

          I got successfully converted the below japanese string:

          1) アップル

           2) 赤

           3) 世丕且且世两上与丑万丣丕且丗丕

           4) 世世丗丈

       

      But, I got failures in below cases :

      1) Input:  ひほわれよう        Output : �?��?��?れよ�?�

      2) Input: 存在する               Output: 存在�?�る

       

      It seems from above, some of the japanese characters are not converted properly.

      Thanks

        • 1. Re: UTF-8 Encoding
          jschellSomeoneStoleMyAlias

          Java Strings ALWAYS contain UTF8.  It doesn't matter how you created it or what you think it should have in it is always UTF8.

           

          So if "japaneseString"  is a String then that is the source of your problem.

          • 2. Re: UTF-8 Encoding
            jtahlborn

            jschellSomeoneStoleMyAlias wrote:

             

            Java Strings ALWAYS contain UTF8.  It doesn't matter how you created it or what you think it should have in it is always UTF8.

             

            By "UTF-8", you meant "UTF-16", right?

            • 3. Re: UTF-8 Encoding
              jtahlborn

              Since "japaneseString" is a Java String instance (presumably), it has already been converted into unicode by Java (through some other operation, probably while reading a file or stream).  if it is broken, the breakage was from _before_ the input data became "japaneseString".  any charset conversion you do after that point is just massaging already broken data.  you need to show how you are creating japaneseString, because that is where the issue is.

              • 4. Re: UTF-8 Encoding
                nitul.kukadia

                I am getting data from network stream.

                As I have converted some of Japanese string properly. But, some how some characters are not converted properly.

                • 5. Re: UTF-8 Encoding
                  jtahlborn

                  nitul.kukadia wrote:

                   

                  But, some how some characters are not converted properly.

                  clearly.  which is why i said you should show the relevant code so that we can help you find the error.

                  • 6. Re: UTF-8 Encoding
                    nitul.kukadia

                    Thanks. But I am getting the parameter value as parameter from Jersey Web Service., which is in Japaneses characters.

                     

                    As you asked for more code, but it is just as above.

                       

                      String name = new String(japaneseString.getBytes(), "UTF-8");


                    Here 'japaneseString' is from web service parameter.

                    • 7. Re: UTF-8 Encoding
                      jtahlborn

                      So, let's take a step back.  why do you think you need to re-encode the string in the first place?

                      • 8. Re: UTF-8 Encoding
                        jschellSomeoneStoleMyAlias

                        nitul.kukadia wrote:

                         

                        I am getting data from network stream.

                        As I have converted some of Japanese string properly. But, some how some characters are not converted properly.

                         

                        Again, as I already said, your code indicates that you are NOT doing what you think you are doing.  You cannot create a string the way you are doing it.  The "javaString" has UTF16, the "name" will have UTF16.   Both have UTF16 because that is the only way java strings exists.

                         

                        Your code is creating bytes that represent UTF16 and then mapping those bytes to UTF8.  But the bytes are not UTF8 and java does NOT 'convert' the character representation to UTF8.  What it does is take the bytes and attempt to map them directly to UTF8.  Some of those do not map so it converts those to '?'.  The mapped value is then put into the String "name" as UTF16 (again not UTF8).

                         

                        So either "javaString" is either already correct or the way you created "javaString" is wrong and NOTHING you do after that will fix it.  If it is wrong you must fix how you created it.

                        • 9. Re: UTF-8 Encoding
                          nitul.kukadia

                          Thanks,

                          I found the solution by file.encoding JVM parameter.

                          By setting this parameter not need to do any cod. I have added parameter -Dfile.encoding-UTF-8.

                          String name = new String(japaneseString.getBytes(), "UTF-8");

                          Now I am getting 'japaneseString' already in Japanes characters.

                           

                          I got the solution but still, there is question , why the previous solution does not worked for some characters?