10 Replies Latest reply on Aug 1, 2009 5:38 PM by 843810

    Problems with Unicode webapplication

    843810
      Hi,
      I am developping a small webapplication which is supposed to support Unicode (different characters such as �S�\ etc.).

      I have a simple form which calls a JSP page.
      The JSP page prints the recieved parameter and prints it to the WEB browser.
      The JSP page also prints the parameter to a file.

      If I use:
           String textIn= request.getParameter("unicodeText");
           byte d[]=new byte[textIn.length()];
           textIn.getBytes(0, textIn.length(), d, 0);
           String unicodeText = (new String(d,"UTF-8"));
      Everything works fine but if I use:
           byte[] bytes = request.getParameter("unicodeText").getBytes();
           String unicodeText= new String(bytes, "UTF-8");
      OR:
           byte[] bytes = request.getParameter("unicodeText").getBytes("UTF-8");
           String unicodeText= new String(bytes, "UTF-8");
      OR:
           byte[] bytes = request.getParameter("unicodeText").getBytes("UTF-8");
           String unicodeText= new String(bytes);
      The parameter retrieved by the JSP page contains strange characters (such as a box followed by a questionmark) when I submit for example '�\' and '�S'.

      The problem is that the method:
      getBytes(0, textIn.length(), d, 0);
      used in the working example is Deprecated. In the API documentation I am told that the getBytes()

      Deprecated. This method does not properly convert characters into bytes. As of JDK 1.1, the preferred way to do this is via the the getBytes() method, which uses the platform's default charset.


      I really do not know why I can not get it to work!

      /Fredrik

        • 1. Re: Problems with Unicode webapplication
          843810
          The client is submitting the parameter using an encoding different from UTF-8. This encoding just happens to be the same as your JVM's default encoding, so when you use the getBytes variant which does not require an encoding name, it coincidentally works. Under different circumstances it may fail.

          Regarding your other attempts, you can NOT reliably translate a String which was constructed with the wrong encoding using new String(oldString.toBytes(), encoding) -- read the javadoc to learn why.

          Instead, you should set the encoding of the html or jsp page which contains the form or url (I suggest UTF-8), then the browser should send the parameters using the proper encoding.

          At the top of your jsp page:
          <%@ page contentType="text/html; charset=UTF-8" %>
          ... and/or within the <head> tag of your html:
          <meta http-equiv="content-type" content="text/html; charset=utf-8">
          • 2. Re: Problems with Unicode webapplication
            843810
            Hi JN,
            Thanks for you reply, unfortunately I already included the tag:

            <%@ page contentType="text/html; charset=UTF-8" %>

            at the top of my jsp page AND the

            <meta http-equiv="content-type" content="text/html; charset=utf-8">

            within the html.

            The client is submitting the parameter using a different encoding than UTF-8 even though the mentioned tags are used.

            When I try the following
            byte[] bytes = request.getParameter("unicodeText").getBytes("ISO-8859-1");     
            String unicodeText= new String(bytes, "UTF-8");
            it works fine.
            To me this is an indication of that the client sends the parameter encoded in ISO 8859-1.
            I pick them out in the same format and translates the byta array to UTF-8.
            I thought that the tags mentioned above would make the client send the parameter encoded in UTF-8 (and that the method
            getBytes("UTF-8")
            should work) but it seems like they dont.
            Further, the client and the tomcat seems to use different default encodings since the method getBytes(no encoding) does not work.

            The problem is that when installing this application on a machine in Asia the dafault encoding of the client will NOT be ISO-8859-1 and it will not work.

            How can I make the client encode the parameters in UTF-8?
            • 3. Re: Problems with Unicode webapplication
              843810
              When I try the following
              byte[] bytes =
              request.getParameter("unicodeText").getBytes("ISO-8859-1");     
              String unicodeText= new String(bytes, "UTF-8");
              it works fine.
              To me this is an indication of that the client sends
              the parameter encoded in ISO 8859-1.
              Hmm are you sure? I could be wrong, but to me this would seem to indicate that the client is correctly sending with UTF-8 encoding, and your jsp container is incorrectly decoding it to String with ISO-8859-1. Then, you are retrieving the String, re-encoding it to binary with ISO-8859-1 to reverse the jsp container's mistake, and then correctly decoding it back to String using UTF-8 (which was the correct encoding used by the browser).

              Which jsp container are you using? Are you using the most recent version? Does it properly handle the Content-Type (i.e., charset) HTTP header on the request? Is there some configuration option which is forcing ISO-8859-1 decoding? Could you try testing with another jsp container (I use resin, see http://www.caucho.com )?
              • 4. Re: Problems with Unicode webapplication
                843810
                Hi JN,
                Thank you for taking time to think about my problem.
                You gave me some ideas...
                You are probably right in that the jsp container decodes the parameter with ISO-8859-1 and that getBytes("ISO-8859-1") re-encodes the parameter to UTF-8.

                I reconfigured my jakarta-tomcat 4.1.29 so that i forced it to start with:
                JAVA_OPTS=-Dfile.encoding=UTF-8
                and now it works.
                Before I did this the jsp container probably used the default platform charset which id ISO-8859-1

                I tried this before but then I typed in:
                JAVA_OPTS=-Dfile.encoding=utf8
                which did not work.

                So, thanks a lot.
                /Fredrik





                • 5. Re: Problems with Unicode webapplication
                  843810
                  Noooo, I guess I was too fast when writing the previous reply!
                  Even though I start tomcat with:
                  JAVA_OPTS=-Dfile.encoding=UTF-8
                  it does not work :-(

                  Maybe I should try caucho....
                  /Fredrik
                  • 6. Re: Problems with Unicode webapplication
                    843810
                    Ha again,
                    I installed resin and started it with the default configuration and changed my jsp file as follows:
                    byte[] bytes =  request.getParameter("unicodeText").getBytes("UTF-8");
                    String unicodeText= new String(bytes, "UTF-8");
                    and this gives the wanted result (i.e. it works).
                    I have tried around 20 chineese characters and &#229; &#228; and &#246;.
                    There is only one small thing... all combinations of &#229; &#228; and &#246; works except &#229;&#228;&#246; in this order.

                    Anyway, this is a minor problem.

                    Now I have to figure out why it works with resin but nor with jakarta.tomcat....

                    Does anyone have any idea?
                    • 7. Re: Problems with Unicode webapplication
                      843810
                      Ha again,
                      I installed resin and started it with the default
                      configuration and changed my jsp file as follows:
                      byte[] bytes =
                      request.getParameter("unicodeText").getBytes("UTF-8");
                      
                      String unicodeText= new String(bytes, "UTF-8");
                      What you are doing is unnecessary (but should still work). Instead, simply write:
                      String unicodeText = request.getParameter("unicodeText");
                      as the unicode string has already been decoded by the jsp container.

                      and this gives the wanted result (i.e. it works).
                      I have tried around 20 chineese characters and �
                      � and �.
                      There is only one small thing... all combinations of
                      � � and � works except
                      ��� in this order.
                      Really? I have tried the same character sequence (with your above code, and with my simplified version) and both work. Here is my test page:
                      <%@ page contentType="text/html; charset=UTF-8" %>
                      <%@ page session="false" %>
                      
                      <html>
                       <head>
                        <title>Test encodings</title>
                        <meta http-equiv="content-type" content="text/html; charset=utf-8">
                       </head>
                       <br />
                       <br />
                      
                       Request parameters:<br />
                       <%
                         java.util.Map params = request.getParameterMap();
                         for (java.util.Iterator i = params.keySet().iterator(); i.hasNext(); ) {
                           String param = (String) i.next();
                           String[] values = request.getParameterValues(param);
                           out.write("<b>" + param + "=[</b>");
                           for (int j = 0; j < values.length; j++) {
                             if (j > 0) {
                               out.write("<b>,</b>");
                             }
                             out.write(values[j]);
                           }
                           out.write("<b>]</b> ");
                         }
                      
                       %>
                       
                       <body bgcolor="#ffffff" text="#000000"
                        link="#006699" vlink="#006699" alink="#006699" leftmargin="0" topmargin="0"
                        marginheight="0" marginwidth="0">
                      
                        <table border="0" align="left">
                         <tr><th colspan="2">Enter some values</th></tr>
                         <form method="post"
                          action="testenc2.jsp"
                          <%-- enctype="multipart/form-data" --%> 
                          >
                          <input type="hidden" name="hidden" value="hidden_val" />
                          <tr>
                           <td>name</td>
                           <td>
                            <input type="text" name="name" size="50" maxlength="100"/></td>
                          </tr> 
                          <tr>
                           <td>address</td>
                           <td>
                            <input type="text" name="address" size="15" maxlength="20"/>
                           </td>
                          </tr>
                      
                          <tr>
                           <td colspan="2">
                           <input type="submit" name="cmd" value="send" />
                           <input type="submit" name="cmd" value="cancel" />
                           </td>
                          </tr>
                         </form>
                        </table>
                        
                       </body>
                      </html>
                      >
                      Anyway, this is a minor problem.

                      Now I have to figure out why it works with resin but
                      nor with jakarta.tomcat....
                      Doesn't tomcat have a mailing list? That's probably a good place to ask or search.
                      • 8. Re: Problems with Unicode webapplication
                        843810
                        OK, thanks.
                        So... now I know that there is nothing wrong with my jsp page but the problem has to do with the configuration of the environment.
                        I will try to find the jakarta mailing list...

                        /Fredrik
                        • 9. Re: Problems with Unicode webapplication
                          843810
                          byte[] bytes =
                          request.getParameter("unicodeText").getBytes("UTF-8");

                          String unicodeText= new String(bytes, "UTF-8");
                          FYI
                          - this does nothing. When getting the bytes and setting the bytes using the same charset the same underlying decoders will be used.

                          -and I also watched the network packets send by a (ie) browser and there is no reference to the characterset in the header. The content is correctly coded in UTF-8 but that's it.

                          This is the content type part of a post request.
                          Content-Type: multipart/form-data; boundary=---------------------------7d42661a20186

                          It all depends on the default character set the container uses, I guess.
                          • 10. Re: Problems with Unicode webapplication
                            843810
                            My observation is:

                            Suppose you have a byteArrOrg.
                            Now u do the following-

                            String enc =""// tobe explained below

                            String s = new String(byteArrOrg, enc);
                            byte[] resultByteArr = s.getBytes(enc);

                            in the above code is enc is "ISO-8859-1" the resultByteArr & byteArrOrg will be always same.
                            if enc is not "ISO-8859-1" the resultByteArr & byteArrOrg will be NOt always be same.

                            Because "ISO-8859-1" conversion does not replace the original bytes with 0x3f if byte is not found. This is my assumption based on program observation. But wanted to get a spec proof in support of it.