This discussion is archived
10 Replies Latest reply: Aug 1, 2009 10:38 AM by 843810 RSS

Problems with Unicode webapplication

843810 Newbie
Currently Being Moderated
Hi,
I am developping a small webapplication which is supposed to support Unicode (different characters such as �S�\ etc.).

I have a simple form which calls a JSP page.
The JSP page prints the recieved parameter and prints it to the WEB browser.
The JSP page also prints the parameter to a file.

If I use:
     String textIn= request.getParameter("unicodeText");
     byte d[]=new byte[textIn.length()];
     textIn.getBytes(0, textIn.length(), d, 0);
     String unicodeText = (new String(d,"UTF-8"));
Everything works fine but if I use:
     byte[] bytes = request.getParameter("unicodeText").getBytes();
     String unicodeText= new String(bytes, "UTF-8");
OR:
     byte[] bytes = request.getParameter("unicodeText").getBytes("UTF-8");
     String unicodeText= new String(bytes, "UTF-8");
OR:
     byte[] bytes = request.getParameter("unicodeText").getBytes("UTF-8");
     String unicodeText= new String(bytes);
The parameter retrieved by the JSP page contains strange characters (such as a box followed by a questionmark) when I submit for example '�\' and '�S'.

The problem is that the method:
getBytes(0, textIn.length(), d, 0);
used in the working example is Deprecated. In the API documentation I am told that the getBytes()

Deprecated. This method does not properly convert characters into bytes. As of JDK 1.1, the preferred way to do this is via the the getBytes() method, which uses the platform's default charset.


I really do not know why I can not get it to work!

/Fredrik

  • 1. Re: Problems with Unicode webapplication
    843810 Newbie
    Currently Being Moderated
    The client is submitting the parameter using an encoding different from UTF-8. This encoding just happens to be the same as your JVM's default encoding, so when you use the getBytes variant which does not require an encoding name, it coincidentally works. Under different circumstances it may fail.

    Regarding your other attempts, you can NOT reliably translate a String which was constructed with the wrong encoding using new String(oldString.toBytes(), encoding) -- read the javadoc to learn why.

    Instead, you should set the encoding of the html or jsp page which contains the form or url (I suggest UTF-8), then the browser should send the parameters using the proper encoding.

    At the top of your jsp page:
    <%@ page contentType="text/html; charset=UTF-8" %>
    ... and/or within the <head> tag of your html:
    <meta http-equiv="content-type" content="text/html; charset=utf-8">
  • 2. Re: Problems with Unicode webapplication
    843810 Newbie
    Currently Being Moderated
    Hi JN,
    Thanks for you reply, unfortunately I already included the tag:

    <%@ page contentType="text/html; charset=UTF-8" %>

    at the top of my jsp page AND the

    <meta http-equiv="content-type" content="text/html; charset=utf-8">

    within the html.

    The client is submitting the parameter using a different encoding than UTF-8 even though the mentioned tags are used.

    When I try the following
    byte[] bytes = request.getParameter("unicodeText").getBytes("ISO-8859-1");     
    String unicodeText= new String(bytes, "UTF-8");
    it works fine.
    To me this is an indication of that the client sends the parameter encoded in ISO 8859-1.
    I pick them out in the same format and translates the byta array to UTF-8.
    I thought that the tags mentioned above would make the client send the parameter encoded in UTF-8 (and that the method
    getBytes("UTF-8")
    should work) but it seems like they dont.
    Further, the client and the tomcat seems to use different default encodings since the method getBytes(no encoding) does not work.

    The problem is that when installing this application on a machine in Asia the dafault encoding of the client will NOT be ISO-8859-1 and it will not work.

    How can I make the client encode the parameters in UTF-8?
  • 3. Re: Problems with Unicode webapplication
    843810 Newbie
    Currently Being Moderated
    When I try the following
    byte[] bytes =
    request.getParameter("unicodeText").getBytes("ISO-8859-1");     
    String unicodeText= new String(bytes, "UTF-8");
    it works fine.
    To me this is an indication of that the client sends
    the parameter encoded in ISO 8859-1.
    Hmm are you sure? I could be wrong, but to me this would seem to indicate that the client is correctly sending with UTF-8 encoding, and your jsp container is incorrectly decoding it to String with ISO-8859-1. Then, you are retrieving the String, re-encoding it to binary with ISO-8859-1 to reverse the jsp container's mistake, and then correctly decoding it back to String using UTF-8 (which was the correct encoding used by the browser).

    Which jsp container are you using? Are you using the most recent version? Does it properly handle the Content-Type (i.e., charset) HTTP header on the request? Is there some configuration option which is forcing ISO-8859-1 decoding? Could you try testing with another jsp container (I use resin, see http://www.caucho.com )?
  • 4. Re: Problems with Unicode webapplication
    843810 Newbie
    Currently Being Moderated
    Hi JN,
    Thank you for taking time to think about my problem.
    You gave me some ideas...
    You are probably right in that the jsp container decodes the parameter with ISO-8859-1 and that getBytes("ISO-8859-1") re-encodes the parameter to UTF-8.

    I reconfigured my jakarta-tomcat 4.1.29 so that i forced it to start with:
    JAVA_OPTS=-Dfile.encoding=UTF-8
    and now it works.
    Before I did this the jsp container probably used the default platform charset which id ISO-8859-1

    I tried this before but then I typed in:
    JAVA_OPTS=-Dfile.encoding=utf8
    which did not work.

    So, thanks a lot.
    /Fredrik





  • 5. Re: Problems with Unicode webapplication
    843810 Newbie
    Currently Being Moderated
    Noooo, I guess I was too fast when writing the previous reply!
    Even though I start tomcat with:
    JAVA_OPTS=-Dfile.encoding=UTF-8
    it does not work :-(

    Maybe I should try caucho....
    /Fredrik
  • 6. Re: Problems with Unicode webapplication
    843810 Newbie
    Currently Being Moderated
    Ha again,
    I installed resin and started it with the default configuration and changed my jsp file as follows:
    byte[] bytes =  request.getParameter("unicodeText").getBytes("UTF-8");
    String unicodeText= new String(bytes, "UTF-8");
    and this gives the wanted result (i.e. it works).
    I have tried around 20 chineese characters and &#229; &#228; and &#246;.
    There is only one small thing... all combinations of &#229; &#228; and &#246; works except &#229;&#228;&#246; in this order.

    Anyway, this is a minor problem.

    Now I have to figure out why it works with resin but nor with jakarta.tomcat....

    Does anyone have any idea?
  • 7. Re: Problems with Unicode webapplication
    843810 Newbie
    Currently Being Moderated
    Ha again,
    I installed resin and started it with the default
    configuration and changed my jsp file as follows:
    byte[] bytes =
    request.getParameter("unicodeText").getBytes("UTF-8");
    
    String unicodeText= new String(bytes, "UTF-8");
    What you are doing is unnecessary (but should still work). Instead, simply write:
    String unicodeText = request.getParameter("unicodeText");
    as the unicode string has already been decoded by the jsp container.

    and this gives the wanted result (i.e. it works).
    I have tried around 20 chineese characters and �
    � and �.
    There is only one small thing... all combinations of
    � � and � works except
    ��� in this order.
    Really? I have tried the same character sequence (with your above code, and with my simplified version) and both work. Here is my test page:
    <%@ page contentType="text/html; charset=UTF-8" %>
    <%@ page session="false" %>
    
    <html>
     <head>
      <title>Test encodings</title>
      <meta http-equiv="content-type" content="text/html; charset=utf-8">
     </head>
     <br />
     <br />
    
     Request parameters:<br />
     <%
       java.util.Map params = request.getParameterMap();
       for (java.util.Iterator i = params.keySet().iterator(); i.hasNext(); ) {
         String param = (String) i.next();
         String[] values = request.getParameterValues(param);
         out.write("<b>" + param + "=[</b>");
         for (int j = 0; j < values.length; j++) {
           if (j > 0) {
             out.write("<b>,</b>");
           }
           out.write(values[j]);
         }
         out.write("<b>]</b> ");
       }
    
     %>
     
     <body bgcolor="#ffffff" text="#000000"
      link="#006699" vlink="#006699" alink="#006699" leftmargin="0" topmargin="0"
      marginheight="0" marginwidth="0">
    
      <table border="0" align="left">
       <tr><th colspan="2">Enter some values</th></tr>
       <form method="post"
        action="testenc2.jsp"
        <%-- enctype="multipart/form-data" --%> 
        >
        <input type="hidden" name="hidden" value="hidden_val" />
        <tr>
         <td>name</td>
         <td>
          <input type="text" name="name" size="50" maxlength="100"/></td>
        </tr> 
        <tr>
         <td>address</td>
         <td>
          <input type="text" name="address" size="15" maxlength="20"/>
         </td>
        </tr>
    
        <tr>
         <td colspan="2">
         <input type="submit" name="cmd" value="send" />
         <input type="submit" name="cmd" value="cancel" />
         </td>
        </tr>
       </form>
      </table>
      
     </body>
    </html>
    >
    Anyway, this is a minor problem.

    Now I have to figure out why it works with resin but
    nor with jakarta.tomcat....
    Doesn't tomcat have a mailing list? That's probably a good place to ask or search.
  • 8. Re: Problems with Unicode webapplication
    843810 Newbie
    Currently Being Moderated
    OK, thanks.
    So... now I know that there is nothing wrong with my jsp page but the problem has to do with the configuration of the environment.
    I will try to find the jakarta mailing list...

    /Fredrik
  • 9. Re: Problems with Unicode webapplication
    843810 Newbie
    Currently Being Moderated
    byte[] bytes =
    request.getParameter("unicodeText").getBytes("UTF-8");

    String unicodeText= new String(bytes, "UTF-8");
    FYI
    - this does nothing. When getting the bytes and setting the bytes using the same charset the same underlying decoders will be used.

    -and I also watched the network packets send by a (ie) browser and there is no reference to the characterset in the header. The content is correctly coded in UTF-8 but that's it.

    This is the content type part of a post request.
    Content-Type: multipart/form-data; boundary=---------------------------7d42661a20186

    It all depends on the default character set the container uses, I guess.
  • 10. Re: Problems with Unicode webapplication
    843810 Newbie
    Currently Being Moderated
    My observation is:

    Suppose you have a byteArrOrg.
    Now u do the following-

    String enc =""// tobe explained below

    String s = new String(byteArrOrg, enc);
    byte[] resultByteArr = s.getBytes(enc);

    in the above code is enc is "ISO-8859-1" the resultByteArr & byteArrOrg will be always same.
    if enc is not "ISO-8859-1" the resultByteArr & byteArrOrg will be NOt always be same.

    Because "ISO-8859-1" conversion does not replace the original bytes with 0x3f if byte is not found. This is my assumption based on program observation. But wanted to get a spec proof in support of it.