1 2 Previous Next 20 Replies Latest reply: Feb 15, 2011 8:48 AM by abillconsl RSS

    Unicode File movement from Windows to Unix adding Special Characters

    836688
      Hi,
      In our application a Unicode file with German and Japanese characters is submitted, which is moved to a Unix directory by using the MultipartRequest JAVA API. Later Oracle PL/SQL processes the file and makes entries in the database.

      We have observed that this load is failing since the file is having some special characters when it is getting transferred to Unix. The file is untouched if it contains only English characters. To confirm this we created a file directly in Unix containing Ger/Jap chars and called the Oracle St Proc and it worked fine. When this same file was moved back to Windows using WinSCP, the file was different again.

      Hence overall it looks like Unicode file movement between Windows and Unix changes the file in someway for some reason. Please let me know if any JAVA API can avoid this issue.

      I scanned the Net for close to a week but couldn't find anything related. Any help will be greatly appreciated.

      If we cant find any solution, we are considering using POI so that JAVA can directly update the Database.

      Rgds,
      Raghu
        • 1. Re: Unicode File movement from Windows to Unix adding Special Characters
          DrClap
          It's certainly possible to move a file from one environment to another while preserving the data byte-for-byte. That's done every day and it's the default for a lot of software. However it's possible for badly-written software to damage data by making bad assumptions.

          So you should examine the software which is doing the file-moving. If it's software which you wrote then examine it for sins like assuming the data is characters in the system's default encoding and fix that. If it's somebody else's software then they should be asked to fix it, or you should stop using their software.
          • 2. Re: Unicode File movement from Windows to Unix adding Special Characters
            handat
            When you transfer your files using winscp make sure you use binary option instead of text. Text means ASCII characters so your unicode get mangled.
            • 3. Re: Unicode File movement from Windows to Unix adding Special Characters
              836688
              DrClap,
              I agree that unaltered byte transfer should be possible. Its just that I can't figure out what other changes would be needed in the code to achieve this, which is the primary reason for my post. By the way, this is an existing piece of code which I have stated working on.

              I read that MultipartRequest and Apache FileUpload are the two common APIs used to perform file uploads in JAVA. Hence I tried also with the Apache API, but the result was exactly the same.

              Hence I suspect that some encoding related setting is missing in either the JSP or the Servlet code. Below is my code snippet. Please suggest.

              -- JSP
              <meta http-equiv="Content-Language" content="en-us">
              <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
              <meta http-equiv="Expires" content="Tue, 20 Aug 1996 14:25:27 GMT">
              <meta http-equiv="Cache-Control" content="no-cache">

              <script language="javascript" src="../js/stylesheet.js"></script>
              <script language="JavaScript" src="../js/datePicker.js"></script>
              <script language="JavaScript" src="../js/validate.js"></script>

              </head>

              <body>

              <form ENCTYPE="multipart/form-data" name="frmUpload" method="POST" action="<%= request.getContextPath() %>/servlet/TestServlet">

              Select File: <input type="file" name="file" size="30">
              <input value="Upload" name="cmdUpload" type="submit">

              -- Servlet
              public void doPost(HttpServletRequest request, HttpServletResponse response)
              throws ServletException, IOException
              {
              request.setCharacterEncoding("UTF-8");

              MultipartRequest multi = new MultipartRequest(request,
              "/tmp",
              20000000,
              "UTF-8");
              }

              Let me know if you have any suggestions.
              • 4. Re: Unicode File movement from Windows to Unix adding Special Characters
                jschellSomeoneStoleMyAlias
                user10307445 wrote:
                Hi,
                In our application a Unicode file with German and Japanese characters is submitted, which is moved to a Unix directory by using the MultipartRequest JAVA API. Later Oracle PL/SQL processes the file and makes entries in the database.

                We have observed that this load is failing since the file is having some special characters when it is getting transferred to Unix. The file is untouched if it contains only English characters. To confirm this we created a file directly in Unix containing Ger/Jap chars and called the Oracle St Proc and it worked fine. When this same file was moved back to Windows using WinSCP, the file was different again.

                Hence overall it looks like Unicode file movement between Windows and Unix changes the file in someway for some reason. Please let me know if any JAVA API can avoid this issue.

                I scanned the Net for close to a week but couldn't find anything related. Any help will be greatly appreciated.

                If we cant find any solution, we are considering using POI so that JAVA can directly update the Database.
                In the above I don't see any mention that you actually verified that the content is correct on the originating machine.

                It is of course pointless to look at FTP as the problem source if the original production is the problem.
                • 5. Re: Unicode File movement from Windows to Unix adding Special Characters
                  836688
                  Hi,
                  The source was always verified.. I missed out to mention that. As it is a file that we create and then submit that file from the front-end, it is implicit that the source contains the right content.

                  As mentioned before, the source file is a Unicode Text (.csv) file. It is opened using MS Excel and edited.

                  Please let me know if you have more questions.
                  • 6. Re: Unicode File movement from Windows to Unix adding Special Characters
                    abillconsl
                    handat wrote:
                    When you transfer your files using winscp make sure you use binary option instead of text. Text means ASCII characters so your unicode get mangled.
                    If you go [url http://winscp.net/eng/docs/ui_pref_transfer]here it tells you that txt, html, etc files should be transferred as text. Always transfer text in text mode.

                    But that is likely not the problem here.

                    Edited by: abillconsl on Feb 10, 2011 4:25 PM
                    • 7. Re: Unicode File movement from Windows to Unix adding Special Characters
                      abillconsl
                      user10307445 wrote:
                      We have observed that this load is failing since the file is having some special characters when it is getting transferred to Unix. The file is untouched if it contains only English characters. To confirm this we created a file directly in Unix containing Ger/Jap chars and called the Oracle St Proc and it worked fine.
                      You sure it's not what you're viewing the file with that is the culprit?
                      • 8. Re: Unicode File movement from Windows to Unix adding Special Characters
                        836688
                        Yes, I am pretty much sure. I am viewing the file using Putty.

                        When the file is transferred using Java upload, it looks all different - the foreign chars look totally cluttered while there is a consistent addition of some characters in the beginning of the file (like \377, \366) and in some other places.

                        When the file is created using vi in Putty and the contents are copied from another file within putty, the file looks ok. This file when fed to the Oracle St proc, is processed correctly. The earlier file is not.

                        Hope this helps.
                        • 9. Re: Unicode File movement from Windows to Unix adding Special Characters
                          EJP
                          If you go [url http://winscp.net/eng/docs/ui_pref_transfer]here it tells you that txt, html, etc files should be transferred as text.
                          No it doesn't. That page doesn't 'tell you' any such thing. It just shows a dialog box that has that setting as an option.
                          Always transfer text in text mode.
                          Always transfer all files in binary mode unless you know that you have a text file and you need the newline transformations.
                          • 10. Re: Unicode File movement from Windows to Unix adding Special Characters
                            805622
                            Hi, you can go for the direct transform for the request inputstream ,you need to read the input stream in bytes, and write to bytes, this are some of the steps for do what you want, I hope it help you


                            ServletInputStream sis = request.getInputStream();
                            String linea = sis.readLine(byte[] b, int off, int len)
                            *//remember to set the lenght limit because when you get to the file it doens't have line break until you que the end of the file*

                            *//Remove the header lines of the multi-part*
                            System.arraycopy(Object src, int srcPos, Object dest, int destPos, int length)

                            RandomAccessFile raf = new RandomAccessFile("fileName, "rw");
                            raf.write(dest, 0, offset -2);
                            raf.close();
                            • 11. Re: Unicode File movement from Windows to Unix adding Special Characters
                              EJP
                              String linea = sis.readLine(byte[] b, int off, int len)
                              That won't solve the problem. If there is binary data present this step will corrupt it. And why read lines when you could just read into a byte array?
                              *//remember to set the length limit because when you get to the file it doesn't have line break until you que the end of the file*
                              Remember to set the length limit because it is required by the API and you don't want to overflow the buffer. It has nothing to do with the behaviour at EOS whatsover.
                              //Remove the header lines of the multi-part
                              System.arraycopy(Object src, int srcPos, Object dest, int destPos, int length)
                              There is no connection between this statement and the previous code. In fact it isn't a statement, it is a method signature. What you read was a String. System.arrayCopy() doesn't operate on Strings. Of course if you had read into a byte[] as you should have, this might make some sense.
                              raf.write(dest, 0, offset -2);
                              That makes no sense either. Why would you pass an offset value as the length parameter?

                              Having said all that, the kernel of this suggestion, to use Streams instead of Readers, is sound.
                              • 12. Re: Unicode File movement from Windows to Unix adding Special Characters
                                abillconsl
                                EJP wrote:
                                If you go [url http://winscp.net/eng/docs/ui_pref_transfer]here it tells you that txt, html, etc files should be transferred as text.
                                No it doesn't. That page doesn't 'tell you' any such thing. It just shows a dialog box that has that setting as an option.
                                You're right technically; it does NOT "say" or "state" it. I does show a group of radio buttons indicating "Transfer Mode" choices, with labels that specify Text (plain text, html ...). So I took a liberty of paraphrasing! I don't see any other way of interpreting that other than "it tells you ... ", because it bloody well IS telling you that.
                                Always transfer text in text mode.
                                Always transfer all files in binary mode unless you know that you have a text file and you need the newline transformations.
                                Rubbish. You're technically right again. Most of us transfer files all the time back and forth from Unix and Windows and in my case the vast majority of these are text based files. If I transfer them in binary mode they get messed up. You said that . So the only times I use binary mode is for binary files that are not some kind of plain text, such as images, class and executable files, etc.

                                It seems clear to me that this OP knows what kind of file is being transferred here because the operation starts in Windows, which almost always means there is an extension to the file.

                                Edited by: abillconsl on Feb 11, 2011 9:46 AM
                                • 13. Re: Unicode File movement from Windows to Unix adding Special Characters
                                  YoungWinston
                                  Raghu wrote:
                                  In our application a Unicode file with German and Japanese characters is submitted, which is moved to a Unix directory by using the MultipartRequest JAVA API. Later Oracle PL/SQL processes the file and makes entries in the database.
                                  I'm no expert in this field, but have you ever thought of simply putting these files on a Unix Samba share? Nothing to do with Java, I know, but then both sides could access the file without the need for a "transfer".

                                  Winston
                                  • 14. Re: Unicode File movement from Windows to Unix adding Special Characters
                                    EJP
                                    You're right technically
                                    In other words, right.
                                    it bloody well IS telling you that.
                                    No it isn't. You (the user) are telling it how to transfer the files. There is no 'should' about it. It is giving you the option, in accordance with the suggestion I made.

                                    You are also contradicting yourself here.
                                    If I transfer them in binary mode they get messed up. You said that .
                                    I said what? The part you crossed out? No.

                                    And what's with all this crossing out stuff? Erase key stuck again? Or some special semantics we don't know about? If you don't want to post it, don't post it. Or is the crossing-out part what you wanted to say but don't want to be held responsible for? Like the prior 'rubbish'? It doesn't work like that. You post it, you said it.

                                    Conversely I didn't post anything you've quoted here so I didn't say it. Got it?
                                    It seems clear to me that this OP knows what kind of file is being transferred here because the operation starts in Windows, which almost always means there is an extension to the file.
                                    Good, so he can apply the rule I gave.
                                    1 2 Previous Next