1 2 Previous Next 20 Replies Latest reply: Feb 15, 2011 8:48 AM by abillconsl Go to original post RSS
      • 15. Re: Unicode File movement from Windows to Unix adding Special Characters
        796440
        EJP wrote:
        Always transfer text in text mode.
        Always transfer all files in binary mode unless you know that you have a text file and you need the newline transformations.
        And even then, only use text mode for certain "text" files.

        The OP mentions Unicode in his subject. I'm a far cry from a Unicode expert, but it seems likely to me that \r and \n could show up as one byte of a multibyte character. If the text mode in question is a traditional ASCII text handler, without Unicode knowledge, it will blindly convert between \r, \n, and \r\n, depending on the source and target platforms. Certainly not what one would want in this case.

        Additionally, the actual need for text mode is becoming more and more rare as text viewing and editing tools on various platforms are generally able to handle any of the common line-end conventions.
        • 16. Re: Unicode File movement from Windows to Unix adding Special Characters
          jschellSomeoneStoleMyAlias
          Raghu wrote:
          Hi,
          In our application a Unicode file with German and Japanese characters is submitted, which is moved to a Unix directory by using the MultipartRequest JAVA API. Later Oracle PL/SQL processes the file and makes entries in the database.

          We have observed that this load is failing since the file is having some special characters when it is getting transferred to Unix. The file is untouched if it contains only English characters. To confirm this we created a file directly in Unix containing Ger/Jap chars and called the Oracle St Proc and it worked fine. When this same file was moved back to Windows using WinSCP, the file was different again.

          Hence overall it looks like Unicode file movement between Windows and Unix changes the file in someway for some reason. Please let me know if any JAVA API can avoid this issue.

          I scanned the Net for close to a week but couldn't find anything related. Any help will be greatly appreciated.
          The real problem is that you don't actually know what is happening to the file.

          You say you have a file in "unicode". That means nothing.

          You have a file which has a specific character set which represents unicode.

          The character set has a very specific byte representation and you have not looked into exactly how the byte representation is being changed. You know it is changed by not how.

          So you need to
          1. Figure out exactly which character set you are using, certainly to the extent of whether it is variable byte or fixed byte and if the latter what the size and byte order is
          2. You need a hex editor - something that displays the bytes of a file in hex code.

          With those you then open the two files using the hex editor and determine exactly what is different. For a file with japanese probably the first 16 bytes would be sufficient to determine some difference. However you should also look and end of line characters.
          Windows using WinSCP, the file was different again.
          The following certainly seems to say that WinScp doesn't support unicode 16 bit at all.

          http://winscp.net/tracker/show_bug.cgi?id=521
          • 17. Re: Unicode File movement from Windows to Unix adding Special Characters
            abillconsl
            EJP wrote:
            You're right technically
            In other words, right.
            No, in other words, right, it does not come out and tell you that as an instruction. It does provide it as a hint on the label; as in "Text Mode (.txt, .html, ... ). You and I disagree on that, that much is clear.
            it bloody well IS telling you that.
            No it isn't. You (the user) are telling +it+ how to transfer the files. There is no 'should' about it. It is giving you the +option,+ in accordance with the suggestion I made.
            Pissing contest ... I'm too old to win that. or to care if I do.

            >
            You are also contradicting yourself here.
            No, I'm not. I was just being honest. I wrote that, and rather than erase it - which I certainly could have done - I left it in so it would be evident that I realized, upon re-reading what I wrote and your response, I needed to correct myself - that's it.

            --If I transfer them in binary mode they get messed up.-- _You said that_ .
            I said what? The part you crossed out? No.
            You said: "Always transfer all files in binary mode unless you know that you have a text file and you need the newline transformations."

            If you "need" the newline transformations and don't get them, the files will "get messed up".
            And what's with all this crossing out stuff? Erase key stuck again? Or some special semantics we don't know about? If you don't want to post it, don't post it. Or is the crossing-out part what you wanted to say but don't want to be held responsible for? Like the prior 'rubbish'? It doesn't work like that. You post it, you said it.
            Exactly why I crossed it out - I said it and then changed my mind and corrected myself. You don't like that? I could care less.

            What I was in the end trying to say was that I don't disagree with most of what you say, but do disagree in your dogmatic approach. I realize that what I said was dogmatic as well. Saying that you should FTP/SFTP in binary unless you know it's text based I don't disagree with. However, as I said, based on my own experience I usually end up using ASCII because I mostly FTP text based stuff. Also, that the OP ought to know what kind of file he's moving and therefore should not need to guess.
            Conversely I +didn't+ post anything you've quoted here so I +didn't+ say it. Got it?
            It seems clear to me that this OP knows what kind of file is being transferred here because the operation starts in Windows, which almost always means there is an extension to the file.
            Good, so he can apply the rule I gave.
            Edited by: abillconsl on Feb 14, 2011 10:26 AM
            • 18. Re: Unicode File movement from Windows to Unix adding Special Characters
              abillconsl
              I just did the following. See if any of it mirrors what you are doing or want to do.

              I created a simple Oracle table that holds just one column - varchar2(50).

              In Windows XP:
              Using an AWT TextArea, I pasted in some Hebrew text I had copied from a web site. Then in the back end, after the program - using getText() - obtained the UTF8 String - it inserted this text into the table. So in other words, I loaded the DB table from a Java app in the WinXP environment.

              Back over on Unix:
              I wrote a Q&D non GUI app that connects to the same DB table and selects the text and writes it out to a file on Unix using BufferedWriter.

              Using Secure Shell:
              I SFTP this file back over to WinXP, first in binary mode, and then again in ASCII mode.

              Finally I opened the text file in MS Word, which asked me to select an encoding; so in 'Other encoding' I chose Unicode (UTF-8).

              It didn't matter which way I transferred the file - binary or ASCII - it opened the file to display the exact character Sting I had pasted into the TextArea at the start of this exercise, as desired.
              • 19. Re: Unicode File movement from Windows to Unix adding Special Characters
                EJP
                What you so elegantly describe as a 'pissing contest' is in reality a discussion about a matter of fact.
                You said: "Always transfer all files in binary mode unless you know that you have a text file and you need the newline transformations."

                If you "need" the newline transformations and don't get them, the files will "get messed up".
                A file will not get 'messed up' by any binary transfer, and, contrary to your assertion, nowhere did I say that it would. Don't put words into my mouth. Not getting newline transformations because you didn't ask them for does not constitute 'messing up'.
                why I crossed it out - I said it and then changed my mind and corrected myself. You don't like that? I could care less.
                Your actual degree of caring less is illustrated by the care you have taken to post a reply here. Nobody is interested in your errors and retractions. Just post what you mean to say, like everybody else. Otherwise you just run a needless risk of being misunderstood.
                disagree in your dogmatic approach. I realize that what I said was dogmatic as well.
                Exactly. It's OK when you do it but not when I do it?

                Give us a break, please.
                • 20. Re: Unicode File movement from Windows to Unix adding Special Characters
                  abillconsl
                  EJP wrote:
                  What you so elegantly describe as a 'pissing contest' is in reality a discussion about a matter of fact.
                  No, it is a fruitless argument revolving solely around semantics.
                  You said: "Always transfer all files in binary mode unless you know that you have a text file and you need the newline transformations."

                  If you "need" the newline transformations and don't get them, the files will "get messed up".
                  A file will not get 'messed up' by any binary transfer, and, contrary to your assertion, nowhere did I say that it would. Don't put words into my mouth. Not getting newline transformations because you didn't ask them for does not constitute 'messing up'.
                  I am not putting words in your mouth. I am constituting a lack of proper end of line characters "when they are needed" as being "messed up". Granted, "messed up" is hardly eloquent. However, I have heard far less eloquent versions of "this is messed up", when someone gets the file after it was transferred improperly. You don't consider that "messed up". Fine, I do.
                  why I crossed it out - I said it and then changed my mind and corrected myself. You don't like that? I could care less.
                  Your actual degree of caring less is illustrated by the care you have taken to post a reply here. Nobody is interested in your errors and retractions. Just post what you mean to say, like everybody else. Otherwise you just run a needless risk of being misunderstood.
                  No my degree of not caring is related to whether or not you agree with me or not. I've read many give and take arguments and debates here on these and the former Sun forums and many times I've seen others - and perhaps myself - write that they could care less. Yet the debate continued. In most cases I thought I understood (and understand) why ... and it's not always for the same reason, apparently. Posting a rebuttal does not necessarily constitute caring what the other fella thinks, but often times simply qualifies as a disagreement.
                  disagree in your dogmatic approach. I realize that what I said was dogmatic as well.
                  Exactly. It's OK when you do it but not when I do it?
                  So you agree you did/do it then, huh? You agree you did and I will agree that it's okay.
                  Give us a break, please.
                  You have been assigned as the spokes person for "us", huh. Okaaaaaaaaaaaay.
                  1 2 Previous Next