This discussion is archived
9 Replies Latest reply: Dec 8, 2011 10:46 AM by DrClap RSS

UTF-16 UTF-32 Characters encoding under diffrent OS (Win7 / WinXP / Linux)

903940 Newbie
Currently Being Moderated
Hi there,


I made some (file text encrypting) application. I wanted to launch it on few OS. Text processing looks this:

JTextArea -> String -> base64 (with UTF-16BE or UTF-32BE) -> DES/MD5/other -> bytes array -> file

The problem is when I move this encrypted file to another OS insted of proper characters I get invalid signs.

For example:
[Windows 7 x64 (everything OK)|http://ifotos.pl/zobacz/Windows7-_rhpxhpq.jpg/] -> [Windows XP SP3 x32/x64 (almost all special characters are invalid)|http://ifotos.pl/zobacz/WindowsXP_rhpxhrq.JPG/]

Windows 7 x64 -> Linux (ubuntu) (some characters are invalid, but not many of them)

Windows 7 x64 -> MacOS (Leopard) (everything is OK)


I think the problem involves characters encoding. I don't know why but under Windows XP almost all special characters are invalid. Can someone tell me why and how to fix it?

I checked the default charset of this Java virtual machine under Windows XP SP3 x32:
Charset.defaultCharset()
Charset.availableCharsets() 
Default charset is:   windows-1250


Available charsets are: 

{Big5=Big5, Big5-HKSCS=Big5-HKSCS, EUC-JP=EUC-JP, EUC-KR=EUC-KR, GB18030=GB18030, GB2312=GB2312,
 GBK=GBK, IBM-Thai=IBM-Thai, IBM00858=IBM00858, IBM01140=IBM01140, IBM01141=IBM01141, 
IBM01142=IBM01142, IBM01143=IBM01143, IBM01144=IBM01144, IBM01145=IBM01145, IBM01146=IBM01146,
 IBM01147=IBM01147, IBM01148=IBM01148, IBM01149=IBM01149, IBM037=IBM037, IBM1026=IBM1026, 
IBM1047=IBM1047, IBM273=IBM273, IBM277=IBM277, IBM278=IBM278, IBM280=IBM280, IBM284=IBM284, 
IBM285=IBM285, IBM297=IBM297, IBM420=IBM420, IBM424=IBM424, IBM437=IBM437, IBM500=IBM500, 
IBM775=IBM775, IBM850=IBM850, IBM852=IBM852, IBM855=IBM855, IBM857=IBM857, IBM860=IBM860, 
IBM861=IBM861, IBM862=IBM862, IBM863=IBM863, IBM864=IBM864, IBM865=IBM865, IBM866=IBM866, 
IBM868=IBM868, IBM869=IBM869, IBM870=IBM870, IBM871=IBM871, IBM918=IBM918, ISO-2022-CN=ISO-2022-CN,
 ISO-2022-JP=ISO-2022-JP, ISO-2022-JP-2=ISO-2022-JP-2, ISO-2022-KR=ISO-2022-KR, ISO-8859-1=ISO-8859-1,
 ISO-8859-13=ISO-8859-13, ISO-8859-15=ISO-8859-15, ISO-8859-2=ISO-8859-2, ISO-8859-3=ISO-8859-3,
 ISO-8859-4=ISO-8859-4, ISO-8859-5=ISO-8859-5, ISO-8859-6=ISO-8859-6, ISO-8859-7=ISO-8859-7, 
ISO-8859-8=ISO-8859-8, ISO-8859-9=ISO-8859-9, JIS_X0201=JIS_X0201, JIS_X0212-1990=JIS_X0212-1990,
 KOI8-R=KOI8-R, KOI8-U=KOI8-U, Shift_JIS=Shift_JIS, TIS-620=TIS-620, US-ASCII=US-ASCII, UTF-16=UTF-16,
 UTF-16BE=UTF-16BE, UTF-16LE=UTF-16LE, UTF-32=UTF-32, UTF-32BE=UTF-32BE, UTF-32LE=UTF-32LE,
 UTF-8=UTF-8, windows-1250=windows-1250, windows-1251=windows-1251, windows-1252=windows-1252,
 windows-1253=windows-1253, windows-1254=windows-1254, windows-1255=windows-1255, windows-1256=windows-1256,
 windows-1257=windows-1257, windows-1258=windows-1258, windows-31j=windows-31j, x-Big5-Solaris=x-Big5-Solaris,
 x-euc-jp-linux=x-euc-jp-linux, x-EUC-TW=x-EUC-TW, x-eucJP-Open=x-eucJP-Open, x-IBM1006=x-IBM1006,
 x-IBM1025=x-IBM1025, x-IBM1046=x-IBM1046, x-IBM1097=x-IBM1097, x-IBM1098=x-IBM1098,
 x-IBM1112=x-IBM1112, x-IBM1122=x-IBM1122, x-IBM1123=x-IBM1123, x-IBM1124=x-IBM1124, 
x-IBM1381=x-IBM1381, x-IBM1383=x-IBM1383, x-IBM33722=x-IBM33722, x-IBM737=x-IBM737, 
x-IBM834=x-IBM834, x-IBM856=x-IBM856, x-IBM874=x-IBM874, x-IBM875=x-IBM875, x-IBM921=x-IBM921, 
x-IBM922=x-IBM922, x-IBM930=x-IBM930, x-IBM933=x-IBM933, x-IBM935=x-IBM935, x-IBM937=x-IBM937, 
x-IBM939=x-IBM939, x-IBM942=x-IBM942, x-IBM942C=x-IBM942C, x-IBM943=x-IBM943, x-IBM943C=x-IBM943C,
 x-IBM948=x-IBM948, x-IBM949=x-IBM949, x-IBM949C=x-IBM949C, x-IBM950=x-IBM950, x-IBM964=x-IBM964,
 x-IBM970=x-IBM970, x-ISCII91=x-ISCII91, x-ISO-2022-CN-CNS=x-ISO-2022-CN-CNS, 
x-ISO-2022-CN-GB=x-ISO-2022-CN-GB, x-iso-8859-11=x-iso-8859-11, x-JIS0208=x-JIS0208, 
x-JISAutoDetect=x-JISAutoDetect, x-Johab=x-Johab, x-MacArabic=x-MacArabic, x-MacCentralEurope=x-MacCentralEurope, 
x-MacCroatian=x-MacCroatian, x-MacCyrillic=x-MacCyrillic, x-MacDingbat=x-MacDingbat, x-MacGreek=x-MacGreek, 
x-MacHebrew=x-MacHebrew, x-MacIceland=x-MacIceland, x-MacRoman=x-MacRoman, x-MacRomania=x-MacRomania, 
x-MacSymbol=x-MacSymbol, x-MacThai=x-MacThai, x-MacTurkish=x-MacTurkish, x-MacUkraine=x-MacUkraine, 
x-MS932_0213=x-MS932_0213, x-MS950-HKSCS=x-MS950-HKSCS, x-mswin-936=x-mswin-936, x-PCK=x-PCK, 
x-SJIS_0213=x-SJIS_0213, x-UTF-16LE-BOM=x-UTF-16LE-BOM, X-UTF-32BE-BOM=X-UTF-32BE-BOM, 
X-UTF-32LE-BOM=X-UTF-32LE-BOM, x-windows-50220=x-windows-50220, x-windows-50221=x-windows-50221, 
x-windows-874=x-windows-874, x-windows-949=x-windows-949, x-windows-950=x-windows-950, 
x-windows-iso2022jp=x-windows-iso2022jp}
UTF-16 and UTF-32 are on the list.

Edited by: 900937 on 2011-12-06 08:24
  • 1. Re: UTF-16 UTF-32 Characters encoding under diffrent OS (Win7 / WinXP / Linux)
    sabre150 Expert
    Currently Being Moderated
    This seems wrong!

    String -> base64 (with UTF-16BE or UTF-32BE) ) -> DES/MD5/other

    I would have expected :-

    String -> bytes (using some encoding say utf-8) -> encryption(ciphertext) -> base64 encoding -> file.

    There is certainly no need to base64 encode the String and then convert to UTF-16BE or UTF-32BE and the transformation String -> base64 (with UTF-16BE or UTF-32BE) ) seems to be a nonsense . I would love to see the code that does the String -> base64 (with UTF-16BE or UTF-32BE) ) . If you are just writing the ciphertext to a file then the base64 encoding step is not needed; you can just write the bytes of the ciphertext straight to the file.

    I don't understand what you mean by DES/MD5/other. DES is a deprecated encryption algorithm, MD5 is a digest and 'other' is meaningless.
  • 2. Re: UTF-16 UTF-32 Characters encoding under diffrent OS (Win7 / WinXP / Linux)
    903940 Newbie
    Currently Being Moderated
    Well correct. I fixed it and now:

    JTextArea -> encryption -> file

    "DES/MD5/other" mean some cypher algorithm in this case - "PBEWithMD5AndDES"

    but it's not the problem because im going to use another algorithm in the future.

    The problem is when i open encrypted file under other system (in this case under Windows XP Professional sp3 PL) i got several invalid characters.

    please look at the screenshot with invalid characters:
    "Windows XP sp3 x32 utf-16BE"
    http://zapodaj.net/bf4d75bdca0a.jpg.html

    here is screenshot with all correct characters:
    "Windows 7 Ultimate sp1 x64 utf-16BE"
    http://zapodaj.net/29db17684a5a.jpg.html


    Maybe the problem is not with Java/Encoding/My application. Maybe this is font problem or something else?
  • 3. Re: UTF-16 UTF-32 Characters encoding under diffrent OS (Win7 / WinXP / Linux)
    DrClap Expert
    Currently Being Moderated
    I haven't followed your links. However if you are seeing rectangular boxes, that means your font can't render the character. If you are seeing question marks, that means the data went through an encoding phase which could not encode the character.
  • 4. Re: UTF-16 UTF-32 Characters encoding under diffrent OS (Win7 / WinXP / Linux)
    sabre150 Expert
    Currently Being Moderated
    900937 wrote:
    Well correct. I fixed it and now:

    JTextArea -> encryption -> file

    "DES/MD5/other" mean some cypher algorithm in this case - "PBEWithMD5AndDES"

    but it's not the problem because im going to use another algorithm in the future.

    The problem is when i open encrypted file under other system (in this case under Windows XP Professional sp3 PL) i got several invalid characters.
    You still have not shown any code so we are guessing.

    >
    please look at the screenshot with invalid characters:
    "Windows XP sp3 x32 utf-16BE"
    http://zapodaj.net/bf4d75bdca0a.jpg.html
    Irrelevant if we cannot see what Java code was used to do the encryption and decryption.

    >
    here is screenshot with all correct characters:
    "Windows 7 Ultimate sp1 x64 utf-16BE"
    http://zapodaj.net/29db17684a5a.jpg.html


    Maybe the problem is not with Java/Encoding/My application. Maybe this is font problem or something else?
    DrClap probably has the correct answer but without seeing your code and without knowing exactly how you are displaying the result it is difficult to be certain.
  • 5. Re: UTF-16 UTF-32 Characters encoding under diffrent OS (Win7 / WinXP / Linux)
    DrClap Expert
    Currently Being Moderated
    My bet is that when we see the code it's going to be something basic like converting the chars (from the text field) to bytes (which are what the encryption wants) using the system's default charset.
  • 6. Re: UTF-16 UTF-32 Characters encoding under diffrent OS (Win7 / WinXP / Linux)
    903940 Newbie
    Currently Being Moderated
    [s'il vous plaît|http://en.wiktionary.org/wiki/s%27il_vous_pla%C3%AEt] :
        public static String TEXT_ENCODING = "UTF-16BE";
    
    
    
    
        public static byte[] encryptingString(String plain_text) throws Exception {
            return encryptingBytes(plain_text.getBytes(TEXT_ENCODING));
        }
    
        private static byte[] encryptingBytes(byte[] plain_text) throws Exception {
            // Initialize PBE Cipher with key and parameters
            pbeCipher.init(Cipher.ENCRYPT_MODE, pbeKey, pbeParamSpec);              
            byte[] cipherText = pbeCipher.doFinal(plain_text);
            return cipherText;
        }
    
    
    
        public static String decrypting(byte[] crypted_bytes) throws Exception {
            byte[] plainTextBytes = decryptingBytes(crypted_bytes);
    
            int ByteNumberPeerCharacter = 2; // for UTF-16 is 2, UTF-32 is 4
            char c;
    
    //      StringBuffer s = new StringBuffer();
    //      other solution to convert char[] -> String
    //        for (int i = 0; i < (plainTextBytes.length - 1); i = i + ByteNumberPeerCharacter) {
    //            c = (char) (((plainTextBytes&0x00FF)<<8) + (plainTextBytes[i+1]&0x00FF));
    // s.append(c);
    // }

    String s2 = new String(plainTextBytes, TEXT_ENCODING);
    return s2.toString();
    }



    private static byte[] decryptingBytes(byte[] crypted_text) throws Exception {
    // Initialize PBE Cipher with key and parameters
    try {
         pbeCipher.init(Cipher.DECRYPT_MODE, pbeKey, pbeParamSpec);
         byte[] cipherText = pbeCipher.doFinal( crypted_text );
         return cipherText;
    }
    catch (BadPaddingException ex) {
         throw new MyDecryptingException();
    }
    }



    I think the code is correct because it crypt and decrypt correct tekst under same and diffrent OS (Windows7 x64, WindowsXP x32 and x64, linux ubuntu 11.04, MacOS-Lion 10.7.2). The problem is only under WindowXP and linux but only with some characters (rectangle box insted of proper character). By the way thats why i tried to use base64 (before crypting) to exclude the problem with characters encoding but ofcourse i dont need base64.

    ooh and i almost forgot about font:
            text_area = new JTextArea();
    
            Font regularFont = new Font("Century Schoolbook", Font.PLAIN, 15);
            text_area.setFont(regularFont.deriveFont(Font.PLAIN));
    Under WindowsXP and linux there are rectangle boxes insted of proper characters so maybe it is font problem - just like DrClap said.
  • 7. Re: UTF-16 UTF-32 Characters encoding under diffrent OS (Win7 / WinXP / Linux)
    sabre150 Expert
    Currently Being Moderated
    Apart from some very nasty redundancy your encryption/decryption code looks reasonable. We cannot really determine your display problem without some knowledge of what the content of the Strings that you are encrypting but I suspect that on some platforms font ' "Century Schoolbook", Font.PLAIN ' does not have glyphs for some of the characters you are trying to display.

    Out of interest - why are you using UTF16BE rather than UTF8 ?
  • 8. Re: UTF-16 UTF-32 Characters encoding under diffrent OS (Win7 / WinXP / Linux)
    903940 Newbie
    Currently Being Moderated
    Yes, there are nasty redundancy, but i think in final version i will remove it.

    I copy-paste some characters form text-symbols.com [http://text-symbols.com/emoticons/] to JTextArea.

    I tried to take a look what kind of charset ( charset=UTF-8 ) and fonts are on this webpage ( Sans-Serif, arial, Arial Unicode MS, lucida grande, tahoma, verdana ) and set it as JTextArea font. No success yet. I tried also to set other fonts ( Lucida Bright Demi Bold, Century Schoolbook, Serif, Bitstream Cyberbit ) but no success (on Windows XP) as well. I don't know how to find proper font for all OS but i think it's not matter, becouse nobody is using those kind of characters in simple text files. So maybe i will just leave this issue.

    In current beta versions im using UTF-8 again.

    Update:
    Or maybe i should find some free font with all proper glyphs and add it to my application.
  • 9. Re: UTF-16 UTF-32 Characters encoding under diffrent OS (Win7 / WinXP / Linux)
    DrClap Expert
    Currently Being Moderated
    The charset you use for encoding and decoding during the encryption process has nothing to do with the fonts you're using to display the unencrypted characters. So if your data is mostly from the Latin alphabet, UTF-8 will give you fewer bytes to encrypt than UTF-16 will, but since both of them can handle all characters, it doesn't really matter unless the size of the encrypted data is an issue.

    As for fonts which can and can't render certain characters, that has nothing to do with the fact that the characters at some earlier point in time went through an encryption and decryption process. (As long as the result of those processes produced the original clear-text without alteration, that is.)

    If you choose a font with "Unicode" in its name then you have the best chance of being able to render the most characters.

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points