5 Replies Latest reply: Jan 29, 2013 3:10 PM by user13427480 RSS

    character set issue we8 to UTF8

    user13427480
      Having datatype length issues due to character set differences between two databases one is in Western European char set (WE8) and another database is in UTF8. we bring in data from text fields, the values are getting truncated. If we have to increase the column lengths in FITS side to bring in the values without truncating, but we are not sure how much we should increase the column lengths to. is there a DBA rule of thumb, that says this many bytes in WEC translate to this many bytes in UTF 8 or something like that? in other words, is there a multiplication factor for converting WE8 text with special characters to UTF8 text
        • 1. Re: character set issue we8 to UTF8
          JustinCave
          If the source database character set is Windows-1252 (WE8MSWIN1252), which is what I'm guessing you mean by "WE8", one character in the Windows-1252 database could require up to three bytes of storage (though the vast majority will require only 1 byte and most of the rest will require 2 bytes).

          You can either triple the size of the column in bytes on the UTF-8 database or you can use character length semantics. If you declare a table
          CREATE TABLE foo (
            col1 VARCHAR2(10 byte),
            col2 VARCHAR2(10 char)
          )
          COL1 will allocate up to 10 bytes of space which may be enough for as little as 2 characters in a UTF-8 database. COL2 will allocate space for up to 10 characters regardless of the number of bytes that requires (though the maximum length of a VARCHAR2 field remains 4000 bytes). Most of the time, you're better off using character length semantics rather than tripling the size of the field in bytes and potentially allowing much longer strings to get saved.

          Ideally, you would specify character length semantics in the DDL for every table. You can set the initialization parameter NLS_LENGTH_SEMANTICS to CHAR to make character length semantics the default but some of the folks over in the Globalization forum that are much smarter than I am have concerns about scripts that haven't been tested with character length semantics. Personally, I've never encountered any issues in setting NLS_LENGTH_SEMANTICS to CHAR but they've got a lot more experience with it than I do.

          Justin
          • 2. Re: character set issue we8 to UTF8
            user13427480
            Thanks for the reply.

            source characterset : WE8MSWIN1252
            destination charaterset: UTF8
            • 3. Re: character set issue we8 to UTF8
              user13427480
              we have VARCHAR2 (2000 bytes) from source and VARCHAR2 (4000 bytes) described in our destination database and still it is failing.
              If we even change the datatype from BYTES to CHAR and size might be greater than 4000 bytes in destination then it will fail again.
              Is there anything else we can do.
              • 4. Re: character set issue we8 to UTF8
                JustinCave
                A VARCHAR2 is limited to 4000 bytes. If you need more than 4000 bytes of storage, you'd need to move to a CLOB.

                That being said, it doesn't make sense that you could have properly stored Windows-1252 data in a VARCHAR2(2000 byte) that can't fit in a VARCHAR2(4000 byte) column in a UTF-8 database unless the data isn't getting converted correctly or the data itself is really weird. The first 127 characters from Windows-1252 will only require 1 byte of storage in UTF-8-- that covers all the English letters, numbers, basic punctuation, etc. You'd only need to (and occasionally 3) bytes for some of the symbols, accented characters, etc. Unless you have really, really unusual text, doubling the amount of data required would be odd.

                Justin

                Edited by: Justin Cave on Jan 29, 2013 3:47 PM
                • 5. Re: character set issue we8 to UTF8
                  user13427480
                  thanks Justin