This discussion is archived
5 Replies Latest reply: Jan 29, 2013 1:10 PM by user13427480 RSS

character set issue we8 to UTF8

user13427480 Newbie
Currently Being Moderated
Having datatype length issues due to character set differences between two databases one is in Western European char set (WE8) and another database is in UTF8. we bring in data from text fields, the values are getting truncated. If we have to increase the column lengths in FITS side to bring in the values without truncating, but we are not sure how much we should increase the column lengths to. is there a DBA rule of thumb, that says this many bytes in WEC translate to this many bytes in UTF 8 or something like that? in other words, is there a multiplication factor for converting WE8 text with special characters to UTF8 text
  • 1. Re: character set issue we8 to UTF8
    Justin Cave Oracle ACE
    Currently Being Moderated
    If the source database character set is Windows-1252 (WE8MSWIN1252), which is what I'm guessing you mean by "WE8", one character in the Windows-1252 database could require up to three bytes of storage (though the vast majority will require only 1 byte and most of the rest will require 2 bytes).

    You can either triple the size of the column in bytes on the UTF-8 database or you can use character length semantics. If you declare a table
    CREATE TABLE foo (
      col1 VARCHAR2(10 byte),
      col2 VARCHAR2(10 char)
    )
    COL1 will allocate up to 10 bytes of space which may be enough for as little as 2 characters in a UTF-8 database. COL2 will allocate space for up to 10 characters regardless of the number of bytes that requires (though the maximum length of a VARCHAR2 field remains 4000 bytes). Most of the time, you're better off using character length semantics rather than tripling the size of the field in bytes and potentially allowing much longer strings to get saved.

    Ideally, you would specify character length semantics in the DDL for every table. You can set the initialization parameter NLS_LENGTH_SEMANTICS to CHAR to make character length semantics the default but some of the folks over in the Globalization forum that are much smarter than I am have concerns about scripts that haven't been tested with character length semantics. Personally, I've never encountered any issues in setting NLS_LENGTH_SEMANTICS to CHAR but they've got a lot more experience with it than I do.

    Justin
  • 2. Re: character set issue we8 to UTF8
    user13427480 Newbie
    Currently Being Moderated
    Thanks for the reply.

    source characterset : WE8MSWIN1252
    destination charaterset: UTF8
  • 3. Re: character set issue we8 to UTF8
    user13427480 Newbie
    Currently Being Moderated
    we have VARCHAR2 (2000 bytes) from source and VARCHAR2 (4000 bytes) described in our destination database and still it is failing.
    If we even change the datatype from BYTES to CHAR and size might be greater than 4000 bytes in destination then it will fail again.
    Is there anything else we can do.
  • 4. Re: character set issue we8 to UTF8
    Justin Cave Oracle ACE
    Currently Being Moderated
    A VARCHAR2 is limited to 4000 bytes. If you need more than 4000 bytes of storage, you'd need to move to a CLOB.

    That being said, it doesn't make sense that you could have properly stored Windows-1252 data in a VARCHAR2(2000 byte) that can't fit in a VARCHAR2(4000 byte) column in a UTF-8 database unless the data isn't getting converted correctly or the data itself is really weird. The first 127 characters from Windows-1252 will only require 1 byte of storage in UTF-8-- that covers all the English letters, numbers, basic punctuation, etc. You'd only need to (and occasionally 3) bytes for some of the symbols, accented characters, etc. Unless you have really, really unusual text, doubling the amount of data required would be odd.

    Justin

    Edited by: Justin Cave on Jan 29, 2013 3:47 PM
  • 5. Re: character set issue we8 to UTF8
    user13427480 Newbie
    Currently Being Moderated
    thanks Justin

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points