This discussion is archived
6 Replies Latest reply: Feb 12, 2013 10:32 AM by Sergiusz Wolicki (Oracle) RSS

Character Encoding Problem - LATIN-1 Character in UTF-8 DB

990108 Newbie
Currently Being Moderated
I am using C++ OCI LIB, to insert some report data from remote OCI client to oracle 11 server. This data is read by another process to create the report.

The DB CHARSET is UTF-8. But the report tool expects the data to be ISO08859-1 encoded. So while inserting the data into the database i specify the following LANG and CHARSET for my table colulmn in client:

The TARGET DB CHARSET is UTF-8

NLS_LANG=AMERICAN_AMERICA.WE8ISO8859P1

size_t csid = 871; // UTF-8

OCIAttrSet((void *) bnd1p, (ub4) OCI_HTYPE_BIND,
*(void *)&csid*,
(ub4) 0,
(ub4)OCI_ATTR_CHARSET_ID, errhp);

This solution works for almost every case of ASCII and Extended ASCII Charest but we are facing issues if we have few specific characters to be inserted.

If we are trying to insert single beta character [β] through client, the data goes empty to the column.

Beta Character details:
DEC     OCT     HEX     BIN     Symbol           Description
223     337     DF     11011111     ß     Latin small letter sharp s - ess-zed

DB Output after insert single β:
select rawtohex(NAME) from PERSONS where EID=333;

RAWTOHEX(NAME)
---------------------------


But if the string is *"ββ"* everything work fine:
DB Output for "ββ":
select rawtohex(NAME) from PERSONS where EID=333;

RAWTOHEX(NAME)
---------------------------
DFDF

Please help me out in understanding where I am going wrong and how to resolve this.
  • 1. Re: Character Encoding Problem - LATIN-1 Character in UTF-8 DB
    Zoltan Kecskemethy Expert
    Currently Being Moderated
    You need to understand when you insert in ASCII characters to a UTF database OCI automatically converts ascii to utf when you insert and utf to ascii when you read data. I guess something goes wrong in this auto conversion.

    IMHO Best would be to insert in UTF characters to the db so they would go in without any conversion.
    And you can read them in any charset you wish...

    One more thing make sure that you have beta character in your display character set.

    BTW use oracle [url http://docs.oracle.com/cd/E11882_01/server.112/e26088/functions055.htm#SQLRF00635]dump() fucntion to see how oracle stores varchar2 data type.

    Edited by: Zoltan Kecskemethy on Feb 11, 2013 12:25 PM
    added dump func
  • 2. Re: Character Encoding Problem - LATIN-1 Character in UTF-8 DB
    990108 Newbie
    Currently Being Moderated
    Thanks Zoltan...

    The problem is that the ASCII as the application that reads the data my application is inserting expects the value ONLY with ASCII encoding [The HEX value of the inserted data has to be DF for them to decode it properly].

    But when we are trying to insert beta character *[β]* with ASCII encoding so that it goes as DF, OCI client library is inserting the data as EMPTY STRING.

    But we can not currently change the implementation of the application that reads the data as it is 3rd party.

    Please help...

    The DUMP output shows the following:

    SELECT DUMP('β', 1016) FROM DUAL;
    Typ=96 Len=2 CharacterSet=UTF8: c3,9f

    Edited by: 987105 on Feb 11, 2013 6:22 PM
  • 3. Re: Character Encoding Problem - LATIN-1 Character in UTF-8 DB
    Zoltan Kecskemethy Expert
    Currently Being Moderated
    I see your issue. I tried to reproduce this issue using sqlplus.
    Setup NLS_LANG as you mentioned using WE8ISO8859P1 and connect to a UTF8 db.
    [oracle@server ~]$ locale -a | grep en_US
    en_US
    en_US.iso88591
    en_US.iso885915
    en_US.utf8
    [oracle@server ~]$ export LC_ALL=en_US.iso88591
    [oracle@server ~]$ locale
    LANG=en_US.UTF-8
    LC_CTYPE="en_US.iso88591"
    LC_NUMERIC="en_US.iso88591"
    LC_TIME="en_US.iso88591"
    ...
    LC_MEASUREMENT="en_US.iso88591"
    LC_IDENTIFICATION="en_US.iso88591"
    LC_ALL=en_US.iso88591
    [oracle@server ~]$ echo $NLS_LANG
    AMERICAN_AMERICA.WE8ISO8859P1
    
    SQL> select value from nls_database_parameters where parameter='NLS_CHARACTERSET';
    AL32UTF8
    
    SQL> select nchr(223) from dual;
    
    NCHR
    ----
    ß
    
    SQL> select chr(223) from dual;
    ERROR:
    ORA-29275: partial multibyte character
    
    SQL> SELECT CHR (223 USING NCHAR_CS) from dual;
    
    C
    -
    ß
    
    SQL>  select  CONVERT('ß', 'WE8ISO8859P1','UTF8') FROM DUAL;
    CO
    --
    ??
    SQL> select  CONVERT(CHR(223 USING NCHAR_CS), 'WE8ISO8859P1','UTF8') FROM DUAL;
    select  CONVERT(CHR (223 USING NCHAR_CS), 'WE8ISO8859P1','UTF8') FROM DUAL
                                                                          *
    ERROR at line 1:
    ORA-12703: this character set conversion is not supported
    I found that beta is in WE8ISO8859P1 character set
    here[url http://docs.oracle.com/cd/E11882_01/server.112/e10729/ch2charset.htm#NLSPG164]What Characters Does a Character Set Support?

    But I don't really understand whats the problem in the conversion.
  • 4. Re: Character Encoding Problem - LATIN-1 Character in UTF-8 DB
    Sergiusz Wolicki (Oracle) Expert
    Currently Being Moderated
    Atanu,

    You should realize that the character set information that you give in the inserting client must always match the encoding of data that you put into the buffers. The character set of the database is not relevant here (though it is relevant to ensure that data can be stored without using replacement characters; UTF-8 is OK here.) The NLS_LANG encoding will be used for the INSERT statement itself and all literals included in it. The OCI_ATTR_CHARSET_ID value will be used for the content of the bind buffer. If you set 817 for OCI_ATTR_CHARSET_ID, then the data in the buffer must be UTF-8 encoded. If you set NLS_LANG to .WE8ISO8859P1, all text literals in the INSERT statement must be LATIN-1 encoded. Therefore, you must either convert the data to the configured client character sets (not a good idea, in general) or you have to configure the character sets (NLS_LANG and/or OCI_ATTR_CHARSET_ID) to match the input data.

    Configuration of the report generator is separate. If the generator expects LATIN-1 encoded data, you should make sure that NLS_LANG is set to .WE8ISO8859P1 -- this assumes that this is an OCI client that does no special NLS settings itself. Otherwise, further investigation may be necessary.

    Note that LATIN-1 non-ASCII characters, such as the German sharp-s, expand in conversion from one byte to two bytes each. You must make sure the target column is large enough for the converted value.

    -- Sergiusz
  • 5. Re: Character Encoding Problem - LATIN-1 Character in UTF-8 DB
    990108 Newbie
    Currently Being Moderated
    Sergiusz,

    I understand your point.

    But when I set
    NLS_LANG = WE8ISO8859P1
    OCI_ATTR_CHARSET_ID = 31 // WE8859-1

    which i understand from you should be the correct configuration in client, The data is still encoded as UTF-8. Please refer below:
    SQL> select rawtohex(NAME) from persons;

    RAWTOHEX(NAME)
    --------------------------------------------------------------------------------
    61 // *"a"*
    1EC39F // *"β"*
    C39FC39F // *"ββ"*

    AND

    When I set
    NLS_LANG = WE8ISO8859P1
    OCI_ATTR_CHARSET_ID = 871 // UTF-8

    which i understand is not a correct configuration in client, The data is encoded as expected [ASCII] for input with 2 characters. Please refer below:
    SQL> select rawtohex(NAME) from persons;

    RAWTOHEX(NAME)
    --------------------------------------------------------------------------------
    DFDF // *"ββ" with CHARSET_ID = 871*

    Please let me know where I am going wrong
  • 6. Re: Character Encoding Problem - LATIN-1 Character in UTF-8 DB
    Sergiusz Wolicki (Oracle) Expert
    Currently Being Moderated
    Atanu,


    If your database is UTF-8, i.e. SELECT value FROM nls_database_parameters WHERE parameter = 'NLS_CHARACTERSET'; returns AL32UTF8 (or the similar UTF8), then data in VARCHAR2/CHAR/LONG columns must be encoded in UTF-8. This is the whole sense of the database character set declaration. What you need to assure is that this data, when selected by the reporting program, is converted back to Latin-1. This can be achieved by setting NLS_LANG to .WE8ISO8859P1 in the environment of the reporting program. If this does not help, we need to take a closer look at this program. What Oracle client API it uses, in which programming language it is written, on which platform it runs, etc.


    -- S5sz

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points