Skip to Main Content

Database Software

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

Interested in getting your voice heard by members of the Developer Marketing team at Oracle? Check out this post for AppDev or this post for AI focus group information.

UTF-8 vs. UTF-16 vs. CESU-8

471965Nov 25 2005 — edited Dec 12 2005
According to Oracle documenation the Oracle character set UTF8, follows the CESU-8 encoding scheme, rather than the UTF-8 standard.

According to Unicode.org the CESU-8 encoding scheme for Unicode is identical to UTF-8 except for its representation of supplementary characters, i.e. a binary collation of data encoded in CESU-8 is identical to the binary collation of the same data encoded in UTF-16 thus for all practical purposes UTF-8 and UTF-16 yield comparable results. Yet, despite this assurance Unicode.org states and I quote "This document specifies an 8-bit Compatibility Encoding Scheme for UTF-16 (CESU) that is intended for internal use within systems processing Unicode in order to provide an ASCII-compatible 8-bit encoding that is similar to UTF-8 but preserves UTF-16 binary collation. It is not intended nor recommended as an encoding used for open information exchange.".

Given the Unicode.org position and Oracles statement "Starting
with the next major functional release after Oracle Database 10g Release 2, the choice for the database character set will be limited to this list of recommended character sets for new system deployment." The only "universal character set on the list is" AL32UTF8 Unicode 4.0 UTF-8 Universal character set.

The questions are three:
1-Do you expect Oracle's UTF-8 to remain as CESU-8?
2-Since we must support some 12 different languages and we want to do so in a single database UTF-8 is our only option, however, we must disseminate our content to various exchanges and so must we label our data as CESU-8 or can we allow it to be auto-detected?
3-We assume that Oracle uses UTF-8 as it's database character set within it's own internal databases as well as within the Oracle applicaton suite. In those cases when content is disseminated is that content labeled CESU-8? What is Oracle's position.

While this may seem a trivial issue we believe that AL32UTF8 is database character set that we must use to meet our needs but are concerned of possible long term implications and hence are asking for your opinion of the long term viability of AL32UTF8 given the unicode.org statement that Oracle UTF-8 is not really "unicode".

Thanks.

Comments

50379
You can find a white paper on Oracle Unicode support here: http://www.oracle.com/technology/tech/globalization/pdf/TWP_AppDev_Unicode_10gR2.pdf
Oracle's recommendation for Unicode support, especially when dealing with supplementary characters is AL32UTF8. Note that AL32UTF8 does not use the CESU-8 encoding scheme for supplementary characters and is the UTF-8 character set that Oracle updates to comply with new versions of UTF-8 standard. If you currently have UTF8 character set for your database I would recoomend migrating to AL32UTF8 to ensure best compatibility.
319622
Just curious. What is the meaning of "AL32" and "AL16"?
AL32 = All Languages, 32 bits maximum character width
AL16 = All Languages, 16 bits maximum character width


Best regards,
Sergiusz
## 1-Do you expect Oracle's UTF-8 to remain as CESU-8?

Do not mix UTF-8 with UTF8. UTF-8 is a term defined by Unicode. UTF8 is the character set name in Oracle.

Oracle's UTF8 will remain Unicode's CESU-8 with Unicode 3.0 repertiore of characters. It is not planned to change.

Oracle's AL32UTF8 is Unicode's UTF-8 and will be enhanced if the character repertoire of Unicode is enhanced (Oracle10gR2 uses the Unicode 4.0 repertoire).

## 2-Since we must support some 12 different languages and we want to do so
## in a single database UTF-8 is our only option, however, we must disseminate
## our content to various exchanges and so must we label our data as CESU-8
## or can we allow it to be auto-detected?

If you use AL32UTF8 as the database character set (recommended for all environments that use Oracle9i or newer software only), then you should mark the data as 'utf-8' (if we talk about MIME tags).

If you use UTF8 as the database character set (recommended only if 8.0 or 8i clients or databases exist in the environment), you should use either 'utf-8' or 'cesu-8'. If your database contains no surrogate pairs, which is usually the case, use 'utf-8'. If you have surrogates, then theoretically you should use 'cesu-8'. But, as your receiving applications may not recognize this MIME tag (it is not widely known), you may have to use 'utf-8' instead.

## 3-We assume that Oracle uses UTF-8 as it's database character set
## within it's own internal databases as well as within the Oracle applicaton
## suite. In those cases when content is disseminated is that content labeled
## CESU-8? What is Oracle's position.

As far as I know, we usually assume that there are no surrogates in the database and we use 'utf-8'. But strictly speaking, if database is UTF8 and not AL32UTF8, 'cesu-8' would be the correct tag. Unfortunately, many applications may be unable to recognize it.


Best regards,

Sergiusz
1 - 4
Locked Post
New comments cannot be posted to this locked post.

Post Details

Locked on Jan 9 2006
Added on Nov 25 2005
4 comments
13,496 views