UTF-8 vs. UTF-16 vs. CESU-8
471965Nov 25 2005 — edited Dec 12 2005According to Oracle documenation the Oracle character set UTF8, follows the CESU-8 encoding scheme, rather than the UTF-8 standard.
According to Unicode.org the CESU-8 encoding scheme for Unicode is identical to UTF-8 except for its representation of supplementary characters, i.e. a binary collation of data encoded in CESU-8 is identical to the binary collation of the same data encoded in UTF-16 thus for all practical purposes UTF-8 and UTF-16 yield comparable results. Yet, despite this assurance Unicode.org states and I quote "This document specifies an 8-bit Compatibility Encoding Scheme for UTF-16 (CESU) that is intended for internal use within systems processing Unicode in order to provide an ASCII-compatible 8-bit encoding that is similar to UTF-8 but preserves UTF-16 binary collation. It is not intended nor recommended as an encoding used for open information exchange.".
Given the Unicode.org position and Oracles statement "Starting
with the next major functional release after Oracle Database 10g Release 2, the choice for the database character set will be limited to this list of recommended character sets for new system deployment." The only "universal character set on the list is" AL32UTF8 Unicode 4.0 UTF-8 Universal character set.
The questions are three:
1-Do you expect Oracle's UTF-8 to remain as CESU-8?
2-Since we must support some 12 different languages and we want to do so in a single database UTF-8 is our only option, however, we must disseminate our content to various exchanges and so must we label our data as CESU-8 or can we allow it to be auto-detected?
3-We assume that Oracle uses UTF-8 as it's database character set within it's own internal databases as well as within the Oracle applicaton suite. In those cases when content is disseminated is that content labeled CESU-8? What is Oracle's position.
While this may seem a trivial issue we believe that AL32UTF8 is database character set that we must use to meet our needs but are concerned of possible long term implications and hence are asking for your opinion of the long term viability of AL32UTF8 given the unicode.org statement that Oracle UTF-8 is not really "unicode".
Thanks.