Skip to Main Content

Database Software

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

UTF-8 stored in VARCHAR2 on a non-Unicode DB

snmdlaJan 29 2014 — edited Jan 31 2014

Hi there,

we have a company that implements storing Unicode data in Oracle in the following way:

A plain VARCHAR2 on a non-Unicode DB (charset is actually WE8MSWIN1252) receives UTF-8 coded data.

As client and server have the same setting for NLS_LANG, no conversion takes place, and the app will run fine.

(in my eyes, a clean way to set this up would be utilizing NVARCHAR fields for this, but this is no option)

But: how can I do query based on these columns without getting garbage for each non-ASCII character?

I imagine setting up views for that purpose, but I need the syntax on how to re-interpret the UTF-8 data coming from a VARCHAR2 field.

I tried the following:

SELECT CONVERT(column, 'WE8MSWIN1252', 'AL32UTF8') FROM table where ...

This will give me the right data on a client with cp 1252 set up, with the restriction to 8 bit output.

Now I would like to have a Unicode-capable application like SQL*Developer to be fully capable of dealing with the Unicode data, but I guess, for that to work, I would need the DB to deliver a NVCHAR2 output from the above query?

Any help and comments appreciated.

Tom

Message was edited by: snmdla

Comments

JustinCave

This is a classic example of an incorrect NLS_LANG setup.

Realistically, if you're going to misconfigure things this way, *every single application* would need to be misconfigured in exactly the same way.  That would generally mean that tools that have inherent support for Unicode (SQL Developer, any JDBC application, etc.) would be unable to work with the database.  That's one of the very many reasons that this sort of configuration should never be used.

In theory, I suppose, you could convert the VARCHAR2 data to RAW in your view, load it into a binary data type in your application, and then convert the binary data to a string in your application using the UTF-8 character set.  That's going to be a non-trivial amount of effort, though.  And if you want to modify data, you'd have even more problems.

Justin

unknown-7404
we have a company that implements storing Unicode data in Oracle in the following way:

No - they don't. They are NOT storing unicode data - they are storing individual one-byte characters and using that VARCHAR2 column as a BLOB. Ask them how, of if, they query the data.

A plain VARCHAR2 on a non-Unicode DB (charset is actually WE8MSWIN1252) receives UTF-8 coded data.

No - it doesn't. It receives a string of one byte characters in the WE8MSWIN1252 character set. It does not know, or care, what those one-byte characters represent. All you are doing is storing BINARY data in that VARCHAR2 column one byte at a time. When you query it you will get one or several bytes back - but since Oracle thinks it is really character data, when it is actually binary, you can only match it by matching those one-byte characters.

I imagine setting up views for that purpose, but I need the syntax on how to re-interpret the UTF-8 data coming from a VARCHAR2 field.

You don't have UTF-8 data - you have a BLOB that you need to convert to UTF-8 data. You can use the DBMS_LOB.CONVERTTOCLOB procedure to do the conversion and specify the character set to use. See the DBMD_LOB API

http://docs.oracle.com/cd/B28359_01/appdev.111/b28419/d_lob.htm#i1020356

CONVERTTOCLOB Procedure

This procedure takes a source BLOB instance, converts the binary data in the source instance to character data using the character set you specify, writes the character data to a destination CLOB or NCLOB instance, and returns the new offsets.

See this AskTom article for further review

http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:3575852900346063772

snmdla

Thanks to both of you. Would you agree that storing the data in a NVARCHAR2 column should be a good way, in the absense of the option to migrate the whole (multi-purpose) db to a Unicode charset like AL32UTF8?

I guess that manufacturers do not seldom opt at such solutions, in an effort to avoid db-specific implementations (PostgreSQL appears not to have a NVARCHAR2 datatype available).

Thanks, Tom

JustinCave

If you really, really can't migrate the database to the proper character set, using a NVARCHAR2 for any column that needs to store Unicode data would be a reasonable alternative.  Be aware of the space implications of that-- NVARCHAR2 will generally use the UTF-16 character set which will require at least 2 bytes of storage for every character (UTF-8 uses only 1 byte for any ASCII character).  You'll also need to ensure that your NLS setup is correct, the existing setup where the clients are lying about what they are sending and retrieving and bypassing character set conversion is unlikely to work when you start storing data properly.

The right answer, though, is almost always to migrate the database.  If an application doesn't need Unicode data, changing the database character set should have no impact on the client.  The client can still request data in the Windows-1252 character set and can still send data in the Windows-1252 character set.

Justin

1 - 4
Locked Post
New comments cannot be posted to this locked post.

Post Details

Locked on Feb 28 2014
Added on Jan 29 2014
4 comments
5,968 views