Skip to Main Content

Java EE (Java Enterprise Edition) General Discussion

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

Interested in getting your voice heard by members of the Developer Marketing team at Oracle? Check out this post for AppDev or this post for AI focus group information.

Parsing xml with unicode characters

843834Sep 13 2002 — edited Sep 20 2002
Hi,

I'm building applications that use both latin and cyrillic characters. One application is used to create the xml file, another is used to read the file and present it to the user.

First I created the xml files by hand, using character entities to insert cyrillic characters and latin characters with accents. When I loaded that file into my application everything worked fine. But this costs a lot of time, so I created an editor to create the files (I can insert cyrillic characters easily with a keyboard mapping).

So I used my editor application to create the files encoded as UTF-8. When I loaded that file into the other application, all cyrillic characters were replaced by question marks (I used the same font). The same happens when I view the file in a text editor.

Then I used the same editor application to create the files encoded as ISO-8859-1. Now the cyrillic characters show up fine but not the latin characters with accents, they show up as squares. When I view that file in a text editor, the cyrillic characters are replaced by character entites, but latin characters with accents are not replaced by entities (or by question marks).

I also tried the UTF-16 encoding, but then I get an exception:
org.xml.sax.SAXParseException: The encoding "UTF-16" is not supported.

How can I solve this problem?
Ideally all latin characters with accents should also be replaced by character entities as the cyrillic characters are when using ISO-8859-1.
Or should I change the SAX parser I use to load the file. Should I set the encoding for the SAX parser? If so, how? When I save the file with my editor application I set the encoding with
		OutputFormat format = new OutputFormat(document,"ISO-8859-1",true);
. Does a similar method exist to change the encoding when parsing the file?

I use the Xerces parser.

Thank you,
Don

Comments

Locked Post
New comments cannot be posted to this locked post.

Post Details

Locked on Oct 18 2002
Added on Sep 13 2002
11 comments
2,134 views