This is my biggest annoyance with xml. there are some characters which are completely invalid in xml 1.0 even if you encode the character using an entity. most of the control characters below 32 are invalid (except for the few obvious exceptions like newlines and tabs).
compounded with this issue is the longstanding bug in the java xml serializer which happily encodes invalid characters into an xml document. the parser correctly fails when reading these characters, so you ultimately can generate xml files that you cannot read. (i've been bitten by this bug in two separate jobs now so, yes, i'm bitter).
fixing this problem is not trivial. we ended up grabbing a copy of the XMLChar.java file from apache xerces and using the "isInvalidChar" method to check character validity. there is not an easy way to plug this into the xml serialization process however, so you generally have to pre-check your data.
1. completely invalid characters: this is what i talked about in my original reply. these are characters which are illegal in any xml document, in any form of encoding
2. illegally encoded xml characters: this is a case where the creator of an xml document did not correctly encode characters which are otherwise legal
The problem you are seeing is 2. this has nothing to do with the characters themselves, and everything to do with how the xml document is generated. the characters "è, À, ì" are perfectly legal in xml documents as long as they are encoded correctly. if you are trying to generate files from database strings, my guess would be that you are not writing the files with the correct character encoding. the character encoding on the files you write should match the encoding specified in the xml header (in your example, utf-8). if you show some example code, maybe someone could give you some more pointers.
I didn't get that what do you mean by
my guess would be that you are not writing the files with the correct character encoding. the character encoding on the files you write should match the encoding specified in the xml header (in your example, utf-8).
when am creating the xml file from the database recoreds , i have specifed <?xml version="1.0" encoding="UTF-8"?> in xml file , and it should take care of encoding of the characters what ever characters or string we write in to xml file right?
my problem is , some of the xml files which i have created from database records , those are not opening when i try to open with IE, showing the follwoing error
An invalid character was found in text content. Error processing resource line number .....
I need to find out those characters or string which causing this????
I need some filtermethod which should find these charactes which causing the above error?
the error showing that there are some invalid xml chracters which is not allowed in a xml file(am using UTF8 ENCODING)
i have wrtten the code like
private static String replaceInvalidXmlCharacter(String s)
Is this method satisfies the first point you mentioned 1. completely invalid characters: this is what i talked about in my original reply. these are characters which are illegal in any xml document, in any form of encoding
BUT THIS METHOD IS NOT FINDING OR REPLACING THE CHARATRES WHICH CAUSED THE ERROR WHICH I MENTIONED ABOVE
You need to write the XML with the standard APIs. They will take care of escaping etc. If you write the XML directly with I/O primitives you need to do the escaping yourself, in accordance with the rules of XML. Seemingly you aren't doing that correctly.