This discussion is archived
5 Replies Latest reply: Dec 24, 2010 12:57 AM by EJP RSS

Invalid XML characters

user72848 - oracle Newbie
Currently Being Moderated
HI,

How we can identify that a given string is having valid xml characters.

It is possible that some Unicode characters not allowed in XML.

if a string has invalid xml charactres , i need to find that string????


Thanks
  • 1. Re: Invalid XML characters
    jtahlborn Expert
    Currently Being Moderated
    This is my biggest annoyance with xml. there are some characters which are completely invalid in xml 1.0 even if you encode the character using an entity. most of the control characters below 32 are invalid (except for the few obvious exceptions like newlines and tabs).

    compounded with this issue is the longstanding bug in the java xml serializer which happily encodes invalid characters into an xml document. the parser correctly fails when reading these characters, so you ultimately can generate xml files that you cannot read. (i've been bitten by this bug in two separate jobs now so, yes, i'm bitter).

    fixing this problem is not trivial. we ended up grabbing a copy of the XMLChar.java file from apache xerces and using the "isInvalidChar" method to check character validity. there is not an easy way to plug this into the xml serialization process however, so you generally have to pre-check your data.
  • 2. Re: Invalid XML characters
    user72848 - oracle Newbie
    Currently Being Moderated
    Actually I have to clean up my Database records ,in the database record string if i found any invalid xml character ,i need to find those records.

    I have already tried using the XMLChar.java file from apache xerces and using the "isInvalid" and "isValid" methods, but am not clear that what this method does?

    when i have tried running using isInvalid or isValid , it does not identifying the invalid xml characters in a string, every thing it is showing valid charactes even it is not valid characters.

    as per my knowledge these 3 characters è, À, ì are invalid xml charactes , but when we pass these charactes to isInvalid method it does not identifying as invalid xml characters

    for eg
    i have a xml file
    <?xml version="1.0" encoding="UTF-8" ?>
    <foo>
    è, À, ì
    </foo>

    when i try to open this file using IE


    The XML page cannot be displayed
    Cannot view XML input using XSL style sheet. Please correct the error and then click the Refresh button, or try again later.


    --------------------------------------------------------------------------------

    An invalid character was found in text content. Error processing resource 'file:///C:/Documents and Settings/vishwa/Deskt...


    any Idea please help me ???????????
  • 3. Re: Invalid XML characters
    jtahlborn Expert
    Currently Being Moderated
    you are confusing 2 different situations.

    1. completely invalid characters: this is what i talked about in my original reply. these are characters which are illegal in any xml document, in any form of encoding
    2. illegally encoded xml characters: this is a case where the creator of an xml document did not correctly encode characters which are otherwise legal

    The problem you are seeing is 2. this has nothing to do with the characters themselves, and everything to do with how the xml document is generated. the characters "è, À, ì" are perfectly legal in xml documents as long as they are encoded correctly. if you are trying to generate files from database strings, my guess would be that you are not writing the files with the correct character encoding. the character encoding on the files you write should match the encoding specified in the xml header (in your example, utf-8). if you show some example code, maybe someone could give you some more pointers.
  • 4. Re: Invalid XML characters
    user72848 - oracle Newbie
    Currently Being Moderated
    Thanks for your reply.

    I didn't get that what do you mean by
    my guess would be that you are not writing the files with the correct character encoding. the character encoding on the files you write should match the encoding specified in the xml header (in your example, utf-8).

    when am creating the xml file from the database recoreds , i have specifed <?xml version="1.0" encoding="UTF-8"?> in xml file , and it should take care of encoding of the characters what ever characters or string we write in to xml file right?

    my problem is , some of the xml files which i have created from database records , those are not opening when i try to open with IE, showing the follwoing error

    An invalid character was found in text content. Error processing resource line number .....

    I need to find out those characters or string which causing this????

    I need some filtermethod which should find these charactes which causing the above error?

    the error showing that there are some invalid xml chracters which is not allowed in a xml file(am using UTF8 ENCODING)

    i have wrtten the code like
    private static String replaceInvalidXmlCharacter(String s)

    {    String valid = "[^\\x09\\x0A\\x0D\\x20-\\xD7FF\\xE000-\\xFFFD\\x10000-x10FFFF]";

    return s.replaceAll(valid, "");

    }

    Is this method satisfies the first point you mentioned 1. completely invalid characters: this is what i talked about in my original reply. these are characters which are illegal in any xml document, in any form of encoding



    BUT THIS METHOD IS NOT FINDING OR REPLACING THE CHARATRES WHICH CAUSED THE ERROR WHICH I MENTIONED ABOVE

    PLS REPLY ME WITH YOUR INPUTS........
  • 5. Re: Invalid XML characters
    EJP Guru
    Currently Being Moderated
    Cut out the SHOUTING thanks.

    You need to write the XML with the standard APIs. They will take care of escaping etc. If you write the XML directly with I/O primitives you need to do the escaping yourself, in accordance with the rules of XML. Seemingly you aren't doing that correctly.

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points