5 Replies Latest reply: Dec 24, 2010 2:57 AM by EJP RSS

    Invalid XML characters

    vishwa
      HI,

      How we can identify that a given string is having valid xml characters.

      It is possible that some Unicode characters not allowed in XML.

      if a string has invalid xml charactres , i need to find that string????


      Thanks
        • 1. Re: Invalid XML characters
          jtahlborn
          This is my biggest annoyance with xml. there are some characters which are completely invalid in xml 1.0 even if you encode the character using an entity. most of the control characters below 32 are invalid (except for the few obvious exceptions like newlines and tabs).

          compounded with this issue is the longstanding bug in the java xml serializer which happily encodes invalid characters into an xml document. the parser correctly fails when reading these characters, so you ultimately can generate xml files that you cannot read. (i've been bitten by this bug in two separate jobs now so, yes, i'm bitter).

          fixing this problem is not trivial. we ended up grabbing a copy of the XMLChar.java file from apache xerces and using the "isInvalidChar" method to check character validity. there is not an easy way to plug this into the xml serialization process however, so you generally have to pre-check your data.
          • 2. Re: Invalid XML characters
            vishwa
            Actually I have to clean up my Database records ,in the database record string if i found any invalid xml character ,i need to find those records.

            I have already tried using the XMLChar.java file from apache xerces and using the "isInvalid" and "isValid" methods, but am not clear that what this method does?

            when i have tried running using isInvalid or isValid , it does not identifying the invalid xml characters in a string, every thing it is showing valid charactes even it is not valid characters.

            as per my knowledge these 3 characters è, À, ì are invalid xml charactes , but when we pass these charactes to isInvalid method it does not identifying as invalid xml characters

            for eg
            i have a xml file
            <?xml version="1.0" encoding="UTF-8" ?>
            <foo>
            è, À, ì
            </foo>

            when i try to open this file using IE


            The XML page cannot be displayed
            Cannot view XML input using XSL style sheet. Please correct the error and then click the Refresh button, or try again later.


            --------------------------------------------------------------------------------

            An invalid character was found in text content. Error processing resource 'file:///C:/Documents and Settings/vishwa/Deskt...


            any Idea please help me ???????????
            • 3. Re: Invalid XML characters
              jtahlborn
              you are confusing 2 different situations.

              1. completely invalid characters: this is what i talked about in my original reply. these are characters which are illegal in any xml document, in any form of encoding
              2. illegally encoded xml characters: this is a case where the creator of an xml document did not correctly encode characters which are otherwise legal

              The problem you are seeing is 2. this has nothing to do with the characters themselves, and everything to do with how the xml document is generated. the characters "è, À, ì" are perfectly legal in xml documents as long as they are encoded correctly. if you are trying to generate files from database strings, my guess would be that you are not writing the files with the correct character encoding. the character encoding on the files you write should match the encoding specified in the xml header (in your example, utf-8). if you show some example code, maybe someone could give you some more pointers.
              • 4. Re: Invalid XML characters
                vishwa
                Thanks for your reply.

                I didn't get that what do you mean by
                my guess would be that you are not writing the files with the correct character encoding. the character encoding on the files you write should match the encoding specified in the xml header (in your example, utf-8).

                when am creating the xml file from the database recoreds , i have specifed <?xml version="1.0" encoding="UTF-8"?> in xml file , and it should take care of encoding of the characters what ever characters or string we write in to xml file right?

                my problem is , some of the xml files which i have created from database records , those are not opening when i try to open with IE, showing the follwoing error

                An invalid character was found in text content. Error processing resource line number .....

                I need to find out those characters or string which causing this????

                I need some filtermethod which should find these charactes which causing the above error?

                the error showing that there are some invalid xml chracters which is not allowed in a xml file(am using UTF8 ENCODING)

                i have wrtten the code like
                private static String replaceInvalidXmlCharacter(String s)

                {    String valid = "[^\\x09\\x0A\\x0D\\x20-\\xD7FF\\xE000-\\xFFFD\\x10000-x10FFFF]";

                return s.replaceAll(valid, "");

                }

                Is this method satisfies the first point you mentioned 1. completely invalid characters: this is what i talked about in my original reply. these are characters which are illegal in any xml document, in any form of encoding



                BUT THIS METHOD IS NOT FINDING OR REPLACING THE CHARATRES WHICH CAUSED THE ERROR WHICH I MENTIONED ABOVE

                PLS REPLY ME WITH YOUR INPUTS........
                • 5. Re: Invalid XML characters
                  EJP
                  Cut out the SHOUTING thanks.

                  You need to write the XML with the standard APIs. They will take care of escaping etc. If you write the XML directly with I/O primitives you need to do the escaping yourself, in accordance with the rules of XML. Seemingly you aren't doing that correctly.