Forum Stats

  • 3,874,521 Users
  • 2,266,749 Discussions
  • 7,911,877 Comments

Discussions

Error LPX-00245: extra data after end of document

daphne_121
daphne_121 Member Posts: 4 Green Ribbon
edited Oct 14, 2022 4:58AM in SQL & PL/SQL

Hello everyone,

I have a problem inserting xml file. I have to create directory and passing from it. But my xml file like this:

<?xml version = "1.0" encoding="utf-8"?>
<a> <date>20221011</date></a>
<b>
<b1>
<field1>092010</field1>
</b1>
<b1>
<field1>093456</field1>
</b1>
....
</b>

I want to import <field1>'s data into my table, not <a>'s. But of course there's an error for multiple roots. I want to remove or delete <a> tag or add <a> and <b> tags into new root.

What should I do?

Thanks,

Daphne.

Best Answers

  • Paulzip
    Paulzip Member Posts: 8,802 Blue Diamond
    edited Oct 11, 2022 11:34PM Answer ✓

    Single root = XML Document, processing instruction is allowed <?xml ... ?>

    Multiple roots = XML Content, processing instruction not allowed

    So yours is neither valid XML document or content. Your best choice is to strip out the processing instruction, treat as content and process with XML DB functions.

    select *
    from   XMLTable(
             '/b/b1/field1'
             passing XMLParse(content regexp_replace(
            '<?xml version="1.0" encoding="utf-8"?>
            <a> <date>20221011</date></a>
            <b>
              <b1>
                <field1>092010</field1>
              </b1>
              <b1>
                <field1>093456</field1>
              </b1>
            </b>'
            , '<\?xml[^?]+\?>', null, 1, 1, 'i'))
             columns
               order_id for ordinality,
               field1   varchar2(10) path '.'
           )
    / 
    
      ORDER_ID FIELD1    
    ---------- ----------
             1 092010    
             2 093456    
    
    
    Regex explanation :
    <\?xml[^?]+\?>
    --------------
    
    Match the character “<” literally «<»
    Match the question mark character «\?»
    Match the character string “xml” literally
    Match any single character that is NOT present in the list below «[^?]+»
      Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
      The question mark character «?»
    Match the question mark character «\?»
    Match the character “>” literally «>»
    


  • BluShadow
    BluShadow Member, Moderator Posts: 42,561 Red Diamond
    Answer ✓

    So, the other option, assuming there is just one <b> element in the XML, is to remove the <a> elements(s)... (also assuming there aren't multiple <a> elements around the <b> element, otherwise <b> in the middle of them would also end up being stripped out by the following regular expression, what with ".*" being greedy)

    with t(txt) as (select '<?xml version = "1.0" encoding="utf-8"?>
    <a>
      <date>20221011</date>
    </a>
    <b>
      <b1>
        <field1>092010</field1>
      </b1>
      <b1>
        <field1>093456</field1>
      </b1>
    </b>' from dual
    )
    --
    -- end of sample text
    --
    select x.field1
    from   t
           cross join
           xmltable('/b/b1'
                   passing xmltype(regexp_replace(txt, '<a>.*</a>', '', 1, 0, 'in'))
                   columns field1 varchar2(6) path './field1'
                   ) x
    /
    
    FIELD1
    ------
    092010
    093456
    


    That shouldn't matter if there's a DOCTYPE declaration or not, as we're not trying to find the right place to add a root element, but assuming there is just one <b> element that will act as the root once the <a> element(s) are removed.

  • daphne_121
    daphne_121 Member Posts: 4 Green Ribbon
    edited Oct 13, 2022 6:49PM Answer ✓

    @Paulzip @BluShadow @mathguy Thank you all!

    I tried your suggestions and they worked perfectly:

    declare
    begin 
    insert into xml_tab (
           field1)
    select x.*
    from  (select to_clob(BFILENAME ('TEST_DIR', 'test.xml')) from dual)
        cross join
        xmltable('/b/b1'
            passing xmltype(regexp_replace((select to_clob(BFILENAME ('TEST_DIR', 'test.xml')) from dual), '<a>.*</a>', '', 1, 0, 'in'))
            columns 
             field1 number(8) path './field1'
            ) x;
    commit;        
    end;  
    

    I tried to use WITH clause instead of "SELECT" part but it didn't work. It's alright, it works at least!

    Thanks a lot!

«1

Answers

  • mathguy
    mathguy Member Posts: 10,900 Black Diamond

    You are giving two different (and contradictory) descriptions to your problem. In one place you wrote "... problem inserting xml file" but in another you wrote "... import <field1>'s data into my table".

    Which is it? Do you need to insert xml file (whatever that means)? Or do you need to extract data from the <field1> elements and import that to an existing table?

    Where does the XML document exist? Is it in a text file on the server? On your client computer? Or does it exist as a text column (possibly CLOB) in the database already?

    What is the exact structure of the table you must insert into or import to?

    In any case, if you have text like that, it is probably easiest to pre-process the text, to add a <root> element to the whole thing. The string '<root>' must be added (concatenated) right after the first occurrence of '?>' in the string, and the string '</root>' must be added at the end of the string. However, this assumes there are no other things at the top of the document (such as DOCTYPE declarations) - you will need to be 100% sure you understand your document before using such imprecise tools.

    Best solution though? An XML "document" shouldn't look like that to begin with. Where did it come from? No "source" that produces ill-formed XML should be tolerated; see where the document came from, and complain about its not being well-formed. You shouldn't have to fix such things on the back-end.

    daphne_121
  • daphne_121
    daphne_121 Member Posts: 4 Green Ribbon

    First of all, thank you for your comprehensive answer!

    I couldn't explain myself for my bad English, sorry. I googled this problem for 3 days, but I couldn't find exact solution. Only solution I could find was add '<root>' manually and it worked. But there's so many .xml files on the server. So I have to create procedure or function that include directory to add file from server and add '<root>' element automatically to parse '<field1>'s data to table that I create for this xml file. But i don't know how to add, if you know please help me.

    And I complained for this source problem but they said there's so many files so they can fix for next files not for old ones.

    I hope I've explained myself.

  • Paulzip
    Paulzip Member Posts: 8,802 Blue Diamond
    edited Oct 11, 2022 11:34PM Answer ✓

    Single root = XML Document, processing instruction is allowed <?xml ... ?>

    Multiple roots = XML Content, processing instruction not allowed

    So yours is neither valid XML document or content. Your best choice is to strip out the processing instruction, treat as content and process with XML DB functions.

    select *
    from   XMLTable(
             '/b/b1/field1'
             passing XMLParse(content regexp_replace(
            '<?xml version="1.0" encoding="utf-8"?>
            <a> <date>20221011</date></a>
            <b>
              <b1>
                <field1>092010</field1>
              </b1>
              <b1>
                <field1>093456</field1>
              </b1>
            </b>'
            , '<\?xml[^?]+\?>', null, 1, 1, 'i'))
             columns
               order_id for ordinality,
               field1   varchar2(10) path '.'
           )
    / 
    
      ORDER_ID FIELD1    
    ---------- ----------
             1 092010    
             2 093456    
    
    
    Regex explanation :
    <\?xml[^?]+\?>
    --------------
    
    Match the character “<” literally «<»
    Match the question mark character «\?»
    Match the character string “xml” literally
    Match any single character that is NOT present in the list below «[^?]+»
      Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
      The question mark character «?»
    Match the question mark character «\?»
    Match the character “>” literally «>»
    


  • mathguy
    mathguy Member Posts: 10,900 Black Diamond

    @Paulzip - I don't think that will work if the document includes a DOCTYPE declaration in addition to the XML declaration. I didn't spend much time testing, but I believe the "content" option to XMLParse will cause it to error out on DOCTYPE declarations.

    I mentioned this as a potential complication in my reply - it makes adding <root>...</root> harder too.

    daphne_121
  • BluShadow
    BluShadow Member, Moderator Posts: 42,561 Red Diamond
    Answer ✓

    So, the other option, assuming there is just one <b> element in the XML, is to remove the <a> elements(s)... (also assuming there aren't multiple <a> elements around the <b> element, otherwise <b> in the middle of them would also end up being stripped out by the following regular expression, what with ".*" being greedy)

    with t(txt) as (select '<?xml version = "1.0" encoding="utf-8"?>
    <a>
      <date>20221011</date>
    </a>
    <b>
      <b1>
        <field1>092010</field1>
      </b1>
      <b1>
        <field1>093456</field1>
      </b1>
    </b>' from dual
    )
    --
    -- end of sample text
    --
    select x.field1
    from   t
           cross join
           xmltable('/b/b1'
                   passing xmltype(regexp_replace(txt, '<a>.*</a>', '', 1, 0, 'in'))
                   columns field1 varchar2(6) path './field1'
                   ) x
    /
    
    FIELD1
    ------
    092010
    093456
    


    That shouldn't matter if there's a DOCTYPE declaration or not, as we're not trying to find the right place to add a root element, but assuming there is just one <b> element that will act as the root once the <a> element(s) are removed.

  • Paulzip
    Paulzip Member Posts: 8,802 Blue Diamond

    @mathguy

    Any DOCTYPE would be invalid too, as they are also for documents not content. Also, any DTD in the DOCTYPE would not be able to validate as these won't be valid for multi root. The same regex approach could be taken...

    select *
    from   XMLTable(
             '/b/b1/field1'
             passing XMLParse(content regexp_replace(
            '<?XML version="1.0" encoding="utf-8"?>
             <!DOCTYPE test PUBLIC "/blah/blah" [
             <!ELEMENT b (b1+)>
             <!ELEMENT b1 (field1)>
             <!ELEMENT field1 (#PCDATA)>
             ]>
             <a> <date>20221011</date></a>
             <b>
               <b1>
                 <field1>092010</field1>
               </b1>
               <b1>
                 <field1>093456</field1>
               </b1>
             </b>'
            , '<\?xml[^?]+\?>|<!doctype[^]]+\]>', null, 1, 0, 'i'))
             columns
               order_id for ordinality,
               field1   varchar2(10) path '.'
           )
    / 
    
      ORDER_ID FIELD1    
    ---------- ----------
             1 092010    
             2 093456   
    

    But personally, if I were OP, I'd be asking the question why invalid junk XML is occurring.

  • mathguy
    mathguy Member Posts: 10,900 Black Diamond

    @Paulzip

    The problem with such solutions (remove non-content from the input, or add root tags in just the right places) will always have further exceptions that eventually will have to be handled. In the end one will need to write a quasi-XML parser just to get this task right!

    For example, for your solution, suppose the DTD includes the following entity declaration:

             <!ENTITY right_bracket "]">
    

    The regexp search will not match anymore - the first occurrence of right-bracket is not immediately followed by a right angled bracket, so there is no match, so the entire DOCTYPE declaration remains in place, and then XMLParse errors out.

    = = = = =

    The OP already said they made noise about the broken documents and were told that this may be fixed in the future, but they must deal with legacy documents that no one will fix for them.

    I am just not convinced that the "fixing" should be done in the database. There may be better tools that one could use - at least ones that support more advanced regexp features. This may be a good question on a different forum (perhaps even here on Oracle's platform, but a forum dealing with XML issues): are there any software tools that can help "fix" broken XML documents? The question is, I would guess, not uncommon.

    daphne_121
  • BluShadow
    BluShadow Member, Moderator Posts: 42,561 Red Diamond

    are there any software tools that can help "fix" broken XML documents?

    Aside from common things like missing end tags or suchlike, you'd be hard pressed to find any tool that can deal with the numerous ways people break XML formats.

    Fortunately, most XML content is generated by code/tools that are designed to generate XML, and will do so correctly in most cases. and not well formed documents are usually the result of manual concatenation of text, rather than built in XML functionality... though we could even do it with Oracle XML functionality if we try...

    SQL> with t(a, b) as (
      2    select 'Here is A valid XML'
      3          ,'Here is B valid XML'
      4    from dual
      5   )
      6  select xmlroot(
      7           xmlforest(a,b)
      8           , version '1.0'
      9           , standalone yes
     10         ) as result
     11  from   t
     12  /
    
    RESULT
    ----------------------------------------
    <?xml version="1.0" standalone="yes"?>
    <A>Here is A valid XML</A>
    <B>Here is B valid XML</B>
    
    
    

    So, even Oracle will happily create what appears to be two root elements with a prolog. And we can break it even further...

    SQL> with t(a, b) as (
      2    select 'Here is A valid XML'
      3          ,'Here is B valid XML'
      4    from dual
      5   )
      6  select xmlelement("X",
      7           xmlroot(
      8             xmlforest(a,b)
      9             , version '1.0'
     10             , standalone yes
     11           )
     12         ) as result
     13  from   t
     14  /
    
    RESULT
    -------------------------------------------------
    <X><?xml version="1.0" standalone="yes"?>
    <A>Here is A valid XML</A>
    <B>Here is B valid XML</B>
    </X>
    

    Of course, I completely agree with the "go back to source and get it fixed there".

    Other than that, some manual processing with an o/s script (not even sure we could process it with XSLT?), or regexp in the database would be required... assuming the issue is consistent across all the required documents.

    daphne_121
  • daphne_121
    daphne_121 Member Posts: 4 Green Ribbon
    edited Oct 13, 2022 6:49PM Answer ✓

    @Paulzip @BluShadow @mathguy Thank you all!

    I tried your suggestions and they worked perfectly:

    declare
    begin 
    insert into xml_tab (
           field1)
    select x.*
    from  (select to_clob(BFILENAME ('TEST_DIR', 'test.xml')) from dual)
        cross join
        xmltable('/b/b1'
            passing xmltype(regexp_replace((select to_clob(BFILENAME ('TEST_DIR', 'test.xml')) from dual), '<a>.*</a>', '', 1, 0, 'in'))
            columns 
             field1 number(8) path './field1'
            ) x;
    commit;        
    end;  
    

    I tried to use WITH clause instead of "SELECT" part but it didn't work. It's alright, it works at least!

    Thanks a lot!

  • Paulzip
    Paulzip Member Posts: 8,802 Blue Diamond

    @daphne_121

    So you get help and mark your own reply as the correct answer? Mmm ok.

    BluShadow