13 Replies Latest reply: Nov 30, 2012 8:27 AM by user2280597 RSS

    bfilename based load of xml table results in garbled data

    user2280597
      When I load an xml file using bfilename, the values for some attributes do not match the xml file as I expected.

      Here is the minimal code that demonstrates the problem:
      create table xxln.t_xml_5a
      (
          xdata  XMLTYPE
      ) 
      
      -- load xml from file
      insert into xxln.t_xml_5a
      SELECT XMLTYPE (bfilename ('XXLN_PIM2LMD_DIR', 'delta.xml'), NLS_CHARSET_ID ('AL32UTF8'))  FROM   DUAL
      commit
      
      select xdata from xxln.t_xml_5a
      
       -- file contents, .... indicates attr omitted for brevity
       <PCS PCSI="379424" ACTION="DEL" NAME="Energy Design Update"  .....
        LICENSOR_NUM="105836" LICENSOR_NAME="Wolters Kluwer US Corporation" 
        LEGACY_HCSI="379424"/>
      
       -- actually loaded. Note LEGACY_HCSI field does not match input
       <PCS PCSI="379424" ACTION="DEL" NAME="Energy Design Update"  .....
           LICENSOR_NUM="105836" LICENSOR_NAME="Wolters Kluwer US Corporation" 
           LEGACY_HCSI="BNA (ET Version for E&amp;Y, D&amp;T &amp; KPMG)"/>
      
      -- the value for LEGACY_HCSI above is actually the LICENSOR_NAME from the previous record, 
      -- although this pattern is not always repeatable (the data actually loaded appears random in many cases) 
      LICENSOR_NAME="BNA (ET Version for E&amp;Y, D&amp;T &amp; KPMG)" 
      There are many more attributes in the xml file than shown, however I have reduced them down to the minimum that
      demonstrates the problem for posting purposes. The xml is in attribute normal form.

      One things I notice is that the data corruption is strongly (but not completely) correlated to attributes
      missing. In other words, for many of the corrupted attributes, another optional attribute is missing from the
      xml element. For testing purposes, I manually put in this missing attribute and the LEGACY_HCSI field was
      no longer corrupted but other populated attributes in the file were loaded as blank. I don't have control of the data
      feed xml, so I can't assume all the optional attributes are there for production.

      I must be missing something here or perhaps exceeding some limit perhaps for attribute text size. Why would the xml loaded not match what is in the file?

      Thanks for your help isolating this issue.

      Here is my version info:
      Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
      PL/SQL Release 11.2.0.2.0 - Production
      CORE     11.2.0.2.0     Production
      TNS for Linux: Version 11.2.0.2.0 - Production
      NLSRTL Version 11.2.0.2.0 - Production

      I've seen the same issue in TOAD and sqldeveloper.
        • 1. Re: bfilename based load of xml table results in garbled data
          odie_63
          Hi,

          Could you post something we can use to reproduce the issue? i.e. a clear test case, not something with ellipses.

          I've tried with the sample XML you gave, but it works, even with missing attributes.

          Thanks.
          • 2. Re: bfilename based load of xml table results in garbled data
            user2280597
            I tried to post the whole test file that recreates the problem, but I ran into the 30k post size limit. The xml file is about 75k. So, I deleted records until the file + post text was under 30k and tested to ensure the problem was still evident. And it wasn't :-( Should I break the xml up into 3 posts? Is there another way?

            This does tell me the problem might be file size or data content related.

            BTW, when I say "data corruption" I mean I'm doing something wrong to mess up the data, not that I think there is something wrong with Oracle XML DB itself. Or xml I'm given is not valid for some reason.

            I'll keep digging.
            • 3. Re: bfilename based load of xml table results in garbled data
              odie_63
              So, I deleted records until the file + post text was under 30k and tested to ensure the problem was still evident.
              Could you post this version then? The one you gave is far for 30k :)
              Should I break the xml up into 3 posts? Is there another way?
              You can do that too, or make it available at an online shared location.
              BTW, when I say "data corruption" I mean I'm doing something wrong to mess up the data
              Unless you've not told us everything, for now, I don't think of anything you're doing wrong.

              Is the target table really like the DDL you gave, hence with Binary XML storage?
              Are you using compression? Indexes?
              • 4. Re: bfilename based load of xml table results in garbled data
                user2280597
                I've uploaded the test xml data to the following link It contains the full test data file that reproduces the issue, for me at least. The delta0.xml test file is actually ~ 157K and contains 102 records.

                https://docs.google.com/open?id=0BzrfOQLNyN7seFF6ZGdCSzNEQ0U

                >
                Is the target table really like the DDL you gave, hence with Binary XML storage?
                >
                yes, thats exactly it. Of course, this is the simplest code that demonstrates the problem. The complete design has many more columns and more complexity. But you did jog my memory to try the CLOB storage, just in case :-) (I know it should not matter, but I'm desperate here)
                >
                Are you using compression? Indexes?
                >
                No compression or indexes. Once I get the load working, then there will be some indexes for sure. Full feeds from this source are ~ 20M files, so I won't keep many

                Edited by: user2280597 on Sep 27, 2012 12:29 PM
                • 5. Re: bfilename based load of xml table results in garbled data
                  odie_63
                  I've uploaded the test xml data to the following link It contains the full test data file that reproduces the issue, for me at least.
                  Thanks, that's very helpful.

                  Which records don't appear correctly after loading?
                  • 6. Re: bfilename based load of xml table results in garbled data
                    user2280597
                    First of all, thanks so much for your help and patience with me!

                    >
                    Which records don't appear correctly after loading?
                    >

                    Here are a few PCSIs where legacy_hcsi is not numeric as it is in the xml file. PCSI is the key and typically the first attribute in the record.
                    165089
                    165127
                    165124
                    373563
                    165133
                    165076
                    165101
                    230838

                    and here is one way I use besides eyeballing the data to find the bad ones:
                    select 
                            xtbl.pcsi,
                            xtbl.action,
                            xtbl.item_name,
                            xtbl.licensor_num
                            ,xtbl.licensor_name
                            ,xtbl.coverage_clarify
                            ,xtbl.legacy_hcsi
                    FROM   (select xdata  from xxln.t_xml_5a ) T,
                                      xmltable  ('/mds_response/mds_response_data/PCS'
                                              passing T.xdata
                                              columns
                                                "PCSI" VARCHAR2(80) path '@PCSI'
                                                ,"ACTION" VARCHAR2(80) path '@ACTION' -- will be null for full feeds as it aint there
                                                ,"ITEM_NAME" VARCHAR2(1000) path '@NAME'               
                                                ,"LICENSOR_NUM" VARCHAR2(255) path '@LICENSOR_NUM'
                                                ,"LICENSOR_NAME" VARCHAR2(1000) path '@LICENSOR_NAME' DEFAULT 'ZZZZZZZ'     
                                                ,"COVERAGE_CLARIFY" VARCHAR2(4000) path '@COVERAGE_CLARIFY'                                                
                                                ,"LEGACY_HCSI" varchar2(80) path '@LEGACY_HCSI'   
                                ) xtbl
                        where  not regexp_like(xtbl.legacy_hcsi, '\d+') 
                    in addition, for PCSI 251195, I'm getting licensor_name = "mic & Library Solutions" for the following XML where the attribute is "UNDEFINED"
                              <PCS PCSI="251195" ACTION="UPD" NAME="Ethnic News" CONTENT_TYPE="UNDEFINED" CONTENT_TYPE_PGUID="UNDEFINED" LIB="NEWS" FILE="ETHNLN" PUBLISHER_NUM="251285" PUBLISHER_NAME="zzz Library Solutions" PUBID="UNDEFINED" GEO="////"
                                 SHORT_DESC="UNDEFINED" LONG_DESC="UNDEFINED" COVERAGE_START="UNDEFINED" COVERAGE_END="UNDEFINED" UPD_FREQ="UNDEFINED" CREATE_DATE="26-MAR-2002" CREATED_BY="dbo" UPD_FREQ_CLARIFY="UNDEFINED" COVERAGE_CLARIFY="UNDEFINED"
                                 UPDATE_SCHEDULE="UNDEFINED" DOC_COUNT="0" CSR_READY="1" LICENSOR_NUM="UNDEFINED" LICENSOR_NAME="UNDEFINED" LEGACY_HCSI="251195"/>
                    I do see & amp ; (spaces added) in the xml, which I believe is a valid character entity for ampersand. I'll go see if there is any correlation.

                    Further testing indicates that just the bare read of the xml file exhibits the issue. In other words, I don't have to insert into the table to see the legacy_hcsi get "corrupted". For instance, I just run the following and looking at the results I see the problem:
                    SELECT XMLTYPE (bfilename ('XXLN_PIM2LMD_DIR', 'delta0.xml'), NLS_CHARSET_ID ('AL32UTF8'))  FROM   DUAL
                    It may be easier to see if filtered as follows. This makes the missing coverage_clarify tag very visible, for this test file, every bad legacy_hcsi tag is from a row with a missing coverage_clarify
                    select 
                            xtbl.pcsi,     xtbl.action,     xtbl.item_name,     xtbl.licensor_num     ,xtbl.licensor_name     ,xtbl.coverage_clarify     ,xtbl.legacy_hcsi
                    FROM   (SELECT XMLTYPE (bfilename ('XXLN_PIM2LMD_DIR', 'delta0.xml'), NLS_CHARSET_ID ('AL32UTF8'))  xdata FROM   DUAL  ) T,
                                      xmltable  ('/mds_response/mds_response_data/PCS'
                                              passing T.xdata
                                              columns
                                                "PCSI" VARCHAR2(80) path '@PCSI'
                                                ,"ACTION" VARCHAR2(80) path '@ACTION' -- will be null for full feeds as it aint there
                                                ,"ITEM_NAME" VARCHAR2(1000) path '@NAME'               
                                                ,"LICENSOR_NUM" VARCHAR2(255) path '@LICENSOR_NUM'
                                                ,"LICENSOR_NAME" VARCHAR2(1000) path '@LICENSOR_NAME' DEFAULT 'ZZZZZZZ'     
                                                ,"COVERAGE_CLARIFY" VARCHAR2(4000) path '@COVERAGE_CLARIFY'                                                
                                                ,"LEGACY_HCSI" varchar2(80) path '@LEGACY_HCSI'   
                                ) xtbl
                    where not  regexp_like(xtbl.legacy_hcsi, '\d+')
                    Thanks again!
                    • 7. Re: bfilename based load of xml table results in garbled data
                      odie_63
                      Sorry, I don't reproducing the issue, I'm getting all values at their correct places :
                      SQL> select count(*)
                        2  FROM xmltable  ('/mds_response/mds_response_data/PCS'
                        3                   passing XMLTYPE (bfilename ('TEST_DIR', 'delta0.xml'), NLS_CHARSET_ID ('AL32UTF8'))
                        4                   columns
                        5                     "PCSI" VARCHAR2(80) path '@PCSI'
                        6                     ,"ACTION" VARCHAR2(80) path '@ACTION' -- will be null for full feeds as it aint there
                        7                     ,"ITEM_NAME" VARCHAR2(1000) path '@NAME'
                        8                     ,"LICENSOR_NUM" VARCHAR2(255) path '@LICENSOR_NUM'
                        9                     ,"LICENSOR_NAME" VARCHAR2(1000) path '@LICENSOR_NAME' DEFAULT 'ZZZZZZZ'
                       10                     ,"COVERAGE_CLARIFY" VARCHAR2(4000) path '@COVERAGE_CLARIFY'
                       11                     ,"LEGACY_HCSI" varchar2(80) path '@LEGACY_HCSI'
                       12     ) xtbl
                       13  where not  regexp_like(xtbl.legacy_hcsi, '\d+');
                       
                        COUNT(*)
                      ----------
                               0
                       
                      SQL> select *
                        2  FROM xmltable  ('/mds_response/mds_response_data/PCS'
                        3                   passing XMLTYPE (bfilename ('TEST_DIR', 'delta0.xml'), NLS_CHARSET_ID ('AL32UTF8'))
                        4                   columns
                        5                     "PCSI" VARCHAR2(15) path '@PCSI'
                        6                   , "LEGACY_HCSI" varchar2(15) path '@LEGACY_HCSI'
                        7                   , "COVERAGE_CLARIFY" VARCHAR2(4000) path '@COVERAGE_CLARIFY'
                        8          ) xtbl
                        9  where xtbl.pcsi in (
                       10  '165089',
                       11  '165127',
                       12  '165124',
                       13  '373563',
                       14  '165133',
                       15  '165076',
                       16  '165101',
                       17  '230838'
                       18  )
                       19  ;
                       
                      PCSI            LEGACY_HCSI     COVERAGE_CLARIFY
                      --------------- --------------- --------------------------------------------------------------------------------
                      165089          165089          
                      165127          165127          
                      165124          165124          
                      373563          373563          
                      165133          165133          
                      165076          165076          
                      165101          165101          
                      230838          230838          
                       
                      8 rows selected
                       
                      BTW, my test database is :
                      SQL> select * from v$version;
                       
                      BANNER
                      --------------------------------------------------------------------------------
                      Oracle Database 11g Express Edition Release 11.2.0.2.0 - Production
                      PL/SQL Release 11.2.0.2.0 - Production
                      CORE     11.2.0.2.0     Production
                      TNS for 32-bit Windows: Version 11.2.0.2.0 - Production
                      NLSRTL Version 11.2.0.2.0 - Production
                       
                      So not exactly yours, but the core functionalities should be the same.
                      • 8. Re: bfilename based load of xml table results in garbled data
                        user2280597
                        Thanks so much odie_63 for running the test, that gives me a good data point for comparison.

                        My takeaway is that the approach is valid and there is possibly something wrong with our 11G config. At least it seems environmental, not something wrong with the method of loading/querying the xml? I have tested this on multiple 11G servers, but they are all clones of the production system (its an EBS environment).

                        So my next steps are listed below, I'd appreciate any other ideas:
                        1. try on an 11G server not cloned from the same source and a 10G server
                        2. Oracle support ticket
                        3. convert xml to element normal and reload (it does not look too bad using perl or xslt but I hate to change data to workaround a technical problem)
                        4. see if the problem is type related. The data corruption I know about seems restricted to integer fields.
                        5. use a schema where possible. XMLTable does not seem to support a schema parameter. (largely unknown to me, but perhaps being more explicit with the xml will help -- pointers here especially appreciated)

                        I have narrowed the problem down somewhat by determining that the XMLTYPE(bfilename part is correct, so the problem is in the XMLTABLE clause somwhere. I confirmed this by running the following statemeent and verifying the attribute in question is correct.
                        SELECT XMLTYPE (bfilename ('XXLN_PIM2LMD_DIR', 'delta0_noschema.xml'), NLS_CHARSET_ID ('AL32UTF8'))  as XML_COL FROM   DUAL
                        I ran your last query and here are the results I get:
                        PCSI          LEGACY_HCSI
                        -----------   ---------------
                        165089     CCH Ct Blue Sky     
                        165127     CCH Kentucky Bl     
                        165124     KSDEC contains      
                        373563          
                        165133     In the fast-cha     
                        165076     In the fast-cha     
                        165101     In the fast-cha     
                        230838          
                        here is some perl code to extract one of the problem fields for easy verification that all the input values are integers
                        perl -nae 'if ( ($lhcsi) = $_ =~ /.*LEGACY_HCSI="(\d*)".*/ ) { print $lhcsi . "\n"; }' delta0.xml
                        • 9. Re: bfilename based load of xml table results in garbled data
                          odie_63
                          5. use a schema where possible. XMLTable does not seem to support a schema parameter. (largely unknown to me, but perhaps being more explicit with the xml will help -- pointers here especially appreciated)
                          Worth a try indeed.

                          I'd first focus on solving the actual issue with the help of Oracle Support though.

                          In order to use the schema-based approach, you have to register the schema first, and create a schema-based XMLType table :
                          http://docs.oracle.com/cd/E11882_01/appdev.112/e23094/xdb05sto.htm#g1070409
                          I have narrowed the problem down somewhat by determining that the XMLTYPE(bfilename part is correct, so the problem is in the XMLTABLE clause somwhere.
                          Doesn't that contradict your previous statement :
                          Further testing indicates that just the bare read of the xml file exhibits the issue.
                          or, did you mean you were using XMLTable all the time?

                          BTW, does CLOB storage work better?
                          • 10. Re: bfilename based load of xml table results in garbled data
                            user2280597
                            I've submitted a support ticket, I'm starting to work through that.

                            I tried the schema approach, but I keep getting "object out of sync with definition" errors when I query the schema and tables won't create with a schema. But I've shelved this in light of the SR and converting the XML to element normal format as a test.

                            >
                            I have narrowed the problem down somewhat by determining that the XMLTYPE(bfilename part is correct, so the problem is in the XMLTABLE clause somwhere.
                            Doesn't that contradict your previous statement :
                            Further testing indicates that just the bare read of the xml file exhibits the issue.
                            or, did you mean you were using XMLTable all the time?
                            >
                            I'm sorry, my statement was not clear at all. I meant that I eliminated storing the output of XMLTYPE(bfilename in a table. In other words, the problem is with xmltable construct.

                            >
                            BTW, does CLOB storage work better?
                            >
                            I will try that later today, production issues permitting :-)
                            • 11. Re: bfilename based load of xml table results in garbled data
                              user2280597
                              The XMLTABLE contruct works for element based XML. I used the exact same input file, I just wrote a little perl to convert
                              all the attributes to sub-elements and removed the @ qualifyer from the path.

                              Since the XML from the upstream system is attribute based, I'll have to productiionize the attribute to element conversion or wait for a fix from Oracle.

                              just for completeness, here is the working code:
                              select *
                                FROM xmltable  ('/mds_response/mds_response_data/PCS'
                                                 passing XMLTYPE (bfilename ('XXLN_PIM2LMD_DIR', 'delta0_ele.xml'), NLS_CHARSET_ID ('AL32UTF8'))
                                                 columns
                                                   "PCSI" VARCHAR2(15) path 'PCSI'
                                                 , "LEGACY_HCSI" varchar2(15) path 'LEGACY_HCSI'
                                                 , "COVERAGE_CLARIFY" VARCHAR2(4000) path 'COVERAGE_CLARIFY'
                                        ) xtbl
                                where xtbl.pcsi in (  '165089',  '165127',  '165124',  '373563',  '165133',  '165076',  '165101',  '230838'  )
                              • 12. Re: bfilename based load of xml table results in garbled data
                                odie_63
                                user2280597 wrote:
                                The XMLTABLE contruct works for element based XML. I used the exact same input file, I just wrote a little perl to convert
                                all the attributes to sub-elements and removed the @ qualifyer from the path.
                                Thanks for the feedback. I'd be interested in knowing what Oracle Support says about it too.
                                Since the XML from the upstream system is attribute based, I'll have to productiionize the attribute to element conversion or wait for a fix from Oracle.
                                You can also perform an "on-the-fly" XSL transformation at insert time.
                                insert into t_xml_5a
                                select xmltransform(
                                         xmltype(bfilename('TEST_DIR', 'delta0.xml'), nls_charset_id ('AL32UTF8'))
                                       , xmlparse(document
                                '<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
                                 <xsl:output method="xml"/>
                                  <xsl:template match="@*|node()">
                                    <xsl:copy>
                                      <xsl:apply-templates select="@*|node()"/>
                                    </xsl:copy>
                                  </xsl:template>
                                  <xsl:template match="PCS/@*">
                                    <xsl:element name="{local-name(.)}">
                                      <xsl:value-of select="."/>
                                    </xsl:element>
                                  </xsl:template>
                                </xsl:stylesheet>')
                                )
                                from dual;
                                SQL> select x.*
                                  2  from t_xml_5a t
                                  3     , xmltable(
                                  4         '/mds_response/mds_response_data/PCS'
                                  5         passing t.xdata
                                  6         columns
                                  7           "PCSI"             varchar2(15)   path 'PCSI'
                                  8         --, "ACTION"           varchar2(80)   path 'ACTION'
                                  9         --, "ITEM_NAME"        varchar2(1000) path 'NAME'
                                 10         --, "LICENSOR_NUM"     varchar2(255)  path 'LICENSOR_NUM'
                                 11         --, "LICENSOR_NAME"    varchar2(1000) path 'LICENSOR_NAME' DEFAULT 'ZZZZZZZ'
                                 12         , "LEGACY_HCSI"      varchar2(15)   path 'LEGACY_HCSI'
                                 13         , "COVERAGE_CLARIFY" varchar2(4000) path 'COVERAGE_CLARIFY'
                                 14       ) x
                                 15  where x.pcsi in (
                                 16   '165089', '165127', '165124', '373563', '165133', '165076', '165101', '230838'
                                 17  );
                                 
                                PCSI            LEGACY_HCSI     COVERAGE_CLARIFY
                                --------------- --------------- --------------------------------------------------------------------------------
                                165089          165089          
                                165127          165127          
                                165124          165124          
                                373563          373563          
                                165133          165133          
                                165076          165076          
                                165101          165101          
                                230838          230838          
                                 
                                8 rows selected
                                 
                                Edited by: odie_63 on 3 oct. 2012 17:37
                                • 13. Re: bfilename based load of xml table results in garbled data
                                  user2280597
                                  Oracle support confirned this is a bug that is fixed in 11.2.0.3 and I'm on 11.2.0.2 currently.

                                  There are actually several other fixes we want from 11.2.0.3, so we will be going there after the next functionality release in 1Q13.