This discussion is archived
13 Replies Latest reply: Nov 30, 2012 6:27 AM by user2280597 RSS

bfilename based load of xml table results in garbled data

user2280597 Newbie
Currently Being Moderated
When I load an xml file using bfilename, the values for some attributes do not match the xml file as I expected.

Here is the minimal code that demonstrates the problem:
create table xxln.t_xml_5a
(
    xdata  XMLTYPE
) 

-- load xml from file
insert into xxln.t_xml_5a
SELECT XMLTYPE (bfilename ('XXLN_PIM2LMD_DIR', 'delta.xml'), NLS_CHARSET_ID ('AL32UTF8'))  FROM   DUAL
commit

select xdata from xxln.t_xml_5a

 -- file contents, .... indicates attr omitted for brevity
 <PCS PCSI="379424" ACTION="DEL" NAME="Energy Design Update"  .....
  LICENSOR_NUM="105836" LICENSOR_NAME="Wolters Kluwer US Corporation" 
  LEGACY_HCSI="379424"/>

 -- actually loaded. Note LEGACY_HCSI field does not match input
 <PCS PCSI="379424" ACTION="DEL" NAME="Energy Design Update"  .....
     LICENSOR_NUM="105836" LICENSOR_NAME="Wolters Kluwer US Corporation" 
     LEGACY_HCSI="BNA (ET Version for E&amp;Y, D&amp;T &amp; KPMG)"/>

-- the value for LEGACY_HCSI above is actually the LICENSOR_NAME from the previous record, 
-- although this pattern is not always repeatable (the data actually loaded appears random in many cases) 
LICENSOR_NAME="BNA (ET Version for E&amp;Y, D&amp;T &amp; KPMG)" 
There are many more attributes in the xml file than shown, however I have reduced them down to the minimum that
demonstrates the problem for posting purposes. The xml is in attribute normal form.

One things I notice is that the data corruption is strongly (but not completely) correlated to attributes
missing. In other words, for many of the corrupted attributes, another optional attribute is missing from the
xml element. For testing purposes, I manually put in this missing attribute and the LEGACY_HCSI field was
no longer corrupted but other populated attributes in the file were loaded as blank. I don't have control of the data
feed xml, so I can't assume all the optional attributes are there for production.

I must be missing something here or perhaps exceeding some limit perhaps for attribute text size. Why would the xml loaded not match what is in the file?

Thanks for your help isolating this issue.

Here is my version info:
Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
PL/SQL Release 11.2.0.2.0 - Production
CORE     11.2.0.2.0     Production
TNS for Linux: Version 11.2.0.2.0 - Production
NLSRTL Version 11.2.0.2.0 - Production

I've seen the same issue in TOAD and sqldeveloper.
  • 1. Re: bfilename based load of xml table results in garbled data
    odie_63 Guru
    Currently Being Moderated
    Hi,

    Could you post something we can use to reproduce the issue? i.e. a clear test case, not something with ellipses.

    I've tried with the sample XML you gave, but it works, even with missing attributes.

    Thanks.
  • 2. Re: bfilename based load of xml table results in garbled data
    user2280597 Newbie
    Currently Being Moderated
    I tried to post the whole test file that recreates the problem, but I ran into the 30k post size limit. The xml file is about 75k. So, I deleted records until the file + post text was under 30k and tested to ensure the problem was still evident. And it wasn't :-( Should I break the xml up into 3 posts? Is there another way?

    This does tell me the problem might be file size or data content related.

    BTW, when I say "data corruption" I mean I'm doing something wrong to mess up the data, not that I think there is something wrong with Oracle XML DB itself. Or xml I'm given is not valid for some reason.

    I'll keep digging.
  • 3. Re: bfilename based load of xml table results in garbled data
    odie_63 Guru
    Currently Being Moderated
    So, I deleted records until the file + post text was under 30k and tested to ensure the problem was still evident.
    Could you post this version then? The one you gave is far for 30k :)
    Should I break the xml up into 3 posts? Is there another way?
    You can do that too, or make it available at an online shared location.
    BTW, when I say "data corruption" I mean I'm doing something wrong to mess up the data
    Unless you've not told us everything, for now, I don't think of anything you're doing wrong.

    Is the target table really like the DDL you gave, hence with Binary XML storage?
    Are you using compression? Indexes?
  • 4. Re: bfilename based load of xml table results in garbled data
    user2280597 Newbie
    Currently Being Moderated
    I've uploaded the test xml data to the following link It contains the full test data file that reproduces the issue, for me at least. The delta0.xml test file is actually ~ 157K and contains 102 records.

    https://docs.google.com/open?id=0BzrfOQLNyN7seFF6ZGdCSzNEQ0U

    >
    Is the target table really like the DDL you gave, hence with Binary XML storage?
    >
    yes, thats exactly it. Of course, this is the simplest code that demonstrates the problem. The complete design has many more columns and more complexity. But you did jog my memory to try the CLOB storage, just in case :-) (I know it should not matter, but I'm desperate here)
    >
    Are you using compression? Indexes?
    >
    No compression or indexes. Once I get the load working, then there will be some indexes for sure. Full feeds from this source are ~ 20M files, so I won't keep many

    Edited by: user2280597 on Sep 27, 2012 12:29 PM
  • 5. Re: bfilename based load of xml table results in garbled data
    odie_63 Guru
    Currently Being Moderated
    I've uploaded the test xml data to the following link It contains the full test data file that reproduces the issue, for me at least.
    Thanks, that's very helpful.

    Which records don't appear correctly after loading?
  • 6. Re: bfilename based load of xml table results in garbled data
    user2280597 Newbie
    Currently Being Moderated
    First of all, thanks so much for your help and patience with me!

    >
    Which records don't appear correctly after loading?
    >

    Here are a few PCSIs where legacy_hcsi is not numeric as it is in the xml file. PCSI is the key and typically the first attribute in the record.
    165089
    165127
    165124
    373563
    165133
    165076
    165101
    230838

    and here is one way I use besides eyeballing the data to find the bad ones:
    select 
            xtbl.pcsi,
            xtbl.action,
            xtbl.item_name,
            xtbl.licensor_num
            ,xtbl.licensor_name
            ,xtbl.coverage_clarify
            ,xtbl.legacy_hcsi
    FROM   (select xdata  from xxln.t_xml_5a ) T,
                      xmltable  ('/mds_response/mds_response_data/PCS'
                              passing T.xdata
                              columns
                                "PCSI" VARCHAR2(80) path '@PCSI'
                                ,"ACTION" VARCHAR2(80) path '@ACTION' -- will be null for full feeds as it aint there
                                ,"ITEM_NAME" VARCHAR2(1000) path '@NAME'               
                                ,"LICENSOR_NUM" VARCHAR2(255) path '@LICENSOR_NUM'
                                ,"LICENSOR_NAME" VARCHAR2(1000) path '@LICENSOR_NAME' DEFAULT 'ZZZZZZZ'     
                                ,"COVERAGE_CLARIFY" VARCHAR2(4000) path '@COVERAGE_CLARIFY'                                                
                                ,"LEGACY_HCSI" varchar2(80) path '@LEGACY_HCSI'   
                ) xtbl
        where  not regexp_like(xtbl.legacy_hcsi, '\d+') 
    in addition, for PCSI 251195, I'm getting licensor_name = "mic & Library Solutions" for the following XML where the attribute is "UNDEFINED"
              <PCS PCSI="251195" ACTION="UPD" NAME="Ethnic News" CONTENT_TYPE="UNDEFINED" CONTENT_TYPE_PGUID="UNDEFINED" LIB="NEWS" FILE="ETHNLN" PUBLISHER_NUM="251285" PUBLISHER_NAME="zzz Library Solutions" PUBID="UNDEFINED" GEO="////"
                 SHORT_DESC="UNDEFINED" LONG_DESC="UNDEFINED" COVERAGE_START="UNDEFINED" COVERAGE_END="UNDEFINED" UPD_FREQ="UNDEFINED" CREATE_DATE="26-MAR-2002" CREATED_BY="dbo" UPD_FREQ_CLARIFY="UNDEFINED" COVERAGE_CLARIFY="UNDEFINED"
                 UPDATE_SCHEDULE="UNDEFINED" DOC_COUNT="0" CSR_READY="1" LICENSOR_NUM="UNDEFINED" LICENSOR_NAME="UNDEFINED" LEGACY_HCSI="251195"/>
    I do see & amp ; (spaces added) in the xml, which I believe is a valid character entity for ampersand. I'll go see if there is any correlation.

    Further testing indicates that just the bare read of the xml file exhibits the issue. In other words, I don't have to insert into the table to see the legacy_hcsi get "corrupted". For instance, I just run the following and looking at the results I see the problem:
    SELECT XMLTYPE (bfilename ('XXLN_PIM2LMD_DIR', 'delta0.xml'), NLS_CHARSET_ID ('AL32UTF8'))  FROM   DUAL
    It may be easier to see if filtered as follows. This makes the missing coverage_clarify tag very visible, for this test file, every bad legacy_hcsi tag is from a row with a missing coverage_clarify
    select 
            xtbl.pcsi,     xtbl.action,     xtbl.item_name,     xtbl.licensor_num     ,xtbl.licensor_name     ,xtbl.coverage_clarify     ,xtbl.legacy_hcsi
    FROM   (SELECT XMLTYPE (bfilename ('XXLN_PIM2LMD_DIR', 'delta0.xml'), NLS_CHARSET_ID ('AL32UTF8'))  xdata FROM   DUAL  ) T,
                      xmltable  ('/mds_response/mds_response_data/PCS'
                              passing T.xdata
                              columns
                                "PCSI" VARCHAR2(80) path '@PCSI'
                                ,"ACTION" VARCHAR2(80) path '@ACTION' -- will be null for full feeds as it aint there
                                ,"ITEM_NAME" VARCHAR2(1000) path '@NAME'               
                                ,"LICENSOR_NUM" VARCHAR2(255) path '@LICENSOR_NUM'
                                ,"LICENSOR_NAME" VARCHAR2(1000) path '@LICENSOR_NAME' DEFAULT 'ZZZZZZZ'     
                                ,"COVERAGE_CLARIFY" VARCHAR2(4000) path '@COVERAGE_CLARIFY'                                                
                                ,"LEGACY_HCSI" varchar2(80) path '@LEGACY_HCSI'   
                ) xtbl
    where not  regexp_like(xtbl.legacy_hcsi, '\d+')
    Thanks again!
  • 7. Re: bfilename based load of xml table results in garbled data
    odie_63 Guru
    Currently Being Moderated
    Sorry, I don't reproducing the issue, I'm getting all values at their correct places :
    SQL> select count(*)
      2  FROM xmltable  ('/mds_response/mds_response_data/PCS'
      3                   passing XMLTYPE (bfilename ('TEST_DIR', 'delta0.xml'), NLS_CHARSET_ID ('AL32UTF8'))
      4                   columns
      5                     "PCSI" VARCHAR2(80) path '@PCSI'
      6                     ,"ACTION" VARCHAR2(80) path '@ACTION' -- will be null for full feeds as it aint there
      7                     ,"ITEM_NAME" VARCHAR2(1000) path '@NAME'
      8                     ,"LICENSOR_NUM" VARCHAR2(255) path '@LICENSOR_NUM'
      9                     ,"LICENSOR_NAME" VARCHAR2(1000) path '@LICENSOR_NAME' DEFAULT 'ZZZZZZZ'
     10                     ,"COVERAGE_CLARIFY" VARCHAR2(4000) path '@COVERAGE_CLARIFY'
     11                     ,"LEGACY_HCSI" varchar2(80) path '@LEGACY_HCSI'
     12     ) xtbl
     13  where not  regexp_like(xtbl.legacy_hcsi, '\d+');
     
      COUNT(*)
    ----------
             0
     
    SQL> select *
      2  FROM xmltable  ('/mds_response/mds_response_data/PCS'
      3                   passing XMLTYPE (bfilename ('TEST_DIR', 'delta0.xml'), NLS_CHARSET_ID ('AL32UTF8'))
      4                   columns
      5                     "PCSI" VARCHAR2(15) path '@PCSI'
      6                   , "LEGACY_HCSI" varchar2(15) path '@LEGACY_HCSI'
      7                   , "COVERAGE_CLARIFY" VARCHAR2(4000) path '@COVERAGE_CLARIFY'
      8          ) xtbl
      9  where xtbl.pcsi in (
     10  '165089',
     11  '165127',
     12  '165124',
     13  '373563',
     14  '165133',
     15  '165076',
     16  '165101',
     17  '230838'
     18  )
     19  ;
     
    PCSI            LEGACY_HCSI     COVERAGE_CLARIFY
    --------------- --------------- --------------------------------------------------------------------------------
    165089          165089          
    165127          165127          
    165124          165124          
    373563          373563          
    165133          165133          
    165076          165076          
    165101          165101          
    230838          230838          
     
    8 rows selected
     
    BTW, my test database is :
    SQL> select * from v$version;
     
    BANNER
    --------------------------------------------------------------------------------
    Oracle Database 11g Express Edition Release 11.2.0.2.0 - Production
    PL/SQL Release 11.2.0.2.0 - Production
    CORE     11.2.0.2.0     Production
    TNS for 32-bit Windows: Version 11.2.0.2.0 - Production
    NLSRTL Version 11.2.0.2.0 - Production
     
    So not exactly yours, but the core functionalities should be the same.
  • 8. Re: bfilename based load of xml table results in garbled data
    user2280597 Newbie
    Currently Being Moderated
    Thanks so much odie_63 for running the test, that gives me a good data point for comparison.

    My takeaway is that the approach is valid and there is possibly something wrong with our 11G config. At least it seems environmental, not something wrong with the method of loading/querying the xml? I have tested this on multiple 11G servers, but they are all clones of the production system (its an EBS environment).

    So my next steps are listed below, I'd appreciate any other ideas:
    1. try on an 11G server not cloned from the same source and a 10G server
    2. Oracle support ticket
    3. convert xml to element normal and reload (it does not look too bad using perl or xslt but I hate to change data to workaround a technical problem)
    4. see if the problem is type related. The data corruption I know about seems restricted to integer fields.
    5. use a schema where possible. XMLTable does not seem to support a schema parameter. (largely unknown to me, but perhaps being more explicit with the xml will help -- pointers here especially appreciated)

    I have narrowed the problem down somewhat by determining that the XMLTYPE(bfilename part is correct, so the problem is in the XMLTABLE clause somwhere. I confirmed this by running the following statemeent and verifying the attribute in question is correct.
    SELECT XMLTYPE (bfilename ('XXLN_PIM2LMD_DIR', 'delta0_noschema.xml'), NLS_CHARSET_ID ('AL32UTF8'))  as XML_COL FROM   DUAL
    I ran your last query and here are the results I get:
    PCSI          LEGACY_HCSI
    -----------   ---------------
    165089     CCH Ct Blue Sky     
    165127     CCH Kentucky Bl     
    165124     KSDEC contains      
    373563          
    165133     In the fast-cha     
    165076     In the fast-cha     
    165101     In the fast-cha     
    230838          
    here is some perl code to extract one of the problem fields for easy verification that all the input values are integers
    perl -nae 'if ( ($lhcsi) = $_ =~ /.*LEGACY_HCSI="(\d*)".*/ ) { print $lhcsi . "\n"; }' delta0.xml
  • 9. Re: bfilename based load of xml table results in garbled data
    odie_63 Guru
    Currently Being Moderated
    5. use a schema where possible. XMLTable does not seem to support a schema parameter. (largely unknown to me, but perhaps being more explicit with the xml will help -- pointers here especially appreciated)
    Worth a try indeed.

    I'd first focus on solving the actual issue with the help of Oracle Support though.

    In order to use the schema-based approach, you have to register the schema first, and create a schema-based XMLType table :
    http://docs.oracle.com/cd/E11882_01/appdev.112/e23094/xdb05sto.htm#g1070409
    I have narrowed the problem down somewhat by determining that the XMLTYPE(bfilename part is correct, so the problem is in the XMLTABLE clause somwhere.
    Doesn't that contradict your previous statement :
    Further testing indicates that just the bare read of the xml file exhibits the issue.
    or, did you mean you were using XMLTable all the time?

    BTW, does CLOB storage work better?
  • 10. Re: bfilename based load of xml table results in garbled data
    user2280597 Newbie
    Currently Being Moderated
    I've submitted a support ticket, I'm starting to work through that.

    I tried the schema approach, but I keep getting "object out of sync with definition" errors when I query the schema and tables won't create with a schema. But I've shelved this in light of the SR and converting the XML to element normal format as a test.

    >
    I have narrowed the problem down somewhat by determining that the XMLTYPE(bfilename part is correct, so the problem is in the XMLTABLE clause somwhere.
    Doesn't that contradict your previous statement :
    Further testing indicates that just the bare read of the xml file exhibits the issue.
    or, did you mean you were using XMLTable all the time?
    >
    I'm sorry, my statement was not clear at all. I meant that I eliminated storing the output of XMLTYPE(bfilename in a table. In other words, the problem is with xmltable construct.

    >
    BTW, does CLOB storage work better?
    >
    I will try that later today, production issues permitting :-)
  • 11. Re: bfilename based load of xml table results in garbled data
    user2280597 Newbie
    Currently Being Moderated
    The XMLTABLE contruct works for element based XML. I used the exact same input file, I just wrote a little perl to convert
    all the attributes to sub-elements and removed the @ qualifyer from the path.

    Since the XML from the upstream system is attribute based, I'll have to productiionize the attribute to element conversion or wait for a fix from Oracle.

    just for completeness, here is the working code:
    select *
      FROM xmltable  ('/mds_response/mds_response_data/PCS'
                       passing XMLTYPE (bfilename ('XXLN_PIM2LMD_DIR', 'delta0_ele.xml'), NLS_CHARSET_ID ('AL32UTF8'))
                       columns
                         "PCSI" VARCHAR2(15) path 'PCSI'
                       , "LEGACY_HCSI" varchar2(15) path 'LEGACY_HCSI'
                       , "COVERAGE_CLARIFY" VARCHAR2(4000) path 'COVERAGE_CLARIFY'
              ) xtbl
      where xtbl.pcsi in (  '165089',  '165127',  '165124',  '373563',  '165133',  '165076',  '165101',  '230838'  )
  • 12. Re: bfilename based load of xml table results in garbled data
    odie_63 Guru
    Currently Being Moderated
    user2280597 wrote:
    The XMLTABLE contruct works for element based XML. I used the exact same input file, I just wrote a little perl to convert
    all the attributes to sub-elements and removed the @ qualifyer from the path.
    Thanks for the feedback. I'd be interested in knowing what Oracle Support says about it too.
    Since the XML from the upstream system is attribute based, I'll have to productiionize the attribute to element conversion or wait for a fix from Oracle.
    You can also perform an "on-the-fly" XSL transformation at insert time.
    insert into t_xml_5a
    select xmltransform(
             xmltype(bfilename('TEST_DIR', 'delta0.xml'), nls_charset_id ('AL32UTF8'))
           , xmlparse(document
    '<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
     <xsl:output method="xml"/>
      <xsl:template match="@*|node()">
        <xsl:copy>
          <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
      </xsl:template>
      <xsl:template match="PCS/@*">
        <xsl:element name="{local-name(.)}">
          <xsl:value-of select="."/>
        </xsl:element>
      </xsl:template>
    </xsl:stylesheet>')
    )
    from dual;
    SQL> select x.*
      2  from t_xml_5a t
      3     , xmltable(
      4         '/mds_response/mds_response_data/PCS'
      5         passing t.xdata
      6         columns
      7           "PCSI"             varchar2(15)   path 'PCSI'
      8         --, "ACTION"           varchar2(80)   path 'ACTION'
      9         --, "ITEM_NAME"        varchar2(1000) path 'NAME'
     10         --, "LICENSOR_NUM"     varchar2(255)  path 'LICENSOR_NUM'
     11         --, "LICENSOR_NAME"    varchar2(1000) path 'LICENSOR_NAME' DEFAULT 'ZZZZZZZ'
     12         , "LEGACY_HCSI"      varchar2(15)   path 'LEGACY_HCSI'
     13         , "COVERAGE_CLARIFY" varchar2(4000) path 'COVERAGE_CLARIFY'
     14       ) x
     15  where x.pcsi in (
     16   '165089', '165127', '165124', '373563', '165133', '165076', '165101', '230838'
     17  );
     
    PCSI            LEGACY_HCSI     COVERAGE_CLARIFY
    --------------- --------------- --------------------------------------------------------------------------------
    165089          165089          
    165127          165127          
    165124          165124          
    373563          373563          
    165133          165133          
    165076          165076          
    165101          165101          
    230838          230838          
     
    8 rows selected
     
    Edited by: odie_63 on 3 oct. 2012 17:37
  • 13. Re: bfilename based load of xml table results in garbled data
    user2280597 Newbie
    Currently Being Moderated
    Oracle support confirned this is a bug that is fixed in 11.2.0.3 and I'm on 11.2.0.2 currently.

    There are actually several other fixes we want from 11.2.0.3, so we will be going there after the next functionality release in 1Q13.

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points