This discussion is archived
2 Replies Latest reply: Nov 22, 2013 10:45 AM by 978225 RSS

Processing hierarchical XML in DWH type setup

978225 Newbie
Currently Being Moderated

Hello everyone,

 

My environment is Oracle DB 11.2.0.3.

 

I have some XML documents in a CLOB that have a three level hierarchie.

Cardinality for Level1:Level2 is 1:n and for Level2:Level3 also 1:n

 

 

The XML looks something like this:

 

 

 

<l1_root>

  <l1_id>1</l1_id>

  <l1_attrib1>AAAA</l1_attrib1>

  <l1_attrib2>BBBB</l1_attrib2>

  <l2>

    <l2_id>2</l2_id>

    <l2_attrib1>CCCC</l2_attrib1>

    <l2_attrib2>DDDD</l2_attrib2>

    <l3>

      <l3_id>3</l3_id>

      <l3_attrib1>EEEE</l3_attrib1>

      <l3_attrib2>FFFF</l3_attrib2>

  </l3>

    <l3>

      <l3_id>4</l3_id>

      <l3_attrib1>EEEE</l3_attrib1>

      <l3_attrib2>FFFF</l3_attrib2>

  </l3>

  ...

  </l2>

  <l2>

    <l2_id>5</l2_id>

    <l2_attrib1>GGGG</l2_attrib1>

    <l2_attrib2>HHHH</l2_attrib2>

    <l3>

      <l3_id>6</l3_id>

      <l3_attrib1>IIII</l3_attrib1>

      <l3_attrib2>JJJJ</l3_attrib2>

  </l3>

    <l3>

      <l3_id>7</l3_id>

      <l3_attrib1>KKKK</l3_attrib1>

      <l3_attrib2>LLLL</l3_attrib2>

  </l3>

  ...

  </l2>

  ...

</l1_root>

 

 

I would like to convert this into a normalized relational table structure

 

 

L1

=========

L1_ID

Attrib1

Attrib2

 

L2

=========

L1_ID

L2_ID

Attrib1

Attrib2

 

 

L3

=========

L1_ID

L2_ID

L3_ID

Attrib1

Attrib2

 

 

This is in an incremental load DWH type of scenario, so I will have to do this initially for all previously existing records (approx. 200 Mio) and then daily for the additional records (approx 1Mio per day). I am using MERGE statements for the incremental loads.

Some of the documents may have up to 100 or more elements on level 2 with about 1-5 level 3 elements each.

 

 

What is the most effective way of doing this?

My first idea was to convert the CLOB into an XMLTYPE and then use 3 nested XMLTABLEs for parsing out the XPATH expressions. However, it feels like doing this, I am denormalizing everything (flattening it) and then regrouping into the 3 normalized levels...

 

 

Does it make sense to apply an XmlIndex to the XMLTYPE data? I am not sure whether the overhead for creating the index and adding the new records into it is really killing the gain from it on the parsing in the next step. Since this is a DWH incremental load situation, each record is only processed once, so I am thinking maybe not creating an XmlIndex might be faster.

I can see how an XmlIndex would be helpful for extracting very small fragments out of relatively large documents. But in this case, pretty much the whole document is relevant, which adds to my doubts about an index making sense on this one...

 

 

Any pointers/ideas/comments?

 

Thank you in advance...

  • 1. Re: Processing hierarchical XML in DWH type setup
    odie_63 Guru
    Currently Being Moderated

    Hi,

    My first idea was to convert the CLOB into an XMLTYPE

    Yes, you must first convert the column to XMLTYPE with binary XML storage in order to have any chance for this to perform decently.

     

    and then use 3 nested XMLTABLEs for parsing out the XPATH expressions. However, it feels like doing this, I am denormalizing everything (flattening it) and then regrouping into the 3 normalized levels...

    You could have used a single query like you describe combined with a multitable insert statement, but since you're in a data merging scenario, it might be simply better to use 3 different MERGE statements each selecting/targeting the relevant portion of the XML data/table.

    Something like this :

    merge into l1

    using (

        select x.*

        from tmp_xml t

           , xmltable(

               '/l1_root'

               passing t.object_value

               columns

                 L1_ID   integer      path 'l1_id'

               , Attrib1 varchar2(30) path 'l1_attrib1'

               , Attrib2 varchar2(30) path 'l1_attrib2'

             ) x

    ) v

    on (l1.l1_id = v.l1_id)

    when matched then update

    set l1.attrib1 = v.attrib1

       , l1.attrib2 = v.attrib2

    when not matched then insert (l1_id, attrib1, attrib2)

    values (v.l1_id, v.attrib1, v.attrib2)

    ;

     

     

    merge into l2

    using (

        select x.*

        from tmp_xml t

           , xmltable(

               'for $i1 in /l1_root

                  , $i2 in $i1/l2

                return element r {

                  $i1/l1_id

                , $i2

                }'

               passing t.object_value

               columns

                 L1_ID   integer      path 'l1_id'

               , L2_ID   integer      path 'l2/l2_id'

               , Attrib1 varchar2(30) path 'l2/l2_attrib1'

               , Attrib2 varchar2(30) path 'l2/l2_attrib2'

             ) x

    ) v

    on (   l2.l1_id = v.l1_id

        and l2.l2_id = v.l2_id )

    when matched then update

    set l2.attrib1 = v.attrib1

       , l2.attrib2 = v.attrib2

    when not matched then insert (l1_id, l2_id, attrib1, attrib2)

    values (v.l1_id, v.l2_id, v.attrib1, v.attrib2)

    ;

     

     

    merge into l3

    using (

        select x.*

        from tmp_xml t

           , xmltable(

               'for $i1 in /l1_root

                  , $i2 in $i1/l2

                  , $i3 in $i2/l3

                return element r {

                  $i1/l1_id

                , $i2/l2_id

                , $i3

                }'

               passing t.object_value

               columns

                 L1_ID   integer      path 'l1_id'

               , L2_ID   integer      path 'l2_id'

               , L3_ID   integer      path 'l3/l3_id'

               , Attrib1 varchar2(30) path 'l3/l3_attrib1'

               , Attrib2 varchar2(30) path 'l3/l3_attrib2'

             ) x

    ) v

    on (   l3.l1_id = v.l1_id

        and l3.l2_id = v.l2_id

        and l3.l3_id = v.l3_id )

    when matched then update

    set l3.attrib1 = v.attrib1

       , l3.attrib2 = v.attrib2

    when not matched then insert (l1_id, l2_id, l3_id, attrib1, attrib2)

    values (v.l1_id, v.l2_id, v.l3_id, v.attrib1, v.attrib2)

    ;

     

    Does it make sense to apply an XmlIndex to the XMLTYPE data? I am not sure whether the overhead for creating the index and adding the new records into it is really killing the gain from it on the parsing in the next step.

    Your assumption is correct.

    If you create a (structured) XMLIndex, Oracle will actually run similar XMLTable queries to maintain the underlying storage structure of the index.

    So, in this case, it's better to directly load your target relational tables, without any XML index.

     

    I will have to do this initially for all previously existing records (approx. 200 Mio) and then daily for the additional records (approx 1Mio per day)

    "Mio" as in "million" (of rows) or "mebioctet"?

  • 2. Re: Processing hierarchical XML in DWH type setup
    978225 Newbie
    Currently Being Moderated

    Thanks for this replie, odie!

     

    Yes, that second option with the "for $i..." looks like what I had in mind - I didn't know this syntax existed.

     

    And "Mio" as in "million" of records.

     

    Thanks for these pointers, I will try this out.

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points