8 Replies Latest reply: Sep 1, 2013 4:47 AM by odie_63 RSS

    Optimizing performance when querying XML data

    Erik_NL

      I have a table in my database containing information about persons. The table has a xmltype column with a lot of data about that person.

      One of the things in there is a telephone number. What I now need to figure out is whether there are any duplicate phone numbers in there.

       

      The xml basically looks like this (simplefied example):

       

      <DATAGROUP>

          <PERSON>

              <BUSINESS_ID>123</BUSINESS_ID>

              <INITIALS>M.</INITIALS>

              <NAME>Testperson</NAME>

              <BIRTHDATE>1977-12-12T00:00:00</BIRTHDATE>

              <GENDER>F</GENDER>

              <TELEPHONE>

                  <COUNTRYCODE>34</COUNTRYCODE>

                  <AREACODE>06</AREACODE>

                  <LOCALCODE>4318527235</LOCALCODE>

              </TELEPHONE>

          </PERSON>

      </DATAGROUP>

       

      As a result I would need the pk_id of the table with the xmltype column in it and a id that's unique for the person (the business_id that's also somewhere in the XML)

       

      I've conducted this query which will give me all telephone numbers and the number of times they occur.

       

        SELECT   OD.pk_ID,

                 tel.business_id  ,

             COUNT ( * ) OVER (PARTITION BY tel.COUNTRYCODE, tel.AREACODE, tel.LOCALCODE) totalcount

             FROM   xml_data od,

             XMLTABLE ('/DATAGROUP/PERSON' PASSING OD.DATAGROUP

                       COLUMNS "COUNTRYCODE" NUMBER PATH '/PERSON/TELEPHONE/COUNTRYCODE',

                               "AREACODE" NUMBER PATH '/PERSON/TELEPHONE/AREACODE',

                               "LOCALCODE" NUMBER PATH '/PERSON/TELEPHONE/LOCALCODE',

                               "BUSINESS_ID"  NUMBER PATH '/PERSON/BUSINESS_ID'

                   ) tel

             WHERE  tel.LOCALCODE is not null --ignore persons without a tel nr

          

      Since I am only interested in the telephone number that occur more than once, I used the above query as a subquery:

       

      WITH q as (

        SELECT   OD.pk_ID,

                 tel.business_id  ,

             COUNT ( * ) OVER (PARTITION BY tel.COUNTRYCODE, tel.AREACODE, tel.LOCALCODE) totalcount

             FROM   xml_data od,

             XMLTABLE ('/DATAGROUP/PERSON' PASSING OD.DATAGROUP

                       COLUMNS "COUNTRYCODE" NUMBER PATH '/PERSON/TELEPHONE/COUNTRYCODE',

                               "AREACODE" NUMBER PATH '/PERSON/TELEPHONE/AREACODE',

                               "LOCALCODE" NUMBER PATH '/PERSON/TELEPHONE/LOCALCODE',

                               "BUSINESS_ID"  NUMBER PATH '/PERSON/BUSINESS_ID'

                   ) tel

             WHERE  tel.LOCALCODE is not null) --ignore persons without a tel nr

      SELECT   OD.pk_ID,  tel.business_id

        FROM   q

      WHERE   totalcount > 1

       

      Now this is working and is giving me the right results, but the performance is dreadful with larger sets of data and will even go into errors like "LPX-00651 VM Stack overflow.".

      What I see is when I do a explain plan for the query is that there are things happening like "COLLECTION ITERATOR PICKLER FETCH PROCEDURE SYS.XQSEQUENCEFROMXMLTYPE" which seems to be something like a equivalent of a full table scan if I google on it.

       

      Any ideas how I can speed up this query? are there maybe smarter ways to do this?

       

      One thing to note is that the XMLTYPE data is not indexed in any way. Is there a possibility to do this? and how? I read about it in the oracle docs, but they where not very clear to me.     

        • 1. Re: Optimizing performance when querying XML data
          padders

          > the performance is dreadful with lager

           

          How is it when sober?

          • 2. Re: Optimizing performance when querying XML data
            BluShadow

            What is your database version?

             

            You may also be better asking your question over in the XML DB as that forum specializes in XML issues.

            The pickler fetch would kind of indicate you're using a pre 10.2.0.3 version (See Mark Drake's examples on this thread: Re: XML file processing into oracle post 7 onwards, and the fact your data is not indexed probably doesn't help.

            • 3. Re: Optimizing performance when querying XML data
              Erik_NL

              Thanks, I've moved it into the XML DB forum.

               

              My database version is actually Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production

               

              Some extra info:  I do also have a XSD for the xml that is in the column, I've registred that into the database (with DBMS_XMLSCHEMA.registerSchema), not sure if that's a good or a bad thing..

              • 4. Re: Optimizing performance when querying XML data
                BluShadow

                It's ok registering the schema, but was the XML loaded into the database with reference to that schema so that it was shredded and stored with appropriate indexes as Mark demonstrates on that thread I linked to?

                 

                Anyway, I don't tend to hang around in XML DB much, so I'll leave you in the capable hands of the XML specialists. 

                • 5. Re: Optimizing performance when querying XML data
                  Jason_(A_Non)

                  How is table xml_data defined, specifically the DATAGROUP column?  Was the table created in the 11.2.0.2 DB or a previous version of Oracle? What did your DBMS_XMLSCHEMA.registerSchema statement look like.  Based on your example, I'm not seeing any references from the XML to a schema.  Does any exist in the XML itself or is the sample you provided consistent with your real XML?

                  • 6. Re: Optimizing performance when querying XML data
                    odie_63

                    The "COLLECTION ITERATOR PICKLER FETCH" operation means that most likely the XMLType storage is BASICFILE CLOB, therefore greatly limiting the range of optimization techniques that Oracle could apply.

                    You can confirm what the current storage is by looking at the table DDL, as Jason asked.

                     

                    CLOB storage is deprecated now in favor of SECUREFILE BINARY XML (the default in 11.2.0.2).

                    Migrating the column to BINARY XML should give you a first significant improvement in the query.

                     

                    If the query is actually a recurring task, then it may further benefit from a structured XML index.

                    Here's a small test case :

                    create table xml_data nologging as

                    select level as pk_id, xmlparse(document '<DATAGROUP>

                        <PERSON>

                            <BUSINESS_ID>'||to_char(level)||'</BUSINESS_ID>

                            <INITIALS>M.</INITIALS>

                            <NAME>Testperson</NAME>

                            <BIRTHDATE>1977-12-12T00:00:00</BIRTHDATE>

                            <GENDER>F</GENDER>

                            <TELEPHONE>

                                <COUNTRYCODE>34</COUNTRYCODE>

                                <AREACODE>06</AREACODE>

                                <LOCALCODE>'||to_char(trunc(dbms_random.value(1,10000)))||'</LOCALCODE>

                            </TELEPHONE>

                        </PERSON>

                    </DATAGROUP>' wellformed) as datagroup

                    from dual

                    connect by level <= 100000 ;

                     

                    create index xml_data_sxi on xml_data (datagroup) indextype is xdb.xmlindex

                    parameters (q'{

                    XMLTABLE xml_data_xtab '/DATAGROUP/PERSON'

                    COLUMNS countrycode number path 'TELEPHONE/COUNTRYCODE',

                            areacode    number path 'TELEPHONE/AREACODE',

                            localcode   number path 'TELEPHONE/LOCALCODE',

                            business_id number path 'BUSINESS_ID'

                    }');

                     

                    call dbms_stats.gather_table_stats(user, 'XML_DATA');

                     

                    SQL> set autotrace traceonly

                    SQL> set timing on

                    SQL> select pk_id

                      2       , business_id

                      3       , totalcount

                      4  from (

                      5    select t.pk_id

                      6         , x.business_id

                      7         , count(*) over (partition by x.countrycode, x.areacode, x.localcode) totalcount

                      8    from xml_data t

                      9       , xmltable(

                    10           '/DATAGROUP/PERSON'

                    11           passing t.datagroup

                    12           columns countrycode number path 'TELEPHONE/COUNTRYCODE'

                    13                 , areacode    number path 'TELEPHONE/AREACODE'

                    14                 , localcode   number path 'TELEPHONE/LOCALCODE'

                    15                 , business_id number path 'BUSINESS_ID'

                    16         ) x

                    17    where x.localcode is not null

                    18  ) v

                    19  where v.totalcount > 1 ;

                     

                    99998 rows selected.

                     

                    Elapsed: 00:00:03.79

                     

                    Execution Plan

                    ----------------------------------------------------------

                    Plan hash value: 3200397756

                     

                    ----------------------------------------------------------------------------------------------

                    | Id  | Operation            | Name          | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |

                    ----------------------------------------------------------------------------------------------

                    |   0 | SELECT STATEMENT     |               |   100K|  3808K|       |  2068   (1)| 00:00:25 |

                    |*  1 |  VIEW                |               |   100K|  3808K|       |  2068   (1)| 00:00:25 |

                    |   2 |   WINDOW SORT        |               |   100K|  4101K|  5528K|  2068   (1)| 00:00:25 |

                    |*  3 |    HASH JOIN         |               |   100K|  4101K|  2840K|   985   (1)| 00:00:12 |

                    |   4 |     TABLE ACCESS FULL| XML_DATA      |   100K|  1660K|       |   533   (1)| 00:00:07 |

                    |*  5 |     TABLE ACCESS FULL| XML_DATA_XTAB |   107K|  2616K|       |   123   (1)| 00:00:02 |

                    ----------------------------------------------------------------------------------------------

                     

                    Predicate Information (identified by operation id):

                    ---------------------------------------------------

                     

                       1 - filter("V"."TOTALCOUNT">1)

                       3 - access("T".ROWID="SYS_SXI_0"."RID")

                       5 - filter("SYS_SXI_0"."LOCALCODE" IS NOT NULL)

                     

                     

                    Statistics

                    ----------------------------------------------------------

                              0  recursive calls

                              1  db block gets

                           2359  consistent gets

                            485  physical reads

                            168  redo size

                        2352128  bytes sent via SQL*Net to client

                          73746  bytes received via SQL*Net from client

                           6668  SQL*Net roundtrips to/from client

                              1  sorts (memory)

                              0  sorts (disk)

                          99998  rows processed

                     

                     

                    If the above is still not satisfying then you can try structured storage (schema-based).

                    • 7. Re: Optimizing performance when querying XML data
                      Erik_NL

                      Thanks for all the help and suggestions.

                       

                      to answer some questions first:

                       

                      The DATAGROUP column is stored as a binary XML column, not as BASICFILE CLOB.

                       

                      The XML schema was registered with  DBMS_XMLSCHEMA.REGISTERSCHEMA, but since there no reference to it in the XMLDATA table, that probably didn't make too much sense. The data is composed by queries using functions like XMLELEMENT, XMLATTRIBUTE, etc, after which the composed XMLTYPE variable is inserted into the table. I don't think it's now shredded in any way, how would I have to "tell"  oracle to do that? Is adding the schemalocation (with xsi:noNamespaceSchemaLocation) to the root element the only option?

                       

                      To give you some background information: The data in xml_data is generated from relational data, the table is actually a temporary table, used to gather the data before it's saved out to files. The queries I'm writing are performing checks on this data. (it sound like it would make sense to do the checks on the relational data, but the checks are done on the XML data since the relational data is constantly changing and might be changed after checks have been performed)

                       

                      With the sample from Odie I was able to create a xmlindex, which greatly improved the performance of the query.

                       

                      I actually rebuild both my query and the index script a little, since I realized that one person could have multiple telephone numbers, and I want to check each of them.

                      with q as (
                      SELECT OD.pk_ID,
                             tel.business_id,
                             COUNT ( * ) OVER (PARTITION BY tel1.COUNTRYCODE, tel1.AREACODE, tel1.LOCALCODE) totalcount
                             FROM xml_data od,
                             XMLTABLE ('/DATAGROUP/PERSON[1]' PASSING OD.DATAGROUP
                                       COLUMNS "BUSINESS_ID"  NUMBER PATH '/PERSON/BUSINESS_ID',
                                               "PERSON"       XMLTYPE PATH '.'
                                   ) tel,
                             XMLTABLE ('/PERSON/TELEPHONE' PASSING PERSON
                                       COLUMNS "COUNTRYCODE" VARCHAR2(10) PATH '/TELEPHONE/COUNTRYCODE',
                                               "AREACODE" VARCHAR2(10) PATH '/TELEPHONE/AREACODE',
                                               "LOCALCODE" VARCHAR2(10) PATH '/TELEPHONE/LOCALCODE'
                                   ) tel1
                             WHERE  tel1.LOCALCODE is not null
                             )
                      SELECT   od.PK_ID, business_id
                        FROM   q
                      WHERE   totalcount > 1;
                      
                      

                       

                       

                      and I created this index for it:

                       

                       

                      create  index xml_data_sxi on xml_data (datagroup) indextype is xdb.xmlindex
                      parameters (q'{
                      XMLTABLE xml_data_xtab '/DATAGROUP/PERSON[1]'
                      COLUMNS                        "BUSINESS_ID"  NUMBER PATH '/PERSON/BUSINESS_ID',
                                                                 "PERSON"       XMLTYPE PATH '.' VIRTUAL    
                      XMLTABLE xml_data_xtab1 '/PERSON/TELEPHONE' passing PERSON
                      COLUMNS                  "COUNTRYCODE" VARCHAR2(10) PATH '/TELEPHONE/COUNTRYCODE',
                                                           "AREACODE" VARCHAR2(10) PATH '/TELEPHONE/AREACODE',    
                                                           "LOCALCODE" VARCHAR2(10) PATH '/TELEPHONE/LOCALCODE'         
                                               }');
                      
                      

                       

                       

                      PLAN_TABLE_OUTPUT
                      
                      ------------------------------------------------------------------------------------------------
                      | Id  | Operation               | Name                    | Rows  | Bytes |TempSpc| Cost (%CPU)|
                      ------------------------------------------------------------------------------------------------
                      |   0 | SELECT STATEMENT        |                         | 11520 |   641K|       |   279   (3)|
                      |*  1 |  VIEW                   |                         | 11520 |   641K|       |   279   (3)|
                      |   2 |   WINDOW SORT           |                         | 11520 |  1237K|  1368K|   279   (3)|
                      |*  3 |    HASH JOIN            |                         | 11520 |  1237K|       |    31  (10)|
                      |   4 |     INDEX FAST FULL SCAN| XML_DATA_PK  | 12323 |   216K|       |     4   (0)|
                      |*  5 |     HASH JOIN           |                         | 11520 |  1035K|       |    26   (8)|
                      |   6 |      TABLE ACCESS FULL  | XML_DATA_XTAB           | 11520 |   517K|       |    12   (0)|
                      |*  7 |      TABLE ACCESS FULL  | XML_DATA_XTAB1          | 11555 |   519K|       |    13   (8)|
                      ------------------------------------------------------------------------------------------------
                      
                      Predicate Information (identified by operation id):
                      ---------------------------------------------------
                      
                         1 - filter("TOTALCOUNT">1)
                         3 - access("OD".ROWID="SYS_SXI_0"."RID")
                         5 - access("SYS_SXI_0"."KEY"="SYS_SXI_1"."PKEY")
                         7 - filter("SYS_SXI_1"."LOCALCODE" IS NOT NULL)
                      

                       

                      The problem I'm facing now is that I can only define one xmltype index on the DATAGROUP column. If I would try to create a second one, I trigger a ORA-29879: cannot create multiple domain indexes on a column list using same indextype error.

                       

                      My queries would need multiple paths through the XML, my first query for example was for checking duplicate telephone numbers, but what if I want to do the same for addresses?

                       

                      for instance, let's say my XML looks like this:

                       

                      <DATAGROUP>
                          <PERSON>
                              <BUSINESS_ID>123</BUSINESS_ID>
                              <INITIALS>M.</INITIALS>
                              <NAME>Testperson</NAME>
                              <BIRTHDATE>1977-12-12T00:00:00</BIRTHDATE>
                              <GENDER>F</GENDER>
                              <TELEPHONE>
                                  <COUNTRYCODE>34</COUNTRYCODE>
                                  <AREACODE>06</AREACODE>
                                  <LOCALCODE>4318527235</LOCALCODE>
                              </TELEPHONE>
                              <ADDRESS>
                                  <ZIP>123444</ZIP>
                                  <STREET>somestreet</STREET>
                                  <HOUSENUMBER>12</HOUSENUMBER>
                              </ADDRESS>      
                          </PERSON>
                      </DATAGROUP>
                      
                      

                      Is there a way to extend the index? I saw some examples using  DBMS_XMLINDEX.registerParameter, but I'm not sure how I should use that functionality in this context.

                      • 8. Re: Optimizing performance when querying XML data
                        odie_63

                        Is there a way to extend the index? I saw some examples using  DBMS_XMLINDEX.registerParameter, but I'm not sure how I should use that functionality in this context.

                         

                        See Indexing XMLType Data :

                        A structured_clause specifies the structured islands that you want to index. You use the keyword GROUP to specify each structured island: an island thus corresponds syntactically to a structure group. If you specify no group explicitly, then the predefined group DEFAULT_GROUP is used. For ALTER INDEX, you precede the GROUP keyword with the modification operation keyword: ADD_GROUP specifies a new group (island); DROP_GROUP deletes a group.

                         

                        Why have multiple groups within a single index, instead of simply using multiple XMLIndex indexes? The reason is that XMLIndex is a domain index, and you can create only one domain index of a given type on a given database column.

                         

                        Example :

                        BEGIN

                          DBMS_XMLINDEX.registerParameter(

                            'xml_data_param1',

                            q'{ADD_GROUP GROUP address_group

                        XMLTABLE xml_data_xtab2 '/DATAGROUP/PERSON'

                        COLUMNS business_id NUMBER  PATH 'BUSINESS_ID',

                                addresses   XMLTYPE PATH 'ADDRESS' VIRTUAL    

                        XMLTABLE xml_data_xtab3 '/ADDRESS' PASSING addresses

                        COLUMNS zip          VARCHAR2(10) PATH 'ZIP',

                                street       VARCHAR2(30) PATH 'STREET',    

                                housenumber  VARCHAR2(10) PATH 'HOUSENUMBER' }');

                        END;

                        /

                         

                        ALTER INDEX xml_data_sxi PARAMETERS('PARAM xml_data_param1');

                         

                        SQL> set autotrace on

                        SQL> set lines 200

                        SQL> set pages 100

                        SQL> select x1.business_id

                          2       , x2.*

                          3  from xml_data t

                          4     , XMLTABLE('/DATAGROUP/PERSON'

                          5         PASSING t.datagroup

                          6         COLUMNS business_id NUMBER  PATH 'BUSINESS_ID',

                          7                 addresses   XMLTYPE PATH 'ADDRESS' ) x1,

                          8       XMLTABLE('/ADDRESS'

                          9         PASSING x1.addresses

                        10         COLUMNS zip          VARCHAR2(10) PATH 'ZIP',

                        11                 street       VARCHAR2(30) PATH 'STREET',

                        12                 housenumber  VARCHAR2(10) PATH 'HOUSENUMBER' ) x2 ;

                         

                        BUSINESS_ID ZIP        STREET                         HOUSENUMBE

                        ----------- ---------- ------------------------------ ----------

                                123 123444     somestreet                     12

                         

                         

                        Execution Plan

                        ----------------------------------------------------------

                        Plan hash value: 2187959680

                         

                        --------------------------------------------------------------------------------------------------------

                        | Id  | Operation                     | Name                   | Rows  | Bytes | Cost (%CPU)| Time     |

                        --------------------------------------------------------------------------------------------------------

                        |   0 | SELECT STATEMENT              |                        |     1 |  1072 |     4   (0)| 00:00:01 |

                        |   1 |  NESTED LOOPS                 |                        |     1 |  1072 |     4   (0)| 00:00:01 |

                        |   2 |   NESTED LOOPS                |                        |     1 |  1060 |     3   (0)| 00:00:01 |

                        |   3 |    TABLE ACCESS FULL          | XML_DATA_XTAB3         |     1 |   533 |     3   (0)| 00:00:01 |

                        |   4 |    TABLE ACCESS BY INDEX ROWID| XML_DATA_XTAB2         |     1 |   527 |     0   (0)| 00:00:01 |

                        |*  5 |     INDEX RANGE SCAN          | SYS38402_38423_KEY_IDX |     1 |       |     0   (0)| 00:00:01 |

                        |   6 |   TABLE ACCESS BY USER ROWID  | XML_DATA               |     1 |    12 |     1   (0)| 00:00:01 |

                        --------------------------------------------------------------------------------------------------------

                         

                        Predicate Information (identified by operation id):

                        ---------------------------------------------------

                         

                           5 - access("SYS_SXI_0"."KEY"="SYS_SXI_1"."PKEY")

                         

                        Note

                        -----

                           - dynamic sampling used for this statement (level=2)

                         

                         

                        Statistics

                        ----------------------------------------------------------

                                 61  recursive calls

                                  2  db block gets

                                430  consistent gets

                                  0  physical reads

                                148  redo size

                                629  bytes sent via SQL*Net to client

                                419  bytes received via SQL*Net from client

                                  2  SQL*Net roundtrips to/from client

                                  0  sorts (memory)

                                  0  sorts (disk)

                                  1  rows processed

                         

                         

                        You may also be interested in using an XMLType view instead : XMLType Views

                        XQuery-based queries on the view are rewritten to use the relational storage, furthermore on real-time data.