12 Replies Latest reply: Jan 31, 2013 4:12 AM by odie_63 RSS

    Approach to parse large number of XML files into the relational table.

    225452
      We are exploring the option of XML DB for processing a large number of files coming same day.
      The objective is to parse the XML file and store in multiple relational tables. Once in relational table we do not care about the XML file.
      The file can not be stored on the file server and need to be stored in a table before parsing due to security issues. A third party system will send the file and will store it in the XML DB.
      File size can be between 1MB to 50MB and high performance is very much expected other wise the solution will be tossed.
      Although we do not have XSD, the XML file is well structured. We are on 11g Release 2.

      Based on the reading this is what my approach.
      1. CREATE TABLE XML_DATA
      (xml_col XMLTYPE)
      XMLTYPE xml_col STORE AS SECUREFILE BINARY XML;

      2. Third party will store the data in XML_DATA table.
      3. Create XMLINDEX on the unique XML element
      4. Create views on XMLTYPE
      CREATE OR REPLACE FORCE VIEW V_XML_DATA(
         Stype,
         Mtype,
         MNAME,
         OIDT
      )
      AS
         SELECT x."Stype",
                x."Mtype",
                x."Mname",
                x."OIDT"
         FROM   data_table t,
                XMLTABLE (
                   '/SectionMain'
                   PASSING t.data
                   COLUMNS Stype VARCHAR2 (30) PATH 'Stype',
                           Mtype VARCHAR2 (3) PATH 'Mtype',
                           MNAME VARCHAR2 (30) PATH 'MNAME',
                           OIDT VARCHAR2 (30) PATH 'OID') x;

      5. Bulk load the parse data in the staging table based on the index column.

      Please comment on the above approach any suggestion that can improve the performance.

      Thanks
      AnuragT
        • 1. Re: Approach to parse large number of XML files into the relational table.
          odie_63
          What's your exact version? (The XQuery engine changed on 11.2.0.2)

          Given the details you gave so far, that's the correct approach.

          If the staging table has a single XMLType column, maybe you can just create an XMLType object table instead :
          CREATE TABLE xml_data OF XMLType
          XMLType STORE AS SECUREFILE BINARY XML ;
          What kind of XML index are you using? Maintaining an XML index has a cost too.
          As an alternative, you can also try adding a virtual column that target the unique value in the XML and index it with a classic B-tree index :
          http://docs.oracle.com/cd/E11882_01/appdev.112/e23094/xdb_indexing.htm#CHDBHADG
          • 2. Re: Approach to parse large number of XML files into the relational table.
            225452
            Thanks for your response. It givies more confidence.

            Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
            TNS for Linux: Version 11.2.0.3.0 - Production

            Example XML
            <SectionMain>
            <SectionState>Closed</SectionState>
            <FunctionalState>CP FINISHED</FunctionalState>
            <CreatedTime>2012-08</CreatedTime>
            <Number>106</Number>
            <SectionType>Reel</SectionType>
            <MachineType>CP</MachineType>
            <MachineName>CP_225</MachineName>
            <OID>99dd48cf-fd1b-46cf-9983-0026c04963d2</OID>
            </SectionMain>
            <SectionEvent>
            <SectionOID>99dd48cf-2</SectionOID>
            <EventName>CP.CP_225.Shredder</EventName>
            <OID>b3dd48cf-532d-4126-92d2</OID>
            </SectionEvent>
            <SectionAddData>
            <SectionOID>99dd48cf2</SectionOID>
            <AttributeName>ReelVersion</AttributeName>
            <AttributeValue>4</AttributeValue>
            <OID>b3dd48cf</OID>
            </SectionAddData>
            - <SectionAddData>
            <SectionOID>99dd48cf-fd1b-46cf-9983</SectionOID>
            <AttributeName>ReelNr</AttributeName>
            <AttributeValue>38</AttributeValue>
            <OID>b3dd48cf</OID>
            ...
            ...
            <BNCounter>
            <SectionID>99dd48cf-fd1b-46cf-9983-0026c04963d2</SectionID>
            <Run>CPFirstRun</Run>
            <SortingClass>84</SortingClass>
            <OutputStacker>D2</OutputStacker>
            ...
            <BNCounter>54605</BNCounter>
            </BNCounter>
            ...
            ...

            I was not aware of Virtual column but looks like we can use it and avoid creating views by just inserting directly into
            the staging table using virtual column.

            Suppose OID id is the unique identifier of each XML FILE and I created virtual column

            CREATE TABLE po_Virtual OF XMLTYPE
            XMLTYPE STORE AS BINARY XML
            VIRTUAL COLUMNS
            (OID_1 AS (XMLCAST(XMLQUERY('/SectionMain/OID'
            PASSING OBJECT_VALUE RETURNING CONTENT)
            AS VARCHAR2(30))));

            1. My question is how then I will write this query by NOT USING COLMUN XML_COL
            SELECT x."SECTIONTYPE",
            x."MACHINETYPE",
            x."MACHINENAME",
            x."OIDT"
            FROM po_Virtual t,
            XMLTABLE (
            '/SectionMain'
            PASSING t.xml_col                          <--WHAT WILL PASSING HERE SINCE NO XML_COL
            COLUMNS SectionType VARCHAR2 (30) PATH 'SectionType',
            MachineType VARCHAR2 (3) PATH 'MachineType',
            MachineName VARCHAR2 (30) PATH 'MachineName',
            OIDT VARCHAR2 (30) PATH 'OID') x;

            2. Insetead of creating the view then Can I do

            insert into STAGING_table_yyy ( col1 ,col2,col3,col4,
            )
            SELECT x."SECTIONTYPE",
            x."MACHINETYPE",
            x."MACHINENAME",
            x."OIDT"
            FROM xml_data t,
            XMLTABLE (
            '/SectionMain'
            PASSING t.xml_col                         <--WHAT WILL PASSING HERE SINCE NO XML_COL
            COLUMNS SectionType VARCHAR2 (30) PATH 'SectionType',
            MachineType VARCHAR2 (3) PATH 'MachineType',
            MachineName VARCHAR2 (30) PATH 'MachineName',
            OIDT VARCHAR2 (30) PATH 'OID') x
            where oid_1 = '99dd48cf-fd1b-46cf-9983';<--VIRTUAL COLUMN


            insert into STAGING_table_yyy ( col1 ,col2,col3

            )
            SELECT x."SectionOID",
            x."EventName",
            x."OIDT"
            FROM xml_data t,
            XMLTABLE (
            '/SectionMain'
            PASSING t.xml_col                         <--WHAT WILL PASSING HERE SINCE NO XML_COL
            COLUMNS SectionOID PATH 'SectionOID',
            EventName VARCHAR2 (30) PATH 'EventName',
            OID VARCHAR2 (30) PATH 'OID',
            ) x
            where oid_1 = '99dd48cf-fd1b-46cf-9983';<--VIRTUAL COLUMN


            Same insert for other tables usind the OID_1 virtual coulmn


            3. Finaly Once done how can I delete the XML document from XML.

            If I am using virtual column then I beleive it will be easy
            DELETE table po_Virtual where oid_1 = '99dd48cf-fd1b-46cf-9983';
            But in case we can not use the Virtual column how we can delete the data

            Thanks in advance
            AnuragT
            • 3. Re: Approach to parse large number of XML files into the relational table.
              Marco Gralike
              PASSING t.xml_col <--WHAT WILL PASSING HERE SINCE NO XML_COL
              In case of using an XMLType Table, instead of an XMLType column reference, you refer to the XMLType content while using the PSEUDO column "OBJECT_VALUE"

              so --> t.object_value
              • 4. Re: Approach to parse large number of XML files into the relational table.
                odie_63
                anuragtandon wrote:
                Example XML
                Difficult to make sense out of the sample XML. Isn't there any root element?

                Looks like you want to load multilevel collections of data into different tables, this might help :
                http://odieweblog.wordpress.com/2012/05/10/how-to-load-xml-data-into-multiple-tables/
                But in case we can not use the Virtual column how we can delete the data
                For example :
                DELETE my_xmltype_table t
                WHERE XMLExists(
                       '/SectionMain[OID=$myoid]' 
                       passing t.object_value
                            , '99dd48cf-fd1b-46cf-9983' as "myoid"
                      ) ;
                • 5. Re: Approach to parse large number of XML files into the relational table.
                  225452
                  Thanks for everone response. So far we did a good progress and will also look for multi insert.
                  At present we create 9 views for 9 tables and performance is good except 1 view. That view is taking much more time and yes
                  that view have most of the records (around 8000).

                  My question is, anyway we can do indexing on that collection(<SectionAllCU>) of XML FILE. It is a simple collection


                  - <DataCollectorJobIDLog>
                       - <DataCollectorStatus>
                            max 1 record
                       </DataCollectorStatus>
                       - <SectionMain>
                            1 record
                       </SectionMain>
                       - <SectionEvent>
                            3000 records
                       </SectionEvent>
                       -SectionAllCU>
                            8000 records --taking 4 to 5 sec to select
                       <SectionAllCU>
                       <AllCUBNCounter>
                            2000 records
                       </AllCUBNCounter>


                  </DataCollectorJobIDLog>
                  total 70,000 Maximum records.

                  Definition of the XML Table is

                  CREATE TABLE XML_DATA (file_nm varchar2(100) primary key,xmldoc XMLTYPE)
                  XMLTYPE xmldoc STORE AS SECUREFILE BINARY XML;

                  Thanks in advance.
                  • 6. Re: Approach to parse large number of XML files into the relational table.
                    odie_63
                    At present we create 9 views for 9 tables and performance is good except 1 view. That view is taking much more time and yes
                    that view have most of the records (around 8000).
                    Could you post view definition and explain plan?

                    What's the performance requirement?
                    My question is, anyway we can do indexing on that collection(<SectionAllCU>) of XML FILE. It is a simple collection
                    As said earlier, I'm not sure you'll gain much by indexing this fragment. The cost of maintaining the index (depending on its type, structured?) is probably close to the cost of running the query without index in the first place (assuming a streaming XPath operation is used).
                    • 7. Re: Approach to parse large number of XML files into the relational table.
                      225452
                      View definition :
                      CREATE OR REPLACE FORCE VIEW L1AXT01.V_ALCU
                      (
                      FILE_NM,
                      SECTIONOID,
                      ROOMNUMBER,
                      OID,
                      ...
                      ...,
                      MACHINETYPE,
                      MACHINENAME,
                      SECTIONINDICATOR,
                      OPMODE
                      )
                      AS
                      SELECT "FILE_NM",
                      "SECTIONOID",
                      "ROOMNUMBER",
                      "OID",
                      ...
                           ...
                      "MACHINETYPE",
                      "MACHINENAME",
                      "SECTIONINDICATOR",
                      "OPMODE"
                      FROM (SELECT T.FILE_NM,
                      ' ' AS SECTIONOID,
                      ' ' AS ROOMNUMBER,
                      OID,
                      ...
                                ...
                      MACHINETYPE,
                      MACHINENAME,
                      SECTIONINDICATOR,
                      OPMODE
                      FROM SIM_XMLFILES T,
                      XMLTABLE (
                      'DataCollectorJobID_AutoStrapLog/SectionAllCU'
                      PASSING T.XMLDOC
                      COLUMNS OID VARCHAR2 (80) PATH 'OID',
                      ...
                                     ...
                      MACHINETYPE VARCHAR2 (80) PATH 'MachineType',
                      MACHINENAME VARCHAR2 (80) PATH 'MachineName',
                      -- SECTIONINDICATOR NUMBER(4,0) PATH 'OID'
                      SECTIONINDICATOR VARCHAR2 (4) PATH 'OID' ) X,
                      XMLTABLE (
                      'DataCollectorJobID_AutoStrapLog/SectionAllCU/Opmode'
                      PASSING T.XMLDOC
                      COLUMNS OPMODE VARCHAR2 (10) PATH 'InputMode') (+) OP);

                      Plan :
                      SELECT STATEMENT ALL_ROWSCost: 4,170,621 Bytes: 3,197,174,886,528 Cardinality: 1,401,040,704
                      5 NESTED LOOPS OUTER Cost: 4,170,621 Bytes: 3,197,174,886,528 Cardinality: 1,401,040,704
                      3 NESTED LOOPS Cost: 518 Bytes: 391,083,840 Cardinality: 171,528
                      1 TABLE ACCESS FULL TABLE L1AXT01.SIM_XMLFILES Cost: 3 Bytes: 47,334 Cardinality: 21
                      2 XPATH EVALUATION
                      4 XPATH EVALUATION

                      All 9 tables should be inserted in less then 2 secs.
                      • 8. Re: Approach to parse large number of XML files into the relational table.
                        odie_63
                        Please use &#x7B;code} tags when posting formatted code snippets, otherwise it's not readable.

                        Because of the lack of useful information, I can only give general directions.
                        In this case I'd probably try this approach instead :
                        SELECT T.FILE_NM
                             , x.OID
                             , x.MACHINETYPE
                             , x.MACHINENAME
                             , op.OPMODE
                        FROM SIM_XMLFILES T
                           , XMLTABLE (
                               '/DataCollectorJobID_AutoStrapLog/SectionAllCU'
                               PASSING T.XMLDOC
                               COLUMNS OID              VARCHAR2(80) PATH 'OID'
                                     , MACHINETYPE      VARCHAR2(80) PATH 'MachineType'
                                     , MACHINENAME      VARCHAR2(80) PATH 'MachineName'
                                     , opmodes          xmltype      PATH 'Opmode' 
                             ) X
                           , XMLTABLE (
                               '/Opmode'
                               PASSING x.opmodes
                               COLUMNS OPMODE           VARCHAR2(10) PATH 'InputMode'
                             ) (+) OP
                        ;
                        • 9. Re: Approach to parse large number of XML files into the relational table.
                          225452
                          Thank you, That helped a lot and it is taking no time any more (<1 sec). If you have any tip then please let me know for better performance. Thanks.
                          • 10. Re: Approach to parse large number of XML files into the relational table.
                            225452
                            Thanks everone help. Approach of XML DB is working in POC with no performance iss and expected SLA. Thnaks
                            • 11. Re: Approach to parse large number of XML files into the relational table.
                              225452
                              I am running with the simple problem. How can I get ONLY the first <SectionMain> only when sometimes we have more then one <SectionMain>.
                              <Node>
                              <SectionMain>
                                  <MachineType>CP</MachineType> 
                                  <MachineName>CP_225</MachineName> 
                                </SectionMain>
                              <SectionMain>
                                       <MachineType>CP</MachineType> 
                                  <MachineName>CP_444</MachineName> 
                              </SectionMain>
                                   <SectionEvent>
                                  <SectionOID>99dd48cf-2</SectionOID> 
                                  <EventName>CP.CP_225.Shredder</EventName> 
                                  <OID>b3dd48cf-532d-4126-92d2</OID> 
                              </SectionEvent>
                              </node>
                              Existing code is similar to below
                              SELECT 
                                     x.MACHINETYPE
                                   , x.MACHINENAME
                              FROM XMLFILES T
                                 , XMLTABLE (
                                     '/Node/SectionMain'
                                     PASSING T.XMLDOC
                                     COLUMNS 
                                           MACHINETYPE      VARCHAR2(80) PATH 'MachineType'
                                           , MACHINENAME      VARCHAR2(80) PATH 'MachineName'
                                           
                                   ) X;
                              • 12. Re: Approach to parse large number of XML files into the relational table.
                                odie_63
                                With a positional predicate :
                                 XMLTABLE (
                                       '/Node/SectionMain[1]'
                                 ...