2 Replies Latest reply on Jun 6, 2017 6:14 PM by 3054700

    Parallel loading of semantic data

    3054700

      Hello,

       

      I wrote an application that automates the process of generating a RDF model from relational data.

       

      Here are the basic steps it does:

       

      - The whole process is divided in incremental steps, let's say 1..N

      - For 1..N, do:

      1) Generate R2RML

      2) Create a virtual model (i.e. sem_apis.create_rdfview_model) from R2RML mapping

      3) Export the virtual model (i.e. sem_apis.export_rdfview_model) into a stating table

      4) Bulk load (i.e. sem_apis.bulk_load_from_staging_table) into RDF model

       

      The whole process (1..N) takes almost 10 hours (uses 100% of 1 CPU). On the current machine we have 8 CPUs available so I intend to use them for speeding up the generation. As far as I understood from the documentation, only the step 4 (bulk loading) can use parallel processing.

       

      Thus I would like to use threads to perform the set 1..N in parallel (each thread would perform steps 1, 2, 3, 4 and then finish);

       

      It seems to me that there wouldn't be a problem for steps 1, 2, 3 since each thread would use a separate staging table; For step 4 (bullk load) I'm not sure it is safe to run simultaneous bulk load into the same model, can anyone confirm?

       

      Thank you!

      Fred

       

      For

        • 1. Re: Parallel loading of semantic data
          Sdas-Oracle

          Hi Fred,

           

          Few questions for you:

          1) Roughly how much data (in total number of triples) do you expect to generate using the series of sem_apis.export_rdfview_model calls?

          2) Do you have any CLOBs or geospatial data?

          3) Are there quads or just triples?

          4) Do you expect lots of duplicate triples (more than a million or so) if we combine all the exported data under one database view object and provide that view as input row source to sem_apis.bulk_load_from_staging_table?

           

           

          I would suggest doing steps 3 and 4 as follows:

           

          Step 3:

           

          ---------

          alter session force parallel dml parallel <degree>;

          alter session force parallel ddl parallel <degree>;

          alter session force parallel query parallel <degree>;

           

          exec sem_apis.export_rdfview_model(...);

           

          Step 4 (do it outside of loop -- invoking a single bulk-load for the combined data would allow avoiding incremental index maintenance overheads):

          ---------

          create a view, say stage_view, as UNION ALL of all the staging tables where the RDF triples were exported to.

           

          alter session force parallel dml parallel <degree>;

          alter session force parallel ddl parallel <degree>;

          alter session force parallel query parallel <degree>;

           

          -- load from stage_view

          -- if you expect lots of duplicate triples in the stage_view row source, say more than a million duplicates,

          -- then add the following in the flags string below: DEL_BATCH_DUPS=USE_INSERT

          exec sem_apis.bulk_load_from_staging_table(..., flags=>' PARSE MBV_METHOD=SHADOW PARALLEL=<degree> ');

           

          If you have any problems, feel free to contact me directly: souripriya dot das at oracle dot com.

           

          Thanks,

          - Souri.

          • 2. Re: Parallel loading of semantic data
            3054700

            Hi Souri, thank you for your help.

             

            1) We generate a total of about 140 millions triples

            2) There is no CLOBs or geospatial data in this dataset (but might have in the future)

            3) Just triples

            4) I'm not sure about the number of duplicates but I'd say far less than 1 million.

             

            I'm making some tests here. I got an ORA-13199: Insufficient privilege for using MBV_METHOD=SHADOW option. What privilege is needed?