5 Replies Latest reply on Aug 1, 2008 3:18 PM by alwu-Oracle

    Difference between RDF loading methods

      I just started loading some RDF data into Oracle 11g. Initially I have tried the Jena Adaptor to load my data. These are the Java commands I use:

      Model model = ModelOracleSem.createOracleSemModel(oracle, "graph1");
      model.read(new InputStreamReader(fis), "", getFileType(filename));

      When I execute this code, Oracle creates 2 tables: graph1_tpl and graph1_ns. Obviously the tpl table contains the triples and the ns table contains the namespace information.

      When I follow the SQL*Loader instructions outlined in the Oracle-By-Example (OBE) instructions, Oracle only creates one table to hold the triple data.

      Why do these two methods produce difference results? It looks like the Oracle-By-Example method does not create the ns table.  Are there stored procedures I should run after using the SQL*Loader to create the ns table? Are there any other steps I need to perform after using SQL*Loader? Do I need to manually create an index?

      One further question, what indices should I create on the RDF triple tables?

        • 1. Re: Difference between RDF loading methods
          When a semantic network is created, the indexes on the underlying storage are created as part of that operation. Afterwards, during bulk-load and other load operations, the indexes are automatically maintained.

          In order to create a model using Oracle SEM_APIS, you need to do first create an application table with an SDO_RDF_TRIPLE_S type column and then create an RDF model associated with that column of the application table. For example,

          create table nci_rdf_data
          (id number,
          triple sdo_rdf_triple_s);

          execute sem_apis.create_sem_model(NCI, 'nci_rdf_data', 'triple');

          The Jena adaptor was created later and provided a slightly different interface.

          Invoking sqlldr is just for loading the staging table. Once the staging table has been loaded, you need to invoke the SEM_APIS.BULK_LOAD_FROM_STAGING_TABLE routine to load the data from the staging table into the Semantic Store and the application table.

          The bulk-load API automatically populates the indexes on the internal storage tables. User may optionally create custom indexes on the application table, if necessary. (Please note that presence of indexes on application table during large bulk-load slows down the bulk-load operation due to incremental index maintenance.)
          • 2. Re: Difference between RDF loading methods
            Thanks for the information. Next question- what is the fastest method for loading RDF data into Oracle 11g? I've used Jena and the SQL*Loader approach. Is there a faster method? If the SQL*Loader is the fastest, can I load data in parallel by using several SQL*Loader staging tables for each model I am loading?
            • 3. Re: Difference between RDF loading methods
              Using SQL*Loader to load the staging table and then using the bulk-load API is the fastest method when you are loading large number of triples.

              Loading into N different target models, in parallel, from N different staging tables is going to be faster than loading them one after another.

              Loading into a single target model, in parallel, from N different staging tables is not always going to be faster, particularly if the target model is empty to start with. This is caused by the performance difference between bottom-up index build and incremental index maintenance.

              Not sure if this answers your question.
              • 4. Re: Difference between RDF loading methods
                Actually that does answer my question. Thanks.

                I've been running a process that loads data into a model. I have used the SQL*Loader and bulk-load API. I've loaded 60 million triples in about 20 hours. I don't think that is very fast.

                The Oracle database is a "vanilla" installation. I just used the Oracle Database Configuration Assistant. I found your earlier post about speeding up the loading process by changing the sga_target, pga_aggregate_target, etc. I will modify those values and see if that helps. Please let me know if there are any other "standard" settings to use. I'm not a DBA so I just went with the default settings.

                Another wrinkle with this process is I'm running the Oracle server on a virtual Linux server. My virtual machine has 4 GB of memory. I am going to restart the virtual machine with 8GB. We don't have an extra machine to host Linux. If you think virtualization is leading to poor performance, we could justify buying another machine. What should be the "average" load time for loading triples into Oracle?

                • 5. Re: Difference between RDF loading methods
                  To answer your question on the differences in using Oracle Jena Adaptor vs sql approach.

                  Oracle Jena Adaptor will create a few additional tables when a new RDF model is created using the provided Java API. Also, an additional index on the application table will be created to facilitate triple deletes. These are mainly for addressing common application development needs.

                  For example, the additional index on the application table is necessary for Top Braid Composer to remove a class definition triple from your ontology.

                  Some default names for those additional tables and the index are used, if not specified.
                  Javadoc has detailed information.

                  Regular SQL approach will not do the above additional steps. You will have to do it by yourself if there is a need.