9 Replies Latest reply on Jul 29, 2010 8:56 PM by Matperry-Oracle

    Queries on Large Datasets Not Returning

    784525
      Hi,
      I'm running a bunch of queries against models in Oracle and they are taking over an hour to return. I've had to kill the processes, so I don't even know if they return. I was hoping you could take a look at the queries and explain if I'm doing something wrong or why Oracle might be taking so long.

      The technical details:
      I've installed Oracle 11.2.0 Enterprise on a VirtualBox machine running Ubuntu 10.04 64-bit server edition. It's able to handle 8 other queries I've thrown at it just fine. It will even return the correct results for these queries against a 10,000 triple model without too much of a delay.

      I'm using the latest Oracle Jena, and submitting the queries using Jena and the ARQ apis. (Source included below).

      I've run these queries against models with 50,000 triples, 250,000 triples, 1 million triples, and 5 million triples. All of them take longer than 1 hour to complete. I let the 1 million and 5 million run for 2 hours before quitting.

      The model is populated with data from the SP2 benchmark. [http://dbis.informatik.uni-freiburg.de/index.php?project=SP2B] The benchmark comes with a rdf generator, which produces n-triples files which I loaded into my model by hand using SQL*Loader to get the triples into a staging table and SEM_APIS.BULK_LOAD_FROM_STAGING_TABLE to populate the model.

      Any advice you could offer as to why this is taking so long or what I can do to diagnose would be greatly appreciated, thanks!



      Here are the SPARQL queries (also included in my source code below):

      PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
      PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
      PREFIX foaf: <http://xmlns.com/foaf/0.1/>
      PREFIX dc: <http://purl.org/dc/elements/1.1/>
      PREFIX dcterms: <http://purl.org/dc/terms/>

      SELECT ?yr ?name ?document
      WHERE {
      ?class rdfs:subClassOf foaf:Document .
      ?document rdf:type ?class .
      ?document dcterms:issued ?yr .
      ?document dc:creator ?author .
      ?author foaf:name ?name
      OPTIONAL {
      ?class2 rdfs:subClassOf foaf:Document .
      ?document2 rdf:type ?class2 .
      ?document2 dcterms:issued ?yr2 .
      ?document2 dc:creator ?author2
      FILTER (?author=?author2 && ?yr2<?yr)
      } FILTER (!bound(?author2))
      }

      and

      PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
      PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
      PREFIX foaf: <http://xmlns.com/foaf/0.1/>
      PREFIX dc: <http://purl.org/dc/elements/1.1/>
      PREFIX dcterms: <http://purl.org/dc/terms/>

      SELECT DISTINCT ?title
      WHERE {
      ?class rdfs:subClassOf foaf:Document .
      ?doc rdf:type ?class .
      ?doc dc:title ?title .
      ?bag2 ?member2 ?doc .
      ?doc2 dcterms:references ?bag2
      OPTIONAL {
      ?class3 rdfs:subClassOf foaf:Document .
      ?doc3 rdf:type ?class3 .
      ?doc3 dcterms:references ?bag3 .
      ?bag3 ?member3 ?doc
      OPTIONAL {
      ?class4 rdfs:subClassOf foaf:Document .
      ?doc4 rdf:type ?class4 .
      ?doc4 dcterms:references ?bag4 .
      ?bag4 ?member4 ?doc3
      } FILTER (!bound(?doc4))
      } FILTER (!bound(?doc3))
      }

      Here's the code I am using:


      import com.hp.hpl.jena.query.*;
      import java.util.concurrent.*;
      import oracle.spatial.rdf.client.jena.*;

      /**
      * javac -cp .:lib/sdordfclient-11.2.jar:lib/arq-2.8.1.jar:lib/jena-2.6.2.jar:lib/ojdbc6-11.2.0.1.0.jar SparqlQueryOracle.java
      *
      * java -cp .:lib/sdordfclient-11.2.jar:lib/arq-2.8.1.jar:lib/jena-2.6.2.jar:lib/ojdbc6-11.2.0.1.0.jar:lib/slf4j-api-1.5.6.jar:lib/slf4j-log4j-12-1.5.6.jar:lib/log4j-1.2.13.jar:lib/xercesImpl-1.2.jar:lib/iri-0.7.jar:lib/icu4j-3.4.4.jar SparqlQueryOracle 192.168.16.11 rdfuser rdfuser spsquared 0
      */

      public class SparqlQueryOracle {

      public static String[] queries = new String[] {
           "PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>" +
                "PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>" +
                "PREFIX foaf: <http://xmlns.com/foaf/0.1/>" +
                "PREFIX dc: <http://purl.org/dc/elements/1.1/>" +
                "PREFIX dcterms: <http://purl.org/dc/terms/>" +
                "" +
                "SELECT ?yr ?name ?document" +
                "WHERE {" +
                " ?class rdfs:subClassOf foaf:Document ." +
                " ?document rdf:type ?class ." +
                " ?document dcterms:issued ?yr ." +
                " ?document dc:creator ?author ." +
                " ?author foaf:name ?name" +
                " OPTIONAL {" +
                " ?class2 rdfs:subClassOf foaf:Document ." +
                " ?document2 rdf:type ?class2 ." +
                " ?document2 dcterms:issued ?yr2 ." +
                " ?document2 dc:creator ?author2 " +
                " FILTER (?author=?author2 && ?yr2<?yr)" +
                " } FILTER (!bound(?author2))" +
                "}",

                "PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>" +
                "PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>" +
                "PREFIX foaf: <http://xmlns.com/foaf/0.1/>" +
                "PREFIX dc: <http://purl.org/dc/elements/1.1/>" +
                "PREFIX dcterms: <http://purl.org/dc/terms/>" +
                "" +
                "SELECT DISTINCT ?title" +
                "WHERE {" +
                " ?class rdfs:subClassOf foaf:Document ." +
                " ?doc rdf:type ?class ." +
                " ?doc dc:title ?title ." +
                " ?bag2 ?member2 ?doc ." +
                " ?doc2 dcterms:references ?bag2" +
                " OPTIONAL {" +
                " ?class3 rdfs:subClassOf foaf:Document ." +
                " ?doc3 rdf:type ?class3 ." +
                " ?doc3 dcterms:references ?bag3 ." +
                " ?bag3 ?member3 ?doc" +
                " OPTIONAL {" +
                " ?class4 rdfs:subClassOf foaf:Document ." +
                " ?doc4 rdf:type ?class4 ." +
                " ?doc4 dcterms:references ?bag4 ." +
                " ?bag4 ?member4 ?doc3" +
                " } FILTER (!bound(?doc4))" +
                " } FILTER (!bound(?doc3))" +
                "}"
           };

      public static void main(String[] args) throws Exception {
           if(args.length != 5) {
           System.out.println("usage: SparqlQueryJava host user pass graph query-index");
           System.exit(1);
           }

           String host = args[0];
           String user = args[1];
           String password = args[2];
           
           String graphName = args[3];
           int index = Integer.parseInt(args[4]);     

           Oracle oracle = new Oracle("jdbc:oracle:thin:@//" + host + ":1521/tseval", user, password);
           GraphOracleSem g = new GraphOracleSem(oracle, graphName);
           ModelOracleSem model = new ModelOracleSem(g);

           Query query = QueryFactory.create(queries[index]);
           QueryExecution queryExecution = QueryExecutionFactory.create(query, model);
           ResultSet results = queryExecution.execSelect();

           int count = 0;
           while(results.hasNext()) {
           count++;
           results.next();
           }

           System.out.println("Query " + index + " returned " + count + " results");
      }

      }
        • 1. Re: Queries on Large Datasets Not Returning
          Matperry-Oracle
          Hi,

          Have you applied the latest Semantics patch for 11g Release2? It has some performance improvements.

          Patch 9819833: Semantic Technologies 11G R2 Fix Bundle 2

          This patch is available on My Oracle Support. It was released in early June 2010.

          Thanks,
          Matt
          • 2. Re: Queries on Large Datasets Not Returning
            784525
            I have installed a patch that we received from Oracle. I should have mentioned that.

            running this query (from the verification section of the patch installation instructions):

            select * from MDSYS.RDF_PARAMETER where namespace='MDSYS';

            returns

            NAMESPACE ATTRIBUTE VALUE DESCRIPTION
            --------------------------------------------------------------
            MDSYS 112_P2 SEM_VERSION VALID
            • 3. Re: Queries on Large Datasets Not Returning
              Matperry-Oracle
              Please make sure to gather statistics for the RDF Network.

              Usually multi-column statistics for P,S and P,C column groups helps. Please run the following PL/SQL procedures for your models.

              -- PC
              exec sem_apis.analyze_model('my_model', METHOD_OPT =>'FOR COLUMNS (P_VALUE_ID,CANON_END_NODE_ID) SIZE AUTO');
              -- PS
              exec sem_apis.analyze_model('my_model', METHOD_OPT =>'FOR COLUMNS (P_VALUE_ID,START_NODE_ID) SIZE AUTO');

              Then, be sure to run gather_stats for the whole network:

              exec sem_perf.gather_stats;

              Also, it seems that the queries can be modified a bit to help with performance.

              For the following query

              PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
              PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
              PREFIX foaf: <http://xmlns.com/foaf/0.1/>
              PREFIX dc: <http://purl.org/dc/elements/1.1/>
              PREFIX dcterms: <http://purl.org/dc/terms/>

              SELECT ?yr ?name ?document
              WHERE {
              ?class rdfs:subClassOf foaf:Document .
              ?document rdf:type ?class .
              ?document dcterms:issued ?yr .
              ?document dc:creator ?author .
              ?author foaf:name ?name
              OPTIONAL {
              ?class2 rdfs:subClassOf foaf:Document .
              ?document2 rdf:type ?class2 .
              ?document2 dcterms:issued ?yr2 .
              ?document2 dc:creator ?author2
              FILTER (?author=?author2 && ?yr2<?yr)
              } FILTER (!bound(?author2))
              }

              There are no common variables between the root BGP and the OPTIONAL BGP. The only join condition comes from the FILTER clause. This will give a cartesian join followed by a filter. It would be better to use ?author in the OPTIONAL BGP instead of FILTER (?author=?author2) and then use a different variable for the !bound FILTER. For example,

              PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
              PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
              PREFIX foaf: <http://xmlns.com/foaf/0.1/>
              PREFIX dc: <http://purl.org/dc/elements/1.1/>
              PREFIX dcterms: <http://purl.org/dc/terms/>

              SELECT ?yr ?name ?document
              WHERE {
              ?class rdfs:subClassOf foaf:Document .
              ?document rdf:type ?class .
              ?document dcterms:issued ?yr .
              ?document dc:creator ?author .
              ?author foaf:name ?name
              OPTIONAL {
              ?class2 rdfs:subClassOf foaf:Document .
              ?document2 rdf:type ?class2 .
              ?document2 dcterms:issued ?yr2 .
              ?document2 dc:creator ?author
              FILTER (?yr2<?yr)
              } FILTER (!bound(?document2))
              }
              • 4. Re: Queries on Large Datasets Not Returning
                766393
                Hi,

                Sorry to take advantage of this thread to ask further questions, but, why did you specifically advise nyoung to gather stats on the PC and PS column groups instead of using the default "FOR ALL COLUMNS SIZE AUTO" ?
                Another related question : How would you run these from Jena GraphOracleSem.analyze() ? If I want to specify METHOD_OPT, GraphOracleSem.analyze asks for a 'float estimatePercent' which prevents me from letting Oracle determine it by using the default DBMS_STATS.AUTO_SAMPLE_SIZE.

                Regards,
                Julien
                • 5. Re: Queries on Large Datasets Not Returning
                  Matperry-Oracle
                  Hi Julien,

                  Multi-column statistics provide more detailed information to the query optimizer. You can read about them here: http://download.oracle.com/docs/cd/E11882_01/server.112/e10821/stats.htm#PFGRF10131

                  When executing a SPARQL query, we often access our indexes (i.e. PCSM) with 2 columns. For example, give me all the subjects for a fixed predicate and object. Given the default index setup (PCSM and PSCM) and typical access patterns, it makes sense to gather statistics for the PC and PS column groups. We have seen that the query optimizer usually does a better job when we explicitly create these column groups.

                  Regards,
                  Matt
                  • 6. Re: Queries on Large Datasets Not Returning
                    715399
                    Hi Julien,

                    To answer your second question: the value of the DBMS_STATS.AUTO_SAMPLE_SIZE constant is 0, so you can just pass in that.

                    Regards,
                    Vladimir
                    • 7. Re: Queries on Large Datasets Not Returning
                      766393
                      Hi,

                      Thanks for these details.
                      Regarding the default value at 0, I would have tried something like this but the javadoc specifies a range of [0.000001,100], so it seemed that the method needed a more meaningful value. I will try with 0.

                      To come back to nyoung first question regarding performance, it seems that the idea is to run the SP² benchmark on an Oracle RDF database. I don't think that modifying the queries to optimize them really is an option in this case.

                      Regards,
                      Julien
                      • 8. Re: Queries on Large Datasets Not Returning
                        784525
                        Thanks for the advice running sem_apis.analyze_model and sem_perf.gather_stats. I've run all three of the procedures you suggested and the times only decreased incrementally. Against 50,000 triples, the queries run pretty consistently under a half hour, but against 250,000 and higher the queries are still taking longer than an hour to complete.

                        Julien is right in that I am running the SP2 benchmark, so in this particular case, modifying the queries isn't really an option, but it is certainly instructive to know what types of queries Oracle handles best and what SPARQL constructs improve (or decrease) query performance.

                        I was really hoping that there was something obvious in my setup that I had missed that would yield dramatic performance improvements for these two queries.

                        If that's not the case, that's certainly fine, but I wanted to check with people who knew first. If you have any other suggestions, I'd love to try them.
                        • 9. Re: Queries on Large Datasets Not Returning
                          Matperry-Oracle
                          Please email me directly at matthew dot perry at oracle dot com. I would like to know a little more about your setup (hardware, database settings, number of models in the semantic network, etc.).