This discussion is archived
4 Replies Latest reply: Nov 16, 2011 1:06 PM by matperry RSS

TextContains Optimization

900576 Newbie
Currently Being Moderated
Hello,

I am looking into the sparql function orardf:textContains, and I see it's optimization and use of indexes over the RDF_VALUE$ table over regex. However, I am looking at using textContains for it's search capabilities over all of the values in the model.

The sparql/query that I am attempting to execute is taking about 3 minutes to return (we have about 65million triples in the model). I am attempting to get the results from the orardf:textContains to drive the triple t0, and then get the displayLabel of ?s with the triple t1. I also have PSCM, PCSM, and CM (thought being the CANON_END_NODE_ID is coming back in ?o) indexes on the model that we are querying against.


select s, l
FROM TABLE (SEM_MATCH(
'{
?s ?p ?o .
?s :displayLabel ?l .
filter( orardf:textContains(?o, "Bob") )
}',
SEM_MODELS('test_model'),
null,
SEM_ALIASES(
SEM_ALIAS('','http://ontology#namespace')
),null,null,
'HINT0={ LEADING(?o t0 t1) }'
));

The current size of the RDF_VALUE$ table is around 26million, and we have the CONTEXT index on that table.

Running the following query only takes milliseconds (granted, i'm not joining to the rdf_link$ table and back to the rdf_value$ table to get ?s and ?l):

select * from mdsys.rdf_value$ where contains(vname_prefix, 'Bob', 1) > 0 and value_type = 'TL';


Is there an index that we are missing? or a HINT0 option that we need to specify? Any guidance would be appreciated.


Thanks
-MichaelB

Edited by: 897573 on Nov 16, 2011 10:48 AM
  • 1. Re: TextContains Optimization
    matperry Journeyer
    Currently Being Moderated
    Hi Michael,

    I think the problem is with the ?s ?p ?o triple pattern.

    The default PCSM and PSCM indexes are not the best for an unbound predicate, so the join with t0 after getting values for ?o from RDF_VALUE$ is probably a hash join instead of a nested loop join. You have the right idea with the CM index, but we will need all the columns to make the index-based NLJ possible (otherwise we have to do a table access to pick up values for S and P), so I would suggest that you try a CPSM index. Also, remember to rebuild the index with SEM_APIS.ALTER_SEM_INDEX_ON_MODEL after you create it so that it is usable.

    The HINT0 hint is exactly what it should be, so you can leave this as it is.

    Hope this helps,
    - Matt
  • 2. Re: TextContains Optimization
    900576 Newbie
    Currently Being Moderated
    Thank Matt,

    I updated our model to have a usable CPSM index. However the query still took the same amount of time. Here is the explain plan that was generated from the execution. Any other options?

    | Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time | Pstart | Pstop |
    -----------------------------------------------------------------------------------------------------------------------------------
    |0|SELECT STATEMENT| 2900 | 549K| | 187K (3)| 00:37:35 | | |
    |1|NESTED LOOPS| | 2900 | 549K| | 187K (3)| 00:37:35 | | |
    |2|NESTED LOOPS| | 2900 | 430K| | 182K (3)| 00:36:26 | | |
    |* 3|HASH JOIN| | 2900 | 311K| | 176K (3)| 00:35:16 | | |
    | 4|TABLE ACCESS BY INDEX ROWID| RDF_VALUE$ | 279 | 13950 | | 113 (0)| 00:00:02 | | |
    |* 5|DOMAIN INDEX | RDF_V$TEXT_IDX | | | | 53 (0)| 00:00:01 | | |
    |* 6 | HASH JOIN | | 4973K| 284M| 2187M| 176K (3)| 00:35:14 | | |
    | 7 |PARTITION LIST SINGLE | | 63M| 1458M| | 64559 (5)| 00:12:55 | KEY | KEY |
    |* 8 | TABLE ACCESS FULL | RDF_LINK$ | 63M| 1458M| | 64559 (5)| 00:12:55 | 8 | 8 |
    | 9 | PARTITION LIST SINGLE | | 255K| 8994K| | 1306 (1)| 00:00:16 | KEY | KEY |
    |* 10 | INDEX RANGE SCAN | RDF_LNK_PVIDCENSNMID_IDX | 255K| 8994K| | 1306 (1)| 00:00:16 | 8 | 8 |
    | 11 | TABLE ACCESS BY INDEX ROWID | RDF_VALUE$ | 1 | 42 | | 2 (0)| 00:00:01 | | |
    |* 12 | INDEX UNIQUE SCAN | C_PK_VID | 1 | | | 1 (0)| 00:00:01 | | |
    | 13 | TABLE ACCESS BY INDEX ROWID | RDF_VALUE$ | 1 | 42 | | 2 (0)| 00:00:01 | | |
    |* 14 | INDEX UNIQUE SCAN | C_PK_VID | 1 | | | 1 (0)| 00:00:01 | | |



    Predicate Information (identified by operation id):
    ---------------------------------------------------

    3 - access("V3"."VALUE_ID"="CANON_END_NODE_ID")
    5 - access("CTXSYS"."CONTAINS"("V3"."VNAME_PREFIX",'Bob')>0)
    6 - access("START_NODE_ID"="START_NODE_ID")
    8 - filter("CANON_END_NODE_ID">0 AND "START_NODE_ID">0)
    10 - access("P_VALUE_ID"=6080591446605470747 AND "CANON_END_NODE_ID">0 AND "START_NODE_ID">0 AND "CANON_END_NODE_ID" IS
    NOT NULL)
    filter("START_NODE_ID">0)
    12 - access("START_NODE_ID"="V0"."VALUE_ID")
    14 - access("CANON_END_NODE_ID"="V1"."VALUE_ID")
  • 3. Re: TextContains Optimization
    900576 Newbie
    Currently Being Moderated
    So I was playing around with the query a little more and removed the HINT0, and it started using the CPSM index and ran at a much better speed.

    I agree that the CPSM index makes the most sense, however it looks like the optimizer was able to make it faster than specifying the order with the LEADING hint.
  • 4. Re: TextContains Optimization
    matperry Journeyer
    Currently Being Moderated
    Hi Michael,

    I'm glad the CPSM index eventually worked.

    - Matt

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points