For result simplicity think a simple table:
Term:doc id:snippet (if possible)
Hello: doc1:<b>Hello</b> <b>World</b>
What I have tried to date:
Trial 1: raw XML as it is in the database.
Trial 2: flatten XML to raw text removing unnecessary nodes to make the doc text as small as possible (one clob of text)
If I use two pl/sql tables of records, one for the 300 terms and one to extract the 600 XML docs and flatten them.
Using bulk collect and two loops that do nothing but iterate through the two collections, it runs in 7 seconds.
As soon as I add the policy_snippet call and do nothing with the results it takes three and a half minutes.
Replacing the policy_snippet call with policy_highlight takes longer.
Best results so far:
-creating a guid driven table to store the hit results
-Create a table to hold the flattened text from the all docs (if I move forward with this, I’m thinking a fast-refresh materialized view).
-Pre-staging terms in a global temp table (or in-memory pl/sql table... no real difference in performance).
-Create a snippet procedure that loops through the 600 flattened docs from the above flattened table and looks for hits using policy_snippet. If a hit is found, insert it into the guid table for selecting later.
-Loop through the 300 terms and for each one use dbms_scheduler to call the snippet procedure.
-Wait for all jobs to complete, pull results from the guid-based results table.
-Clean up the guid table for this run.
Next best performance:
-Create one HUGE XML doc using the 600 individual docs as fragments. Make 300 calls to ctx_doc.markup (once for each term), then use XMLTABLE/XQUERY to go looking for the hits in the marked up result.
-Separate table (using nologging SECUREFILE CLOBS CACHED) with its own text index (set up nologging) to stage/load the 600 docs in and run the 300 terms against it. Thinking the smaller text index would make it faster (it does… just not fast enough).
Tried with raw XML and flattened text.
-Combinations of pl/sql tables and/or global temporary tables and the text policy functions/procedures.
Unfortunately they are not pre-defined. My users sort of manually group the documents and add search terms to something like a rolodex. that is sort of the glue that binds the docs together.
For example: Read doc1, extract terms (index it). Store the 'important' terms. Use those terms to find more related docs. Repeat until there are no more related documents.
Now, I need to take all the terms and all the related documents and produce a nice 'master' document showing all the relationships.
To make it worse, they can have double-ended wildcards, %world%. That leaves CTXRULE out. I have an enhancement request to at least allow right-side wildcards on a CTXRULE index (I believe it has been approved for a future release).
Trust me, CTXRULE would be a really good fit for me. It's just not flexible enough yet.
For others that might have the same issue here is what I ended up with (Thanks to Roger Ford):
I flatten my target XML docs as before, removing all unnecessary nodes/text making the text to be indexed as small as possible.
I insert them into staging table keyed by a GUID and doc id for the run. This staging table has a Text index (nologging) on it.
Then I loop through all my terms using a contains query and for the rows that match, then call snippet.
for i in (select search_term from term_table) loop
for q in (select docid from staged_docs where guid_key=generated_guid and contains(doc_col, i.search_term) > 0) loop
snippet := ctx_doc.snippet(index_name => 'STAGE_TABLE_IDX', textkey => CTX_DOC.PKENCODE(generated_guid, q.docid),
text_query => i.search_term,
starttag => '--!MY-HIT!--', endtag => '--!MY-HIT!--', separator => '...' );
-- store snippet for later processing