Environment: Oracle Enterprise 18.104.22.168
Have been planning/testing 22.214.171.124 and can likely move to that soon if necessary.
Overall performance is the key here. There is a web-page that is the end recipient of this information so it can’t wait forever.
I have tried every piece of magic I know and my best time is just over two and a half minutes against the test database.
What is the world’s opinion on the absolute fastest way to pull this off?
I have a table with about 3 million XML docs in it (Current test database has 19,000 rows). There is a Text index on the XML column. Average XML doc size 6K.
I have a known subset of 600 XML docs in a group.
I have 300 individual search terms that I need to run against those known 600 docs.
These counts will vary based on pre-defined groups. One group can have 1000 docs with 500 terms, 500 docs and 25 terms, 10 docs with 200 terms, etc...
Sticking with 600 docs and 300 terms: I need to know which of those 600 docs each term is in.
In a perfect world, I would also like the snippets for the docs (I can live without the snippet it there is a faster way to get the individual doc ids).
Doc1: <a>Hello World</a>
Doc2: <a>Goodbye World</a>
Doc3: <a> Goodbye World</a>
Doc4: <a>Fred Flintstone</a>
Term1: Hello: near((Hello,World),2,TRUE)
Term2: Goodbye: near((Goodbye,World),2,TRUE)
For result simplicity think a simple table:
Term:doc id:snippet (if possible)
Hello: doc1:<b>Hello</b> <b>World</b>
What I have tried to date:
Trial 1: raw XML as it is in the database.
Trial 2: flatten XML to raw text removing unnecessary nodes to make the doc text as small as possible (one clob of text)
If I use two pl/sql tables of records, one for the 300 terms and one to extract the 600 XML docs and flatten them.
Using bulk collect and two loops that do nothing but iterate through the two collections, it runs in 7 seconds.
As soon as I add the policy_snippet call and do nothing with the results it takes three and a half minutes.
Replacing the policy_snippet call with policy_highlight takes longer.
Best results so far:
-creating a guid driven table to store the hit results
-Create a table to hold the flattened text from the all docs (if I move forward with this, I’m thinking a fast-refresh materialized view).
-Pre-staging terms in a global temp table (or in-memory pl/sql table... no real difference in performance).
-Create a snippet procedure that loops through the 600 flattened docs from the above flattened table and looks for hits using policy_snippet. If a hit is found, insert it into the guid table for selecting later.
-Loop through the 300 terms and for each one use dbms_scheduler to call the snippet procedure.
-Wait for all jobs to complete, pull results from the guid-based results table.
-Clean up the guid table for this run.
Next best performance:
-Create one HUGE XML doc using the 600 individual docs as fragments. Make 300 calls to ctx_doc.markup (once for each term), then use XMLTABLE/XQUERY to go looking for the hits in the marked up result.
-Separate table (using nologging SECUREFILE CLOBS CACHED) with its own text index (set up nologging) to stage/load the 600 docs in and run the 300 terms against it. Thinking the smaller text index would make it faster (it does… just not fast enough).
Tried with raw XML and flattened text.
-Combinations of pl/sql tables and/or global temporary tables and the text policy functions/procedures.
Did I miss any ‘magic’?
Edited by: user10152425 on May 6, 2013 12:46 PM
Are the hundreds of search terms pre-defined or are they variable? If they are pre-defined, then you can use document classification:
Unfortunately they are not pre-defined. My users sort of manually group the documents and add search terms to something like a rolodex. that is sort of the glue that binds the docs together.
For example: Read doc1, extract terms (index it). Store the 'important' terms. Use those terms to find more related docs. Repeat until there are no more related documents.
Now, I need to take all the terms and all the related documents and produce a nice 'master' document showing all the relationships.
To make it worse, they can have double-ended wildcards, %world%. That leaves CTXRULE out. I have an enhancement request to at least allow right-side wildcards on a CTXRULE index (I believe it has been approved for a future release).
Trust me, CTXRULE would be a really good fit for me. It's just not flexible enough yet.
For others that might have the same issue here is what I ended up with (Thanks to Roger Ford):
I flatten my target XML docs as before, removing all unnecessary nodes/text making the text to be indexed as small as possible.
I insert them into staging table keyed by a GUID and doc id for the run. This staging table has a Text index (nologging) on it.
Then I loop through all my terms using a contains query and for the rows that match, then call snippet.
for i in (select search_term from term_table) loop
for q in (select docid from staged_docs where guid_key=generated_guid and contains(doc_col, i.search_term) > 0) loop
snippet := ctx_doc.snippet(index_name => 'STAGE_TABLE_IDX', textkey => CTX_DOC.PKENCODE(generated_guid, q.docid),
text_query => i.search_term,
starttag => '--!MY-HIT!--', endtag => '--!MY-HIT!--', separator => '...' );
-- store snippet for later processing