2 Replies Latest reply: May 30, 2012 8:55 AM by ant7 RSS

    Return every instance of a word in a document

    ant7
      I have a 1400 page PDF document that contains an ID in the format of WARxxxxxx, a total of nine characters, every two pages or so. I am looking for a way to query out all of the instances of these IDs. For example:

      WAR123123
      WAR010101
      WAR987654

      The table the document is stored in is using standard text index. I tried to use the ctx_doc.highlight and store the results in another table, but I just got PDF gibberish instead of readable English. Is this type of query and result possible through the ctx functions?

      Thanks, Tony
        • 1. Re: Return every instance of a word in a document
          Barbara Boehmer
          If you try to use the lengths and offsets obtained from ctx_doc.highlight in dbms_lob.substr on the unfiltered column, then you get pdf gibberish. In the example below, I have used ctx_ddl.create_policy to create a policy that uses ctxsys.auto_filter, then used that policy in ctx_doc.policy_filter to convert the blob containing a pdf to a filtered clob. I have then used the lengths and offsets from ctx_doc.highlight in dbms_lob.substr on the fitlered clob. In my example, I searched for all words starting with "stor".

          -- test environment:
          SCOTT@orcl_11gR2> -- create table:
          SCOTT@orcl_11gR2> create table test_tab
            2    (id       number,
            3       test_col  blob)
            4  /
          
          Table created.
          
          SCOTT@orcl_11gR2> -- load data:
          SCOTT@orcl_11gR2> create or replace directory my_dir as 'c:\my_oracle_files'
            2  /
          
          Directory created.
          
          SCOTT@orcl_11gR2> declare
            2    v_bfile     bfile := bfilename ('MYDIR', 'banana.pdf');
            3    v_blob     blob;
            4  begin
            5    insert into test_tab (id, test_col) values (1, empty_blob())
            6    returning test_col into v_blob;
            7    dbms_lob.fileopen (v_bfile, dbms_lob.lob_readonly);
            8    dbms_lob.loadfromfile (v_blob, v_bfile, dbms_lob.getlength (v_bfile));
            9    dbms_lob.fileclose (v_bfile);
           10  end;
           11  /
          
          PL/SQL procedure successfully completed.
          
          SCOTT@orcl_11gR2> -- create index:
          SCOTT@orcl_11gR2> create index test_idx
            2  on test_tab (test_col)
            3  indextype is ctxsys.context
            4  /
          
          Index created.
          -- policy, function, and query:
          SCOTT@orcl_11gR2> -- create policy that uses ctxsys.auto_filter:
          SCOTT@orcl_11gR2> begin
            2    ctx_ddl.create_policy ('test_pol', 'ctxsys.auto_filter');
            3  end;
            4  /
          
          PL/SQL procedure successfully completed.
          
          SCOTT@orcl_11gR2> -- create function that uses policy:
          SCOTT@orcl_11gR2> create or replace function get_words
            2    (p_idx     in varchar2,
            3       p_rid     in rowid,
            4       p_query in varchar2,
            5       p_blob     in blob)
            6    return        sys.odcivarchar2list
            7    authid        definer
            8    pipelined
            9  as
           10    v_clob       clob;
           11    v_hightab  ctx_doc.highlight_tab;
           12  begin
           13    dbms_lob.createtemporary (v_clob, true);
           14    ctx_doc.policy_filter ('test_pol', p_blob, v_clob);
           15    ctx_doc.set_key_type ('rowid');
           16    ctx_doc.highlight (p_idx, p_rid, p_query, v_hightab);
           17    for i in 1 .. v_hightab.count loop
           18        pipe row
           19          (dbms_lob.substr
           20             (v_clob,
           21              v_hightab(i).length,
           22              v_hightab(i).offset));
           23    end loop;
           24    dbms_lob.freetemporary (v_clob);
           25    return;
           26  end get_words;
           27  /
          
          Function created.
          
          SCOTT@orcl_11gR2> show errors
          No errors.
          SCOTT@orcl_11gR2> -- query that uses function:
          SCOTT@orcl_11gR2> column words format a30
          SCOTT@orcl_11gR2> select distinct t.id, x.column_value as words
            2  from   test_tab t,
            3           table (get_words ('test_idx', t.rowid, 'stor%', t.test_col)) x
            4  where  contains (t.test_col, 'stor%') > 0
            5  /
          
                  ID WORDS
          ---------- ------------------------------
                   1 store
                   1 stores
                   1 Storage
          
          3 rows selected.
          • 2. Re: Return every instance of a word in a document
            ant7
            Barbara -

            Thank you for your very thorough and well executed example. It provided me just what I needed. I appreciate the time and effort you put into your answer.

            Regards, Tony