Forum Stats

  • 3,825,902 Users
  • 2,260,573 Discussions
  • 7,896,727 Comments

Discussions

Query Template Documentation

tarfu_dba
tarfu_dba Member Posts: 134
edited Jul 4, 2017 7:04AM in Text

We had a requirement to get a count of how many times a particular token appeared in a document, and I found a post by Barbara Boehmer that led us to this solution:

select id, score(1) as token_count from docs where CONTAINS (text,

'<query>

   <textquery lang="ENGLISH" grammar="CONTEXT">blah</textquery>

  <score datatype="INTEGER" algorithm="COUNT"/>

</query>',1)>0;

This works really well for us.  Is there any more documention on Query Templates than what's in the Text Application Developer's Guide? 

Answers

  • Barbara Boehmer
    Barbara Boehmer Member Posts: 4,625 Gold Trophy
    edited Jun 20, 2017 7:55PM

    In newer versions, there is also DEFINESCORE with OCCURRENCE, as shown in the second query below.

    [email protected]_12.1.0.2.0> create table docs as
      2  select 1 id, 'blah blah blah' text from dual union all
      3  select 2, 'blah blah' from dual union all
      4  select 3, 'blah' from dual
      5  /

    Table created.

    [email protected]_12.1.0.2.0> create index docs_text_idx on docs (text) indextype is ctxsys.context
      2  /

    Index created.

    [email protected]_12.1.0.2.0> select id, score(1) as token_count from docs where CONTAINS (text,
      2  '<query>
      3  <textquery lang="ENGLISH" grammar="CONTEXT">blah</textquery>
      4  <score datatype="INTEGER" algorithm="COUNT"/>
      5  </query>'
      6  ,1)>0
      7  /

            ID TOKEN_COUNT
    ---------- -----------
            1          3
            2          2
            3          1

    3 rows selected.

    [email protected]_12.1.0.2.0> select id, score(1) as token_count from docs where CONTAINS (text,
      2  'definescore (blah, occurrence)'
      3  ,1)>0
      4  /

            ID TOKEN_COUNT
    ---------- -----------
            1          3
            2          2
            3          1

    3 rows selected.

    Here are some links to documentation:

    DEFINESCORE
    http://docs.oracle.com/database/121/CCREF/cqoper.htm#GUID-BEF0570D-8B11-4893-AEFF-0AD7DED185BB

    ALTERNATIVE AND USER-DEFINED SCORING
    http://docs.oracle.com/database/121/CCAPP/GUID-89EAF7C5-AFB6-469E-8948-7F7A66DBBBB4.htm#CCAPP9168

    USING QUERY TEMPLATES
    http://docs.oracle.com/database/121/CCAPP/GUID-57BA56DA-CDEC-442A-BF2B-CB8BFEA7C001.htm#CCAPP9164

    Is there any particular question that you have about templates or scoring?

    tarfu_dba
  • tarfu_dba
    tarfu_dba Member Posts: 134
    edited Jun 21, 2017 8:33AM

    Thanks, Barbara.  I read about DEFINESCORE, etc. in the 11gR2 documentation, but your example above clarifies it (it takes me about 3 reads for stuff to sink in).  It looks simpler to code than the <query> XML.

    The recent need by our application team to get a simple count of a token in a document made me search for an answer, and obviously I found your very helpful post.  I had not looked closely at query templates before this.  The documentation isn't terribly elaborate so I was wondering if there was more to it than that.

    We've had a long standing "problem" with trying to search for IP and email addresses with Text queries, compounded by the fact that we use the WORLD_LEXER which has no options to help us (a printjoin of "@" might have been nice to have), and the normal parsing of the CONTAINS search criteria plays havoc with the periods in an IP address ('192.168.%.1' appears to become 192.168 % 1' or '192.168 % .1' which returns everything but what we want).  I'm looking closer at query templates to see if they could help.

  • Barbara Boehmer
    Barbara Boehmer Member Posts: 4,625 Gold Trophy
    edited Jun 21, 2017 8:25PM

    Unfortunately, I do not have a solution for you.  All I can do is narrow down the cause.  The problem is that the period is, by default, an endjoin character.  So, when you search for "192.168." it believes that is the end of the token.  Since you are using the world_lexer, you cannot set attributes in order to either not have the period as an endjoin or make it a skipjoin to separate the strings into smaller tokens.  Perhaps, with this information, someone else may have some further ideas.  Please see the demonstration that shows the behavior.

    [email protected]_12.1.0.2.0> create table docs as
      2  select '192.168.0.1' text from dual union all
      3  select '192.1682.0.1' from dual union all
      4  select '192.168.0.21' from dual union all
      5  select '192.168.' from dual union all
      6  select '192.168' from dual
      7  /

    Table created.

    [email protected]_12.1.0.2.0> begin
      2    ctx_ddl.create_preference ('docs_lex', 'world_lexer');
      3  end;
      4  /

    PL/SQL procedure successfully completed.

    [email protected]_12.1.0.2.0> create index docs_text_idx on docs (text)
      2  indextype is ctxsys.context
      3  parameters ('lexer docs_lex')
      4  /

    Index created.

    [email protected]_12.1.0.2.0> select token_text from dr$docs_text_idx$i
      2  /

    TOKEN_TEXT
    ----------------------------------------------------------------
    192.168
    192.168.0.1
    192.168.0.21
    192.1682.0.1

    4 rows selected.

    [email protected]_12.1.0.2.0> select text from docs where CONTAINS (text, '192.168.%.1', 1) > 0
      2  /

    no rows selected

    [email protected]_12.1.0.2.0> select text from docs where CONTAINS (text, '192.168.%', 1) > 0
      2  /

    TEXT
    ------------
    192.168.
    192.168

    2 rows selected.

    [email protected]_12.1.0.2.0> select text from docs where CONTAINS (text, '192.168%', 1) > 0
      2  /

    TEXT
    ------------
    192.168.0.1
    192.1682.0.1
    192.168.0.21
    192.168.
    192.168

    5 rows selected.

    [email protected]_12.1.0.2.0> select text from docs where CONTAINS (text, '192.168%.1', 1) > 0
      2  /

    TEXT
    ------------
    192.168.0.1
    192.1682.0.1

    2 rows selected.

    [email protected]_12.1.0.2.0> select text from docs where CONTAINS (text, '192.168.0.1', 1) > 0
      2  /

    TEXT
    ------------
    192.168.0.1

    1 row selected.

    [email protected]_12.1.0.2.0>

  • Herald ten Dam
    Herald ten Dam Member Posts: 1,254
    edited Jul 4, 2017 7:04AM

    Hi,

    a solution I once choose for email and IP is to use regular expressions. I used a bunch of regular expressions, used these to extract email/ip/phone number out of a text, put these in a new table with a link to the document, and indexed the new table with a text index, this worked very well. So a kind of entity extracting, but made it in Oracle 10g that time. It is still in use that way. Maybe also a solution for you.

    Herald ten Dam

This discussion has been closed.