14 Replies Latest reply on Apr 30, 2012 4:34 PM by Barbara Boehmer

    Index scan

    sperkmandl
      Hi, as a developer helper - I wonder whether there is any way to scan through a text index, reporting terms and associated info (score, source rowid, etc.).
      Thanks.
        • 1. Re: Index scan
          Herald ten Dam
          Hi,

          a source for the index is to use the package CTX_REPORT, there are procedures which can give inside about an index (see http://docs.oracle.com/cd/E11882_01/text.112/e24436/crptpkg.htm#CHDHHEDC). Of course you can query the tables which belogn to an index or query some metadata views which start with CTX_ (see http://docs.oracle.com/cd/E11882_01/text.112/e24436/aviews.htm#CHDEIJFC).

          Herald ten Dam
          http://htendam.wordpress.com
          • 2. Re: Index scan
            sperkmandl
            Thanks, but I couldn't find any table/view where a "select" would return index "terms" and associated info.
            There are plenty of data describing an index as a whole.
            • 3. Re: Index scan
              Barbara Boehmer
              You need to be more specific as to what information you want and for what purpose. Thus far, you have only mentioned term, rowid, and score. You don't get the score from a data dictionary view or an index or a domain index table. You get a score from a query. You can get the terms and rowid's and other information in various ways. I have provided a demonstration below that shows some objects that are created when you create a text index and how you can extract data from them. I have focused on terms and rowid's, since that is all you have mentioned.

              -- Suppose you have a table and some data like this:
              SCOTT@orcl_11gR2> create table text_tab
                2    (text_col  varchar2 (30))
                3  /
              
              Table created.
              
              SCOTT@orcl_11gR2> insert all
                2  into text_tab (text_col) values ('some text data')
                3  into text_tab (text_col) values ('more text data')
                4  select * from dual
                5  /
              
              2 rows created.
              
              SCOTT@orcl_11gR2> select text_col, rowid
                2  from   text_tab
                3  /
              
              TEXT_COL                       ROWID
              ------------------------------ ------------------
              some text data                 AABE4tAAEAAAXSzAAA
              more text data                 AABE4tAAEAAAXSzAAB
              
              2 rows selected.
              -- When you create a text index:
              SCOTT@orcl_11gR2> create index text_idx on text_tab (text_col)
                2  indextype is ctxsys.context
                3  parameters ('stoplist ctxsys.empty_stoplist')
                4  /
              
              Index created.
              -- it creates some domain index tables of the format dr$your_index_name$...:
              SCOTT@orcl_11gR2> select object_name from user_objects
                2  where  object_name like '%TEXT_IDX%'
                3  /
              
              OBJECT_NAME
              --------------------------------------------------------------------------------
              TEXT_IDX
              DR$TEXT_IDX$X
              DR$TEXT_IDX$R
              DR$TEXT_IDX$N
              DR$TEXT_IDX$K
              DR$TEXT_IDX$I
              
              6 rows selected.
              -- You can describe those tables to see what data is available:
              SCOTT@orcl_11gR2> desc dr$text_idx$i
               Name                                      Null?    Type
               ----------------------------------------- -------- ----------------------------
               TOKEN_TEXT                                NOT NULL VARCHAR2(64)
               TOKEN_TYPE                                NOT NULL NUMBER(3)
               TOKEN_FIRST                               NOT NULL NUMBER(10)
               TOKEN_LAST                                NOT NULL NUMBER(10)
               TOKEN_COUNT                               NOT NULL NUMBER(10)
               TOKEN_INFO                                         BLOB
              
              SCOTT@orcl_11gR2> desc dr$text_idx$k
               Name                                      Null?    Type
               ----------------------------------------- -------- ----------------------------
               DOCID                                              NUMBER(38)
               TEXTKEY                                   NOT NULL ROWID
              -- You can select from those tables:
              SCOTT@orcl_11gR2> column token_text format a10
              SCOTT@orcl_11gR2> select token_text, token_type, token_first, token_last, token_count
                2  from   dr$text_idx$i
                3  /
              
              TOKEN_TEXT TOKEN_TYPE TOKEN_FIRST TOKEN_LAST TOKEN_COUNT
              ---------- ---------- ----------- ---------- -----------
              DATA                0           1          2           2
              MORE                0           2          2           1
              SOME                0           1          1           1
              TEXT                0           1          2           2
              
              4 rows selected.
              
              SCOTT@orcl_11gR2> select docid, textkey
                2  from   dr$text_idx$k
                3  /
              
                   DOCID TEXTKEY
              ---------- ------------------
                       1 AABE4tAAEAAAXSzAAA
                       2 AABE4tAAEAAAXSzAAB
              
              2 rows selected.
              -- You can use queries that join the base table to a domain index table:
              SCOTT@orcl_11gR2> column term format a10
              SCOTT@orcl_11gR2> select i.token_text term, t.rowid
                2  from   dr$text_idx$i i, text_tab t
                3  where  contains (t.text_col, i.token_text) > 0
                4  order  by term
                5  /
              
              TERM       ROWID
              ---------- ------------------
              DATA       AABE4tAAEAAAXSzAAA
              DATA       AABE4tAAEAAAXSzAAB
              MORE       AABE4tAAEAAAXSzAAB
              SOME       AABE4tAAEAAAXSzAAA
              TEXT       AABE4tAAEAAAXSzAAB
              TEXT       AABE4tAAEAAAXSzAAA
              
              6 rows selected.
              -- Some of the columns such as dr$your_index_name$i.token_info (blob) require a utility to extract the data in a readable manner. You can extract the data either as a lengthy report or in xml format. The following demonstrates how to use ctx_report.token_info to get that information in xml format, then extract the rowid. You could use the same method to extract other data.
              SCOTT@orcl_11gR2> column row_id format a20
              SCOTT@orcl_11gR2> select t.term,
                2           extractvalue (value (x), '//@BASE_ROWID') row_id
                3  from   (select token_text term,
                4                ctx_report.token_info
                5                  ('text_idx',
                6                   token_text,
                7                   token_type,
                8                   report_format => 'XML') xml_info
                9            from   dr$text_idx$i) t,
               10           table
               11             (xmlsequence
               12            (extract
               13              (xmltype (t.xml_info),
               14               '//TOKEN_DATA_DOCID'))) x
               15  /
              
              TERM       ROW_ID
              ---------- --------------------
              DATA       AABE4tAAEAAAXSzAAA
              DATA       AABE4tAAEAAAXSzAAB
              MORE       AABE4tAAEAAAXSzAAB
              SOME       AABE4tAAEAAAXSzAAA
              TEXT       AABE4tAAEAAAXSzAAA
              TEXT       AABE4tAAEAAAXSzAAB
              
              6 rows selected.
              • 4. Re: Index scan
                sperkmandl
                Thanks, I think I found what I was looking for in CTX_QUERY.BROWSE_WORDS.
                • 5. Re: Index scan
                  sperkmandl
                  I tried to override the case constraint as reported in another thread while using BROWSE_WORDS, through queries kindly reported by Barbara.
                  Unfortunately it seems not working:

                  select i.token_text term, t.rowid
                  from dr$URL_FULLTEXT_IDX$i i, FULLTEXT_TABLE t
                  where contains (t.url, i.token_text) > 0
                  order by term;

                  SQL developer raises

                  Errore SQL: ORA-20000: Oracle Text error:
                  DRG-50901: text query parser syntax error on line 1, column 1
                  20000. 00000 - "%s"
                  *Cause:    The stored procedure 'raise_application_error'
                  was called which causes this error to be generated.
                  *Action:   Correct the problem as described in the error message or contact
                  the application administrator or DBA for more information.

                  It seems that Contains needs a literal query.
                  • 6. Re: Index scan
                    Barbara Boehmer
                    You need to post a copy and paste of a complete run of what you did from SQL*Plus, not SQL Developer, as I did. We need to see the table creation and index creation, any other related objects, and some sample data. The example that I posted was for a context index, not a ctxcat index. Contains works with variables or columns from other tables and does not require a string. I have already proven that in the example above.
                    • 7. Re: Index scan
                      sperkmandl
                      I use a context index too, from an url data source, operating in a Jdbc environment.
                      It's there that I first found that error, then I tried again in SQL dev. and got the same result.
                      The base table was populated by just a single pdf document: the Oracle Text app. dev. guide.
                      Its definition is:

                      CREATE OR REPLACE TYPE FullText_type AS OBJECT (
                      version RAW(16),
                      path VARCHAR2(200 char),
                      url VARCHAR2(400 char),
                      flags NUMBER(10))

                      CREATE TABLE STORE1.FullText_table OF FullText_type (
                      FOREIGN KEY (version) REFERENCES STORE1.Version_table ON DELETE CASCADE,
                      UNIQUE (version,path),
                      flags DEFAULT 0)

                      Index definition is:

                      begin
                      ctx_ddl.create_preference('MYLEX', 'BASIC_LEXER');
                      ctx_ddl.set_attribute('MYLEX', 'mixed_case', 'YES');
                      ctx_ddl.set_attribute('MYLEX', 'index_stems', 'GERMAN');
                      end;

                      CREATE INDEX STORE1.URL_FULLTEXT_IDX ON STORE1.FULLTEXT_TABLE (URL)
                      INDEXTYPE IS CTXSYS.CONTEXT parameters ('
                      DATASTORE CTXSYS.URL_DATASTORE
                      FILTER CTXSYS.AUTO_FILTER
                      LEXER MYLEX
                      SYNC (ON COMMIT)')

                      Failing query in Jdbc was:

                      SELECT i.token_text term, t.version, t.path
                      FROM STORE1.dr$URL_FULLTEXT_IDX$i i, STORE1.FULLTEXT_TABLE t
                      WHERE i.token_text >= '?' AND
                      CONTAINS (t.url, i.token_text) > 0
                      ORDER BY term

                      DRG-50901: text query parser syntax error on line 1, column 6

                      A little simpler statement in SQL dev. raises the same error:

                      select i.token_text term, t.rowid
                      from dr$URL_FULLTEXT_IDX$i i, FULLTEXT_TABLE t
                      where contains (t.url, i.token_text) > 0
                      order by term;

                      I noticed that if I enclose the query like in single quote such as 'i.token_text', the syntax error disappears but there is no result. This occurs in SQL dev. and in java.

                      Thanks and regards.
                      • 8. Re: Index scan
                        Barbara Boehmer
                        There are some terms that have special meaning to Oracle Text, so they need to be escaped. To do this, just concatenate '{' and '}' around i.token_text in the query. Please see the complete demonstration below.
                        SCOTT@orcl_11gR2> CREATE TABLE version_table
                          2    (version  RAW(16) PRIMARY KEY)
                          3  /
                        
                        Table created.
                        
                        SCOTT@orcl_11gR2> INSERT INTO version_table VALUES
                          2    (UTL_RAW.CAST_TO_RAW ('11.2'))
                          3  /
                        
                        1 row created.
                        
                        SCOTT@orcl_11gR2> CREATE OR REPLACE TYPE FullText_type AS OBJECT
                          2    (version  RAW(16),
                          3       path      VARCHAR2(200 char),
                          4       url      VARCHAR2(400 char),
                          5       flags      NUMBER(10));
                          6  /
                        
                        Type created.
                        
                        SCOTT@orcl_11gR2> CREATE TABLE FullText_table OF FullText_type
                          2    (FOREIGN KEY (version) REFERENCES Version_table ON DELETE CASCADE,
                          3       UNIQUE (version, path),
                          4       flags DEFAULT 0)
                          5  /
                        
                        Table created.
                        
                        SCOTT@orcl_11gR2> INSERT INTO fulltext_table VALUES
                          2    (UTL_RAW.CAST_TO_RAW ('11.2'),
                          3       null,
                          4       'http://docs.oracle.com/cd/E11882_01/text.112/e24435/toc.htm',
                          5       0)
                          6  /
                        
                        1 row created.
                        
                        SCOTT@orcl_11gR2> BEGIN
                          2    CTX_DDL.CREATE_PREFERENCE ('MYLEX', 'BASIC_LEXER');
                          3    CTX_DDL.SET_ATTRIBUTE ('MYLEX', 'mixed_case', 'YES');
                          4    CTX_DDL.SET_ATTRIBUTE ('MYLEX', 'index_stems', 'GERMAN');
                          5  END;
                          6  /
                        
                        PL/SQL procedure successfully completed.
                        
                        SCOTT@orcl_11gR2> CREATE INDEX url_fulltext_idx
                          2  ON fulltext_table (url)
                          3  INDEXTYPE IS CTXSYS.CONTEXT
                          4  PARAMETERS
                          5    ('DATASTORE CTXSYS.URL_DATASTORE
                          6        FILTER CTXSYS.AUTO_FILTER
                          7        LEXER mylex
                          8        SYNC (ON COMMIT)')
                          9  /
                        
                        Index created.
                        
                        SCOTT@orcl_11gR2> COLUMN term FORMAT A30
                        SCOTT@orcl_11gR2> SELECT i.token_text term, t.ROWID
                          2  FROM   dr$url_fulltext_idx$i i,
                          3           fulltext_table t
                          4  WHERE  CONTAINS (t.url, '{' || i.token_text || '}') > 0
                          5  ORDER  by term
                          6  /
                        
                        TERM                           ROWID
                        ------------------------------ ------------------
                        01                             AABFSdAAEAAAYGUAAA
                        1                              AABFSdAAEAAAYGUAAA
                        1.1                            AABFSdAAEAAAYGUAAA
                        1.2                            AABFSdAAEAAAYGUAAA
                        1.2.1                          AABFSdAAEAAAYGUAAA
                        1.3                            AABFSdAAEAAAYGUAAA
                        ... -- output truncated to save space
                        themes                         AABFSdAAEAAAYGUAAA
                        type                           AABFSdAAEAAAYGUAAA
                        updated                        AABFSdAAEAAAYGUAAA
                        use                            AABFSdAAEAAAYGUAAA
                        using                          AABFSdAAEAAAYGUAAA
                        versus                         AABFSdAAEAAAYGUAAA
                        way                            AABFSdAAEAAAYGUAAA
                        wide                           AABFSdAAEAAAYGUAAA
                        
                        924 rows selected.
                        • 9. Re: Index scan
                          sperkmandl
                          Very tricky, but it works in this way.
                          Thank you very much.
                          • 10. Re: Index scan
                            sperkmandl
                            inconsistent index contents
                            • 11. Re: Index scan
                              sperkmandl
                              Hi, in the context of browsing the index I'm experiencing some inconsistencies.
                              I set ALTER SESSION SET NLS_SORT=BINARY to stay away from language issues. Same setting for NLS_COMP.
                              The statement (index only):

                              SELECT i.token_text term
                              FROM STORE1.dr$URL_FULLTEXT_IDX$i i
                              WHERE i.token_text >= 'ä'
                              ORDER BY term

                              reports (see the initial double entry):

                              ä
                              ä
                              ächten
                              äsen
                              å
                              æ
                              ç
                              è
                              é
                              ...

                              while the statement (mixed query):

                              SELECT i.token_text term
                              FROM STORE1.dr$URL_FULLTEXT_IDX$i i, STORE1.FULLTEXT_TABLE t
                              WHERE i.token_text >= 'ä' AND
                              CONTAINS (t.url, '{' || i.token_text || '}') > 0
                              ORDER BY term

                              reports just a single entry twice:

                              ä
                              ä
                              • 12. Re: Index scan
                                Roger Ford-Oracle
                                I don't think that will work. The search value in the CONTAINS clause needs to evaluate to a constant string - you can't use a column value which changes as you iterate a join.

                                This limitation comes from the extensibility framework on which Oracle Text indexes are built, but I'm not sure it's documented anywhere.
                                • 13. Re: Index scan
                                  sperkmandl
                                  In general its seems working well, provided that the search value is properly escaped as explained by Barbara.
                                  It's only that case of those final accented words that seem skipped.

                                  As a further detail - skipped words have a token type = 9, while included words have a token_type = 0.
                                  The value of 9 seems to imply that such word is in the index but not in the document.
                                  As an exception, the word 'ä' is reported twice: once for 0, once for 9, even in the mixed query.

                                  Edited by: sperkmandl on Apr 30, 2012 6:40 AM
                                  • 14. Re: Index scan
                                    Barbara Boehmer
                                    The values of 9 for token_type are stem words that are created when you use index_stems in your lexer. Some of those stem words might also be in your document, in which case they would also be in the domain index table with a token_type of 0. The query that I provided will only provide terms that are actually in the table. Please see the simplified demonstration below.
                                    SCOTT@orcl_11gR2> ALTER SESSION SET NLS_LANGUAGE='GERMAN'
                                      2  /
                                    
                                    Session altered.
                                    
                                    SCOTT@orcl_11gR2> CREATE TABLE demo_tab
                                      2    (demo_col  VARCHAR2(30))
                                      3  /
                                    
                                    Table created.
                                    
                                    SCOTT@orcl_11gR2> INSERT ALL
                                      2  INTO demo_tab (demo_col) VALUES ('SPRICHT')
                                      3  SELECT * FROM DUAL
                                      4  /
                                    
                                    1 row created.
                                    
                                    SCOTT@orcl_11gR2> BEGIN
                                      2    CTX_DDL.CREATE_PREFERENCE ('mylex', 'BASIC_LEXER');
                                      3    CTX_DDL.SET_ATTRIBUTE ('mylex', 'INDEX_STEMS', 'GERMAN');
                                      4  END;
                                      5  /
                                    
                                    PL/SQL procedure successfully completed.
                                    
                                    SCOTT@orcl_11gR2> CREATE INDEX demo_idx
                                      2  ON demo_tab (demo_col)
                                      3  INDEXTYPE IS CTXSYS.CONTEXT
                                      4  PARAMETERS ('LEXER mylex')
                                      5  /
                                    
                                    Index created.
                                    
                                    SCOTT@orcl_11gR2> COLUMN token_text FORMAT A30
                                    SCOTT@orcl_11gR2> SELECT token_type, token_text
                                      2  FROM   dr$demo_idx$i
                                      3  /
                                    
                                    TOKEN_TYPE TOKEN_TEXT
                                    ---------- ------------------------------
                                             9 SPRECHEN
                                             0 SPRICHT
                                    
                                    2 rows selected.
                                    
                                    SCOTT@orcl_11gR2> COLUMN term FORMAT A30
                                    SCOTT@orcl_11gR2> SELECT i.token_text term, t.ROWID
                                      2  FROM   dr$demo_idx$i i,
                                      3           demo_tab t
                                      4  WHERE  CONTAINS (t.demo_col, '{' || i.token_text || '}') > 0
                                      5  ORDER  by term
                                      6  /
                                    
                                    TERM                           ROWID
                                    ------------------------------ ------------------
                                    SPRICHT                        AABFu7AAEAAAZmsAAA
                                    
                                    1 row selected.
                                    As a side note, if you plan to do stemming searches, I would use a wordlist with a German stemmer, instead of a lexer with index_stems set to German. I have provided a separate demonstration for that. If you use the wordlist with the stemmer, it is important not to use the index_stems with the lexer.
                                    SCOTT@orcl_11gR2> ALTER SESSION SET NLS_LANGUAGE='GERMAN'
                                      2  /
                                    
                                    Session altered.
                                    
                                    SCOTT@orcl_11gR2> CREATE TABLE demo_tab
                                      2    (demo_col  VARCHAR2(30))
                                      3  /
                                    
                                    Table created.
                                    
                                    SCOTT@orcl_11gR2> INSERT ALL
                                      2  INTO demo_tab (demo_col) VALUES ('SPRICHT')
                                      3  SELECT * FROM DUAL
                                      4  /
                                    
                                    1 row created.
                                    
                                    SCOTT@orcl_11gR2> BEGIN
                                      2    CTX_DDL.CREATE_PREFERENCE ('mywordlist', 'BASIC_WORDLIST');
                                      3    CTX_DDL.SET_ATTRIBUTE ('mywordlist', 'STEMMER', 'GERMAN');
                                      4  END;
                                      5  /
                                    
                                    PL/SQL procedure successfully completed.
                                    
                                    SCOTT@orcl_11gR2> CREATE INDEX demo_idx
                                      2  ON demo_tab (demo_col)
                                      3  INDEXTYPE IS CTXSYS.CONTEXT
                                      4  PARAMETERS ('WORDLIST mywordlist')
                                      5  /
                                    
                                    Index created.
                                    
                                    SCOTT@orcl_11gR2> SELECT demo_col
                                      2  FROM   demo_tab
                                      3  WHERE  CONTAINS (demo_col, '$sprechen') > 0
                                      4  /
                                    
                                    DEMO_COL
                                    ------------------------------
                                    SPRICHT
                                    
                                    1 row selected.