7 Replies Latest reply on Jan 12, 2012 4:14 PM by Barbara Boehmer

    Oracle Text - Top N Tokens

    Dileep
      Hi All,

      I am working on a prototype for my "word count application" using Oracle Text. My requirement is to take a bunch of customer comments(text column) and come up with top n most frequent words from those comments. I retrieve those customer comments/sub data set depending upon other dimension values(star schema). From Oracle Text documentation I did not find a way to do it.All examples that I have seen so far were related to text query applications but I do not want to query by using any particular word instead I want to retrieve top n words. Can I do this using Oracle text?

      Thanks
      Dileep.
        • 1. Re: Oracle Text - Top N Tokens
          Barbara Boehmer
          When you create an Oracle Text index, it creates a dr$your_index_name$i domain index table, that includes each token (word) and a token_count of how many of each token (word) are in the index. You can use a standard top n query to get the top n tokens, limiting your data either by token_count or number of rows. Please see the demonstration below, using the sentences of your post as data.
          SCOTT@orcl_11gR2> create table comments
            2    (id      number,
            3       commcol  clob)
            4  /
          
          Table created.
          
          SCOTT@orcl_11gR2> insert all
            2  into comments (id, commcol) values
            3  (1, 'I am working on a prototype for my "word count application" using Oracle Text.')
            4  into comments (id, commcol) values
            5  (2, 'My requirement is to take a bunch of customer comments(text column)
            6  and come up with top n most frequent words from those comments.')
            7  into comments (id, commcol) values
            8  (3, 'I retrieve those customer comments/sub data set depending upon other
            9  dimension values(star schema).')
           10  into comments (id, commcol) values
           11  (4, 'From Oracle Text documentation I did not find a way to do it.')
           12  into comments (id, commcol) values
           13  (5, 'All examples that I have seen so far were related to text query applications
           14  but I do not want to query by using any particular word instead I want to retrieve
           15  top n words.')
           16  into comments (id, commcol) values
           17  (6, 'Can I do this using Oracle text?')
           18  select * from dual
           19  /
          
          6 rows created.
          
          SCOTT@orcl_11gR2> create index comments_commcol_idx
            2  on comments (commcol)
            3  indextype is ctxsys.context
            4  /
          
          Index created.
          
          SCOTT@orcl_11gR2> select *
            2  from   (select sum (token_count) token_count, token_text
            3            from   dr$comments_commcol_idx$i
            4            group  by token_text
            5            order  by token_count desc, token_text)
            6  where  token_count > 2
            7  /
          
          TOKEN_COUNT TOKEN_TEXT
          ----------- ----------------------------------------------------------------
                    5 TEXT
                    3 ORACLE
                    3 USING
          
          3 rows selected.
          
          SCOTT@orcl_11gR2> select *
            2  from   (select sum (token_count) token_count, token_text
            3            from   dr$comments_commcol_idx$i
            4            group  by token_text
            5            order  by token_count desc, token_text)
            6  where  rownum <= 3
            7  /
          
          TOKEN_COUNT TOKEN_TEXT
          ----------- ----------------------------------------------------------------
                    5 TEXT
                    3 ORACLE
                    3 USING
          
          3 rows selected.
          1 person found this helpful
          • 2. Re: Oracle Text - Top N Tokens
            Dileep
            Thank you. Only problem that I have with above solution was my data set is not static for all my users as it varies with values of other columns. For ex each user can see only subset of all customer comments depending upon whether he is a manager or employee. How can I solve above problem where my data set is not static and varies depending upon who you are.
            • 3. Re: Oracle Text - Top N Tokens
              Roger Ford-Oracle
              How many different groups of users? If it's only a small number you might manage it using field sections, with each type of section containing the text that should be visible to a particular set of users. If it's more dynamic than that, I think you're out of luck. Barbara's solution is already going beyond the design parameters of Oracle Text.
              1 person found this helpful
              • 4. Re: Oracle Text - Top N Tokens
                Barbara Boehmer
                Perhaps you could use an XML solution. You could use whatever join condition determines the accessible rows in place of the comments table in the query below. Note that it does not filter out the default Oracle Text stopwords, such as "I" and "to" and "a" and "do", but you could add that to your query if desired.
                SCOTT@orcl_11gR2> create table comments
                  2    (id      number,
                  3       commcol  clob)
                  4  /
                
                Table created.
                
                SCOTT@orcl_11gR2> insert all
                  2  into comments (id, commcol) values
                  3  (1, 'I am working on a prototype for my "word count application" using Oracle Text.')
                  4  into comments (id, commcol) values
                  5  (2, 'My requirement is to take a bunch of customer comments(text column)
                  6  and come up with top n most frequent words from those comments.')
                  7  into comments (id, commcol) values
                  8  (3, 'I retrieve those customer comments/sub data set depending upon other
                  9  dimension values(star schema).')
                 10  into comments (id, commcol) values
                 11  (4, 'From Oracle Text documentation I did not find a way to do it.')
                 12  into comments (id, commcol) values
                 13  (5, 'All examples that I have seen so far were related to text query applications
                 14  but I do not want to query by using any particular word instead I want to retrieve
                 15  top n words.')
                 16  into comments (id, commcol) values
                 17  (6, 'Can I do this using Oracle text?')
                 18  select * from dual
                 19  /
                
                6 rows created.
                
                SCOTT@orcl_11gR2> COLUMN word FORMAT A30
                SCOTT@orcl_11gR2> SELECT *
                  2  FROM   (SELECT COUNT (*) word_count,
                  3                EXTRACTVALUE (xt.column_value, 'e') word
                  4            FROM   comments c,
                  5                TABLE
                  6                  (XMLSEQUENCE
                  7                 (EXTRACT
                  8                   (XMLTYPE
                  9                     ('<coll><e>' ||
                 10                      REPLACE
                 11                        (REPLACE
                 12                       (TRANSLATE (c.commcol, '/().,?"', '        '),
                 13                        CHR(10), ' '),
                 14                      ' ', '</e><e>') ||
                 15                      '</e></coll>'),
                 16                    '/coll/e'))) xt
                 17            WHERE  EXTRACTVALUE (xt.column_value, 'e') IS NOT NULL
                 18            GROUP  BY EXTRACTVALUE (xt.column_value, 'e')
                 19            ORDER  BY word_count DESC, word)
                 20  WHERE  word_count > 2
                 21  /
                
                WORD_COUNT WORD
                ---------- ------------------------------
                         7 I
                         5 to
                         3 Oracle
                         3 a
                         3 comments
                         3 do
                         3 text
                         3 using
                
                8 rows selected.
                1 person found this helpful
                • 5. Re: Oracle Text - Top N Tokens
                  Dileep
                  Thanks Barbara.
                  • 6. Re: Oracle Text - Top N Tokens
                    Dileep
                    Is there anyway to get all tokens for each my customer comment using Oracle text. for ex:

                    Comment_id tokens

                    1 Oracle
                    1 Text
                    1 Using
                    2 Test
                    2 Query
                    3 Problem

                    If I can do something like above then I can write a plsql package myself and count top n words.
                    • 7. Re: Oracle Text - Top N Tokens
                      Barbara Boehmer
                      Ctx_doc.tokens can be used to retrieve the tokens for an individual row. The following example uses a packaged pipelined table function to return those results in tabular format. The query below that joins this to the comments table, so that results are retrieved for every row. For your purposes, you could substitute whatever join condition determines the rows that are accessible for the comments table in the from clause of the query.
                      SCOTT@orcl_11gR2> create table comments
                        2    (comment_id   number,
                        3       commcol      clob)
                        4  /
                      
                      Table created.
                      
                      SCOTT@orcl_11gR2> insert all
                        2  into comments (comment_id, commcol) values
                        3  (1, 'I am working on a prototype for my "word count application" using Oracle Text.')
                        4  into comments (comment_id, commcol) values
                        5  (2, 'My requirement is to take a bunch of customer comments(text column)
                        6  and come up with top n most frequent words from those comments.')
                        7  into comments (comment_id, commcol) values
                        8  (3, 'I retrieve those customer comments/sub data set depending upon other
                        9  dimension values(star schema).')
                       10  into comments (comment_id, commcol) values
                       11  (4, 'From Oracle Text documentation I did not find a way to do it.')
                       12  into comments (comment_id, commcol) values
                       13  (5, 'All examples that I have seen so far were related to text query applications
                       14  but I do not want to query by using any particular word instead I want to retrieve
                       15  top n words.')
                       16  into comments (comment_id, commcol) values
                       17  (6, 'Can I do this using Oracle text?')
                       18  select * from dual
                       19  /
                      
                      6 rows created.
                      
                      SCOTT@orcl_11gR2> create index test_idx
                        2  on comments (commcol)
                        3  indextype is ctxsys.context
                        4  /
                      
                      Index created.
                      
                      SCOTT@orcl_11gR2> create or replace package my_tokens_pkg
                        2  as
                        3    type outrec_typ is record
                        4        (token      varchar2 (64));
                        5    type outrecset is table of outrec_typ;
                        6    function my_tokens_func
                        7        (p_index in varchar2,
                        8         p_rowid in rowid)
                        9        return      outrecset pipelined;
                       10  end my_tokens_pkg;
                       11  /
                      
                      Package created.
                      
                      SCOTT@orcl_11gR2> show errors
                      No errors.
                      SCOTT@orcl_11gR2> create or replace package body my_tokens_pkg
                        2  as
                        3    function my_tokens_func
                        4        (p_index in varchar2,
                        5         p_rowid in rowid)
                        6        return      outrecset pipelined
                        7    is
                        8        v_token_tab ctx_doc.token_tab;
                        9        out_rec     outrec_typ;
                       10    begin
                       11        ctx_doc.set_key_type ('ROWID');
                       12        ctx_doc.tokens (p_index, p_rowid, v_token_tab);
                       13        for i in 1 .. v_token_tab.count loop
                       14          out_rec.token := v_token_tab(i).token;
                       15          pipe row (out_rec);
                       16        end loop;
                       17        return;
                       18    end my_tokens_func;
                       19  end my_tokens_pkg;
                       20  /
                      
                      Package body created.
                      
                      SCOTT@orcl_11gR2> show errors
                      No errors.
                      SCOTT@orcl_11gR2> column  tokens format a30
                      SCOTT@orcl_11gR2> select c.comment_id, count (*) as token_count, t.token as tokens
                        2  from   comments c,
                        3           table (my_tokens_pkg.my_tokens_func ('test_idx', c.rowid)) t
                        4  group  by c.comment_id, t.token
                        5  order  by comment_id, tokens
                        6  /
                      
                      COMMENT_ID TOKEN_COUNT TOKENS
                      ---------- ----------- ------------------------------
                               1           1 AM
                               1           1 APPLICATION
                               1           1 COUNT
                               1           1 ORACLE
                               1           1 PROTOTYPE
                               1           1 TEXT
                               1           1 USING
                               1           1 WORD
                               1           1 WORKING
                               2           1 BUNCH
                               2           1 COLUMN
                               2           1 COME
                               2           2 COMMENTS
                               2           1 CUSTOMER
                               2           1 FREQUENT
                               2           1 MOST
                               2           1 N
                               2           1 REQUIREMENT
                               2           1 TAKE
                               2           1 TEXT
                               2           1 TOP
                               2           1 UP
                               2           1 WORDS
                               3           1 COMMENTS
                               3           1 CUSTOMER
                               3           1 DATA
                               3           1 DEPENDING
                               3           1 DIMENSION
                               3           1 OTHER
                               3           1 RETRIEVE
                               3           1 SCHEMA
                               3           1 SET
                               3           1 STAR
                               3           1 SUB
                               3           1 UPON
                               3           1 VALUES
                               4           1 DOCUMENTATION
                               4           1 FIND
                               4           1 ORACLE
                               4           1 TEXT
                               4           1 WAY
                               5           1 APPLICATIONS
                               5           1 EXAMPLES
                               5           1 FAR
                               5           1 INSTEAD
                               5           1 N
                               5           1 PARTICULAR
                               5           2 QUERY
                               5           1 RELATED
                               5           1 RETRIEVE
                               5           1 SEEN
                               5           1 TEXT
                               5           1 TOP
                               5           1 USING
                               5           2 WANT
                               5           1 WORD
                               5           1 WORDS
                               6           1 ORACLE
                               6           1 TEXT
                               6           1 USING
                      
                      60 rows selected.