This discussion is archived
4 Replies Latest reply: Dec 14, 2012 1:05 PM by Roger Ford RSS

clustering only a subset of documents

quadzilla Newbie
Currently Being Moderated
I am using unsupervised classification with CTX_CLS.CLUSTERING. Our document table has several million rows and the user might only want to cluster a few thousand. Because this method of clustering uses the text index on the documents table is it possible to only cluster those few thousand without copying the text off to another table and creating a separate text index for the clustering? When the user wants to cluster 10-20K of content this method is too resource and time consuming.


Here is the example given in the documentation

/* set the preference */
exec ctx_ddl.drop_preference('my_cluster');
exec ctx_ddl.create_preference('my_cluster','KMEAN_CLUSTERING');
exec ctx_ddl.set_attribute('my_cluster','CLUSTER_NUM','3');

/* do the clustering */
exec ctx_output.start_log('my_log');
exec ctx_cls.clustering('collectionx','id','restab','clusters','my_cluster');
exec ctx_output.end_log;

Thanks
  • 1. Re: clustering only a subset of documents
    Ayham Newbie
    Currently Being Moderated
    goog question, i want to know the how many records Oracle clustering can be cluster.
  • 2. Re: clustering only a subset of documents
    Roger Ford Expert
    Currently Being Moderated
    Sorry, I missed this first time round.

    There is no direct way to create a cluster on a subset of the table.

    However, although you have to create an index to base the clustering on, you do NOT need to populate that index. As in the example at
    http://docs.oracle.com/cd/B28359_01/text.111/b28303/classify.htm#i1007174
    create index collectionx on collection(text) 
       indextype is ctxsys.context parameters('nopopulate');
    So this is effectively instant. But you still need to copy the subset of the data. Can we avoid that?

    Yes, it seems we can. The CONTEXT index is really there just to tell clustering how to get and process the data. If we use a USER_DATASTORE procedure for the index, we can specify which rows should be processed and which should not. I've adapted the example from the doc to add a column to the table called USE_THIS_ROW. If set to 1, then the row is used. If set to 0, then it isn't. And you can see that the clustering results do not include any rows with USE_THIS_ROW = 0.
    -- Clustering example from the docs, adapted to use a user_datastore to decide which rows to process
    
    /* collect document into a table */
    drop table collection;
    
    create table collection (id number primary key, text varchar2(4000), use_this_row number);
    insert into collection values (1, 'Oracle Text can index any document or textual content.', 1);
    insert into collection values (2, 'Ultra Search uses a crawler to access documents.', 0);
    insert into collection values (3, 'XML is a tag-based markup language.', 1);
    insert into collection values (4, 'Oracle Database 11g XML DB treats XML as a native datatype in the database.', 1);
    insert into collection values (5, 'There are three Text index types to cover all text search needs.', 0);
    insert into collection values (6, 'Ultra Search also provides API for content management solutions.', 1);
    
    create or replace procedure my_proc 
         (rid in rowid, tlob in out nocopy clob) is 
    begin 
         -- this "for loop" will only execute once but it's easier this way than declaring a 
         -- separate cursor
         for c in ( select text, use_this_row from collection
                    where rowid = rid ) loop
              if c.use_this_row = 1 then
                    tlob := c.text;
              else
                    tlob := '';
              end if;
         end loop;
    end; 
    /
    list
    show errors
    
    exec ctx_ddl.drop_preference('my_datastore')
    
    exec ctx_ddl.create_preference('my_datastore', 'user_datastore')
    exec ctx_ddl.set_attribute('my_datastore', 'procedure', 'my_proc')
    
    create index collectionx on collection(text) 
       indextype is ctxsys.context parameters('datastore my_datastore nopopulate');
    
    drop table restab;
    
    /* prepare result tables, if you omit this step, procedure will create table automatically */
    create table restab (       
           docid NUMBER,
           clusterid NUMBER,
           score NUMBER);
    
    drop table clusters;
    
    create table clusters (
           clusterid NUMBER,
           descript varchar2(4000),
           label varchar2(200),
           sze   number,
           quality_score number,
           parent number);
    
    /* set the preference */
    exec ctx_ddl.drop_preference('my_cluster');
    exec ctx_ddl.create_preference('my_cluster','KMEAN_CLUSTERING');
    exec ctx_ddl.set_attribute('my_cluster','CLUSTER_NUM','3');
    
    /* do the clustering */
    exec ctx_output.start_log('my_log');
    exec ctx_cls.clustering('collectionx','id','restab','clusters','my_cluster');
    exec ctx_output.end_log;
    
    select docid, clusterid, score from restab order by clusterid, docid;
    You could probably get the same results by using a FILTER column, with some rows set to IGNORE. I've not tried that.
  • 3. Re: clustering only a subset of documents
    quadzilla Newbie
    Currently Being Moderated
    The column is currently indexed with a context index. So this is helpful to understand your user datastore idea I don't think I can use it due to that. Currently the data is copied off to a temp heap table , index created nonpopulate then clustered , results recorded then table dropped.
  • 4. Re: clustering only a subset of documents
    Roger Ford Expert
    Currently Being Moderated
    You could add a dummy column to the table and create the index on that.

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points