This site is currently read-only as we are migrating to Oracle Forums for an improved community experience. You will not be able to initiate activity until January 30th, when you will be able to use this site as normal.

    Forum Stats

  • 3,890,056 Users
  • 2,269,775 Discussions
  • 7,916,823 Comments

Discussions

Text Search skiping HTML tags

ratina
ratina Member Posts: 35
edited Apr 4, 2010 5:43PM in Text
I have a table containing clob column.

select code, details from search order by code;

CODE DETAILS
---------- -------------------------------------------------------
4 just a <b>test </b>insert
5 just a <b>test</b> insert
9 <HTML>just a <i>test</i> insert</HTML>
10 checking test insert

I have created a context index and add html tags in the stop list.

exec ctx_ddl.create_stoplist('mystop', 'BASIC_STOPLIST');
exec ctx_ddl.add_stopword('mystop', '<b>');
exec ctx_ddl.add_stopword('mystop', '</b>');

CREATE INDEX searchi ON search(details)
INDEXTYPE IS CTXSYS.CONTEXT PARAMETERS
('FILTER CTXSYS.AUTO_FILTER SECTION GROUP CTXSYS.AUTO_SECTION_GROUP STOPLIST MYSTOP');

But when I search 'test insert' it only shows the following rows

SQL> SELECT score(1), code, details FROM search WHERE CONTAINS(details, 'test insert', 1) > 0 ORDER BY score(1);

SCORE(1) CODE DETAILS
---------- ---------- --------------------------------------------------------------------------------
5 10 checking test insert
5 9 <HTML>just a <i>test</i> insert</HTML>

I would like to define a text index which skips the html keywords and returns all the rows contain the searching phrase
Tagged:

Answers

  • Herald ten Dam
    Herald ten Dam Member Posts: 1,254
    Hi,

    I see you created the index with the AUTO_FILTER? Oracle recommends for indexing HTML document to use the NULL-filter, see http://download.oracle.com/docs/cd/B19306_01/text.102/b14218/cdatadic.htm#i1007244.

    There is also a HTML SECTION GROUP, maybe you get better result if you use that one instead of the AUTO_SECTION_GROUP, see http://download.oracle.com/docs/cd/B19306_01/text.102/b14218/cdatadic.htm#i1009342.

    Herald ten Dam
    Superconsult.nl
    Herald ten Dam
  • ratina
    ratina Member Posts: 35
    thanks Herald

    but I have already using NULL_FILTER with CTXSYS.HTML_SECTION_GROUP section group

    SQL > CREATE INDEX searchi ON search(details) INDEXTYPE IS CTXSYS.CONTEXT PARAMETERS ('FILTER CTXSYS.NULL_FILTER SECTION GROUP CTXSYS.HTML_SECTION_GROUP');

    it works fine if the column values has no tag or started with html tags
    for example :
    checking test insert
    <HTML>just a <i>test</i> insert</HTML>
    </br>just a <i>test</i> insert

    But if I have column values like
    just a <b>test</b> insert

    Then the search result does not inclues this row
  • Barbara Boehmer
    Barbara Boehmer Member Posts: 4,625 Gold Trophy
    Since you did not use code tags in your post, most of your html does not show, so it is difficult to tell what html is in your data or what values you set for your stopwords. One problem with stopwords is that, although the word is not indexed, it still expects some word where the stopword was, so searching for "word1 word2" will not find "word1 removed_stopword word2". How about using a procedure_filter as demonstrated below? I only removed a few tags, so you would need to either expand it to include others or searching for starting and ending tags and remove what is inbetween.
    [email protected]_11g> CREATE TABLE search
      2    (code	 NUMBER,
      3  	details  CLOB)
      4  /
    
    Table created.
    
    [email protected]_11g> INSERT ALL
      2  INTO search VALUES (4, 'just a <b>test</b> insert')
      3  INTO search VALUES (5, 'just a <i>test</i> insert')
      4  INTO search VALUES (9, '<HTML>just a test insert</HTML>')
      5  INTO search VALUES (10, 'checking test insert')
      6  SELECT * FROM DUAL
      7  /
    
    4 rows created.
    
    [email protected]_11g> CREATE OR REPLACE PROCEDURE myproc
      2    (p_rowid    IN ROWID,
      3  	p_in_clob  IN CLOB,
      4  	p_out_clob IN OUT NOCOPY CLOB)
      5  AS
      6  BEGIN
      7    p_out_clob := REPLACE (p_in_clob, '<html>', '');
      8    p_out_clob := REPLACE (p_out_clob, '</html>', '');
      9    p_out_clob := REPLACE (p_out_clob, '<HTML>', '');
     10    p_out_clob := REPLACE (p_out_clob, '</HTML>', '');
     11    p_out_clob := REPLACE (p_out_clob, '<b>', '');
     12    p_out_clob := REPLACE (p_out_clob, '</b>', '');
     13    p_out_clob := REPLACE (p_out_clob, '<B>', '');
     14    p_out_clob := REPLACE (p_out_clob, '</B>', '');
     15    p_out_clob := REPLACE (p_out_clob, '<i>', '');
     16    p_out_clob := REPLACE (p_out_clob, '</i>', '');
     17    p_out_clob := REPLACE (p_out_clob, '<I>', '');
     18    p_out_clob := REPLACE (p_out_clob, '</I>', '');
     19  END myproc;
     20  /
    
    Procedure created.
    
    [email protected]_11g> SHOW ERRORS
    No errors.
    [email protected]_11g> BEGIN
      2    CTX_DDL.CREATE_PREFERENCE ('myfilter', 'PROCEDURE_FILTER');
      3    CTX_DDL.SET_ATTRIBUTE ('myfilter', 'PROCEDURE', 'myproc');
      4    CTX_DDL.SET_ATTRIBUTE ('myfilter', 'ROWID_PARAMETER', 'TRUE');
      5    CTX_DDL.SET_ATTRIBUTE ('myfilter', 'INPUT_TYPE', 'CLOB');
      6    CTX_DDL.SET_ATTRIBUTE ('myfilter', 'OUTPUT_TYPE', 'CLOB');
      7  END;
      8  /
    
    PL/SQL procedure successfully completed.
    
    [email protected]_11g> CREATE INDEX searchi
      2  ON search (details)
      3  INDEXTYPE IS CTXSYS.CONTEXT
      4  PARAMETERS ('FILTER myfilter')
      5  /
    
    Index created.
    
    [email protected]_11g> SELECT token_text FROM dr$searchi$i
      2  /
    
    TOKEN_TEXT
    ----------------------------------------------------------------
    CHECKING
    INSERT
    TEST
    
    3 rows selected.
    
    [email protected]_11g> COLUMN details FORMAT A35
    [email protected]_11g> SELECT score (1), code, details
      2  FROM   search
      3  WHERE  CONTAINS (details, 'test insert', 1) > 0
      4  ORDER  BY score (1)
      5  /
    
      SCORE(1)       CODE DETAILS
    ---------- ---------- -----------------------------------
             3          4 just a <b>test</b> insert
             3          5 just a <i>test</i> insert
             3          9 <HTML>just a test insert</HTML>
             3         10 checking test insert
    
    4 rows selected.
    
    [email protected]_11g> 
    Barbara Boehmer
This discussion has been closed.