This discussion is archived
4 Replies Latest reply: Aug 13, 2013 11:50 PM by flavioc RSS

how to index words containing letters as html entities?

flavioc Pro
Currently Being Moderated

The title says it all.

I am currently replacing known HTML entities with their Unicode counterparts upfront, but I was wondering if some built-in feature in Oracle Text could do the same and save an additional headache.

 

Indexed html entities anyone?

 

Thank you

Flavio

 

----

http://oraclequirks.blogspot.com

http://www.yocoya.com

  • 1. Re: how to index words containing letters as html entities?
    Barbara Boehmer Oracle ACE
    Currently Being Moderated

    flavioc wrote:

     

    The title says it all....

    Sorry, but I don't get it.  Please provide an example with some data containing "html entities" and what tokens you want indexed and what you would search on expecting to get that row.
  • 2. Re: how to index words containing letters as html entities?
    flavioc Pro
    Currently Being Moderated

    Hi Barbara,

    here we go:

     

     

    set scan off

    begin

    ctx_ddl.create_preference('mylex', 'BASIC_LEXER');

    ctx_ddl.set_attribute ( 'mylex', 'base_letter', 'YES');

    end;

    /

     

    create table example (t varchar2(4000));

     

    insert into example values('crónicas y relatos');

    insert into example values('crónicas y relatos');

     

    create index myindex on example(t) indextype is ctxsys.context

    parameters('FILTER ctxsys.null_filter LEXER mylex');

     

    select * from example where contains(t, 'cronicas') > 0;

     

    T                

    -----------------------

    crónicas y relatos

     

    As you see the second row is not returned by the query because the entity "ó" is not indexed as "ó", hence my question about what alternatives are there, if any, other than manually search and replace all the possible html entities before indexing the table.

     

    Thank you

    Flavio

     

    ------

    http://oraclequirks.blogspot.com

    http://www.yocoya.com

  • 3. Re: how to index words containing letters as html entities?
    Barbara Boehmer Oracle ACE
    Currently Being Moderated

    You can create your own procedure that uses whatever method you like, then use that procedure in a procedure filter and use that procedure filter in your index parameters.  In the example below, I have borrowed a strip_html function from

    http://www.supermanhamuerto.com/doku.php?id=oracle:fixhtml

    and used that in the procedure.


    SCOTT@orcl12c_11gR2> set scan off

    SCOTT@orcl12c_11gR2> -- table, data, and lexer:

    SCOTT@orcl12c_11gR2> create table example (t varchar2(4000))

      2  /

     

    Table created.

     

    SCOTT@orcl12c_11gR2> insert all

      2  into example values ('crónicas y relatos')

      3  into example values ('crónicas y relatos')

      4  into example values ('CRÓnicas y Relatos de México')

      5  into example values ('Crónicas y relatos de México')

      6  select * from dual

      7  /

     

    4 rows created.

     

    SCOTT@orcl12c_11gR2> begin

      2    ctx_ddl.create_preference ('mylex', 'BASIC_LEXER');

      3    ctx_ddl.set_attribute ( 'mylex', 'base_letter', 'YES');

      4  end;

      5  /

     

    PL/SQL procedure successfully completed.

     

    SCOTT@orcl12c_11gR2> -- function from http://www.supermanhamuerto.com/doku.php?id=oracle:fixhtml

    SCOTT@orcl12c_11gR2> CREATE OR REPLACE FUNCTION strip_html(dirty IN clob,

      2                         to_cvs IN NUMBER DEFAULT 0)

      3    RETURN clob IS OUT clob ;

      4 

      5    TYPE arr_string IS varray (200) OF VARCHAR2(64);

      6 

      7    entities_search_for arr_string;

      8    entities_replace arr_string;

      9    cont NUMBER;

    10 

    11  BEGIN

    12 

    13 

    14  -- to accelerate the issue

    15  IF dirty IS NULL THEN

    16      RETURN dirty;

    17  END IF; -- isnull(dirty)

    18 

    19  IF LENGTH( dirty ) = 0 THEN

    20      RETURN dirty;

    21  END IF; -- length(dirty)

    22 

    23  entities_search_for := arr_string(

    24  '!',

    25  '#',

    26  '$',

    27  '%',

    28  '&',

    29  '"',

    30  '(',

    31  ')',

    32  '*',

    33  '+',

    34  ',',

    35  '‐',

    36  '.',

    37  '/',

    38  ':',

    39  ';',

    40  '<',

    41  '=',

    42  '>',

    43  '?',

    44  '@',

    45  '[',

    46  '\',

    47  ']',

    48  'ˆ',

    49  '_',

    50  '`',

    51  '{',

    52  '|',

    53  '}',

    54  '˜',

    55  ' ',

    56  '¡',

    57  '¢',

    58  '£',

    59  '¤',

    60  '¥',

    61  '¦',

    62  '§',

    63  '¨',

    64  '©',

    65  'ª',

    66  '«',

    67  '¬',

    68  '­',

    69  '®',

    70  '¯',

    71  '°',

    72  '±',

    73  '²',

    74  '³',

    75  '´',

    76  'µ',

    77  '¶',

    78  '·',

    79  '¸',

    80  '¹',

    81  'º',

    82  '»',

    83  '&fr;',

    84  '&fr;',

    85  '&fr;',

    86  '¿',

    87  'À',

    88  'Á',

    89  'Â',

    90  'Ã',

    91  'Ä',

    92  'Å',

    93  'Æ',

    94  '&il;',

    95  'È',

    96  'É',

    97  'Ê',

    98  'Ë',

    99  'Ì',

    100  'Í',

    101  'Î',

    102  'Ï',

    103  'Ð',

    104  'Ñ',

    105  'Ò',

    106  'Ó',

    107  'Ô',

    108  'Õ',

    109  'Ö',

    110  '×',

    111  'Ø',

    112  'Ù',

    113  'Ú',

    114  'Û',

    115  'Ü',

    116  'Ý',

    117  'Þ',

    118  'ß',

    119  'à',

    120  'á',

    121  'â',

    122  'ã',

    123  'ä',

    124  'è',

    125  'é',

    126  'ê',

    127  '&etilde;',

    128  'ë',

    129  'ì',

    130  'í',

    131  'î',

    132  'ĩ',

    133  'ï',

    134  'ò',

    135  'ó',

    136  'ô',

    137  'õ',

    138  'ö',

    139  'ù',

    140  'ú',

    141  'û',

    142  'ũ',

    143  'ü');

    144 

    145  entities_replace := arr_string(

    146  '¡',

    147  'º',

    148  '$',

    149  '%',

    150  '&',

    151  '"',

    152  '(',

    153  ')',

    154  '*',

    155  '+',

    156  ',',

    157  '-',

    158  '.',

    159  'Sol',

    160  'Colon',

    161  '*',

    162  '<',

    163  '=',

    164  '>',

    165  '?',

    166  ',',

    167  '*',

    168  '*',

    169  '*',

    170  '*',

    171  '_',

    172  '''',

    173  '*',

    174  '*',

    175  '*',

    176  '''',

    177  ' ',

    178  '¡',

    179  'cent',

    180  'L',

    181  '*',

    182  'Y',

    183  '*',

    184  '*',

    185  '.',

    186  '(c)',

    187  '*',

    188  '*',

    189  '!',

    190  '*',

    191  '(r)',

    192  '*',

    193  '*',

    194  '*',

    195  '*',

    196  '*',

    197  'á',

    198  'u',

    199  '*',

    200  '·',

    201  'ç',

    202  '*',

    203  '*',

    204  '*',

    205  '*',

    206  '*',

    207  '*',

    208  '¿',

    209  'È',

    210  'Á',

    211  'Ä',

    212  'Á',

    213  '*',

    214  '*',

    215  'AE',

    216  '*',

    217  'È',

    218  'É',

    219  '*',

    220  '*',

    221  'Ì',

    222  'Í',

    223  'Î',

    224  '*',

    225  '*',

    226  'N',

    227  'Ò',

    228  'Ó',

    229  'Ô',

    230  'O',

    231  '*',

    232  '*',

    233  'O',

    234  'Ù',

    235  'Ú',

    236  'Û',

    237  '*',

    238  '*',

    239  '*',

    240  '*',

    241  'à',

    242  'á',

    243  'â',

    244  'a',

    245  '*',

    246  'è',

    247  'é',

    248  'ê',

    249  'e',

    250  '*',

    251  'ì',

    252  'í',

    253  'î',

    254  'i',

    255  '*',

    256  'ò',

    257  'ó',

    258  'ô',

    259  'o',

    260  '*',

    261  'ù',

    262  'ú',

    263  'û',

    264  'u',

    265  '*');

    266 

    267    OUT := dirty;

    268 

    269    -- replace what is enclosed between <xml> and </xml>

    270    -- *? -> lazy star (catches the minimum possible)

    271    OUT := regexp_replace(OUT, '<xml>.*?</xml>', '', 1, 0, 'ni' );

    272    -- clean what it is inside the style tags

    273    OUT := regexp_replace(OUT, '<style>.*?</style>', '', 1, 0, 'ni' );

    274 

    275    IF to_cvs = 2 THEN

    276        -- sanitize (not clean) the html

    277 

    278        -- clean the tag <?xml:whatever>

    279        OUT := regexp_replace(OUT, '<\?xml:.*?>', '', 1, 0, 'ni');

    280        -- clean the tags <img whatever>

    281        OUT := regexp_replace(OUT, '<img.*?>', '', 1, 0, 'ni');

    282        -- clean comments

    283        OUT := regexp_replace(OUT,'<!--.*?-->','', 1, 0, 'ni');

    284        -- clean meta

    285        OUT := regexp_replace(OUT,'<meta.*?>','', 1, 0, 'ni');

    286        -- clean link

    287        OUT := regexp_replace(OUT,'<link.*?>','', 1, 0, 'ni');

    288        -- clean DIV

    289        OUT := regexp_replace(OUT,'</?div.*?>','', 1, 0, 'ni');

    290        -- clean SPAN

    291        OUT := regexp_replace(OUT,'</?span.*?>','', 1, 0, 'ni');

    292        -- clean "class inside tags"

    293        OUT := regexp_replace(OUT,'(<.*?)class="?[a-zA-Z0-9-_]*"?(.*?>)', '\1\2', 1, 0, 'ni');

    294        -- clean "style" inside the following tags: i b p

    295        OUT := regexp_replace(OUT,'(<[ibp] .*?)style=".*?"(.*?>)', '\1\2', 1, 0, 'ni');

    296        -- clean namespaces <o:p> </o:p>

    297        OUT := regexp_replace(OUT, '(<)[a-zA-Z0-9-_]*:(.*?>)', '\1\2', 1, 0, 'ni');

    298        OUT := regexp_replace(OUT, '(</)[a-zA-Z0-9-_]*:(.*?>)', '\1\2', 1, 0, 'ni');

    299 

    300        -- clean empty opening and closing tags: it has to be

    301        -- passed twice or three times to clean things like this:

    302        -- <strong><u></u></strong>

    303        -- TWEAK: <p></p> must be replaced by <br/>

    304        OUT := regexp_replace(OUT,'<p></p>','<br/>', 1, 0, 'ni');

    305        OUT := regexp_replace(OUT,'<([a-zA-Z0-9-_]*)></\1>','', 1, 0, 'ni');

    306        -- TWEAK: <p></p> must be replaced by <br/>

    307        OUT := regexp_replace(OUT,'<p></p>','<br/>', 1, 0, 'ni');

    308        OUT := regexp_replace(OUT,'<([a-zA-Z0-9-_]*)></\1>','', 1, 0, 'ni');

    309 

    310    ELSE

    311       -- clean html

    312 

    313       -- replace all the stuff that is similar to a carriage return

    314       OUT := regexp_replace(OUT, '</p[^>]*>',CHR(10)||CHR(13));

    315       OUT := regexp_replace(OUT, '</br[^>]*>',CHR(10)||CHR(13));

    316       OUT := regexp_replace(OUT, '</tr[^>]*>',CHR(10)||CHR(13));

    317 

    318       -- replace all the remaining html stuff

    319       OUT := regexp_replace(OUT,'<[^>]*>','', 1, 0, 'ni');

    320 

    321       -- replace all the entities

    322       FOR cont IN 1..119 LOOP

    323         OUT := REPLACE( OUT, entities_search_for(cont), entities_replace(cont) );

    324       END LOOP;

    325 

    326       -- cleaning for export to cvs

    327       IF to_cvs = 1 THEN

    328          OUT := REPLACE( OUT, CHR(10), '' );

    329          OUT := REPLACE( OUT, CHR(13), '' );

    330          OUT := REPLACE( OUT, CHR(9), '' );

    331          OUT := REPLACE( OUT, ';', ',' );

    332          OUT := REPLACE( OUT, '"', '''' );

    333       END IF;

    334 

    335 

    336    END IF;

    337 

    338 

    339    RETURN(OUT);

    340  END strip_html;

    341  /

     

    Function created.

     

    SCOTT@orcl12c_11gR2> -- procedure that uses function:

    SCOTT@orcl12c_11gR2> create or replace procedure normalize

      2    (p_input  in           clob,

      3      p_output in out nocopy clob)

      4  as

      5  begin

      6    p_output := strip_html (p_input);

      7  end normalize;

      8  /

     

    Procedure created.

     

    SCOTT@orcl12c_11gR2> -- filter that uses procedure:

    SCOTT@orcl12c_11gR2> begin

      2    ctx_ddl.create_preference ('myfilt', 'procedure_filter');

      3    ctx_ddl.set_attribute ('myfilt', 'procedure', 'normalize');

      4    ctx_ddl.set_attribute ('myfilt', 'input_type', 'clob');

      5    ctx_ddl.set_attribute ('myfilt', 'output_type', 'clob');

      6  end;

      7  /

     

    PL/SQL procedure successfully completed.

     

    SCOTT@orcl12c_11gR2> -- index that uses filter:

    SCOTT@orcl12c_11gR2> create index myindex on example(t) indextype is ctxsys.context

      2  parameters ('FILTER myfilt LEXER mylex')

      3  /

     

    Index created.

     

    SCOTT@orcl12c_11gR2> -- tokens indexed:

    SCOTT@orcl12c_11gR2> select token_text from dr$myindex$i

      2  /

     

    TOKEN_TEXT

    ----------------------------------------------------------------

    CRONICAS

    DE

    MEXICO

    RELATOS

    Y

     

    5 rows selected.

     

    SCOTT@orcl12c_11gR2> -- searches:

    SCOTT@orcl12c_11gR2> select * from example where contains (t, 'cronicas') > 0

      2  /

     

    T

    --------------------------------------------------------------------------------

    crónicas y relatos

    cr&oacute;nicas y relatos

    CR&Oacute;nicas y Relatos de M&eacute;xico

    Crónicas y relatos de México

     

    4 rows selected.

     

    SCOTT@orcl12c_11gR2> select * from example where contains (t, 'Mexico') > 0

      2  /

     

    T

    --------------------------------------------------------------------------------

    CR&Oacute;nicas y Relatos de M&eacute;xico

    Crónicas y relatos de México

     

    2 rows selected.

  • 4. Re: how to index words containing letters as html entities?
    flavioc Pro
    Currently Being Moderated

    Barbara,

    thank you for the comprehensive reply, as I said I was doing similar cleansing work before inserting the column (currently the list of entities is slightly longer, I am also checking for hex-coded entities, but certainly this approach is better if one is not allowed to alter the original value.

     

    Flavio

     

    ----

    http://oraclequirks.blogspot.com

    http://www.yocoya.com

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points