4 Replies Latest reply: Aug 14, 2013 1:50 AM by flavioc RSS

    how to index words containing letters as html entities?

    flavioc

      The title says it all.

      I am currently replacing known HTML entities with their Unicode counterparts upfront, but I was wondering if some built-in feature in Oracle Text could do the same and save an additional headache.

       

      Indexed html entities anyone?

       

      Thank you

      Flavio

       

      ----

      http://oraclequirks.blogspot.com

      http://www.yocoya.com

        • 1. Re: how to index words containing letters as html entities?
          Barbara Boehmer

          flavioc wrote:

           

          The title says it all....

          Sorry, but I don't get it.  Please provide an example with some data containing "html entities" and what tokens you want indexed and what you would search on expecting to get that row.
          • 2. Re: how to index words containing letters as html entities?
            flavioc

            Hi Barbara,

            here we go:

             

             

            set scan off

            begin

            ctx_ddl.create_preference('mylex', 'BASIC_LEXER');

            ctx_ddl.set_attribute ( 'mylex', 'base_letter', 'YES');

            end;

            /

             

            create table example (t varchar2(4000));

             

            insert into example values('crónicas y relatos');

            insert into example values('crónicas y relatos');

             

            create index myindex on example(t) indextype is ctxsys.context

            parameters('FILTER ctxsys.null_filter LEXER mylex');

             

            select * from example where contains(t, 'cronicas') > 0;

             

            T                

            -----------------------

            crónicas y relatos

             

            As you see the second row is not returned by the query because the entity "ó" is not indexed as "ó", hence my question about what alternatives are there, if any, other than manually search and replace all the possible html entities before indexing the table.

             

            Thank you

            Flavio

             

            ------

            http://oraclequirks.blogspot.com

            http://www.yocoya.com

            • 3. Re: how to index words containing letters as html entities?
              Barbara Boehmer

              You can create your own procedure that uses whatever method you like, then use that procedure in a procedure filter and use that procedure filter in your index parameters.  In the example below, I have borrowed a strip_html function from

              http://www.supermanhamuerto.com/doku.php?id=oracle:fixhtml

              and used that in the procedure.


              SCOTT@orcl12c_11gR2> set scan off

              SCOTT@orcl12c_11gR2> -- table, data, and lexer:

              SCOTT@orcl12c_11gR2> create table example (t varchar2(4000))

                2  /

               

              Table created.

               

              SCOTT@orcl12c_11gR2> insert all

                2  into example values ('crónicas y relatos')

                3  into example values ('crónicas y relatos')

                4  into example values ('CRÓnicas y Relatos de México')

                5  into example values ('Crónicas y relatos de México')

                6  select * from dual

                7  /

               

              4 rows created.

               

              SCOTT@orcl12c_11gR2> begin

                2    ctx_ddl.create_preference ('mylex', 'BASIC_LEXER');

                3    ctx_ddl.set_attribute ( 'mylex', 'base_letter', 'YES');

                4  end;

                5  /

               

              PL/SQL procedure successfully completed.

               

              SCOTT@orcl12c_11gR2> -- function from http://www.supermanhamuerto.com/doku.php?id=oracle:fixhtml

              SCOTT@orcl12c_11gR2> CREATE OR REPLACE FUNCTION strip_html(dirty IN clob,

                2                         to_cvs IN NUMBER DEFAULT 0)

                3    RETURN clob IS OUT clob ;

                4 

                5    TYPE arr_string IS varray (200) OF VARCHAR2(64);

                6 

                7    entities_search_for arr_string;

                8    entities_replace arr_string;

                9    cont NUMBER;

              10 

              11  BEGIN

              12 

              13 

              14  -- to accelerate the issue

              15  IF dirty IS NULL THEN

              16      RETURN dirty;

              17  END IF; -- isnull(dirty)

              18 

              19  IF LENGTH( dirty ) = 0 THEN

              20      RETURN dirty;

              21  END IF; -- length(dirty)

              22 

              23  entities_search_for := arr_string(

              24  '!',

              25  '#',

              26  '$',

              27  '%',

              28  '&',

              29  '"',

              30  '(',

              31  ')',

              32  '*',

              33  '+',

              34  ',',

              35  '‐',

              36  '.',

              37  '/',

              38  ':',

              39  ';',

              40  '<',

              41  '=',

              42  '>',

              43  '?',

              44  '@',

              45  '[',

              46  '\',

              47  ']',

              48  'ˆ',

              49  '_',

              50  '`',

              51  '{',

              52  '|',

              53  '}',

              54  '˜',

              55  ' ',

              56  '¡',

              57  '¢',

              58  '£',

              59  '¤',

              60  '¥',

              61  '¦',

              62  '§',

              63  '¨',

              64  '©',

              65  'ª',

              66  '«',

              67  '¬',

              68  '­',

              69  '®',

              70  '¯',

              71  '°',

              72  '±',

              73  '²',

              74  '³',

              75  '´',

              76  'µ',

              77  '¶',

              78  '·',

              79  '¸',

              80  '¹',

              81  'º',

              82  '»',

              83  '&fr;',

              84  '&fr;',

              85  '&fr;',

              86  '¿',

              87  'À',

              88  'Á',

              89  'Â',

              90  'Ã',

              91  'Ä',

              92  'Å',

              93  'Æ',

              94  '&il;',

              95  'È',

              96  'É',

              97  'Ê',

              98  'Ë',

              99  'Ì',

              100  'Í',

              101  'Î',

              102  'Ï',

              103  'Ð',

              104  'Ñ',

              105  'Ò',

              106  'Ó',

              107  'Ô',

              108  'Õ',

              109  'Ö',

              110  '×',

              111  'Ø',

              112  'Ù',

              113  'Ú',

              114  'Û',

              115  'Ü',

              116  'Ý',

              117  'Þ',

              118  'ß',

              119  'à',

              120  'á',

              121  'â',

              122  'ã',

              123  'ä',

              124  'è',

              125  'é',

              126  'ê',

              127  '&etilde;',

              128  'ë',

              129  'ì',

              130  'í',

              131  'î',

              132  'ĩ',

              133  'ï',

              134  'ò',

              135  'ó',

              136  'ô',

              137  'õ',

              138  'ö',

              139  'ù',

              140  'ú',

              141  'û',

              142  'ũ',

              143  'ü');

              144 

              145  entities_replace := arr_string(

              146  '¡',

              147  'º',

              148  '$',

              149  '%',

              150  '&',

              151  '"',

              152  '(',

              153  ')',

              154  '*',

              155  '+',

              156  ',',

              157  '-',

              158  '.',

              159  'Sol',

              160  'Colon',

              161  '*',

              162  '<',

              163  '=',

              164  '>',

              165  '?',

              166  ',',

              167  '*',

              168  '*',

              169  '*',

              170  '*',

              171  '_',

              172  '''',

              173  '*',

              174  '*',

              175  '*',

              176  '''',

              177  ' ',

              178  '¡',

              179  'cent',

              180  'L',

              181  '*',

              182  'Y',

              183  '*',

              184  '*',

              185  '.',

              186  '(c)',

              187  '*',

              188  '*',

              189  '!',

              190  '*',

              191  '(r)',

              192  '*',

              193  '*',

              194  '*',

              195  '*',

              196  '*',

              197  'á',

              198  'u',

              199  '*',

              200  '·',

              201  'ç',

              202  '*',

              203  '*',

              204  '*',

              205  '*',

              206  '*',

              207  '*',

              208  '¿',

              209  'È',

              210  'Á',

              211  'Ä',

              212  'Á',

              213  '*',

              214  '*',

              215  'AE',

              216  '*',

              217  'È',

              218  'É',

              219  '*',

              220  '*',

              221  'Ì',

              222  'Í',

              223  'Î',

              224  '*',

              225  '*',

              226  'N',

              227  'Ò',

              228  'Ó',

              229  'Ô',

              230  'O',

              231  '*',

              232  '*',

              233  'O',

              234  'Ù',

              235  'Ú',

              236  'Û',

              237  '*',

              238  '*',

              239  '*',

              240  '*',

              241  'à',

              242  'á',

              243  'â',

              244  'a',

              245  '*',

              246  'è',

              247  'é',

              248  'ê',

              249  'e',

              250  '*',

              251  'ì',

              252  'í',

              253  'î',

              254  'i',

              255  '*',

              256  'ò',

              257  'ó',

              258  'ô',

              259  'o',

              260  '*',

              261  'ù',

              262  'ú',

              263  'û',

              264  'u',

              265  '*');

              266 

              267    OUT := dirty;

              268 

              269    -- replace what is enclosed between <xml> and </xml>

              270    -- *? -> lazy star (catches the minimum possible)

              271    OUT := regexp_replace(OUT, '<xml>.*?</xml>', '', 1, 0, 'ni' );

              272    -- clean what it is inside the style tags

              273    OUT := regexp_replace(OUT, '<style>.*?</style>', '', 1, 0, 'ni' );

              274 

              275    IF to_cvs = 2 THEN

              276        -- sanitize (not clean) the html

              277 

              278        -- clean the tag <?xml:whatever>

              279        OUT := regexp_replace(OUT, '<\?xml:.*?>', '', 1, 0, 'ni');

              280        -- clean the tags <img whatever>

              281        OUT := regexp_replace(OUT, '<img.*?>', '', 1, 0, 'ni');

              282        -- clean comments

              283        OUT := regexp_replace(OUT,'<!--.*?-->','', 1, 0, 'ni');

              284        -- clean meta

              285        OUT := regexp_replace(OUT,'<meta.*?>','', 1, 0, 'ni');

              286        -- clean link

              287        OUT := regexp_replace(OUT,'<link.*?>','', 1, 0, 'ni');

              288        -- clean DIV

              289        OUT := regexp_replace(OUT,'</?div.*?>','', 1, 0, 'ni');

              290        -- clean SPAN

              291        OUT := regexp_replace(OUT,'</?span.*?>','', 1, 0, 'ni');

              292        -- clean "class inside tags"

              293        OUT := regexp_replace(OUT,'(<.*?)class="?[a-zA-Z0-9-_]*"?(.*?>)', '\1\2', 1, 0, 'ni');

              294        -- clean "style" inside the following tags: i b p

              295        OUT := regexp_replace(OUT,'(<[ibp] .*?)style=".*?"(.*?>)', '\1\2', 1, 0, 'ni');

              296        -- clean namespaces <o:p> </o:p>

              297        OUT := regexp_replace(OUT, '(<)[a-zA-Z0-9-_]*:(.*?>)', '\1\2', 1, 0, 'ni');

              298        OUT := regexp_replace(OUT, '(</)[a-zA-Z0-9-_]*:(.*?>)', '\1\2', 1, 0, 'ni');

              299 

              300        -- clean empty opening and closing tags: it has to be

              301        -- passed twice or three times to clean things like this:

              302        -- <strong><u></u></strong>

              303        -- TWEAK: <p></p> must be replaced by <br/>

              304        OUT := regexp_replace(OUT,'<p></p>','<br/>', 1, 0, 'ni');

              305        OUT := regexp_replace(OUT,'<([a-zA-Z0-9-_]*)></\1>','', 1, 0, 'ni');

              306        -- TWEAK: <p></p> must be replaced by <br/>

              307        OUT := regexp_replace(OUT,'<p></p>','<br/>', 1, 0, 'ni');

              308        OUT := regexp_replace(OUT,'<([a-zA-Z0-9-_]*)></\1>','', 1, 0, 'ni');

              309 

              310    ELSE

              311       -- clean html

              312 

              313       -- replace all the stuff that is similar to a carriage return

              314       OUT := regexp_replace(OUT, '</p[^>]*>',CHR(10)||CHR(13));

              315       OUT := regexp_replace(OUT, '</br[^>]*>',CHR(10)||CHR(13));

              316       OUT := regexp_replace(OUT, '</tr[^>]*>',CHR(10)||CHR(13));

              317 

              318       -- replace all the remaining html stuff

              319       OUT := regexp_replace(OUT,'<[^>]*>','', 1, 0, 'ni');

              320 

              321       -- replace all the entities

              322       FOR cont IN 1..119 LOOP

              323         OUT := REPLACE( OUT, entities_search_for(cont), entities_replace(cont) );

              324       END LOOP;

              325 

              326       -- cleaning for export to cvs

              327       IF to_cvs = 1 THEN

              328          OUT := REPLACE( OUT, CHR(10), '' );

              329          OUT := REPLACE( OUT, CHR(13), '' );

              330          OUT := REPLACE( OUT, CHR(9), '' );

              331          OUT := REPLACE( OUT, ';', ',' );

              332          OUT := REPLACE( OUT, '"', '''' );

              333       END IF;

              334 

              335 

              336    END IF;

              337 

              338 

              339    RETURN(OUT);

              340  END strip_html;

              341  /

               

              Function created.

               

              SCOTT@orcl12c_11gR2> -- procedure that uses function:

              SCOTT@orcl12c_11gR2> create or replace procedure normalize

                2    (p_input  in           clob,

                3      p_output in out nocopy clob)

                4  as

                5  begin

                6    p_output := strip_html (p_input);

                7  end normalize;

                8  /

               

              Procedure created.

               

              SCOTT@orcl12c_11gR2> -- filter that uses procedure:

              SCOTT@orcl12c_11gR2> begin

                2    ctx_ddl.create_preference ('myfilt', 'procedure_filter');

                3    ctx_ddl.set_attribute ('myfilt', 'procedure', 'normalize');

                4    ctx_ddl.set_attribute ('myfilt', 'input_type', 'clob');

                5    ctx_ddl.set_attribute ('myfilt', 'output_type', 'clob');

                6  end;

                7  /

               

              PL/SQL procedure successfully completed.

               

              SCOTT@orcl12c_11gR2> -- index that uses filter:

              SCOTT@orcl12c_11gR2> create index myindex on example(t) indextype is ctxsys.context

                2  parameters ('FILTER myfilt LEXER mylex')

                3  /

               

              Index created.

               

              SCOTT@orcl12c_11gR2> -- tokens indexed:

              SCOTT@orcl12c_11gR2> select token_text from dr$myindex$i

                2  /

               

              TOKEN_TEXT

              ----------------------------------------------------------------

              CRONICAS

              DE

              MEXICO

              RELATOS

              Y

               

              5 rows selected.

               

              SCOTT@orcl12c_11gR2> -- searches:

              SCOTT@orcl12c_11gR2> select * from example where contains (t, 'cronicas') > 0

                2  /

               

              T

              --------------------------------------------------------------------------------

              crónicas y relatos

              cr&oacute;nicas y relatos

              CR&Oacute;nicas y Relatos de M&eacute;xico

              Crónicas y relatos de México

               

              4 rows selected.

               

              SCOTT@orcl12c_11gR2> select * from example where contains (t, 'Mexico') > 0

                2  /

               

              T

              --------------------------------------------------------------------------------

              CR&Oacute;nicas y Relatos de M&eacute;xico

              Crónicas y relatos de México

               

              2 rows selected.

              • 4. Re: how to index words containing letters as html entities?
                flavioc

                Barbara,

                thank you for the comprehensive reply, as I said I was doing similar cleansing work before inserting the column (currently the list of entities is slightly longer, I am also checking for hex-coded entities, but certainly this approach is better if one is not allowed to alter the original value.

                 

                Flavio

                 

                ----

                http://oraclequirks.blogspot.com

                http://www.yocoya.com