Forum Stats

  • 3,815,636 Users
  • 2,259,064 Discussions
  • 7,893,194 Comments

Discussions

HTML content not handled as expected

GB_CHUV
GB_CHUV Member Posts: 4 Red Ribbon
edited Jun 30, 2020 8:37AM in Text

Dear community,

My document is an HTML document but its content is splitted accross multiple lines in a table.

Most often, html tags are found in my text content, but sometimes the text does not have any html tag but only encoding characters (mostly for accented characters).

For exemple: "Coronarographie élective" (that stands for "Coronarographie élective").

It appears that my index is dealing with it as if it was not HTML.

Therefore, the following CONTAINS query is unable to find this record: CONTAINS('PARAGRAPH_CONTENT','élective') > 0

However, If I have an opening and closing html tag ("Coronarographie &#233;lective" becoming "<html>Coronarographie &#233;lective</html>"), then it is handled correctly by the index.

Is there a way to force this content to be treated as HTML ?

Thanks for the help.

TEST CODE:

CREATE TABLE MyTEST (PARAGRAPH_CONTENT CLOB);  INSERT INTO MyTEST VALUES ('Coronarographie &#233;lective'); INSERT INTO MyTEST VALUES ('<html>Coronarographie &#233;lective</html>');  EXEC CTX_DDL.CREATE_PREFERENCE('TEST_HTML_LXR', 'BASIC_LEXER')  exec CTX_DDL.SET_ATTRIBUTE('TEST_HTML_LXR', 'PRINTJOINS', '_')  CREATE INDEX TEST_HTML_IDX on MyTEST(PARAGRAPH_CONTENT) INDEXTYPE is ctxsys.context     PARAMETERS ('         datastore       CTXSYS.DEFAULT_DATASTORE         filter          CTXSYS.NULL_FILTER         lexer           TEST_HTML_LXR         section group   CTXSYS.HTML_SECTION_GROUP         ') parallel 16;  SELECT * FROM MyTEST WHERE CONTAINS(PARAGRAPH_CONTENT,'Coronarographie')>0; --> 2 lines SELECT * FROM MyTEST WHERE CONTAINS(PARAGRAPH_CONTENT,'Coronarographie AND élective')>0; --> 1 line  EXEC CTX_DDL.DROP_PREFERENCE('TEST_HTML_LXR')DROP INDEX TEST_HTML_IDX;DROP TABLE MyTEST;   

EDIT 14:35: My first exemple was having additionnal specifities. Made the example simpler and the test code better.