Dear community,
My document is an HTML document but its content is splitted accross multiple lines in a table.
Most often, html tags are found in my text content, but sometimes the text does not have any html tag but only encoding characters (mostly for accented characters).
For exemple: "Coronarographie élective" (that stands for "Coronarographie élective").
It appears that my index is dealing with it as if it was not HTML.
Therefore, the following CONTAINS query is unable to find this record: CONTAINS('PARAGRAPH_CONTENT','élective') > 0
However, If I have an opening and closing html tag ("Coronarographie élective" becoming "<html>Coronarographie élective</html>"), then it is handled correctly by the index.
Is there a way to force this content to be treated as HTML ?
Thanks for the help.
TEST CODE:
CREATE TABLE MyTEST (PARAGRAPH_CONTENT CLOB);
INSERT INTO MyTEST VALUES ('Coronarographie élective');
INSERT INTO MyTEST VALUES ('<html>Coronarographie élective</html>');
EXEC CTX_DDL.CREATE_PREFERENCE('TEST_HTML_LXR', 'BASIC_LEXER')
exec CTX_DDL.SET_ATTRIBUTE('TEST_HTML_LXR', 'PRINTJOINS', '_')
CREATE INDEX TEST_HTML_IDX on MyTEST(PARAGRAPH_CONTENT)
INDEXTYPE is ctxsys.context
PARAMETERS ('
datastore CTXSYS.DEFAULT_DATASTORE
filter CTXSYS.NULL_FILTER
lexer TEST_HTML_LXR
section group CTXSYS.HTML_SECTION_GROUP
')
parallel 16;
SELECT * FROM MyTEST WHERE CONTAINS(PARAGRAPH_CONTENT,'Coronarographie')>0; --> 2 lines
SELECT * FROM MyTEST WHERE CONTAINS(PARAGRAPH_CONTENT,'Coronarographie AND élective')>0; --> 1 line
EXEC CTX_DDL.DROP_PREFERENCE('TEST_HTML_LXR')
DROP INDEX TEST_HTML_IDX;
DROP TABLE MyTEST;
EDIT 14:35: My first exemple was having additionnal specifities. Made the example simpler and the test code better.