Database Software

1 error has occurred

Your session has timed out.

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

HTML content not handled as expected

GB_CHUVJun 30 2020 — edited Jun 30 2020

Dear community,

My document is an HTML document but its content is splitted accross multiple lines in a table.

Most often, html tags are found in my text content, but sometimes the text does not have any html tag but only encoding characters (mostly for accented characters).

For exemple: "Coronarographie élective" (that stands for "Coronarographie élective").

It appears that my index is dealing with it as if it was not HTML.

Therefore, the following CONTAINS query is unable to find this record: CONTAINS('PARAGRAPH_CONTENT','élective') > 0

However, If I have an opening and closing html tag ("Coronarographie élective" becoming "<html>Coronarographie élective</html>"), then it is handled correctly by the index.

Is there a way to force this content to be treated as HTML ?

Thanks for the help.

TEST CODE:

CREATE TABLE MyTEST (PARAGRAPH_CONTENT CLOB);  
INSERT INTO MyTEST VALUES ('Coronarographie &#233;lective'); 
INSERT INTO MyTEST VALUES ('<html>Coronarographie &#233;lective</html>');  
EXEC CTX_DDL.CREATE_PREFERENCE('TEST_HTML_LXR', 'BASIC_LEXER')  
exec CTX_DDL.SET_ATTRIBUTE('TEST_HTML_LXR', 'PRINTJOINS', '_')  
CREATE INDEX TEST_HTML_IDX on MyTEST(PARAGRAPH_CONTENT) 
INDEXTYPE is ctxsys.context     
PARAMETERS ('         
datastore       CTXSYS.DEFAULT_DATASTORE         
filter          CTXSYS.NULL_FILTER         
lexer           TEST_HTML_LXR         
section group   CTXSYS.HTML_SECTION_GROUP         
') 
parallel 16;  
SELECT * FROM MyTEST WHERE CONTAINS(PARAGRAPH_CONTENT,'Coronarographie')>0; --> 2 lines 
SELECT * FROM MyTEST WHERE CONTAINS(PARAGRAPH_CONTENT,'Coronarographie AND élective')>0; --> 1 line  
EXEC CTX_DDL.DROP_PREFERENCE('TEST_HTML_LXR')
DROP INDEX TEST_HTML_IDX;
DROP TABLE MyTEST;

EDIT 14:35: My first exemple was having additionnal specifities. Made the example simpler and the test code better.

Added on Jun 30 2020

#database-key-features, #text

0 comments

112 views

Database Software

HTML content not handled as expected

Comments

Post Details