Discussions
Categories
- 17.9K All Categories
- 3.4K Industry Applications
- 3.3K Intelligent Advisor
- 63 Insurance
- 536.4K On-Premises Infrastructure
- 138.3K Analytics Software
- 38.6K Application Development Software
- 5.8K Cloud Platform
- 109.5K Database Software
- 17.5K Enterprise Manager
- 8.8K Hardware
- 71.1K Infrastructure Software
- 105.3K Integration
- 41.6K Security Software
Is it possible to get tags correctly with snippet while indexing html document ?

Good day sir,
I’m making search application using Oracle Text. This application search in HTML documents
I want to obtain an extract containing the queried word and the other words close to him. The objective is to give a context to the users.
The documents used contain custom fonts to display the custom characters. Ideally, I want to find a way to keep the tagging in the extract.
To achieve this, I wanted to use CTX_DOC.SNIPPET which seems to play this role in the Oracle Text logic.
After several tests with my documents, snippet seems not be up to the task. It often returns almost only tags and the query word. Worst the tags aren’t complete or even open and close correctly.
Sometimes there are no tags.
the following image show one of the result:
Objectively it can’t be used like this in the application.
Later, I discovered I can force snippet to ignore the tags by using section group mechanism in my index.
So my problem is the following: Is there a way to get both the words and the tags correctly with snippet?
By the way, I work with Oracle XE 11g.
I use the following codes :
/*Lexer*/
EXEC ctx_ddl.drop_preference('lexerTB') ;
EXEC CTX_DDL.CREATE_PREFERENCE(' lexerTB ', 'BASIC_LEXER');
EXEC ctx_ddl.set_attribute(' lexerTB ', 'BASE_LETTER', 'YES');
EXEC ctx_ddl.set_attribute(' lexerTB ', 'MIXED_CASE', 'NO');
EXEC ctx_ddl.set_attribute(' lexerTB ', 'INDEX_THEMES', 'NO');
EXEC ctx_ddl.set_attribute(' lexerTB ', 'INDEX_TEXT', 'YES');
/* base index*/
create index indexTB on articles (article)
indextype is ctxsys.context
parameters ('DATASTORE ctxsys.default_datastore
LEXER lexerTB
FILTER ctxsys.AUTO_FILTER');
/*Section*/
EXEC ctx_ddl.create_section_group('htmgroup','HTML_SECTION_GROUP');
EXEC ctx_ddl.add_zone_section('htmgroup', 'span', 'span');
/*second index */
create index indexTB2 on articles (article)
indextype is ctxsys.context
parameters ('DATASTORE ctxsys.default_datastore
LEXER lexerTB
FILTER ctxsys.null_filter
section group htmgroup') ;
Best Answer
-
The snippet mechanism is not "markup aware". To ensure that it produced valid HTML or XML would require a high degree of intelligence, and would perhaps be impossible to make work in every circumstance.
The best bet is to a section group as you suggest so that it outputs plain text and just use plain text in your snippets.
Answers
-
The snippet mechanism is not "markup aware". To ensure that it produced valid HTML or XML would require a high degree of intelligence, and would perhaps be impossible to make work in every circumstance.
The best bet is to a section group as you suggest so that it outputs plain text and just use plain text in your snippets.
-
First I thank you for your quick answer.
With "would require a high degree of intelligence, and would perhaps be impossible to make work in every circumstance",
You make me wonder if trying to develop my own function to fulfill my objective is really reachable or is a pipe dream?
It would require that this function can make the difference between the two and after moving left and right from the queried word in the document no ?
-
I had the same problem on Oracle XE 11.
If you are on a full-fledged Oracle version 12+ CTX_DOC.SNIPPET contains new parameters that allow you to specify the "radius" of the text around the matched text and also adjust the total length of the snippet, which may allow you get better results in general.
However I realized I could be better off forgetting CTX_DOC.SNIPPET and replacing it with CTX_DOC.MARKUP, thereafter you can apply your own "intelligence" to the document if you need a shorter result.
Flavio