Forum Stats

  • 3,838,961 Users
  • 2,262,431 Discussions
  • 7,900,819 Comments

Discussions

Is it possible to get tags correctly with snippet while indexing html document ?

thibault daucourt
thibault daucourt Member Posts: 2
edited Sep 4, 2017 4:20AM in Text

Good day sir,

I’m making search application using Oracle Text. This application search in HTML documents

I want to obtain an extract containing the queried word and the other words close to him. The objective is to give a context to the users.

The documents used contain custom fonts to display the custom characters. Ideally, I want to find a way to keep the tagging in the extract.

To achieve this, I wanted to use CTX_DOC.SNIPPET which seems to play this role in the Oracle Text logic.

After several tests with my documents, snippet seems not be up to the task. It often returns almost only tags and the query word. Worst the tags aren’t complete or even open and close correctly.

Sometimes there are no tags.

the following image show one of the result:

pastedImage_0.png

Objectively it can’t be used like this in the application.

Later, I discovered I can force snippet to ignore the tags by using section group mechanism in my index.

So my problem is the following: Is there a way to get both the words and the tags correctly with snippet?

By the way, I work with Oracle XE 11g.

I use the following codes :

/*Lexer*/

EXEC ctx_ddl.drop_preference('lexerTB') ;

EXEC CTX_DDL.CREATE_PREFERENCE(' lexerTB ', 'BASIC_LEXER');

EXEC ctx_ddl.set_attribute(' lexerTB ', 'BASE_LETTER', 'YES');

EXEC ctx_ddl.set_attribute(' lexerTB ', 'MIXED_CASE', 'NO');

EXEC ctx_ddl.set_attribute(' lexerTB ', 'INDEX_THEMES', 'NO');

EXEC ctx_ddl.set_attribute(' lexerTB ', 'INDEX_TEXT', 'YES');

/* base index*/

create index indexTB on articles (article)

  indextype is ctxsys.context

  parameters ('DATASTORE ctxsys.default_datastore

                  LEXER lexerTB

                  FILTER ctxsys.AUTO_FILTER');

/*Section*/

EXEC ctx_ddl.create_section_group('htmgroup','HTML_SECTION_GROUP');

EXEC ctx_ddl.add_zone_section('htmgroup', 'span', 'span');

/*second index */

create index indexTB2 on articles (article)

  indextype is ctxsys.context

  parameters ('DATASTORE ctxsys.default_datastore

                  LEXER lexerTB

                  FILTER ctxsys.null_filter

                  section group htmgroup') ;

thibault daucourt

Best Answer

  • Roger Ford-Oracle
    Roger Ford-Oracle Member Posts: 1,132 Employee
    edited Jun 27, 2017 9:19AM Answer ✓

    The snippet mechanism is not "markup aware".  To ensure that it produced valid HTML or XML would require a high degree of intelligence, and would perhaps be impossible to make work in every circumstance.

    The best bet is to a section group as you suggest so that it outputs plain text and just use plain text in your snippets.

    thibault daucourtthibault daucourt

Answers

  • Roger Ford-Oracle
    Roger Ford-Oracle Member Posts: 1,132 Employee
    edited Jun 27, 2017 9:19AM Answer ✓

    The snippet mechanism is not "markup aware".  To ensure that it produced valid HTML or XML would require a high degree of intelligence, and would perhaps be impossible to make work in every circumstance.

    The best bet is to a section group as you suggest so that it outputs plain text and just use plain text in your snippets.

    thibault daucourtthibault daucourt
  • thibault daucourt
    thibault daucourt Member Posts: 2
    edited Jun 27, 2017 10:35AM

    @Roger Ford-Oracle

    First I thank you for your quick answer.

    With "would require a high degree of intelligence, and would perhaps be impossible to make work in every circumstance",

    You make me wonder if trying to develop my own function to fulfill my objective is really reachable or is a pipe dream?

    It would require that this function can make the difference between the two and after moving left and right from the queried word in the document no ?

  • flavioc
    flavioc Member Posts: 1,128 Silver Badge
    edited Sep 4, 2017 4:20AM

    I had the same problem on Oracle XE 11.

    If you are on a full-fledged Oracle version 12+ CTX_DOC.SNIPPET contains new parameters that allow you to specify the "radius" of the text around the matched text and also adjust the total length of the snippet, which may allow you get better results in general.

    However I realized I could be better off forgetting CTX_DOC.SNIPPET and replacing it with CTX_DOC.MARKUP, thereafter you can apply your own "intelligence" to the document if you need a shorter result.

    Flavio

This discussion has been closed.