Forum Stats

  • 3,826,400 Users
  • 2,260,641 Discussions
  • 7,896,931 Comments

Discussions

Text Index Lexer

Tom Claffy
Tom Claffy Member Posts: 1
edited Jun 28, 2019 12:55PM in General Database Discussions

I am testing text search for implementation on our Windows application by creating and querying some data through Developer that I loaded from actual documents from our system . Although I have years of query experience in SQL both MSS and Oracle, this full text indexing is completely new to me. We are currently storing some file attachments in NCLOB as a base64 string which works since the data can be any type of file. Going forward we plan to use a varbinary(max) and a blob. MSSQL is fairly straight forward on this and detects the language in my pdf stored in the database and returns the results that I expect for Arabic, Chinese, English, & French. Oracle is returning results also but I am confused about the lexer that I seem to need. If I create the index with the world or auto lexer, I only get results for single byte languages when querying in the language of the document with a known word in the document; If I use the Chinese lexer, I am getting results for all 4 languages. Note that I have tried this with and without a language and charset column specified in the index parameters and I seem to be getting the correct results without these columns.

I expected the name of "World" implied a larger character set than "Chinese".  Is this result from the Chinese lexer expected or am I doing some wrong with the World lexer?

exec ctx_ddl.create_preference('MYLEXER', 'world_lexer');

-- RETURNS ONLY ENGLISH AND FRENCH RESULTS

CREATE INDEX my_docs_doc_idx ON my_docs(doc)

INDEXTYPE IS CTXSYS.CONTEXT

parameters( 'LEXER MYLEXER');

exec ctx_ddl.create_preference('CHINESE', 'CHINESE_LEXER');

-- RETURNS ARABIC, CHINESE, ENGLISH AND FRENCH RESULTS

CREATE INDEX my_docs_doc_idx ON my_docs(doc)

INDEXTYPE IS CTXSYS.CONTEXT

parameters( 'LEXER CHINESE');

Tagged: