This discussion is archived
12 Replies Latest reply: Jan 14, 2013 3:08 AM by sperkmandl RSS

WORLD_LEXER attributes

sperkmandl Newbie
Currently Being Moderated
Hi all, I'm used to play with the BASIC_LEXER and its attributes (mixed_case, alternate_spelling, composite and so on).
Now I need to turn into a multi-language environment, so that the WORLD_LEXER looks fine. It avoids manual language setting, nobody wants to do it.
But ... it has no attributes !!
I'm referring to what's described in the OT reference 11g2 (E24436-01).

I'm using Oracle 11g2. Btw, after browsing this forum, I saw those attributes as available sometime ago.
Or was it AUTO_LEXER ?
Any solution or update ? Am I looking at any out of date documentation ?
Thanks.

Edited by: sperkmandl on Jan 12, 2013 12:13 PM

Edited by: sperkmandl on Jan 12, 2013 12:23 PM
  • 1. Re: WORLD_LEXER attributes
    Barbara Boehmer Oracle ACE
    Currently Being Moderated
    The 11.1 auto_lexer had mixed_case and alternate_spelling:

    http://docs.oracle.com/cd/B28359_01/text.111/b28304/cdatadic.htm#CCREF1963

    Please see the following thread for what happened to the auto_lexer in 11.2 and what is planned for the future:

    What is happening to auto_lexer?
  • 2. Re: WORLD_LEXER attributes
    sperkmandl Newbie
    Currently Being Moderated
    Thanks Barbara.
    To summarize: no more AUTO_LEXER actually, and no language attributes for the WORLD_LEXER.

    The only flexible way to achieve automatic language recognition seems through MULTI_LEXER with 'AUTO' setting into the language column (btw, must it be literally 'AUTO' ?).
    The only issue here is that languages must be known (and configured) in advance.
    Correct ?
    Thanks.
  • 3. Re: WORLD_LEXER attributes
    Barbara Boehmer Oracle ACE
    Currently Being Moderated
    I am not sure where you are talking about using 'AUTO' with the multi_lexer. You need a language column that can have whatever name you give it. You need to enter valid values in that column that correspond to your languages. If you do not enter a value, then it uses the default language; It does not do any automatic language detection. You can set some of the multi_lexer attributes, such as theme_language to auto, but there is no automatic detection of the language as a whole.

    The world_lexer does automatic language detection, without a language column. You don't enter 'AUTO' in a language column as there is no language column.

    Edited by: Barbara Boehmer on Jan 12, 2013 2:00 PM
  • 4. Re: WORLD_LEXER attributes
    sperkmandl Newbie
    Currently Being Moderated
    Barbara,
    from the above text ref. manual - pag. 2-39:

    If the language column is set to AUTO, then the multi-lexer detects the language of the
    document for the supported languages shown in Table 2–18.

    that's why my reference to AUTO as auto-detection escape.
    Regards,

    Renzo
  • 5. Re: WORLD_LEXER attributes
    Barbara Boehmer Oracle ACE
    Currently Being Moderated
    Interesting. The 11.1 documentation says that:

    http://docs.oracle.com/cd/B28359_01/text.111/b28304/cdatadic.htm#CCREF1970

    But the corresponding section of the 11.2 documentation does not:

    http://docs.oracle.com/cd/E11882_01/text.112/e24436/cdatadic.htm#CCREF1970

    You might have to experiment to see if it really works in 11.2.

    Edited by: Barbara Boehmer on Jan 12, 2013 2:26 PM
  • 6. Re: WORLD_LEXER attributes
    sperkmandl Newbie
    Currently Being Moderated
    Oops ! Any idea about how to check that a specific language was recognized in a document ?
  • 7. Re: WORLD_LEXER attributes
    Barbara Boehmer Oracle ACE
    Currently Being Moderated
    I believe the following test demonstrates that when 'auto' is specified, it uses the default english_lexer instead. The english_lexer uses upper case and the german_lexer uses mixed case. So, those words that appear in the dr$globalx$i domain index table in upper case were recognized as English and those that appear in the mixed case that they were entered in are recognized as German. In the first case, where 'auto' is used in the language column, everything is in English upper case. In the second case, where 'ger' is used in the language column, the words for that row appear in the dr$globalx$i index in mixed case. So, it seems that 'auto' does not work. However, it is possible that the text was not long enough to allow for automatic detection, instead of the default. If you have some longer documents, you might experiment with them.
    SCOTT@orcl_11gR2> select * from v$version
      2  /
    
    BANNER
    --------------------------------------------------------------------------------
    Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production
    PL/SQL Release 11.2.0.1.0 - Production
    CORE     11.2.0.1.0     Production
    TNS for 64-bit Windows: Version 11.2.0.1.0 - Production
    NLSRTL Version 11.2.0.1.0 - Production
    
    5 rows selected.
    
    SCOTT@orcl_11gR2> create table globaldoc (
      2       doc_id number primary key,
      3       lang varchar2(4),
      4       text clob
      5  )
      6  /
    
    Table created.
    
    SCOTT@orcl_11gR2> begin
      2    ctx_ddl.create_preference('english_lexer','basic_lexer');
      3    ctx_ddl.set_attribute('english_lexer','index_themes','yes');
      4    ctx_ddl.set_attribute('english_lexer','theme_language','english');
      5    ctx_ddl.create_preference('german_lexer','basic_lexer');
      6    ctx_ddl.set_attribute('german_lexer','composite','german');
      7    ctx_ddl.set_attribute('german_lexer','mixed_case','yes');
      8    ctx_ddl.set_attribute('german_lexer','alternate_spelling','german');
      9    ctx_ddl.create_preference('japanese_lexer','japanese_vgram_lexer');
     10    ctx_ddl.create_preference('global_lexer', 'multi_lexer');
     11    ctx_ddl.add_sub_lexer('global_lexer','default','english_lexer');
     12    ctx_ddl.add_sub_lexer('global_lexer','german','german_lexer','ger');
     13    ctx_ddl.add_sub_lexer('global_lexer','japanese','japanese_lexer','jpn');
     14  end;
     15  /
    
    PL/SQL procedure successfully completed.
    
    SCOTT@orcl_11gR2> create index globalx on globaldoc(text) indextype is ctxsys.context
      2  parameters ('lexer global_lexer language column lang sync(on commit)')
      3  /
    
    Index created.
    
    SCOTT@orcl_11gR2> insert into globaldoc values (1, 'auto', 'English Bohemian Forest')
      2  /
    
    1 row created.
    
    SCOTT@orcl_11gR2> insert into globaldoc values (2, 'auto', 'Deutsch Böhmerwald')
      2  /
    
    1 row created.
    
    SCOTT@orcl_11gR2> commit
      2  /
    
    Commit complete.
    
    SCOTT@orcl_11gR2> select token_text from dr$globalx$i where token_type = 0
      2  /
    
    TOKEN_TEXT
    ----------------------------------------------------------------
    BOHEMIAN
    BÖHMERWALD
    DEUTSCH
    ENGLISH
    FOREST
    
    5 rows selected.
    
    SCOTT@orcl_11gR2> truncate table globaldoc
      2  /
    
    Table truncated.
    
    SCOTT@orcl_11gR2> drop index globalx
      2  /
    
    Index dropped.
    
    SCOTT@orcl_11gR2> create index globalx on globaldoc(text) indextype is ctxsys.context
      2  parameters ('lexer global_lexer language column lang sync(on commit)')
      3  /
    
    Index created.
    
    SCOTT@orcl_11gR2> insert into globaldoc values (3, null, 'English Bohemian Forest')
      2  /
    
    1 row created.
    
    SCOTT@orcl_11gR2> insert into globaldoc values (4, 'ger', 'Deutsch Böhmerwald')
      2  /
    
    1 row created.
    
    SCOTT@orcl_11gR2> commit
      2  /
    
    Commit complete.
    
    SCOTT@orcl_11gR2> select token_text from dr$globalx$i where token_type = 0
      2  /
    
    TOKEN_TEXT
    ----------------------------------------------------------------
    BOHEMIAN
    Böhmerwald
    Deutsch
    ENGLISH
    FOREST
    
    5 rows selected.
    
    SCOTT@orcl_11gR2>
    Edited by: Barbara Boehmer on Jan 12, 2013 6:25 PM
  • 8. Re: WORLD_LEXER attributes
    sperkmandl Newbie
    Currently Being Moderated
    Thanks Barbara, I will try with some real document.
    As a matter of curiosity, in the second example - where a German document was recognized and alternate_spelling was enabled - shouldn't we find both:

    Böhmerwald and Boehmerwald

    in the index ?
  • 9. Re: WORLD_LEXER attributes
    Barbara Boehmer Oracle ACE
    Currently Being Moderated
    sperkmandl wrote:
    Thanks Barbara, I will try with some real document.
    As a matter of curiosity, in the second example - where a German document was recognized and alternate_spelling was enabled - shouldn't we find both:

    Böhmerwald and Boehmerwald

    in the index ?
    Apparently not, but I can't explain why. However, the second search does find the alternate spelling, whereas the first does not.
    SCOTT@orcl_11gR2> select * from v$version
      2  /
    
    BANNER
    --------------------------------------------------------------------------------
    Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production
    PL/SQL Release 11.2.0.1.0 - Production
    CORE     11.2.0.1.0     Production
    TNS for 64-bit Windows: Version 11.2.0.1.0 - Production
    NLSRTL Version 11.2.0.1.0 - Production
    
    5 rows selected.
    
    SCOTT@orcl_11gR2> create table globaldoc (
      2       doc_id number primary key,
      3       lang varchar2(4),
      4       text clob
      5  )
      6  /
    
    Table created.
    
    SCOTT@orcl_11gR2> begin
      2    ctx_ddl.create_preference('english_lexer','basic_lexer');
      3    ctx_ddl.set_attribute('english_lexer','index_themes','yes');
      4    ctx_ddl.set_attribute('english_lexer','theme_language','english');
      5    ctx_ddl.create_preference('german_lexer','basic_lexer');
      6    ctx_ddl.set_attribute('german_lexer','composite','german');
      7    ctx_ddl.set_attribute('german_lexer','mixed_case','yes');
      8    ctx_ddl.set_attribute('german_lexer','alternate_spelling','german');
      9    ctx_ddl.create_preference('japanese_lexer','japanese_vgram_lexer');
     10    ctx_ddl.create_preference('global_lexer', 'multi_lexer');
     11    ctx_ddl.add_sub_lexer('global_lexer','default','english_lexer');
     12    ctx_ddl.add_sub_lexer('global_lexer','german','german_lexer','ger');
     13    ctx_ddl.add_sub_lexer('global_lexer','japanese','japanese_lexer','jpn');
     14  end;
     15  /
    
    PL/SQL procedure successfully completed.
    
    SCOTT@orcl_11gR2> create index globalx on globaldoc(text) indextype is ctxsys.context
      2  parameters ('lexer global_lexer language column lang sync(on commit)')
      3  /
    
    Index created.
    
    SCOTT@orcl_11gR2> insert into globaldoc values (1, 'auto', 'English Bohemian Forest')
      2  /
    
    1 row created.
    
    SCOTT@orcl_11gR2> insert into globaldoc values (2, 'auto', 'Deutsch Böhmerwald')
      2  /
    
    1 row created.
    
    SCOTT@orcl_11gR2> commit
      2  /
    
    Commit complete.
    
    SCOTT@orcl_11gR2> select token_text from dr$globalx$i where token_type = 0
      2  /
    
    TOKEN_TEXT
    ----------------------------------------------------------------
    BOHEMIAN
    BÖHMERWALD
    DEUTSCH
    ENGLISH
    FOREST
    
    5 rows selected.
    
    SCOTT@orcl_11gR2> alter session set nls_language = 'GERMAN'
      2  /
    
    Session altered.
    
    SCOTT@orcl_11gR2> select * from globaldoc
      2  where  contains (text, 'Boehmerwald') > 0
      3  /
    
    no rows selected
    
    SCOTT@orcl_11gR2> truncate table globaldoc
      2  /
    
    Table truncated.
    
    SCOTT@orcl_11gR2> drop index globalx
      2  /
    
    Index dropped.
    
    SCOTT@orcl_11gR2> create index globalx on globaldoc(text) indextype is ctxsys.context
      2  parameters ('lexer global_lexer language column lang sync(on commit)')
      3  /
    
    Index created.
    
    SCOTT@orcl_11gR2> insert into globaldoc values (3, null, 'English Bohemian Forest')
      2  /
    
    1 row created.
    
    SCOTT@orcl_11gR2> insert into globaldoc values (4, 'ger', 'Deutsch Böhmerwald')
      2  /
    
    1 row created.
    
    SCOTT@orcl_11gR2> commit
      2  /
    
    Commit complete.
    
    SCOTT@orcl_11gR2> select token_text from dr$globalx$i where token_type = 0
      2  /
    
    TOKEN_TEXT
    ----------------------------------------------------------------
    BOHEMIAN
    Böhmerwald
    Deutsch
    ENGLISH
    FOREST
    
    5 rows selected.
    
    SCOTT@orcl_11gR2> alter session set nls_language = 'GERMAN'
      2  /
    
    Session altered.
    
    SCOTT@orcl_11gR2> select * from globaldoc
      2  where  contains (text, 'Boehmerwald') > 0
      3  /
    
        DOC_ID LANG
    ---------- ----
    TEXT
    --------------------------------------------------------------------------------
             4 ger
    Deutsch Böhmerwald
    
    
    1 row selected.
    
    SCOTT@orcl_11gR2>
  • 10. Re: WORLD_LEXER attributes
    sperkmandl Newbie
    Currently Being Moderated
    Barbara,
    what's behind setting session NLS_LANG ? Do you expect affecting query ?
    AFAIK this setting affects the knowledge base choice while creating an index with theme_language selection as well as the default stoplist - as discussed months ago.
    What else ?

    Renzo
  • 11. Re: WORLD_LEXER attributes
    Barbara Boehmer Oracle ACE
    Currently Being Moderated
    Yes, it affects the query. It won't use the German features like alternate spelling, unless the nls_language is German.
    SCOTT@orcl_11gR2> alter session set nls_language = 'AMERICAN'
      2  /
    
    Session altered.
    
    SCOTT@orcl_11gR2> select * from globaldoc
      2  where  contains (text, 'Boehmerwald') > 0
      3  /
    
    no rows selected
    
    SCOTT@orcl_11gR2> alter session set nls_language = 'GERMAN'
      2  /
    
    Session altered.
    
    SCOTT@orcl_11gR2> select * from globaldoc
      2  where  contains (text, 'Boehmerwald') > 0
      3  /
    
        DOC_ID LANG
    ---------- ----
    TEXT
    --------------------------------------------------------------------------------
             4 ger
    Deutsch Böhmerwald
    
    
    1 row selected.
  • 12. Re: WORLD_LEXER attributes
    sperkmandl Newbie
    Currently Being Moderated
    This is very confusing in the case of multiple languages.
    It means that we will retrieve right hits for the current language (NLS_LANG), plus an incomplete mix of hits from other languages.
    By "incomplete" I mean what is found without fully enabling language-specific features, such as alternate spelling.
    As in your example, if we have a mix of English and German documents and NLS_LANG is set to American, we will get all matching English docs and a subset (depending on spelling) of the matching German docs.
    In other words - to achieve consistent results - we need to restrict the query to one single specific language and this must match NLS_LANG.

    Unfortunately, only the MULTI_LEXER allows for an explicit language restriction. The WORLD_LEXER does the recognition job, but the found language name is not stored anywhere AFAIK.

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points