12 Replies Latest reply: Jan 14, 2013 5:08 AM by sperkmandl RSS

    WORLD_LEXER attributes

    sperkmandl
      Hi all, I'm used to play with the BASIC_LEXER and its attributes (mixed_case, alternate_spelling, composite and so on).
      Now I need to turn into a multi-language environment, so that the WORLD_LEXER looks fine. It avoids manual language setting, nobody wants to do it.
      But ... it has no attributes !!
      I'm referring to what's described in the OT reference 11g2 (E24436-01).

      I'm using Oracle 11g2. Btw, after browsing this forum, I saw those attributes as available sometime ago.
      Or was it AUTO_LEXER ?
      Any solution or update ? Am I looking at any out of date documentation ?
      Thanks.

      Edited by: sperkmandl on Jan 12, 2013 12:13 PM

      Edited by: sperkmandl on Jan 12, 2013 12:23 PM
        • 1. Re: WORLD_LEXER attributes
          Barbara Boehmer
          The 11.1 auto_lexer had mixed_case and alternate_spelling:

          http://docs.oracle.com/cd/B28359_01/text.111/b28304/cdatadic.htm#CCREF1963

          Please see the following thread for what happened to the auto_lexer in 11.2 and what is planned for the future:

          What is happening to auto_lexer?
          • 2. Re: WORLD_LEXER attributes
            sperkmandl
            Thanks Barbara.
            To summarize: no more AUTO_LEXER actually, and no language attributes for the WORLD_LEXER.

            The only flexible way to achieve automatic language recognition seems through MULTI_LEXER with 'AUTO' setting into the language column (btw, must it be literally 'AUTO' ?).
            The only issue here is that languages must be known (and configured) in advance.
            Correct ?
            Thanks.
            • 3. Re: WORLD_LEXER attributes
              Barbara Boehmer
              I am not sure where you are talking about using 'AUTO' with the multi_lexer. You need a language column that can have whatever name you give it. You need to enter valid values in that column that correspond to your languages. If you do not enter a value, then it uses the default language; It does not do any automatic language detection. You can set some of the multi_lexer attributes, such as theme_language to auto, but there is no automatic detection of the language as a whole.

              The world_lexer does automatic language detection, without a language column. You don't enter 'AUTO' in a language column as there is no language column.

              Edited by: Barbara Boehmer on Jan 12, 2013 2:00 PM
              • 4. Re: WORLD_LEXER attributes
                sperkmandl
                Barbara,
                from the above text ref. manual - pag. 2-39:

                If the language column is set to AUTO, then the multi-lexer detects the language of the
                document for the supported languages shown in Table 2–18.

                that's why my reference to AUTO as auto-detection escape.
                Regards,

                Renzo
                • 5. Re: WORLD_LEXER attributes
                  Barbara Boehmer
                  Interesting. The 11.1 documentation says that:

                  http://docs.oracle.com/cd/B28359_01/text.111/b28304/cdatadic.htm#CCREF1970

                  But the corresponding section of the 11.2 documentation does not:

                  http://docs.oracle.com/cd/E11882_01/text.112/e24436/cdatadic.htm#CCREF1970

                  You might have to experiment to see if it really works in 11.2.

                  Edited by: Barbara Boehmer on Jan 12, 2013 2:26 PM
                  • 6. Re: WORLD_LEXER attributes
                    sperkmandl
                    Oops ! Any idea about how to check that a specific language was recognized in a document ?
                    • 7. Re: WORLD_LEXER attributes
                      Barbara Boehmer
                      I believe the following test demonstrates that when 'auto' is specified, it uses the default english_lexer instead. The english_lexer uses upper case and the german_lexer uses mixed case. So, those words that appear in the dr$globalx$i domain index table in upper case were recognized as English and those that appear in the mixed case that they were entered in are recognized as German. In the first case, where 'auto' is used in the language column, everything is in English upper case. In the second case, where 'ger' is used in the language column, the words for that row appear in the dr$globalx$i index in mixed case. So, it seems that 'auto' does not work. However, it is possible that the text was not long enough to allow for automatic detection, instead of the default. If you have some longer documents, you might experiment with them.
                      SCOTT@orcl_11gR2> select * from v$version
                        2  /
                      
                      BANNER
                      --------------------------------------------------------------------------------
                      Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production
                      PL/SQL Release 11.2.0.1.0 - Production
                      CORE     11.2.0.1.0     Production
                      TNS for 64-bit Windows: Version 11.2.0.1.0 - Production
                      NLSRTL Version 11.2.0.1.0 - Production
                      
                      5 rows selected.
                      
                      SCOTT@orcl_11gR2> create table globaldoc (
                        2       doc_id number primary key,
                        3       lang varchar2(4),
                        4       text clob
                        5  )
                        6  /
                      
                      Table created.
                      
                      SCOTT@orcl_11gR2> begin
                        2    ctx_ddl.create_preference('english_lexer','basic_lexer');
                        3    ctx_ddl.set_attribute('english_lexer','index_themes','yes');
                        4    ctx_ddl.set_attribute('english_lexer','theme_language','english');
                        5    ctx_ddl.create_preference('german_lexer','basic_lexer');
                        6    ctx_ddl.set_attribute('german_lexer','composite','german');
                        7    ctx_ddl.set_attribute('german_lexer','mixed_case','yes');
                        8    ctx_ddl.set_attribute('german_lexer','alternate_spelling','german');
                        9    ctx_ddl.create_preference('japanese_lexer','japanese_vgram_lexer');
                       10    ctx_ddl.create_preference('global_lexer', 'multi_lexer');
                       11    ctx_ddl.add_sub_lexer('global_lexer','default','english_lexer');
                       12    ctx_ddl.add_sub_lexer('global_lexer','german','german_lexer','ger');
                       13    ctx_ddl.add_sub_lexer('global_lexer','japanese','japanese_lexer','jpn');
                       14  end;
                       15  /
                      
                      PL/SQL procedure successfully completed.
                      
                      SCOTT@orcl_11gR2> create index globalx on globaldoc(text) indextype is ctxsys.context
                        2  parameters ('lexer global_lexer language column lang sync(on commit)')
                        3  /
                      
                      Index created.
                      
                      SCOTT@orcl_11gR2> insert into globaldoc values (1, 'auto', 'English Bohemian Forest')
                        2  /
                      
                      1 row created.
                      
                      SCOTT@orcl_11gR2> insert into globaldoc values (2, 'auto', 'Deutsch Böhmerwald')
                        2  /
                      
                      1 row created.
                      
                      SCOTT@orcl_11gR2> commit
                        2  /
                      
                      Commit complete.
                      
                      SCOTT@orcl_11gR2> select token_text from dr$globalx$i where token_type = 0
                        2  /
                      
                      TOKEN_TEXT
                      ----------------------------------------------------------------
                      BOHEMIAN
                      BÖHMERWALD
                      DEUTSCH
                      ENGLISH
                      FOREST
                      
                      5 rows selected.
                      
                      SCOTT@orcl_11gR2> truncate table globaldoc
                        2  /
                      
                      Table truncated.
                      
                      SCOTT@orcl_11gR2> drop index globalx
                        2  /
                      
                      Index dropped.
                      
                      SCOTT@orcl_11gR2> create index globalx on globaldoc(text) indextype is ctxsys.context
                        2  parameters ('lexer global_lexer language column lang sync(on commit)')
                        3  /
                      
                      Index created.
                      
                      SCOTT@orcl_11gR2> insert into globaldoc values (3, null, 'English Bohemian Forest')
                        2  /
                      
                      1 row created.
                      
                      SCOTT@orcl_11gR2> insert into globaldoc values (4, 'ger', 'Deutsch Böhmerwald')
                        2  /
                      
                      1 row created.
                      
                      SCOTT@orcl_11gR2> commit
                        2  /
                      
                      Commit complete.
                      
                      SCOTT@orcl_11gR2> select token_text from dr$globalx$i where token_type = 0
                        2  /
                      
                      TOKEN_TEXT
                      ----------------------------------------------------------------
                      BOHEMIAN
                      Böhmerwald
                      Deutsch
                      ENGLISH
                      FOREST
                      
                      5 rows selected.
                      
                      SCOTT@orcl_11gR2>
                      Edited by: Barbara Boehmer on Jan 12, 2013 6:25 PM
                      • 8. Re: WORLD_LEXER attributes
                        sperkmandl
                        Thanks Barbara, I will try with some real document.
                        As a matter of curiosity, in the second example - where a German document was recognized and alternate_spelling was enabled - shouldn't we find both:

                        Böhmerwald and Boehmerwald

                        in the index ?
                        • 9. Re: WORLD_LEXER attributes
                          Barbara Boehmer
                          sperkmandl wrote:
                          Thanks Barbara, I will try with some real document.
                          As a matter of curiosity, in the second example - where a German document was recognized and alternate_spelling was enabled - shouldn't we find both:

                          Böhmerwald and Boehmerwald

                          in the index ?
                          Apparently not, but I can't explain why. However, the second search does find the alternate spelling, whereas the first does not.
                          SCOTT@orcl_11gR2> select * from v$version
                            2  /
                          
                          BANNER
                          --------------------------------------------------------------------------------
                          Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production
                          PL/SQL Release 11.2.0.1.0 - Production
                          CORE     11.2.0.1.0     Production
                          TNS for 64-bit Windows: Version 11.2.0.1.0 - Production
                          NLSRTL Version 11.2.0.1.0 - Production
                          
                          5 rows selected.
                          
                          SCOTT@orcl_11gR2> create table globaldoc (
                            2       doc_id number primary key,
                            3       lang varchar2(4),
                            4       text clob
                            5  )
                            6  /
                          
                          Table created.
                          
                          SCOTT@orcl_11gR2> begin
                            2    ctx_ddl.create_preference('english_lexer','basic_lexer');
                            3    ctx_ddl.set_attribute('english_lexer','index_themes','yes');
                            4    ctx_ddl.set_attribute('english_lexer','theme_language','english');
                            5    ctx_ddl.create_preference('german_lexer','basic_lexer');
                            6    ctx_ddl.set_attribute('german_lexer','composite','german');
                            7    ctx_ddl.set_attribute('german_lexer','mixed_case','yes');
                            8    ctx_ddl.set_attribute('german_lexer','alternate_spelling','german');
                            9    ctx_ddl.create_preference('japanese_lexer','japanese_vgram_lexer');
                           10    ctx_ddl.create_preference('global_lexer', 'multi_lexer');
                           11    ctx_ddl.add_sub_lexer('global_lexer','default','english_lexer');
                           12    ctx_ddl.add_sub_lexer('global_lexer','german','german_lexer','ger');
                           13    ctx_ddl.add_sub_lexer('global_lexer','japanese','japanese_lexer','jpn');
                           14  end;
                           15  /
                          
                          PL/SQL procedure successfully completed.
                          
                          SCOTT@orcl_11gR2> create index globalx on globaldoc(text) indextype is ctxsys.context
                            2  parameters ('lexer global_lexer language column lang sync(on commit)')
                            3  /
                          
                          Index created.
                          
                          SCOTT@orcl_11gR2> insert into globaldoc values (1, 'auto', 'English Bohemian Forest')
                            2  /
                          
                          1 row created.
                          
                          SCOTT@orcl_11gR2> insert into globaldoc values (2, 'auto', 'Deutsch Böhmerwald')
                            2  /
                          
                          1 row created.
                          
                          SCOTT@orcl_11gR2> commit
                            2  /
                          
                          Commit complete.
                          
                          SCOTT@orcl_11gR2> select token_text from dr$globalx$i where token_type = 0
                            2  /
                          
                          TOKEN_TEXT
                          ----------------------------------------------------------------
                          BOHEMIAN
                          BÖHMERWALD
                          DEUTSCH
                          ENGLISH
                          FOREST
                          
                          5 rows selected.
                          
                          SCOTT@orcl_11gR2> alter session set nls_language = 'GERMAN'
                            2  /
                          
                          Session altered.
                          
                          SCOTT@orcl_11gR2> select * from globaldoc
                            2  where  contains (text, 'Boehmerwald') > 0
                            3  /
                          
                          no rows selected
                          
                          SCOTT@orcl_11gR2> truncate table globaldoc
                            2  /
                          
                          Table truncated.
                          
                          SCOTT@orcl_11gR2> drop index globalx
                            2  /
                          
                          Index dropped.
                          
                          SCOTT@orcl_11gR2> create index globalx on globaldoc(text) indextype is ctxsys.context
                            2  parameters ('lexer global_lexer language column lang sync(on commit)')
                            3  /
                          
                          Index created.
                          
                          SCOTT@orcl_11gR2> insert into globaldoc values (3, null, 'English Bohemian Forest')
                            2  /
                          
                          1 row created.
                          
                          SCOTT@orcl_11gR2> insert into globaldoc values (4, 'ger', 'Deutsch Böhmerwald')
                            2  /
                          
                          1 row created.
                          
                          SCOTT@orcl_11gR2> commit
                            2  /
                          
                          Commit complete.
                          
                          SCOTT@orcl_11gR2> select token_text from dr$globalx$i where token_type = 0
                            2  /
                          
                          TOKEN_TEXT
                          ----------------------------------------------------------------
                          BOHEMIAN
                          Böhmerwald
                          Deutsch
                          ENGLISH
                          FOREST
                          
                          5 rows selected.
                          
                          SCOTT@orcl_11gR2> alter session set nls_language = 'GERMAN'
                            2  /
                          
                          Session altered.
                          
                          SCOTT@orcl_11gR2> select * from globaldoc
                            2  where  contains (text, 'Boehmerwald') > 0
                            3  /
                          
                              DOC_ID LANG
                          ---------- ----
                          TEXT
                          --------------------------------------------------------------------------------
                                   4 ger
                          Deutsch Böhmerwald
                          
                          
                          1 row selected.
                          
                          SCOTT@orcl_11gR2>
                          • 10. Re: WORLD_LEXER attributes
                            sperkmandl
                            Barbara,
                            what's behind setting session NLS_LANG ? Do you expect affecting query ?
                            AFAIK this setting affects the knowledge base choice while creating an index with theme_language selection as well as the default stoplist - as discussed months ago.
                            What else ?

                            Renzo
                            • 11. Re: WORLD_LEXER attributes
                              Barbara Boehmer
                              Yes, it affects the query. It won't use the German features like alternate spelling, unless the nls_language is German.
                              SCOTT@orcl_11gR2> alter session set nls_language = 'AMERICAN'
                                2  /
                              
                              Session altered.
                              
                              SCOTT@orcl_11gR2> select * from globaldoc
                                2  where  contains (text, 'Boehmerwald') > 0
                                3  /
                              
                              no rows selected
                              
                              SCOTT@orcl_11gR2> alter session set nls_language = 'GERMAN'
                                2  /
                              
                              Session altered.
                              
                              SCOTT@orcl_11gR2> select * from globaldoc
                                2  where  contains (text, 'Boehmerwald') > 0
                                3  /
                              
                                  DOC_ID LANG
                              ---------- ----
                              TEXT
                              --------------------------------------------------------------------------------
                                       4 ger
                              Deutsch Böhmerwald
                              
                              
                              1 row selected.
                              • 12. Re: WORLD_LEXER attributes
                                sperkmandl
                                This is very confusing in the case of multiple languages.
                                It means that we will retrieve right hits for the current language (NLS_LANG), plus an incomplete mix of hits from other languages.
                                By "incomplete" I mean what is found without fully enabling language-specific features, such as alternate spelling.
                                As in your example, if we have a mix of English and German documents and NLS_LANG is set to American, we will get all matching English docs and a subset (depending on spelling) of the matching German docs.
                                In other words - to achieve consistent results - we need to restrict the query to one single specific language and this must match NLS_LANG.

                                Unfortunately, only the MULTI_LEXER allows for an explicit language restriction. The WORLD_LEXER does the recognition job, but the found language name is not stored anywhere AFAIK.