9 Replies Latest reply: May 7, 2012 2:19 PM by Barbara Boehmer RSS

    languages

    sperkmandl
      Hi, the ref. manual reports German, Danish and Swedish as alternate spelling choices.
      Where do they come from, since Danish and Swedish are not in common language lists (basic lexer and basic wordlist attributes) ?

      The basic lexer index_stems attriibute reports:

      1 ENGLISH
      2 DERIVATIONAL
      3 DUTCH
      4 FRENCH
      5 GERMAN
      6 ITALIAN
      7 SPANISH

      but later on it also reports:

      ■ DERIVATIONAL
      ■ DUTCH
      ■ ENGLISH
      ■ FRENCH
      ■ GERMAN
      ■ ITALIAN
      ■ NORWEGIAN
      ■ SPANISH
      ■ SWEDISH

      while the basic wordlist stemmer reports:

      ENGLISH (English inflectional)
      DERIVATIONAL (English derivational)
      DUTCH
      FRENCH
      GERMAN
      ITALIAN
      SPANISH

      Btw, Danish is missing from all lists.

      And what's the difference between using the basic wordlist stemmer vs. using the basic lexer stemmer ?

      In general, I would appreciate a bit of explanation about what to choose when switching to language X, without assuming any info from system settings (db, session, host, and so on).
      Does it make sense to state that a given index has been setup for a language X, since there is no overall setting such as NLS_LANG (for example) ?

      Thanks
        • 1. Re: languages
          Herald ten Dam
          Hi,

          in which version are you looking? If I look in the 11.2 manual Danish is mentioned in the alternate_spelling of the basic_lexer: http://docs.oracle.com/cd/E11882_01/text.112/e24436/cdatadic.htm#i1007615

          It is maybe possible that this documentation is updated and that this was an omission in older versions. But you don't mention your version or make a link to that version.

          Herald ten Dam
          http://htendam.wordpress.com
          • 2. Re: languages
            sperkmandl
            I'm using 11g2 and yes - as I wrote in my post - Danish is enumerated for alternative spelling (page 2.32), but nowhere else. That's confuses me, there seems no way to create an index for a given language in a single shot.
            • 3. Re: languages
              Barbara Boehmer
              sperkmandl wrote:
              And what's the difference between using the basic wordlist stemmer vs. using the basic lexer stemmer ?
              The index_stems attribute of the basic_lexer is intended to speed up the stemming searches. The example below shows that german composite stemming requires usage of the stemmer attribute of the basic_wordlist. I have demonstrated first just using the basic_lexer with index_stems, then adding the basic_wordlist with the stemmer attribute. Although my database language and session language are american during index creation and queries, the german querying with alternate spelling and stemming and composite stemming work.

              language settings:
              SCOTT@orcl_11gR2> select value from nls_database_parameters where parameter = 'NLS_LANGUAGE'
                2  /
              
              VALUE
              --------------------------------------------------------------------------------
              AMERICAN
              
              1 row selected.
              
              SCOTT@orcl_11gR2> select value from v$nls_parameters where parameter = 'NLS_LANGUAGE'
                2  /
              
              VALUE
              ----------------------------------------------------------------
              AMERICAN
              
              1 row selected.
              demo table and data:
              SCOTT@orcl_11gR2> CREATE TABLE demo_tab
                2    (demo_col  VARCHAR2(30))
                3  /
              
              Table created.
              
              SCOTT@orcl_11gR2> INSERT ALL
                2  INTO demo_tab VALUES ('Böhmer')
                3  INTO demo_tab VALUES ('Boehmer')
                4  INTO demo_tab VALUES ('Grün, Blau, Rot')
                5  INTO demo_tab VALUES ('Rotes Auto')
                6  INTO demo_tab VALUES ('Roter Zug')
                7  INTO demo_tab VALUES ('Hauptbahnhof')
                8  INTO demo_tab VALUES ('Lokomotivführer')
                9  SELECT * FROM DUAL
               10  /
              
              7 rows created.
              lexer only (queries using german composite stemming do not return any rows):
              SCOTT@orcl_11gR2> BEGIN
                2    CTX_DDL.CREATE_PREFERENCE ('demo_lex', 'BASIC_LEXER');
                3    CTX_DDL.SET_ATTRIBUTE ('demo_lex', 'ALTERNATE_SPELLING', 'GERMAN');
                4    CTX_DDL.SET_ATTRIBUTE ('demo_lex', 'COMPOSITE', 'GERMAN');
                5    CTX_DDL.SET_ATTRIBUTE ('demo_lex', 'INDEX_STEMS', 'GERMAN');
                6  END;
                7  /
              
              PL/SQL procedure successfully completed.
              
              SCOTT@orcl_11gR2> CREATE INDEX demo_idx ON demo_tab (demo_col)
                2  INDEXTYPE IS CTXSYS.CONTEXT
                3  PARAMETERS
                4    ('LEXER        demo_lex')
                5  /
              
              Index created.
              
              SCOTT@orcl_11gR2> -- alternate spelling:
              SCOTT@orcl_11gR2> SELECT * FROM demo_tab WHERE CONTAINS (demo_col, 'Böhmer') > 0
                2  /
              
              DEMO_COL
              ------------------------------
              Böhmer
              Boehmer
              
              2 rows selected.
              
              SCOTT@orcl_11gR2> SELECT * FROM demo_tab WHERE CONTAINS (demo_col, 'Boehmer') > 0
                2  /
              
              DEMO_COL
              ------------------------------
              Böhmer
              Boehmer
              
              2 rows selected.
              
              SCOTT@orcl_11gR2> -- stemming
              SCOTT@orcl_11gR2> SELECT * FROM demo_tab WHERE CONTAINS (demo_col, '$rot') > 0
                2  /
              
              DEMO_COL
              ------------------------------
              Grün, Blau, Rot
              Rotes Auto
              Roter Zug
              
              3 rows selected.
              
              SCOTT@orcl_11gR2> -- stemming using german_composite:
              SCOTT@orcl_11gR2> SELECT * FROM demo_tab WHERE CONTAINS (demo_col, '$Haupt') > 0
                2  /
              
              DEMO_COL
              ------------------------------
              Hauptbahnhof
              
              1 row selected.
              
              SCOTT@orcl_11gR2> SELECT * FROM demo_tab WHERE CONTAINS (demo_col, '$bahnhof') > 0
                2  /
              
              no rows selected
              
              SCOTT@orcl_11gR2> SELECT * FROM demo_tab WHERE CONTAINS (demo_col, '$lokomotive') > 0
                2  /
              
              no rows selected
              
              SCOTT@orcl_11gR2> SELECT * FROM demo_tab WHERE CONTAINS (demo_col, '$führer') > 0
                2  /
              
              no rows selected
              lexer and wordlist (everything works):
              SCOTT@orcl_11gR2> DROP INDEX demo_idx
                2  /
              
              Index dropped.
              
              SCOTT@orcl_11gR2> BEGIN
                2    CTX_DDL.CREATE_PREFERENCE ('demo_wordlist', 'BASIC_WORDLIST');
                3    CTX_DDL.SET_ATTRIBUTE ('demo_wordlist', 'STEMMER', 'GERMAN');
                4  END;
                5  /
              
              PL/SQL procedure successfully completed.
              
              SCOTT@orcl_11gR2> CREATE INDEX demo_idx ON demo_tab (demo_col)
                2  INDEXTYPE IS CTXSYS.CONTEXT
                3  PARAMETERS
                4    ('LEXER        demo_lex
                5        WORDLIST  demo_wordlist')
                6  /
              
              Index created.
              
              SCOTT@orcl_11gR2> -- alternate spelling:
              SCOTT@orcl_11gR2> SELECT * FROM demo_tab WHERE CONTAINS (demo_col, 'Böhmer') > 0
                2  /
              
              DEMO_COL
              ------------------------------
              Böhmer
              Boehmer
              
              2 rows selected.
              
              SCOTT@orcl_11gR2> SELECT * FROM demo_tab WHERE CONTAINS (demo_col, 'Boehmer') > 0
                2  /
              
              DEMO_COL
              ------------------------------
              Böhmer
              Boehmer
              
              2 rows selected.
              
              SCOTT@orcl_11gR2> -- stemming
              SCOTT@orcl_11gR2> SELECT * FROM demo_tab WHERE CONTAINS (demo_col, '$rot') > 0
                2  /
              
              DEMO_COL
              ------------------------------
              Grün, Blau, Rot
              Rotes Auto
              Roter Zug
              
              3 rows selected.
              
              SCOTT@orcl_11gR2> -- stemming using german_composite:
              SCOTT@orcl_11gR2> SELECT * FROM demo_tab WHERE CONTAINS (demo_col, '$Haupt') > 0
                2  /
              
              DEMO_COL
              ------------------------------
              Hauptbahnhof
              
              1 row selected.
              
              SCOTT@orcl_11gR2> SELECT * FROM demo_tab WHERE CONTAINS (demo_col, '$bahnhof') > 0
                2  /
              
              DEMO_COL
              ------------------------------
              Hauptbahnhof
              
              1 row selected.
              
              SCOTT@orcl_11gR2> SELECT * FROM demo_tab WHERE CONTAINS (demo_col, '$lokomotive') > 0
                2  /
              
              DEMO_COL
              ------------------------------
              Lokomotivführer
              
              1 row selected.
              
              SCOTT@orcl_11gR2> SELECT * FROM demo_tab WHERE CONTAINS (demo_col, '$führer') > 0
                2  /
              
              DEMO_COL
              ------------------------------
              Lokomotivführer
              
              1 row selected.
              • 4. Re: languages
                Barbara Boehmer
                sperkmandl wrote:
                Hi, the ref. manual reports German, Danish and Swedish as alternate spelling choices.
                Where do they come from, since Danish and Swedish are not in common language lists (basic lexer and basic wordlist attributes) ?
                I don't know where they are stored internally. They may or may not be easily accessible in some table or text file. You specify them as the alternate_spelling attribute of the basic_lexer. They are listed in the documentation:

                http://docs.oracle.com/cd/E11882_01/text.112/e24436/cspell.htm#CIHHGDFH
                • 5. Re: languages
                  Barbara Boehmer
                  Here is a demo of Swedish alternate spelling and stemming, similar to the German demo that I provided earlier.
                  SCOTT@orcl_11gR2> -- settings:
                  SCOTT@orcl_11gR2> select value from nls_database_parameters where parameter = 'NLS_LANGUAGE'
                    2  /
                  
                  VALUE
                  --------------------------------------------------------------------------------
                  AMERICAN
                  
                  1 row selected.
                  
                  SCOTT@orcl_11gR2> select value from v$nls_parameters where parameter = 'NLS_LANGUAGE'
                    2  /
                  
                  VALUE
                  ----------------------------------------------------------------
                  AMERICAN
                  
                  1 row selected.
                  
                  SCOTT@orcl_11gR2> -- demo table and data:
                  SCOTT@orcl_11gR2> CREATE TABLE demo_tab
                    2    (demo_col  VARCHAR2(30))
                    3  /
                  
                  Table created.
                  
                  SCOTT@orcl_11gR2> INSERT ALL
                    2  INTO demo_tab VALUES ('Anders Jonas Ångström')
                    3  INTO demo_tab VALUES ('Anders Jonas Angstrom')
                    4  INTO demo_tab VALUES ('Jag skrattar ofta.')
                    5  INTO demo_tab VALUES ('Han skrattade högt.')
                    6  SELECT * FROM DUAL
                    7  /
                  
                  4 rows created.
                  
                  SCOTT@orcl_11gR2> -- lexer only:
                  SCOTT@orcl_11gR2> BEGIN
                    2    CTX_DDL.CREATE_PREFERENCE ('demo_lex', 'BASIC_LEXER');
                    3    CTX_DDL.SET_ATTRIBUTE ('demo_lex', 'ALTERNATE_SPELLING', 'SWEDISH');
                    4    CTX_DDL.SET_ATTRIBUTE ('demo_lex', 'INDEX_STEMS', 'SWEDISH');
                    5  END;
                    6  /
                  
                  PL/SQL procedure successfully completed.
                  
                  SCOTT@orcl_11gR2> CREATE INDEX demo_idx ON demo_tab (demo_col)
                    2  INDEXTYPE IS CTXSYS.CONTEXT
                    3  PARAMETERS
                    4    ('LEXER        demo_lex')
                    5  /
                  
                  Index created.
                  
                  SCOTT@orcl_11gR2> -- alternate spelling:
                  SCOTT@orcl_11gR2> SELECT * FROM demo_tab WHERE CONTAINS (demo_col, 'Ångström') > 0
                    2  /
                  
                  DEMO_COL
                  ------------------------------
                  Anders Jonas Ångström
                  
                  1 row selected.
                  
                  SCOTT@orcl_11gR2> SELECT * FROM demo_tab WHERE CONTAINS (demo_col, 'Angstrom') > 0
                    2  /
                  
                  DEMO_COL
                  ------------------------------
                  Anders Jonas Angstrom
                  
                  1 row selected.
                  
                  SCOTT@orcl_11gR2> -- stemming
                  SCOTT@orcl_11gR2> SELECT * FROM demo_tab WHERE CONTAINS (demo_col, '$skrattar') > 0
                    2  /
                  
                  DEMO_COL
                  ------------------------------
                  Jag skrattar ofta.
                  Han skrattade högt.
                  
                  2 rows selected.
                  
                  SCOTT@orcl_11gR2> SELECT * FROM demo_tab WHERE CONTAINS (demo_col, '$skrattade') > 0
                    2  /
                  
                  DEMO_COL
                  ------------------------------
                  Jag skrattar ofta.
                  Han skrattade högt.
                  
                  2 rows selected.
                  • 6. Re: languages
                    sperkmandl
                    I'm afraid I wasn't clear. Concerning Danish: why can I select it as capable of alternate spelling, if I cannot select it in any other language choice ?
                    • 7. Re: languages
                      Barbara Boehmer
                      Perhaps I still don't understand. The following demonstrates Danish alternate spelling.
                      SCOTT@orcl_11gR2> -- settings:
                      SCOTT@orcl_11gR2> select value from nls_database_parameters where parameter = 'NLS_LANGUAGE'
                        2  /
                      
                      VALUE
                      --------------------------------------------------------------------------------
                      AMERICAN
                      
                      1 row selected.
                      
                      SCOTT@orcl_11gR2> select value from v$nls_parameters where parameter = 'NLS_LANGUAGE'
                        2  /
                      
                      VALUE
                      ----------------------------------------------------------------
                      AMERICAN
                      
                      1 row selected.
                      
                      SCOTT@orcl_11gR2> -- demo table and data:
                      SCOTT@orcl_11gR2> CREATE TABLE demo_tab
                        2    (demo_col  VARCHAR2(30))
                        3  /
                      
                      Table created.
                      
                      SCOTT@orcl_11gR2> INSERT ALL
                        2  INTO demo_tab VALUES ('mødt')
                        3  INTO demo_tab VALUES ('moedt')
                        4  SELECT * FROM DUAL
                        5  /
                      
                      2 rows created.
                      
                      SCOTT@orcl_11gR2> -- lexer only:
                      SCOTT@orcl_11gR2> BEGIN
                        2    CTX_DDL.CREATE_PREFERENCE ('demo_lex', 'BASIC_LEXER');
                        3    CTX_DDL.SET_ATTRIBUTE ('demo_lex', 'ALTERNATE_SPELLING', 'DANISH');
                        4    CTX_DDL.SET_ATTRIBUTE ('demo_lex', 'INDEX_STEMS', 'DANISH');
                        5  END;
                        6  /
                      
                      PL/SQL procedure successfully completed.
                      
                      SCOTT@orcl_11gR2> CREATE INDEX demo_idx ON demo_tab (demo_col)
                        2  INDEXTYPE IS CTXSYS.CONTEXT
                        3  PARAMETERS
                        4    ('LEXER        demo_lex')
                        5  /
                      
                      Index created.
                      
                      SCOTT@orcl_11gR2> -- alternate spelling:
                      SCOTT@orcl_11gR2> SELECT * FROM demo_tab WHERE CONTAINS (demo_col, 'mødt') > 0
                        2  /
                      
                      DEMO_COL
                      ------------------------------
                      mødt
                      moedt
                      
                      2 rows selected.
                      
                      SCOTT@orcl_11gR2> SELECT * FROM demo_tab WHERE CONTAINS (demo_col, 'moedt') > 0
                        2  /
                      
                      DEMO_COL
                      ------------------------------
                      mødt
                      moedt
                      
                      2 rows selected.
                      • 8. Re: languages
                        Barbara Boehmer
                        I think maybe I understand now. Although you can do alternate spelling in Danish and you can specify Danish as an index_stems attribute of a basic_lexer, it does not appear to actually do Danish stemming. As to why, it is apparently just not an available feature at this time.
                        • 9. Re: languages
                          Barbara Boehmer
                          sperkmandl wrote:
                          Does it make sense to state that a given index has been setup for a language X, since there is no overall setting such as NLS_LANG (for example) ?
                          You can use the multi_lexer for multiple languages, specifying a language column. You can also use the world_lexer without a language column for automatic language detection. However, sometimes the automatic language detection is not accurate, especially with small text samples.