This discussion is archived
6 Replies Latest reply: Mar 23, 2005 8:18 AM by rmhardma RSS

Oracle Text language support.

charles poulsen - oracle Newbie
Currently Being Moderated

Hi,

I have been reading through the Oracle Text
documentation and whitepapers, and it obviously
states that it supports the "English" language.

I assume this is US English.

Does anyone know if Oracle Text supports other types of
english, ie. UK or Australian?

Thanks,
Charles
  • 1. Re: Oracle Text language support.
    rmhardma Oracle ACE
    Currently Being Moderated
    I've never found it to be a problem since the rules are the same between the languages (whereas other languages, like Korean, Japanese, etc have different language rules so separate lexers are appropriate). If using a thesaurus, you'll obviously want to make certain you have the correct terms. I almost always use a custom stoplist as well, so add any custom terms to the standard 'of', 'the' list of stopwords.

    Is there a particular feature of Oracle Text that you were most concerned about, or is your inquiry more of a high-level question? If you are planning on supporting a global app, take a look at the multi_lexer or world_lexer (depends on your version whether the world_lexer is available).

    -Ron
  • 2. Re: Oracle Text language support.
    charles poulsen - oracle Newbie
    Currently Being Moderated

    Hi Ron,

    We are implementing a system in Australia that will be used by school children, between ages 6 - 18. The problem
    is spelling of certain words. For example, American
    english spells Organization, while Australian english
    spells it Organisation.

    Since this will be a teaching and learning system,
    we do not want to encourage incorrect spelling of words!

    So that is the problem in a nutshell :)
  • 3. Re: Oracle Text language support.
    rmhardma Oracle ACE
    Currently Being Moderated
    Hopefully this will clarify it a bit --

    create table test (co1 varchar2(50));

    insert into test values ('organization');
    insert into test values ('organisation');
    commit;

    create index test_idx on test(co1)
    indextype is ctxsys.context;


    If the source doc has the word spelled organisation, the search for organisation will find it.

    SQL> select * from test where contains(co1, 'organization') > 0;

    CO1
    --------------------------------------------------
    organization



    If the source doc is spelled organization, then a search using organization will find it.

    SQL> select * from test where contains(co1, 'organisation') > 0;

    CO1
    --------------------------------------------------
    organisation



    Stemming and/or fuzzy searching will take care of some overlap if you need it.

    SQL> select * from test where contains(co1, '$organisation') > 0;

    CO1
    --------------------------------------------------
    organization
    organisation

    SQL> select * from test where contains(co1, '$organization') > 0;

    CO1
    --------------------------------------------------
    organization
    organisation



    A thesaurus can provide even more assistance if you are indexing American English but wish to search using the Australian version of terms and stemming or fuzzy searching isn't cutting it. For example, if your docs use the term widget, but you want to search on gadget, you can use a thesaurus.

    Hope it helps.
  • 4. Re: Oracle Text language support.
    rmhardma Oracle ACE
    Currently Being Moderated
    um...ok, I just re-read that and I got my own organization/organisation mixed in the post...hope you still get the idea.
  • 5. Re: Oracle Text language support.
    charles poulsen - oracle Newbie
    Currently Being Moderated

    Hi Ron,
    I think get the idea...

    Basically, if we are fuzzy matching or searching against an
    index which does not contain the incorrect spellings,
    they will not show up because they don't exist..

    Thanks for the help
  • 6. Re: Oracle Text language support.
    rmhardma Oracle ACE
    Currently Being Moderated
    Keep in mind that the lexer's job is to break your text into tokens. It doesn't alter your text. Since the rules between different flavors of English are the same, then you won't have a problem. The lexer will still break the text into tokens, and the original 'organisation' will still be spelled 'organisation'. Text offers alternate methods of search, and some add-on capabilities such as theme generation, but your original text is not altered.

    You can see this in the example I showed you by examining the DR$TEST_IDX$I.TOKEN_TEXT column. Query the column and you will see the tokens created.