6 Replies Latest reply on Nov 19, 2013 9:42 AM by Zoltan Kecskemethy

    Umlaut-insensitive search

    897125


      Our database has varchar-fields that contain umlauts (e.g. Möller). We are receiving material in which umlauts ä, ö and ü have been replaced with a, o and u, respectively. With a bit of googling I figured I could do an umlaut-insensitive query by setting nls_comp to LINQUISTIC and nls_sort to GERMAN_AI (have also tried XGERMAN_DIN_AI). These seem to work but the queries are hopelessly slow, especially when I have to do a join.

       

      select customer.id, customer.name from customer

      inner join address on address.id on customer.id

      where customer.name like 'Moller'

      and address.street like 'Hogberggatan';

       

      With umlauts this query take 0.08 seconds. With nls_comp and nls_sort set, and without umlauts it takes over 2 minutes. We're using 11g and the chactersets are NLS_NCHAR_CHARACTERSET = AL16UTF16 and NLS_CHARACTERSET = WE8ISO8859P1.

       

      Any ideas on how we could speed up the queries?

        • 1. Re: Umlaut-insensitive search
          Zoltan Kecskemethy

          You may need "Linguistic Indexes". Please check

          Introduction To Linguistic Indexes – Part I | Richard Foote's Oracle Blog

          Oracle case insensitive indexes searching

           

          But to be sure you would need to check execution plans first of course...

          I mean query performance is slow because it cannot use indexes...

          So you would need to create the same indexes using specific NLS SORT...

          • 2. Re: Umlaut-insensitive search
            Sergiusz Wolicki-Oracle

            Check execution plans, indeed. Also, what is the actual predicate: "customer.name like 'Moller'", "customer.name like '%Moller%'" or "customer.name like 'Moller%'"?

             

             

             

            Thanks,

            Sergiusz

            • 3. Re: Umlaut-insensitive search
              897125

              Thanks for the replies. The customer.name actually contains the whole name including all middle names, and the format is 'SURNAME FIRSTNAME [MIDDLENAME]', already normalized to upper-case (so case is not an issue). We can't trust that the material we receive has the name in correct form, but need to try and identify the person. When preprocessing the material, what we do is take the first two words in the name and try those with the predicate customer.name like 'first_word second_word%'. If that doesn't match we swap the two words. So the actual value in the customer.name might be 'MÖLLER RICHARD ANTHONY', but in the material we receive it might be 'Moller Richard' or 'Moller Richard A.' or 'Richard Moller'. Since false negatives are acceptable but false positives are not, we also cross-check with some other piece of information, a birthday or an address (street). Name alone is never considered enough.

               

              Our DBA department said we should use NLSSORT(street, 'NLS_SORT=GENERIC_M') in the index, and then set nls_comp to LINGUISTIC. If I then execute the query in the original post, I get no hits. I tried reading the docs on linguistic indexes but am not sure what the NLS_SORT ought to be. The DBA Dep. are reluctant to try GERMAN_AI.

               

              Another option is to use REGEXP_LIKE and regex equivalence class, e.g. where REGEXP_LIKE(customer.name, 'M[[=o=]]ller').

              • 4. Re: Umlaut-insensitive search
                897125

                Is there a generic NLS SORT that's umlaut insensitive or should I use GERMAN? I can't find any documentation for the linguistic definitions. For example, what is the difference between GERMAN and XGERMAN?

                • 5. Re: Umlaut-insensitive search
                  Sergiusz Wolicki-Oracle

                  If you want accent (umlaut) insensitive sort, then you need to use GERMAN_AI, XGERMAN_AI, GERMAN_DIN_AI, XGERMAN_DIN_AI or GENERIC_M_AI.

                   

                  The general differences between German sorts are:

                  GERMAN - sorts lowercase before uppercase, treats 'äöü' as accented 'aou', treats 'ß' as accented 's'

                  GERMAN_DIN - sorts uppercase before lowercase, treats 'äöü' as accented 'aou', treats 'ß' as accented 's'

                  XGERMAN - sorts lowercase before uppercase, treats 'äöü' as accented 'aou', treats 'ß' as 'ss'

                  XGERMAN_DIN - sorts uppercase before lowercase, treats 'äöü' as 'aeoeue', treats 'ß' as 'ss'

                   

                  _AI are case- and accent-insensitive versions, which basically do a conversion to baseletter form before comparing.

                   

                  You will see no difference in matching between GERMAN_AI and GERMAN_DIN_AI, as they differ in sorting results only. You will see a significant difference between XGERMAN_DIN_AI and other German _AI sorts as far as umlauts are concerned. XGERMAN_DIN_AI will match 'Käse' with 'KAESE', while the others will match 'Käse' with 'KASE'.

                   

                  GENERIC_M is a generic European sort based on Unicode and ISO standards. GENERIC_M_AI should behave similarly to XGERMAN as far as German is concerned. It does however more, like sorting other languages and scripts and performing Unicode normalization. It is, therefore, more universal but also more expensive than XGERMAN.

                   

                   

                  Thanks,

                  Sergiusz

                  • 6. Re: Umlaut-insensitive search
                    Zoltan Kecskemethy

                    I link here 10g doc as you did not provide yours:

                    See more at Oracle® Database Globalization Support Guide 10g Release 2 (10.2) Linguistic Sorting and String Searching

                    but you can easily find Globalization Support Guide in any of the online documentation.Look for Master Book list and GLO shortcut.