13 Replies Latest reply: Jan 31, 2013 6:26 AM by chris227 RSS

    Regular expression [:alpha:] class not giving correct result on 11g

    987706
      Hi All,

      We have requirement to check whether input string is english or any other language ( we are checking for farsi). I am exceuting following statement to check whether it contains any alphabet, if not then taking it as non english.

      1. when i am passing farsi characters,It should return 0. But it is returning output as 1 on 11G,
      0 on 10G.


      SELECT REGEXP_INSTR ('نام', '[[:alpha:]]', 1,1)
      --- INTO v_alpha
      FROM DUAL;


      2. if i replace [:alpha:] with [A-Za-z] or [:lower:][:upper:] then it is working fine on both 10g,11g , returning 0


      Can anyone help in understanding why [:alpha:] class not working on 11g.

      Thanks in Advance
        • 1. Re: Regular expression [:alpha:] class not giving correct result on 11g
          chris227
          The rexp-function with the usage of character classes and ranges is NLS-Setting dependend.
          That means, that alpha contains all "all alphabetic characters".

          In the seocnd example you gave you restricted the alpha class down to dedicated ranges.
          See the example below for my NLS-Settings:
          SELECT 
           REGEXP_INSTR ('نام', '[[:alpha:]]', 1,1) a
          ,REGEXP_INSTR ('نام', '[y-€]', 1,1) b
          FROM DUAL
          
          A B 
          1 1 
          If really want to test only against ascii you have to do this more accuratly.
          • 2. Re: Regular expression [:alpha:] class not giving correct result on 11g
            987706
            Thanks Chris for quick reply!

            Can you please tell what NLS parameters it depends on...i hav checked in both database all parameter are same except characterset

            In 10g - NLS_CHARACTERSET     WE8ISO8859P1
            11g- NLS_CHARACTERSET     AR8MSWIN1256

            In definition of [:alpha:] class , in most of websites range mentioned is A-Z, a-z .

            How about [:upper:] [:lower:] classes?Are these classes only for english alphabets?


            Thanks Again!!


            one more thing

            i am reading one document

            http://www.oracle.com/technetwork/database/focus-areas/application-development/twp-regular-expressions-133133.pdf

            there it is mentioned at page 11:


            Also, rather than using [a-zA-Z] to define all alphabetical
            characters, you could use [:alpha:] and be portable across different languages and
            character sets as [:alpha:] consists of all alphabetical characters in the current
            character set.

            now how can we be sure specifying [a-zA-Z] range will work finding english alphabets from string.

            Edited by: 984703 on 29-Jan-2013 01:45
            • 3. Re: Regular expression [:alpha:] class not giving correct result on 11g
              chris227
              984703 wrote:
              Thanks Chris for quick reply!

              Can you please tell what NLS parameters it depends on...i hav checked in both database all parameter are same except characterset
              The document you mentioned is a good start. The answer to this question is also in it.
              NLS_CHARACTERSET and NLS_SORT
              In 10g - NLS_CHARACTERSET     WE8ISO8859P1
              11g- NLS_CHARACTERSET     AR8MSWIN1256

              In definition of [:alpha:] class , in most of websites range mentioned is A-Z, a-z .
              That's (wrong) misleading as the document states clearly:
              "[:alpha:] consists of all alphabetical characters in the current character set."
              Obviously it is not always obvious what the a or z belongs to.

              You proove it with your case using two different charactersets.
              How about [:upper:] [:lower:] classes?Are these classes only for english alphabets?
              No.
              SELECT 
               REGEXP_INSTR ('نام', '[[:alpha:]]', 1,1) a
              ,REGEXP_INSTR ('نام', '[y-€]', 1,1) b
              ,length(translate ('نام',
               chr(0)||'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
              ,'chr(0)'
              )) l
              ,nvl(length(translate ('eèé'
              ,chr(0)||'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
              ,chr(0)
              )),0) l
              ,REGEXP_substr ('eéè','[[:lower:]]+') r
              ,dump('نام', 
              1010) d
              FROM DUAL
              
              A B L L R D 
              1 1 3 2 eéè Typ=96 Len=6 CharacterSet=AL32UTF8: 217,134,216,167,217,133 
              I guess it works because there is no definition of upper and lower for this particular characters, but i may be wrong probably.
              now how can we be sure specifying [a-zA-Z] range will work finding english alphabets from string.
              Define "english alphabet".

              May be translate is worth a thought.

              regards

              Edited by: chris227 on 29.01.2013 01:58
              • 4. Re: Regular expression [:alpha:] class not giving correct result on 11g
                987706
                [:upper:][:lower:] class is working..may b bcz there is no lower or upper class for other character except A-Z

                We can conclude like


                [:alpha:] ----> Character classes --> depends on characterset -- >NLS_CHARACTERSET
                [A-Za-z] --- Range matching--->depends on Sort Order --->sort order depends ---> NLS_LANGUAGE
                [:upper:][:lower:] Character classes --> depends on characterset -- >NLS_CHARACTERSET -- > there is no lower or upper diffrence for other alphabets


                Please correct me if this is wrong understanding.



                Thanks a lot!!
                • 5. Re: Regular expression [:alpha:] class not giving correct result on 11g
                  Sven W.
                  984703 wrote:
                  [:upper:][:lower:] class is working..may b bcz there is no lower or upper class for other character except A-Z

                  We can conclude like

                  [:alpha:] ----> Character classes --> depends on characterset -- >NLS_CHARACTERSET
                  [A-Za-z] --- Range matching--->depends on Sort Order --->sort order depends ---> NLS_LANGUAGE
                  [:upper:][:lower:] Character classes --> depends on characterset -- >NLS_CHARACTERSET -- > there is no lower or upper diffrence for other alphabets

                  Please correct me if this is wrong understanding.
                  Your conclusion looks ok, however I don't see how do you want to replace [:alpha:] with [:lower:] or [:upper:].
                  Wouldn't that be the same as using \D => not a digit? See example below.
                  The main difference here are not the alpha characters (Multi character set considered), but how other chars are included like numerical or whitespace characters.

                  And one minor comment: Is there a reason, why you used regexp_instr? In my opinion it is the most useless function of all the regexp_ functions. In general, we don't need it, because regexp_substr/regexP_replace will do the same and usually include the next step (cut/replace the string) too. If it was just for demonstration or analysis purposes, then thats fine.

                  example
                  select regexp_instr ('123a','\D') first_alpha_char from dual;
                  
                  4
                  
                  select regexp_instr ('123€a','\D') first_alpha_char from dual;
                  
                  4
                  • 6. Re: Regular expression [:alpha:] class not giving correct result on 11g
                    chris227
                    984703 wrote:
                    [:upper:][:lower:] class is working..may b bcz there is no lower or upper class for other character except A-Z
                    Well, i think it's not easy like this. In ISO885915 for example there is lower and upper case for é, in AR8MSWIN1256 there is not. May be in AR8MSWIN1256 there is in fact only upper/lower cases defined for ASCII/-characters, but i would only rely on it if i were sure, and i doubt it in this case, see code below.
                    We can conclude like


                    [:alpha:] ----> Character classes --> depends on characterset -- >NLS_CHARACTERSET
                    [A-Za-z] --- Range matching--->depends on Sort Order --->sort order depends ---> NLS_LANGUAGE
                    Notice, the range is defined linguistic, é will probably be included for AR8MSWIN1256
                    [:upper:][:lower:] Character classes --> depends on characterset -- >NLS_CHARACTERSET -- > there is no lower or upper diffrence for other alphabets
                    I think characters for wich a Upper/lower flag exists in the characterset given. See
                    SELECT
                    SELECT
                     dump(convert ('é','AR8MSWIN1256'),1010) de1
                    ,dump(upper(convert ('é','AR8MSWIN1256')),1010) de2u
                    ,dump(lower(convert ('é','AR8MSWIN1256')),1010) de2l
                    ,dump(convert ('e','AR8MSWIN1256'),1010) de3
                    ,dump(upper(convert ('e','AR8MSWIN1256')),1010) de4
                    ,dump(convert (chr(225),'UTF8','AR8MSWIN1256'),1010) de3
                    ,dump(upper(convert (chr(225),'UTF8','AR8MSWIN1256')),1010) de4
                    ,regexp_substr('ل', '[[:lower:][:upper:]]') rs
                    FROM DUAL
                    
                    DE1 DE2U DE2L DE3 DE4 DE3 DE4 RS 
                    Typ=1 Len=1 CharacterSet=AL32UTF8: 233 NULL NULL Typ=1 Len=1 CharacterSet=AL32UTF8: 101 Typ=1 Len=1 CharacterSet=AL32UTF8: 69 Typ=1 Len=2 CharacterSet=AL32UTF8: 217,132 Typ=1 Len=2 CharacterSet=AL32UTF8: 217,132 -  
                    I guess, that for the last column, there will be something other than null in your 11g DB with AR8MSWIN1256
                    (chr(225) is 'ل' in AR8MSWIN1256
                    regards

                    Edited by: chris227 on 29.01.2013 04:14

                    Edited by: chris227 on 29.01.2013 04:17
                    • 7. Re: Regular expression [:alpha:] class not giving correct result on 11g
                      987706
                      That what i thought intially..

                      [[:alpha:] ]= [[:lower:][:upper:]]= [[a-zA-Z]]

                      but its not like that [[:alpha:]] is superset of both and [[:lower:][:upper:]] is of [[a-zA-Z]]

                      Eg.

                      SELECT REGEXP_INSTR ('è', '[[:lower:]]+', 1, 1)
                      FROM DUAL;


                      is returning 1

                      SELECT REGEXP_INSTR ('è', '[[A-Za-z]]+', 1, 1)
                      FROM DUAL;

                      is returning 0.


                      Another point u mentioned using REG_INSTR , we dont want to do anything with input string..just want to check alphabet(a-zA-Z mentioning it explicitly as alpha class rasied so much confusion earlier :) ) present or not...so using REG_INSTR instead of other functions.

                      Is there any other function we can use for the same?
                      • 8. Re: Regular expression [:alpha:] class not giving correct result on 11g
                        Manik
                        See this if it returns a number it means the checking string is out of alnum category.
                        SELECT LENGTH (
                                  TRIM (
                                     TRANSLATE (
                                        'نام',
                                        'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789',
                                        ' ')))
                          FROM DUAL;
                        Cheers,
                        Manik.
                        • 9. Re: Regular expression [:alpha:] class not giving correct result on 11g
                          987706
                          Definitely it will return some length!!

                          question was what does [:alpha:] class contain ...now got clear :)
                          • 10. Re: Regular expression [:alpha:] class not giving correct result on 11g
                            chris227
                            984703 wrote:
                            SELECT REGEXP_INSTR ('è', '[[A-Za-z]]+', 1, 1)
                            FROM DUAL;

                            is returning 0.
                            You dont got it yet.
                            At first there is a typo: double brackets inside the character class.
                            Secondly it depends on NLS_SORT as mentioned above
                            select * from v$NLS_PARAMETERS
                            where
                            parameter = 'NLS_SORT'
                            or
                            parameter = 'NLS_CHARACTERSET'
                            
                            PARAMETER VALUE 
                            NLS_CHARACTERSET AL32UTF8 
                            NLS_SORT BINARY 
                            
                            SELECT
                            REGEXP_INSTR ('è', '[A-Za-z]+', 1, 1) r
                            FROM DUAL;
                            
                            R 
                            0
                            
                            select * from v$NLS_PARAMETERS
                            where
                            parameter = 'NLS_SORT'
                            or
                            parameter = 'NLS_CHARACTERSET'
                            
                            PARAMETER     VALUE
                            NLS_CHARACTERSET     WE8MSWIN1252
                            NLS_SORT     GERMAN
                            
                            SELECT
                             REGEXP_INSTR ('è', '[A-Za-z]+', 1, 1) r
                            FROM DUAL;
                            
                            R
                            1
                            • 11. Re: Regular expression [:alpha:] class not giving correct result on 11g
                              987706
                              i got it in this way only...hadn't shown example with different NLS_SORT .... thanks for explaining it fully with example !
                              • 12. Re: Regular expression [:alpha:] class not giving correct result on 11g
                                987706
                                Thanks Manik ! used your code to check other lang character..it worked well :)
                                • 13. Re: Regular expression [:alpha:] class not giving correct result on 11g
                                  chris227
                                  984703 wrote:
                                  used your code to check other lang character..it worked well :)
                                  Translate was already posted before to show that it works with byte semantic instead of language semantics
                                  ,length(translate ('نام',
                                   chr(0)||'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
                                  ,'chr(0)'
                                  )) l
                                  ,nvl(length(translate ('eèé'
                                  ,chr(0)||'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
                                  ,chr(0)
                                  )),0) l
                                  No need to trim anything. Just use as first character a character which is not in the set of the allowed characters, like char(0) or € e.g.
                                  translate ('eèé','€abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ','€')