This discussion is archived
1 2 3 Previous Next 37 Replies Latest reply: Jul 6, 2010 6:32 AM by Aketi Jyuuzou RSS

Introduction to regular expressions ... last part.

60660 Journeyer
Currently Being Moderated
Continued from Introduction to regular expressions ... continued., here's the third and final part of my introduction to regular expressions. As always, if you find mistakes or have examples that you think could be solved through regular expressions, please post them.

Having fun with regular expressions - Part 3

In some cases, I may have to search for different values in the same column. If the searched values are fixed, I can use the logical OR operator or the IN clause, like in this example (using my brute force data generator from part 2):
SELECT data
  FROM TABLE(regex_utils.gen_data('abcxyz012', 4))
 WHERE data IN ('abc', 'xyz', '012');
There are of course some workarounds as presented in this asktom thread but for a quick solution, there's of course an alternative approach available. Remember the "|" pipe symbol as OR operator inside regular expressions? Take a look at this:
SELECT data
  FROM TABLE(regex_utils.gen_data('abcxyz012', 4))
 WHERE REGEXP_LIKE(data, '^(abc|xyz|012)$')
;
I can even use strings composed of values like
'abc, xyz ,  012'
by simply using another regular expression to replace "," and spaces with the "|" pipe symbol. After reading part 1 and 2 that shouldn't be too hard, right? Here's my "thinking in regular expression": Replace every "," and 0 or more leading/trailing spaces.

Ready to try your own solution?


Does it look like this?
SELECT data
  FROM TABLE(regex_utils.gen_data('abcxyz012', 4))
 WHERE REGEXP_LIKE(data, '^(' || REGEXP_REPLACE('abc, xyz ,  012', ' *, *', '|') || ')$')
;
If I wouldn't use the "^" and "$" metacharacter, this SELECT would search for any occurence inside the data column, which could be useful if I wanted to combine LIKE and IN clause. Take a look at this example where I'm looking for 'abc%', 'xyz%' or '012%' and adding a case insensitive match parameter to it:
SELECT data
  FROM TABLE(regex_utils.gen_data('abcxyz012', 4))
 WHERE REGEXP_LIKE(data, '^(abc|xyz|012)', 'i')
; 
An equivalent non regular expression solution would have to look like this, not mentioning other options with adding an extra "," and using the INSTR function:
SELECT data
  FROM (SELECT data, LOWER(DATA) search 
          FROM TABLE(regex_utils.gen_data('abcxyz012', 4))
       )
 WHERE search LIKE 'abc%'
    OR search LIKE 'xyz%'
    OR search LIKE '012%'
;   
  
SELECT data
  FROM (SELECT data, SUBSTR(LOWER(DATA), 1, 3) search 
          FROM TABLE(regex_utils.gen_data('abcxyz012', 4))
       )
 WHERE search IN ('abc', 'xyz', '012')
;  
I'll leave it to your imagination how a complete non regular example with
'abc, xyz ,  012'
as search condition would look like.

As mentioned in the first part, regular expressions are not very good at formatting, except for some selected examples, such as phone numbers, which in my demonstration, have different formats. Using regular expressions, I can change them to a uniform representation:
WITH t AS (SELECT '123-4567' phone
             FROM dual
            UNION 
           SELECT '01 345678' 
             FROM dual
            UNION 
           SELECT '7 87 8787' 
             FROM dual
           ) 
SELECT t.phone, REGEXP_REPLACE(REGEXP_REPLACE(phone, '[^0-9]'), '(.{3})(.*)', '(\1)-\2')
  FROM t
;
First, all non digit characters are beeing filtered, afterwards the remaining string is put into a "(xxx)-xxxx" format, but not cutting off any phone numbers that have more than 7 digits. Using such a conversion could also be used to check the validity of entered data, and updating the value with a uniform format afterwards.

Thinking about it, why not use regular expressions to check other values about their formats? How about an IP4 address? I'll do this step by step, using 127.0.0.1 as the final test case.

First I want to make sure, that each of the 4 parts of an IP address remains in the range between 0-255. Regular expressions are good at string matching but they don't allow any numeric comparisons. What valid strings do I have to take into consideration?

Single digit values: 0-9
Double digit values: 00-99
Triple digit values: 000-199, 200-255 (this one will be the trickiest part)

So far, I will have to use the "|" pipe operator to match all of the allowed combinations. I'll use my brute force generator to check if my solution works for a single value:
SELECT data
  FROM TABLE(regex_utils.gen_data('0123456789', 3))
 WHERE REGEXP_LIKE(data, '^(25[0-5]|2[0-4][0-9]|[01]?[0-9]{1,2})$')  
; 
More than 255 records? Leading zeros are allowed, but checking on all the records, there's no value above 255. First step accomplished. The second part is to make sure, that there are 4 such values, delimited by a "." dot. So I have to check for 0-255 plus a dot 3 times and then check for another 0-255 value. Doesn't sound to complicated, does it?

Using first my brute force generator, I'll check if I've missed any possible combination:
SELECT data
  FROM TABLE(regex_utils.gen_data('03.', 15))
 WHERE REGEXP_LIKE(data, 
                   '^((25[0-5]|2[0-4][0-9]|[01]?[0-9]{1,2})\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9]{1,2})$'
                  )  
;  
Looks good to me. Let's check on some sample data:
WITH t AS (SELECT '127.0.0.1' ip
             FROM dual
            UNION  
           SELECT '256.128.64.32'
             FROM dual             
          )  
SELECT t.ip
  FROM t WHERE REGEXP_LIKE(t.ip, 
                   '^((25[0-5]|2[0-4][0-9]|[01]?[0-9]{1,2})\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9]{1,2})$'
                  )  
;  
No surprises here. I can take this example a bit further and try to format valid addresses to a uniform representation, as shown in the phone number example. My goal is to display every ip address in the "xxx.xxx.xxx.xxx" format, using leading zeros for 2 and 1 digit values.

Regular expressions don't have any format models like for example the TO_CHAR function, so how could this be achieved? Thinking in regular expressions, I first have to find a way to make sure, that each single number is at least three digits wide. Using my example, this could look like this:
WITH t AS (SELECT '127.0.0.1' ip
             FROM dual
          )  
SELECT t.ip, REGEXP_REPLACE(t.ip, '([0-9]+)(\.?)', '00\1\2')
  FROM t
;  
Look at this: leading zeros. However, that first value "00127" doesn't look to good, does it? If you thought about using a second regular expression function to remove any excess zeros, you're absolutely right. Just take the past examples and think in regular expressions. Did you come up with something like this?


WITH t AS (SELECT '127.0.0.1' ip
             FROM dual
          )  
SELECT t.ip, REGEXP_REPLACE(REGEXP_REPLACE(t.ip, '([0-9]+)(\.?)', '00\1\2'), 
                            '[0-9]*([0-9]{3})(\.?)', '\1\2'
                           )
  FROM t
;  
Think about the possibilities: Now you can sort a table with unformatted IP addresses, if that is a requirement in your application or you find other values where you can use that "trick".

Since I'm on checking INET (internet) type of values, let's do some more, for example an e-mail address. I'll keep it simple and will only check on the
"x@x.xxx", "x@x.xxxx" and "x@x.xx.xx" format, where x represents an alphanumeric character. If you want, you can look up the corresponding RFC definition and try to build your own regular expression for that one.

Now back to this one: At least one alphanumeric character followed by an "@" at sign which is followed by at least one alphanumeric character followed by a "." dot and exactly 3 more alphanumeric characters or 2 more characters followed by a "." dot and another 2 characters. This should be an easy one, right? Use some sample e-mail addresses and my brute force generator, you should be able to verify your solution.

Here's mine:
SELECT data
  FROM TABLE(regex_utils.gen_data('a1@.', 9))
 WHERE REGEXP_LIKE(data, '^[[:alnum:]]+@[[:alnum:]]+(\.[[:alnum:]]{3,4}|(\.[[:alnum:]]{2}){2})$', 'i'); 
Checking on valid domains, in my opinion, should be done in a second function, to keep the checks by itself simple, but that's probably a discussion about readability and taste.

How about checking a valid URL? I can reuse some parts of the e-mail example and only have to decide what type of URLs I want, for example "http://", "https://" and "ftp://", any subdomain and a "/" after the domain. Using the case insensitive match parameter, this shouldn't take too long, and I can use this thread's URL as a test value. But take a minute to figure that one out for yourself.


Does it look like this?
WITH t AS (SELECT 'Introduction to regular expressions ... last part. URL
             FROM dual
            UNION 
           SELECT 'http://x/'
             FROM dual
          )       
SELECT t.URL
  FROM t
 WHERE REGEXP_LIKE(t.URL, '^(https*|ftp)://(.+\.)*[[:alnum:]]+(\.[[:alnum:]]{3,4}|(\.[[:alnum:]]{2}){2})/', 'i')
 ;
Update: Improvements in 10g2

All of you, who are using 10g2 or XE (which includes some of 10g2 features) may want to take a look at several improvements in this version. First of all, there are new, perl influenced meta characters.

Rewriting my example from the first lesson, the WHERE clause would look like this:
 WHERE NOT REGEXP_LIKE(t.col1, '^\d+$')
Or my example with searching decimal numbers:
'^(\.\d+|\d+(\.\d*)?)$'
Saves some space, doesn't it? However, this will only work in 10g2 and future releases.

Some of those meta characters even include non matching lists, for example "\S" is equivalent to "[^ ]", so my example in the second part could be changed to:
SELECT NVL(LENGTH(REGEXP_REPLACE('Having fun with regular expressions', '\S')), 0)
  FROM dual
  ;
Other meta characters support search patterns in strings with newline characters. Just take a look at the link I've included.

Another interesting meta character is "?" non-greedy. In 10g2, "?" not only means 0 or 1 occurrence, it means also the first occurrence. Let me illustrate with a simple example:
SELECT REGEXP_SUBSTR('Having fun with regular expressions', '^.* +')
  FROM dual
  ;
This is old style, "greedy" search pattern, returning everything until the last space.
SELECT REGEXP_SUBSTR('Having fun with regular expressions', '^.* +?')
  FROM dual
  ;
In 10g2, you'd get only "Having " because of the non-greedy search operation. Simulating that behavior in 10g1, I'd have to change the pattern to this:
SELECT REGEXP_SUBSTR('Having fun with regular expressions', '^[^ ]+ +')
  FROM dual
  ;
Another new option is the "x" match parameter. It's purpose is to ignore whitespaces in the searched string. This would prove useful in ignoring trailing/leading spaces for example. Checking on unsigned integers with leading/trailing spaces would look like this:
SELECT REGEXP_SUBSTR(' 123 ', '^[0-9]+$', 1, 1, 'x')
  FROM dual
  ;
However, I've to be careful. "x" would also allow " 1 2 3 " to qualify as valid string.

I hope you enjoyed reading this introduction and hope you'll have some fun with using regular expressions.

C.

Fixed some typos ...
Message was edited by:
cd

Included 10g2 features
Message was edited by:
cd
  • 1. Re: Introduction to regular expressions ... last part.
    529476 Newbie
    Currently Being Moderated
    Awesoome Man!!!! Keep it on
  • 2. Re: Introduction to regular expressions ... last part.
    505805 Newbie
    Currently Being Moderated
    here's the third and final part of my introduction to regular expressions
    I bet there are intermediate and advance lessons, right?
    If so, maybe you should consider writeing your own book ;)
  • 3. Re: Introduction to regular expressions ... last part.
    KathrynS Newbie
    Currently Being Moderated
    Nevermind--I figured it out.

    Message was edited by:
    Kathryn S
  • 4. validation of item for using regexp
    523648 Newbie
    Currently Being Moderated
    hi i would like some help in converting the following by using regexp..

    IF INSTR(:BLOCK.ITEM,CHR(32),1,1) > 0 THEN
         MESSAGE( 'Please check blank space not allowed);
         RAISE FORM_TRIGGER_FAILURE;
    END IF;



    chr(32)= space character


    could you throw some light on how to use regexp in this condition..
  • 5. Re: validation of item for using regexp
    60660 Journeyer
    Currently Being Moderated
    According to your example, you're looking for the first occurrence of a space character. What improvement are you expecting by using a regular expression in this case?

    C.
  • 6. Re: validation of item for using regexp
    sgalaxy Journeyer
    Currently Being Moderated
    Ohhh CD!!!!!
    You converted a 'simple' matter - as it looks like- a 'real' science....!!!!!!

    Thanks , for your notes...
    Simon
  • 7. Re: validation of item for using regexp
    523648 Newbie
    Currently Being Moderated
    Hello,

    We are having a requirement where all the primary keys in our tables will be Alphanumeric,we can use regular expressions to match only alphanumeric numbers,will it be compitable with other languages?

    Our application is a multilingual application.

    Regards
    Amul
  • 8. Re: validation of item for using regexp
    60660 Journeyer
    Currently Being Moderated
    Hello,

    We are having a requirement where all the primary
    keys in our tables will be Alphanumeric,we can use
    regular expressions to match only alphanumeric
    numbers,will it be compitable with other languages?

    Our application is a multilingual application.
    Alphanumeric is alphanumeric, at least in those languages I know.
    What is an alphanumeric number, or are you refering to scientific notation, for example 1.3E10?

    C.
  • 9. Re: validation of item for using regexp
    523648 Newbie
    Currently Being Moderated
    no i am not referrring to scientific notation...i meant arabic.will arabic numbers and characters be interpreted by regexp as alphanum?
  • 10. Re: validation of item for using regexp
    60660 Journeyer
    Currently Being Moderated
    no i am not referrring to scientific notation...i
    meant arabic.will arabic numbers and characters be
    interpreted by regexp as alphanum?
    Why don't you test it? Definition for character classes can be found here.

    C.
  • 11. Re: Introduction to regular expressions ... last part.
    572471 Newbie
    Currently Being Moderated
    I have a list of numbers that consists of 5 symbols ('1' or '0').
    It is possible (but not obligatory) that somwhere inside the string there could be only one '(!)'.
    Can I write this condition with only one reg expr in Oracle (regexp_substr in my example)?
    With multiple regular expressions it would look like:
      
    with t as (select '10(!)010' num from dual union all
    select '1112(!)0' from dual union all --incorrect because of '2'
    select '(!)10010' from dual union all
    select '10010(!)' from dual union all
    select '10010' from dual union all
    select '13001' from dual union all -- incorrect because of '3'
    select '100!01' from dual union all --incorrect because of ! without (!)
    select '100(!)1(!)1' from dual union all -- incorrect because of two occurencies of (!)
    select '1001(!)10' from dual) --incorrect because of length=6
    --end of test data
    select t.*, regexp_substr(num, '^([10]{5}|[10]*\(!\)[10]*)$')
      from t
     where regexp_like(num, '^[10]{5}$')
        or (regexp_like(num, '^[10]*\(!\)[10]*$') and length(num) = 8)
  • 12. Re: Introduction to regular expressions ... last part.
    BluShadow Guru Moderator
    Currently Being Moderated
    where regexp_like(replace(num,'(!)'), '^[10]{5}$') and length(num) <= 8
  • 13. Re: Introduction to regular expressions ... last part.
    572471 Newbie
    Currently Being Moderated
    Can I write this condition with only one reg expr in Oracle (regexp_substr in my example)?
    I meant to use only regexp_substr in select clause and without regexp_like in where clause.

    but for better understanding what I'd like to get
    next example:

    a have strings of two blocks separated by space.
    in the first block 5 symbols of [01] in the second block 3 symbols of [01].
    In the first block it is optional to meet one (!), in the second block it is optional to meet one (>).
    The idea is to find such strings with only one reg expr using regexp_substr in the select clause, so if the string does not satisfy requirments should be passed out null in the result set.
    with t as (select '10(!)010 10(>)1' num from dual union all
    select '1112(!)0 111' from dual union all --incorrect because of '2'
    select '(!)10010 011' from dual union all
    select '10010(!) 101' from dual union all
    select '10010 100(>)' from dual union all
    select '13001 110' from dual union all -- incorrect because of '3'
    select '100!01 100' from dual union all --incorrect because of ! without (!)
    select '100(!)1(!)1 101' from dual union all -- incorrect because of two occurencies of (!)
    select '1001(!)10 101' from dual union all --incorrect because of length of block1=6
    select '1001(!)10 1011' from dual union all) --incorrect because of length of block2=4
    select '10110 1(>)11(>)0' from dual union all)--incorrect because of two occurencies of (>)
    select '1001(>)1 11(!)0' from dual)--incorrect because (!) and (>) are met not in their blocks
    --end of test data
  • 14. Re: Introduction to regular expressions ... last part.
    60660 Journeyer
    Currently Being Moderated
    Or using just one regex function and LENGTH:
     WHERE REGEXP_LIKE(num, '^[10]*(\(!\))?[10]*$') AND LENGTH (num) IN (5, 8)
    C.
1 2 3 Previous Next