This discussion is archived
1 2 Previous Next 29 Replies Latest reply: Apr 20, 2007 12:09 PM by 572471 RSS

Introduction to regular expressions ... continued.

60660 Journeyer
Currently Being Moderated
After some very positive feedback from Introduction to regular expressions ... I'm now continuing on this topic for the interested audience. As always, if you have questions or problems that you think could be solved through regular expression, please post them.

Having fun with regular expressions - Part 2

Finishing my example with decimal numbers, I thought about a method to test regular expressions. A question from another user who was looking for a way to show all possible combinations inspired me in writing a small package.
CREATE OR REPLACE PACKAGE regex_utils AS
  -- Regular Expression Utilities
  -- Version 0.1 
  TYPE t_outrec IS RECORD(
    data VARCHAR2(255)
  );
  TYPE t_outtab IS TABLE OF t_outrec;
  FUNCTION gen_data(
    p_charset IN VARCHAR2 -- character set that is used for generation
  , p_length  IN NUMBER   -- length of the generated 
  ) RETURN t_outtab PIPELINED;
END regex_utils;
/

CREATE OR REPLACE PACKAGE BODY regex_utils AS
-- FUNCTION gen_data returns a collection of generated varchar2 elements
  FUNCTION gen_data(
    p_charset IN VARCHAR2 -- character set that is used for generation
  , p_length  IN NUMBER   -- length of the generated 
  ) RETURN t_outtab PIPELINED
  IS
    TYPE t_counter IS TABLE OF PLS_INTEGER INDEX BY PLS_INTEGER;
    --
    v_counter t_counter;
    v_exit    BOOLEAN;
    v_string  VARCHAR2(255);
    v_outrec  t_outrec;
  BEGIN 
    FOR max_length IN 1..p_length  
    LOOP
      -- init counter loop
      FOR i IN 1..max_length
      LOOP
        v_counter(i) := 1;
      END LOOP; 
      --
      -- start data generation loop
      --
      v_exit := FALSE;
      --
      WHILE NOT v_exit
      LOOP
        -- start generation
        v_string := '';
        --
        FOR i IN 1..max_length
        LOOP
          v_string := v_string || SUBSTR(p_charset, v_counter(i), 1);
        END LOOP;
        --
        -- set outgoing record
        --
        v_outrec.data := v_string;
        --
        -- now pipe the result 
        --
        PIPE ROW(v_outrec);
        --
        -- increment loop
        --
        <<inc_loop>>
        FOR i IN REVERSE 1..max_length
        LOOP
          v_counter(i) := v_counter(i) + 1;      
          IF v_counter(i) > LENGTH(p_charset) THEN 
             IF i > 1 THEN
                v_counter(i) := 1;
             ELSE
                v_exit := TRUE;   
             END IF;
          ELSE
             -- no further processing required
             EXIT inc_loop;   
          END IF;   
        END LOOP;         
        --
      END LOOP;  
    END LOOP;  
  END gen_data;
END regex_utils;
/
This package is a brute force string generator using all possible combinations of a characters in a string up to a maximum length. Together with the regular expressions, I can now show what combinations my solution would allow to pass. But see for yourself:
SELECT *
  FROM (SELECT data col1 
          FROM TABLE(regex_utils.gen_data('+-.0', 5))
       ) t
 WHERE REGEXP_LIKE(NVL(REGEXP_SUBSTR(t.col1, 
                                     '^([+-]?[^+-]+|[^+-]+[+-]?)$'
                                    ), 
                       ' '
                      ), 
                   '^[+-]?(\.[0-9]+|[0-9]+(\.[0-9]*)?)[+-]?$'
                   )
;
You will see some results, which are perfectly valid for my definition of decimal numbers but haven't been mentioned, like '000' or '+.00'. From now on I will also use this package to verify the solutions I'll present to you and hopefully reduce my share of typos.

Counting and finding certain characters or words in a string can be a tedious task. I'll show you how it's done with regular expressions. I'll start with an easy example, count all spaces in the string "Having fun with regular expressions.":
SELECT NVL(LENGTH(REGEXP_REPLACE('Having fun with regular expressions', '[^ ]')), 0)
  FROM dual
  ;
No surprise there. I'm replacing all characters except spaces with a null string. Since REGEXP_REPLACE assumes a NULL string as replacement argument, I can save on adding a third argument, which would look like this:
REGEXP_REPLACE('Having fun with regular expressions', '[^ ]', '')
So REPLACE will return all the spaces which we can count with the LENGTH function. If there aren't any, I will get a NULL string, which is checked by the NVL function. If you want you can play around by changing the space character to somethin else.

A variation of this theme could be counting the number of words. Counting spaces and adding 1 to this result could be misleading if there are duplicate spaces. Thanks to regular expressions, I can of course eliminate duplicates.

Using the old method on the string "Having fun with regular expressions" would return anything but the right number. This is, where Backreferences come into play. REGEXP_REPLACE uses them in the replacement argument, a backslash plus a single digit, like this: '\1'. To reference a string in a search pattern, I have to use subexpressions (remember the round brackets?).
SELECT NVL(LENGTH(REGEXP_REPLACE('Having  fun  with  regular  expressions', '( )\1*|.', '\1')))
  FROM dual
  ;
You may have noticed that I changed from using the "^" as a NOT operator to using the "|" OR operator and the "." any character placeholder. This neat little trick allows to filter all other characters except the one we're looking in the first place. "\1" as backreference is outside of our subexpression since I don't want to count the trailing spaces and is used both in the search pattern and the replacement argument.

Still I'm not satisfied with this: What about leading/trailing blanks, what if there are any special characters, numbers, etc.? Finally, it's time to only count words. For the purpose of this demonstration, I define a word as one or more consecutive letters. If by now you're already thinking in regular expressions, the solution is not far away. One hint: you may want to check on the "i" match parameter which allows for case insensitive search. Another one: You won't need a back reference in the search pattern this time.



Let's compare our solutions than, shall we?
SELECT NVL(LENGTH(REGEXP_REPLACE('Having  fun  with  regular  expressions.  !', 
                                 '([a-z])+|.', '\1', 1, 0, 'i')), 0)
  FROM dual;
This time I don't use a backreference, the "+" operator (remember? 1 or more) will suffice. And since I want to count the occurences, not the letters, I moved the "+" meta character outside of the subexpression. The "|." trick again proved to be useful.

Case insensitive search does have its merits. It will only search but not transform the any found substring. If I want, for example, extract any occurence of the word fun, I'll just use the "i" match parameter and get this substring, whether it's written as "Fun", "FUN" or "fun". Can be very useful if you're looking for example for names of customers, streets, etc.

Enough about counting, how about finding? What if I want to know the last occurence of a certain character or string, for example the postition of the last space in this string "Where is the last space?"?

Addendum: Thanks to another forum member, I should mention that using the INSTR function can do a reverse search by itself.[i]
WITH t AS (SELECT 'Where is the last space?' col1
             FROM dual)
SELECT INSTR(col1, ' ', -1)
  FROM DUAL;
Now regular expressions are powerful, but there is no parameter that allows us to reverse the search direction. However, remembering that we have the "$" meta character that means (until the) end of string, all I have to do is use a search pattern that looks for a combination of space and non-space characters including the end of a string. Now compare the REGEXP_INSTR function to the previous solution:
SELECT REGEXP_INSTR(t.col1, ' [^ ]*$')                        
  FROM t;
So in this case, it'll remain a matter of taste what you want to use. If the search pattern has to look for the last occurrence of another regular expression, this is the way to solve such a requirement.

One more thing about backreferences. They can be used for a sort of primitive "string swapping". If for example you have to transform column values like swapping first and last name, backreferenc is your friend. Here's an example:
SELECT REGEXP_REPLACE('John Doe', '^(.*) (.*)$', '\2, \1') 
  FROM dual
  ;
What about middle names, for example 'John J. Doe'? Look for yourself, it still works.

You can even use that for strings with delimiters, for example reversing delimited "fields" like in this string '10~20~30~40~50' into '50~40~30~20~10'. Using REVERSE, I would get '05~04~03~02~01', so there has to be another way. Using backreferences however is limited to 9 subexpressions, which limits the following solution a bit, if you need to process strings with more than 9 fields. If you want, you can think this example through and see if your solution matches mine.
SELECT REGEXP_REPLACE('10~20~30~40~50', 
                      '^(.*)~(.*)~(.*)~(.*)~(.*)$', 
                      '\5~\4~\3~\2~\1'
                     )
  FROM dual;
After what you've learned so far, that wasn't too hard, was it? Enough for now ...


Continued in Introduction to regular expressions ... last part..

C.

Fixed some typos and a flawed example ...
cd
1 2 Previous Next