# Introduction to regular expressions ... continued.

**cd_2**Oct 18, 2006 10:52 AM

After some very positive feedback from Introduction to regular expressions ... I'm now continuing on this topic for the interested audience. As always, if you have questions or problems that you think could be solved through regular expression, please post them.

Finishing my example with decimal numbers, I thought about a method to test regular expressions. A question from another user who was looking for a way to show all possible combinations inspired me in writing a small package.

Counting and finding certain characters or words in a string can be a tedious task. I'll show you how it's done with regular expressions. I'll start with an easy example, count all spaces in the string "Having fun with regular expressions.":

A variation of this theme could be counting the number of words. Counting spaces and adding 1 to this result could be misleading if there are duplicate spaces. Thanks to regular expressions, I can of course eliminate duplicates.

Using the old method on the string "Having fun with regular expressions" would return anything but the right number. This is, where

Still I'm not satisfied with this: What about leading/trailing blanks, what if there are any special characters, numbers, etc.? Finally, it's time to only count words. For the purpose of this demonstration, I define a word as one or more consecutive letters. If by now you're already thinking in regular expressions, the solution is not far away. One hint: you may want to check on the "i" match parameter which allows for case insensitive search. Another one: You won't need a back reference in the search pattern this time.

Let's compare our solutions than, shall we?

Case insensitive search does have its merits. It will only search but not transform the any found substring. If I want, for example, extract any occurence of the word fun, I'll just use the "i" match parameter and get this substring, whether it's written as "Fun", "FUN" or "fun". Can be very useful if you're looking for example for names of customers, streets, etc.

Enough about counting, how about

**Having fun with regular expressions**- Part 2Finishing my example with decimal numbers, I thought about a method to test regular expressions. A question from another user who was looking for a way to show all possible combinations inspired me in writing a small package.

```
CREATE OR REPLACE PACKAGE regex_utils AS
-- Regular Expression Utilities
-- Version 0.1
TYPE t_outrec IS RECORD(
data VARCHAR2(255)
);
TYPE t_outtab IS TABLE OF t_outrec;
FUNCTION gen_data(
p_charset IN VARCHAR2 -- character set that is used for generation
, p_length IN NUMBER -- length of the generated
) RETURN t_outtab PIPELINED;
END regex_utils;
/
CREATE OR REPLACE PACKAGE BODY regex_utils AS
-- FUNCTION gen_data returns a collection of generated varchar2 elements
FUNCTION gen_data(
p_charset IN VARCHAR2 -- character set that is used for generation
, p_length IN NUMBER -- length of the generated
) RETURN t_outtab PIPELINED
IS
TYPE t_counter IS TABLE OF PLS_INTEGER INDEX BY PLS_INTEGER;
--
v_counter t_counter;
v_exit BOOLEAN;
v_string VARCHAR2(255);
v_outrec t_outrec;
BEGIN
FOR max_length IN 1..p_length
LOOP
-- init counter loop
FOR i IN 1..max_length
LOOP
v_counter(i) := 1;
END LOOP;
--
-- start data generation loop
--
v_exit := FALSE;
--
WHILE NOT v_exit
LOOP
-- start generation
v_string := '';
--
FOR i IN 1..max_length
LOOP
v_string := v_string || SUBSTR(p_charset, v_counter(i), 1);
END LOOP;
--
-- set outgoing record
--
v_outrec.data := v_string;
--
-- now pipe the result
--
PIPE ROW(v_outrec);
--
-- increment loop
--
<<inc_loop>>
FOR i IN REVERSE 1..max_length
LOOP
v_counter(i) := v_counter(i) + 1;
IF v_counter(i) > LENGTH(p_charset) THEN
IF i > 1 THEN
v_counter(i) := 1;
ELSE
v_exit := TRUE;
END IF;
ELSE
-- no further processing required
EXIT inc_loop;
END IF;
END LOOP;
--
END LOOP;
END LOOP;
END gen_data;
END regex_utils;
/
```

This package is a *brute force*string generator using all possible combinations of a characters in a string up to a maximum length. Together with the regular expressions, I can now show what combinations my solution would allow to**pass**. But see for yourself:```
SELECT *
FROM (SELECT data col1
FROM TABLE(regex_utils.gen_data('+-.0', 5))
) t
WHERE REGEXP_LIKE(NVL(REGEXP_SUBSTR(t.col1,
'^([+-]?[^+-]+|[^+-]+[+-]?)$'
),
' '
),
'^[+-]?(\.[0-9]+|[0-9]+(\.[0-9]*)?)[+-]?$'
)
;
```

You will see some results, which are perfectly valid for my definition of decimal numbers but haven't been mentioned, like '000' or '+.00'. From now on I will also use this package to verify the solutions I'll present to you and hopefully reduce my share of typos.Counting and finding certain characters or words in a string can be a tedious task. I'll show you how it's done with regular expressions. I'll start with an easy example, count all spaces in the string "Having fun with regular expressions.":

```
SELECT NVL(LENGTH(REGEXP_REPLACE('Having fun with regular expressions', '[^ ]')), 0)
FROM dual
;
```

No surprise there. I'm replacing all characters except spaces with a null string. Since REGEXP_REPLACE assumes a NULL string as replacement argument, I can save on adding a third argument, which would look like this:`REGEXP_REPLACE('Having fun with regular expressions', '[^ ]', '')`

So REPLACE will return all the spaces which we can count with the LENGTH function. If there aren't any, I will get a NULL string, which is checked by the NVL function. If you want you can play around by changing the space character to somethin else.A variation of this theme could be counting the number of words. Counting spaces and adding 1 to this result could be misleading if there are duplicate spaces. Thanks to regular expressions, I can of course eliminate duplicates.

Using the old method on the string "Having fun with regular expressions" would return anything but the right number. This is, where

*Backreference*s come into play. REGEXP_REPLACE uses them in the replacement argument, a backslash plus a single digit, like this: '\1'. To reference a string in a search pattern, I have to use subexpressions (remember the round brackets?).```
SELECT NVL(LENGTH(REGEXP_REPLACE('Having fun with regular expressions', '( )\1*|.', '\1')))
FROM dual
;
```

You may have noticed that I changed from using the "^" as a NOT operator to using the "|" OR operator and the "." any character placeholder. This neat little trick allows to filter all other characters except the one we're looking in the first place. "\1" as backreference is outside of our subexpression since I don't want to count the trailing spaces and is used both in the search pattern and the replacement argument. Still I'm not satisfied with this: What about leading/trailing blanks, what if there are any special characters, numbers, etc.? Finally, it's time to only count words. For the purpose of this demonstration, I define a word as one or more consecutive letters. If by now you're already thinking in regular expressions, the solution is not far away. One hint: you may want to check on the "i" match parameter which allows for case insensitive search. Another one: You won't need a back reference in the search pattern this time.

Let's compare our solutions than, shall we?

```
SELECT NVL(LENGTH(REGEXP_REPLACE('Having fun with regular expressions. !',
'([a-z])+|.', '\1', 1, 0, 'i')), 0)
FROM dual;
```

This time I don't use a backreference, the "+" operator (remember? 1 or more) will suffice. And since I want to count the occurences, not the letters, I moved the "+" meta character outside of the subexpression. The "|." trick again proved to be useful.Case insensitive search does have its merits. It will only search but not transform the any found substring. If I want, for example, extract any occurence of the word fun, I'll just use the "i" match parameter and get this substring, whether it's written as "Fun", "FUN" or "fun". Can be very useful if you're looking for example for names of customers, streets, etc.

Enough about counting, how about

*finding*? What if I want to know the**last**occurence of a certain character or string, for example the postition of the last space in this string "Where is the last space?"?*Addendum: Thanks to another forum member, I should mention that using the INSTR function can do a reverse search by itself.[i]*

One more thing about backreferences. They can be used for a sort of primitive "string swapping". If for example you have to transform column values like swapping first and last name, backreferenc is your friend. Here's an example:

You can even use that for strings with delimiters, for example reversing delimited "fields" like in this string '10~20~30~40~50' into '50~40~30~20~10'. Using REVERSE, I would get '05~04~03~02~01', so there has to be another way. Using backreferences however is limited to 9 subexpressions, which limits the following solution a bit, if you need to process strings with more than 9 fields. If you want, you can think this example through and see if your solution matches mine.

Continued in Introduction to regular expressions ... last part..

C.

Fixed some typos and a flawed example ...

cd```
WITH t AS (SELECT 'Where is the last space?' col1
FROM dual)
SELECT INSTR(col1, ' ', -1)
FROM DUAL;
```

Now regular expressions are powerful, but there is no parameter that allows us to reverse the search direction. However, remembering that we have the "$" meta character that means (until the) end of string, all I have to do is use a search pattern that looks for a combination of space and non-space characters including the end of a string. Now compare the REGEXP_INSTR function to the previous solution:```
SELECT REGEXP_INSTR(t.col1, ' [^ ]*$')
FROM t;
```

So in this case, it'll remain a matter of taste what you want to use. If the search pattern has to look for the last occurrence of another regular expression, this is the way to solve such a requirement.One more thing about backreferences. They can be used for a sort of primitive "string swapping". If for example you have to transform column values like swapping first and last name, backreferenc is your friend. Here's an example:

```
SELECT REGEXP_REPLACE('John Doe', '^(.*) (.*)$', '\2, \1')
FROM dual
;
```

What about middle names, for example 'John J. Doe'? Look for yourself, it still works.You can even use that for strings with delimiters, for example reversing delimited "fields" like in this string '10~20~30~40~50' into '50~40~30~20~10'. Using REVERSE, I would get '05~04~03~02~01', so there has to be another way. Using backreferences however is limited to 9 subexpressions, which limits the following solution a bit, if you need to process strings with more than 9 fields. If you want, you can think this example through and see if your solution matches mine.

```
SELECT REGEXP_REPLACE('10~20~30~40~50',
'^(.*)~(.*)~(.*)~(.*)~(.*)$',
'\5~\4~\3~\2~\1'
)
FROM dual;
```

After what you've learned so far, that wasn't too hard, was it? Enough for now ...Continued in Introduction to regular expressions ... last part..

C.

Fixed some typos and a flawed example ...

cd

- 19247 Views
- Tags: none (add)