SQL & PL/SQL

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

Introduction to regular expressions ...

cd_2Sep 27 2006 — edited Mar 13 2023

I'm well aware that there are already some articles on that topic, some people asked me to share some of my knowledge on this topic. Please take a look at this first part and let me know if you find this useful. If yes, I'm going to continue on writing more parts using more and more complicated expressions - if you have questions or problems that you think could be solved through regular expression, please post them.

Introduction

Oracle has always provided some character/string functions in its PL/SQL command set, such as SUBSTR, REPLACE or TRANSLATE. With 10g, Oracle finally gave us, the users, the developers and of course the DBAs regular expressions. However, regular expressions, due to their sometimes cryptic rules, seem to be overlooked quite often, despite the existence of some very interesing use cases. Beeing one of the advocates of regular expression, I thought I'll give the interested audience an introduction to these new functions in several installments.

Having fun with regular expressions - Part 1

Oracle offers the use of regular expression through several functions: REGEXP_INSTR, REGEXP_SUBSTR, REGEXP_REPLACE and REGEXP_LIKE. The second part of each function already gives away its purpose: INSTR for finding a position inside a string, SUBSTR for extracting a part of a string, REPLACE for replacing parts of a string. REGEXP_LIKE is a special case since it could be compared to the LIKE operator and is therefore usually used in comparisons like IF statements or WHERE clauses.

Regular expressions excel, in my opinion, in search and extraction of strings, using that for finding or replacing certain strings or check for certain formatting criterias. They're not very good at formatting strings itself, except for some special cases I'm going to demonstrate.

If you're not familiar with regular expression, you should take a look at the definition in Oracle's user guide Using Regular Expressions With Oracle Database, and please note that there have been some changes and advancements in 10g2. I'll provide examples, that should work on both versions.

Some of you probably already encountered this problem: checking a number inside a string, because, for whatever reason, a column was defined as VARCHAR2 and not as NUMBER as one would have expected.

Let's check for all rows where column col1 does NOT include an unsigned integer. I'll use this SELECT for demonstrating different values and search patterns:

WITH t AS (SELECT '456' col1
             FROM dual
            UNION 
           SELECT '123x'
             FROM dual
            UNION   
           SELECT 'x123'
             FROM dual
            UNION  
           SELECT 'y'
             FROM dual
            UNION  
           SELECT '+789'
             FROM dual
            UNION  
           SELECT '-789'
             FROM dual
            UNION  
           SELECT '159-'
             FROM dual
            UNION  
           SELECT '-1-'
             FROM dual
          )     
SELECT t.col1
  FROM t
 WHERE NOT REGEXP_LIKE(t.col1, '^[0-9]+$')
;

Let's take a look at the 2nd argument of this REGEXP function: '^[0-9]+$'. Translated it would mean: start at the beginning of the string, check if there's one or more characters in the range between '0' and '9' (also called a matching character list) until the end of this string. "^", "[", "]", "+", "$" are all Metacharacters.

To understand regular expressions, you have to "think" in regular expressions. Each regular expression tries to "fit" an available string into its pattern and returns a result beeing successful or not, depending on the function. The "art" of using regular expressions is to construct the right search pattern for a certain task. Using functions like TRANSLATE or REPLACE did already teach you using search patterns, regular expressions are just an extension to this paradigma. Another side note: most of the search patterns are placeholders for single characters, not strings.

I'll take this example a bit further. What would happen if we would remove the "$" in our example? "$" means: (until the) end of a string. Without this, this expression would only search digits from the beginning until it encounters either another character or the end of the string. So this time, '123x' would be removed from the SELECTION since it does fit into the pattern.

Another change: we will keep the "$" but remove the "^". This character has several meanings, but in this case it declares: (start from the) beginning of a string. Without it, the function will search for a part of a string that has only digits until the end of the searched string. 'x123' would now be removed from our selection.

Now there's a question: what happens if I remove both, "^" and "$"? Well, just think about it. We now ask to find any string that contains at least one or more digits, so both '123x' and 'x123' will not show up in the result.

So what if I want to look for signed integer, since "+" is also used for a search expression. Escaping is the name of the game. We'll just use '^\+[0-9]+$' Did you notice the "\" before the first "+"? This is now a search pattern for the plus sign.

Should signed integers include negative numbers as well? Of course they should, and I'll once again use a matching character list. In this list, I don't need to do escaping, although it is possible. The result string would now look like this: '^[+-]?[0-9]+$'. Did you notice the "?"? This is another metacharacter that changes the placeholder for plus and minus to an optional placeholder, which means: if there's a "+" or "-", that's ok, if there's none, that's also ok. Only if there's a different character, then again the search pattern will fail.

Addendum: From this on, I found a mistake in my examples. If you would have tested my old examples with test data that would have included multiple signs strings, like "--", "-+", "++", they would have been filtered by the SELECT statement. I mistakenly used the "*" instead of the "?" operator. The reason why this is a bad idea, can also be found in the user guide: the "*" meta character is defined as 0 to multiple occurrences.

Looking at the values, one could ask the question: what about the integers with a trailing sign? Quite simple, right? Let's just add another '[+-] and the search pattern would look like this: '^[+-]?[0-9]+[+-]?$'.

Wait a minute, what happened to the row with the column value "-1-"?

You probably already guessed it: the new pattern qualifies this one also as a valid string. I could now split this pattern into several conditions combined through a logical OR, but there's something even better: a logical OR inside the regular expression. It's symbol is "|", the pipe sign.

Changing the search pattern again to something like this '^[+-]?[0-9]+$|^[0-9]+[+-]?$' [1] would return now the "-1-" value. Do I have to duplicate the same elements like "^" and "$", what about more complicated, repeating elements in future examples? That's where subexpressions/grouping comes into play. If I want only certain parts of the search pattern using an OR operator, we can put those inside round brackets. '^([+-]?[0-9]+|[0-9]+[+-]?)$' serves the same purpose and allows for further checks without duplicating the whole pattern.

Now looking for integers is nice, but what about decimal numbers? Those may be a bit more complicated, but all I have to do is again to think in (meta) characters. I'll just use an example where the decimal point is represented by ".", which again needs escaping, since it's also the place holder in regular expressions for "any character".

Valid decimals in my example would be ".0", "0.0", "0.", "0" (integer of course) but not ".". If you want, you can test it with the TO_NUMBER function. Finding such an unsigned decimal number could then be formulated like this: from the beginning of a string we will either allow a decimal point plus any number of digits OR at least one digits plus an optional decimal point followed by optional any number of digits. Think about it for a minute, how would you formulate such a search pattern?

Compare your solution to this one:

'^(\.[0-9]+|[0-9]+(\.[0-9]*)?)$'

Addendum: Here I have to use both "?" and "*" to make sure, that I can have 0 to many digits after the decimal point, but only 0 to 1 occurrence of this substrings. Otherwise, strings like "1.9.9.9" would be possible, if I would write it like this:

'^(\.[0-9]+|[0-9]+(\.[0-9]*)*)$'

Some of you now might say: Hey, what about signed decimal numbers? You could of course combine all the ideas so far and you will end up with a very long and almost unreadable search pattern, or you start combining several regular expression functions. Think about it: Why put all the search patterns into one function? Why not split those into several steps like "check for a valid decimal" and "check for sign".

I'll just use another SELECT to show what I want to do:

WITH t AS (SELECT '0' col1
             FROM dual
            UNION
           SELECT '0.'  
             FROM dual
            UNION
           SELECT '.0'  
             FROM dual
            UNION
           SELECT '0.0'  
             FROM dual
            UNION
           SELECT '-1.0'  
             FROM dual
            UNION
           SELECT '.1-'  
             FROM dual
            UNION
           SELECT '.'  
             FROM dual
            UNION
           SELECT '-1.1-'  
             FROM dual
          )  
SELECT t.*
  FROM t
;

From this select, the only rows I need to find are those with the column values "." and "-1.1-". I'll start this with a check for valid signs. Since I want to combine this with the check for valid decimals, I'll first try to extract a substring with valid signs through the REGEXP_SUBSTR function:

NVL(REGEXP_SUBSTR(t.col1, '^([+-]?[^+-]+|[^+-]+[+-]?)$'), ' ')

Remember the OR operator and the matching character collections? But several "^"? Some of the meta characters inside a search pattern can have different meanings, depending on their positions and combination with other meta characters. In this case, the pattern translates into: from the beginning of the string search for "+" or "-" followed by at least another character that is not "+" or "-". The second pattern after the "|" OR operator does the same for a sign at the end of the string.

This only checks for a sign but not if there also only digits and a decimal point inside the string. If the search string fails, for example when we have more than one sign like in the "-1.1-", the function returns NULL. NULL and LIKE don't go together very well, so we'll just add NVL with a default value that tells the LIKE to ignore this string, in this case a space.

All we have to do now is to combine the check for the sign and the check for a valid decimal number, but don't forget an option for the signs at the beginning or end of the string, otherwise your second check will fail on the signed decimals. Are you ready?

Does your solution look a bit like this?

 WHERE NOT REGEXP_LIKE(NVL(REGEXP_SUBSTR(t.col1, 
                           '^([+-]?[^+-]+|[^+-]+[+-]?)$'), 
                       ' '), 
                       '^[+-]?(\.[0-9]+|[0-9]+(\.[0-9]*)?)[+-]?$'
                      )

Now the optional sign checks in the REGEXP_LIKE argument can be added to both ends, since the SUBSTR won't allow any string with signs on both ends. Thinking in regular expression again.

Continued in https://forums.oracle.com/ords/apexds/post/introduction-to-regular-expressions-continued-9561

Fixed some embarrassing typos ... and mistakes.
cd

[Edited by BluShadow 13/03/2023: Fix links in new forum platform]

473610

hi ,

Cd it was very usefull - u can keep the posting's regularly updated.

cheers
Nirmal

94799

Go for it.

The more clear, well explained examples of this subject the better - and you seem to be as well-qualified as anyone around here to be explaining this.

BluShadow

Looks good to me. Keep it up.

21205

To be continued ... ?

Absolutely!

Muthukumar S

Hi CD,

Too good. If you have any BLOG for you let us know. So that we can check that for regular updates on it.

Regards,
S.Muthukumar.

BluShadow

CD, can I suggest you just keep adding to this thread, so we can bookmark it and keep coming back for the latest. :)

RadhakrishnaSarma

Vow! Too good! Another AskTom?

Keep adding to this thread and we will come back to this for second volume of KT (Knowledge Transfer).

Cheers
Sarma.

cd_2

Since I don't have a blog (OMG, I'm so backwards ... ;-)), I'd second your suggestion.

@Radhakrishna: g Just a developer with a soft spot for regex.

C.

Message was edited by:
cd

32685

Excellent write up CD. Very nice indeed. Hopefully you'll be completing parts 2 and 3 some time soon. And with any luck, your article will encourage others to do the same....I know there's a few I'd like to see and a few I'd like to have a go at writing too :-)

390020

Hi, cd. Good work and kudos for sharing.

Just a suggestion:
Did you consider publishing it on the OTN or maybe on the Ora Mag?
http://www.oracle.com/technology/contact/otn_submit.html

cd_2

I can't really decide if my "article" would meet the publishing standards of Oracle Mag/OTN.

C.

ebrian

cd...thanks for taking the time to post this write-up for others to learn from.

474126

Hi Cd,

This article is really useful. Keep updating this thread, this will really help others to learn and understand more about regular expression.

Cheers,
Mohana

523648

thanks for the informative post.keep going

RadhakrishnaSarma

I'd like to have a go at writing too :-)

Very good move David. Would like to have some sessions on Performance Tuning where you can start to educate a dumb man in Performace Tuning.

Congrats cd. You not only took an initiative but also drove others for the same. Thanks for that too!

I also would like to make some suggestions about this good turn in forum. Any threads taken up like this should be updated by others too to add up what was missed out or what had been discovered from their experience.

What do you say guys?

Cheers
Sarma.

32685

...Performance Tuning where you can start to educate a dumb man in Performace Tuning.

I think there are many people on this forum who are much better placed to write articles on performance tuning :-)

I also would like to make some suggestions about this good turn in forum. Any threads taken up like this should be updated by others too to add up > what was missed out or what had been discovered from their experience.

What do you say guys?

What about setting up a wiki that we can collaborate on? I've not set one up before so I don't know exactly what can be done but I was thinking it could be a good way of making sure the articles are complete and accurate, giving everyone a chance to engage and provide their input. Each time there's a new article submitted, the person submitting it could update a thread on the forums page to let people know there's new material.

We could also have a list of suggestions for articles for people to submit, that way people can take on writing about things that they need to learn themselves and may not have considered writing about before, as well as subjects that they are knowledgeable on.

Just a thought....

p.s.

I wouldn't mind setting it up if anyone is up for it :-)

138365

Please take a look at this first part and let me know if you find this useful.

I found it very useful and looking forward for next part!

Cheers!

cd_2

Thanks for all the encouraging replies and suggestions, right now I don't have the capacity to start a wiki or open a blog, but I'd more than willing to add this and any future articles to such a site. Just to keep you updated, here's a list on things I want to write about in the next part:

-- count occurrence of a character
-- find last occurrence, "looking back"
-- swapping substrings
-- variable IN lists
-- LIKE and IN together
-- case insensitive search
-- phone numbers
-- checking inet values, like IP address, e-mail or URL
-- removing duplicate characters

If you have any ideas or example you think could be solved by regular expressions, please keep adding them to this thread.

C.

APC

Hi CD

I mentioned your article on my blog last night.

I would be quite happy to host your articles on Radio Free Tooting. To be frank it could do with more hard tech-y stuff.

Cheers, APC

cd_2

Why thank you. I don't have any objections if you want to host them, since the whole thing is about sharing knowledge. I'm also planning on translating this one into german and add them to a german speaking forum which is focused on Oracle.

C.

William Robertson

I could put it up on my site. Similar to APC, I haven't added enough new stuff recently.

cd_2

Again, no objections from my side - thanks for "replicating" that knowledge. Saves me the time to put something on the web myself. ;-)

C.

Gabe2

You have some typos in some of the patterns above the first addendum ... look for '$\+[0-9]+$' and '$[+-]?[0-9]+$'

Cheers.

cd_2

Thank you. Since my first pattern was using the "^" it could have happened when I was fixing my "*" mistake. Talking about peer review ... ;-)

C.

APC

We could also have a list of suggestions for articles for people to submit, that way
people can take on writing about things that they need to learn themselves and may >> not have considered writing about before, as well as subjects that they are
knowledgeable on.

There are a large number of initiatives out there already. There is Ask Tom, Oracle-L, teh Oracle FAQ forum, any number of bloggers. In this particular context there is Jonathan Lewis's Co-operative FAQ which does something very similar to what you're suggesting.

Which isn't to say that it's not a good idea, just that it will take commitment and energy both from the site admin and the contributors. As we know there are always more questioners than responders. And we will get the occasional espontáneo whose answers will need moderating. There is the risk of burnout and/or dilution.

But by all means set up a wiki or some similar site. You never know, it may be the thing that gets you your ACE-hood :)

Whilst I'm on the topic, I'm always happy to receive submissions from people who would like to present at the UKOUG Development SIG

Cheers, APC

533549

Very nice article helping others to learn new, useful Oracle functionality.

Thank you very much C.!

529476

Too usefull CD!!
I didnt check this article yesterday!!!!
You are really genius!!!!
May be you can now publish oracle magazine!!!!!
Then we will subscribe your magazine in our office

cd_2

Interesting, in the other thread you're accusing me of quarrelling with each and everybody and beeing immature ... could you make up your mind?

C.

orawarebyte

It seems to be very concrete model though i didnt read yet ..today is weekend i
will digest it (Y) , good job keep it up.

Khurram

44451

I agree with you Sarma!

Hat's off CD keep it up!

j4john

cd,

Excellent, thankyou very much for sharing your expertise.

Could I please point you to 432368 to get your insight on why I'm having a problem with Apex using a regex that works OK in a non-Apex environment?

Many thanks,

john

245482

And, as I mentioned in another thread, HJR's www.dizwell.com is another good option.

469875

Mohammed Taj

Hi CD,

Too Good..

best regards
Taj

Keith Jamieson

This is just the sort of information I was looking for, and I think you should submit it to oracle magazine. I'm sure they'll like it, but they may ask for some amendments.

cd_2

Thank you. However I'm afraid Oracle Magazine won't take my article because they state that they will only publish articles that haven't been published anywhere else - and posting it here probably counts as publishing.

C.

Aman....

Hi CD,
Excellent!
Aman....

Mahmood Lebbai

I was trying to use regexp_substr function to retrieve the email domain (eg. yahoo.co.in for abc@yahoo.co.in)...but all I could get to was

"select regexp_substr('my email is abc@yahoo.co.in','@[^[:space:]]*',1,1) str from dual;" -- @yahoo.co.in

I am missing something here....Is there any ways that I could get what I just wanted.....Using trim function to remove the leading '@' is one way....

Can the regular expression do this for this situation?

Thanks.

Edited by: Mahmood Lebbai on Sep 22, 2008 10:40 PM

cd_2

Just some quick answers from my side, with REGEXP_SUBSTR and REGEXP_REPLACE:

WITH t AS (SELECT 'my email is abc@yahoo.co.in' col1
             FROM dual
          )
SELECT t.col1
     , LTRIM(REGEXP_SUBSTR(t.col1, '@[^[:space:]]+'), '@') solution_1
     , REGEXP_REPLACE(t.col1, '@([^[:space:]]+)|.', '\1') solution_2
  FROM t
;    

OL1                        SOLUTION_1                  SOLUTION_2
-------------------------- --------------------------- --------------------
y email is abc@yahoo.co.in yahoo.co.in                 yahoo.co.in

Aketi Jyuuzou

similar threads used for same regex-logic.
700680
705978

Mahmood Lebbai

Hello CD,

Thanks for your answer...

Though i thought of a sort of solution_1, solution_2 (REGEXP_REPLACE(t.col1, '@([^[:space:]]+)|.', '\1') solution_2) was really astounding...

Thanks one more time....I learnt something today!

Edited by: Mahmood Lebbai on Sep 25, 2008 8:57 AM

BluShadow

cd, I think you need to edit your article to cater for the jive forum formatting. A lot of the {noformat}[ and ]{noformat} are being boogered up in the text. ;)

1 - 42

Locked Post

New comments cannot be posted to this locked post.

Locked on Feb 19 2010

Added on Sep 27 2006

#oracle, #query, #sql

42 comments

59,840 views

SQL & PL/SQL

Introduction to regular expressions ...

Comments

Post Details