Forum Stats

  • 3,782,743 Users
  • 2,254,682 Discussions
  • 7,880,151 Comments

Discussions

Introduction to regular expressions ... continued.

cd_2
cd_2 Member Posts: 5,021
edited Mar 29, 2011 2:35AM in SQL & PL/SQL
After some very positive feedback from 429716 I'm now continuing on this topic for the interested audience. As always, if you have questions or problems that you think could be solved through regular expression, please post them.

Having fun with regular expressions - Part 2

Finishing my example with decimal numbers, I thought about a method to test regular expressions. A question from another user who was looking for a way to show all possible combinations inspired me in writing a small package.
CREATE OR REPLACE PACKAGE regex_utils AS
  -- Regular Expression Utilities
  -- Version 0.1 
  TYPE t_outrec IS RECORD(
    data VARCHAR2(255)
  );
  TYPE t_outtab IS TABLE OF t_outrec;
  FUNCTION gen_data(
    p_charset IN VARCHAR2 -- character set that is used for generation
  , p_length  IN NUMBER   -- length of the generated 
  ) RETURN t_outtab PIPELINED;
END regex_utils;
/

CREATE OR REPLACE PACKAGE BODY regex_utils AS
-- FUNCTION gen_data returns a collection of generated varchar2 elements
  FUNCTION gen_data(
    p_charset IN VARCHAR2 -- character set that is used for generation
  , p_length  IN NUMBER   -- length of the generated 
  ) RETURN t_outtab PIPELINED
  IS
    TYPE t_counter IS TABLE OF PLS_INTEGER INDEX BY PLS_INTEGER;
    --
    v_counter t_counter;
    v_exit    BOOLEAN;
    v_string  VARCHAR2(255);
    v_outrec  t_outrec;
  BEGIN 
    FOR max_length IN 1..p_length  
    LOOP
      -- init counter loop
      FOR i IN 1..max_length
      LOOP
        v_counter(i) := 1;
      END LOOP; 
      --
      -- start data generation loop
      --
      v_exit := FALSE;
      --
      WHILE NOT v_exit
      LOOP
        -- start generation
        v_string := '';
        --
        FOR i IN 1..max_length
        LOOP
          v_string := v_string || SUBSTR(p_charset, v_counter(i), 1);
        END LOOP;
        --
        -- set outgoing record
        --
        v_outrec.data := v_string;
        --
        -- now pipe the result 
        --
        PIPE ROW(v_outrec);
        --
        -- increment loop
        --
        <<inc_loop>>
        FOR i IN REVERSE 1..max_length
        LOOP
          v_counter(i) := v_counter(i) + 1;      
          IF v_counter(i) > LENGTH(p_charset) THEN 
             IF i > 1 THEN
                v_counter(i) := 1;
             ELSE
                v_exit := TRUE;   
             END IF;
          ELSE
             -- no further processing required
             EXIT inc_loop;   
          END IF;   
        END LOOP;         
        --
      END LOOP;  
    END LOOP;  
  END gen_data;
END regex_utils;
/
This package is a brute force string generator using all possible combinations of a characters in a string up to a maximum length. Together with the regular expressions, I can now show what combinations my solution would allow to pass. But see for yourself:
SELECT *
  FROM (SELECT data col1 
          FROM TABLE(regex_utils.gen_data('+-.0', 5))
       ) t
 WHERE REGEXP_LIKE(NVL(REGEXP_SUBSTR(t.col1, 
                                     '^([+-]?[^+-]+|[^+-]+[+-]?)$'
                                    ), 
                       ' '
                      ), 
                   '^[+-]?(\.[0-9]+|[0-9]+(\.[0-9]*)?)[+-]?$'
                   )
;
You will see some results, which are perfectly valid for my definition of decimal numbers but haven't been mentioned, like '000' or '+.00'. From now on I will also use this package to verify the solutions I'll present to you and hopefully reduce my share of typos.

Counting and finding certain characters or words in a string can be a tedious task. I'll show you how it's done with regular expressions. I'll start with an easy example, count all spaces in the string "Having fun with regular expressions.":
SELECT NVL(LENGTH(REGEXP_REPLACE('Having fun with regular expressions', '[^ ]')), 0)
  FROM dual
  ;
No surprise there. I'm replacing all characters except spaces with a null string. Since REGEXP_REPLACE assumes a NULL string as replacement argument, I can save on adding a third argument, which would look like this:
REGEXP_REPLACE('Having fun with regular expressions', '[^ ]', '')
So REPLACE will return all the spaces which we can count with the LENGTH function. If there aren't any, I will get a NULL string, which is checked by the NVL function. If you want you can play around by changing the space character to somethin else.

A variation of this theme could be counting the number of words. Counting spaces and adding 1 to this result could be misleading if there are duplicate spaces. Thanks to regular expressions, I can of course eliminate duplicates.

Using the old method on the string "Having fun with regular expressions" would return anything but the right number. This is, where Backreferences come into play. REGEXP_REPLACE uses them in the replacement argument, a backslash plus a single digit, like this: '\1'. To reference a string in a search pattern, I have to use subexpressions (remember the round brackets?).
SELECT NVL(LENGTH(REGEXP_REPLACE('Having  fun  with  regular  expressions', '( )\1*|.', '\1')))
  FROM dual
  ;
You may have noticed that I changed from using the "^" as a NOT operator to using the "|" OR operator and the "." any character placeholder. This neat little trick allows to filter all other characters except the one we're looking in the first place. "\1" as backreference is outside of our subexpression since I don't want to count the trailing spaces and is used both in the search pattern and the replacement argument.

Still I'm not satisfied with this: What about leading/trailing blanks, what if there are any special characters, numbers, etc.? Finally, it's time to only count words. For the purpose of this demonstration, I define a word as one or more consecutive letters. If by now you're already thinking in regular expressions, the solution is not far away. One hint: you may want to check on the "i" match parameter which allows for case insensitive search. Another one: You won't need a back reference in the search pattern this time.



Let's compare our solutions than, shall we?
SELECT NVL(LENGTH(REGEXP_REPLACE('Having  fun  with  regular  expressions.  !', 
                                 '([a-z])+|.', '\1', 1, 0, 'i')), 0)
  FROM dual;
This time I don't use a backreference, the "+" operator (remember? 1 or more) will suffice. And since I want to count the occurences, not the letters, I moved the "+" meta character outside of the subexpression. The "|." trick again proved to be useful.

Case insensitive search does have its merits. It will only search but not transform the any found substring. If I want, for example, extract any occurence of the word fun, I'll just use the "i" match parameter and get this substring, whether it's written as "Fun", "FUN" or "fun". Can be very useful if you're looking for example for names of customers, streets, etc.

Enough about counting, how about finding? What if I want to know the last occurence of a certain character or string, for example the postition of the last space in this string "Where is the last space?"?

Addendum: Thanks to another forum member, I should mention that using the INSTR function can do a reverse search by itself.[i]
WITH t AS (SELECT 'Where is the last space?' col1 FROM dual) SELECT INSTR(col1, ' ', -1) FROM DUAL;Now regular expressions are powerful, but there is no parameter that allows us to reverse the search direction. However, remembering that we have the "$" meta character that means (until the) end of string, all I have to do is use a search pattern that looks for a combination of space and non-space characters including the end of a string. Now compare the REGEXP_INSTR function to the previous solution:
SELECT REGEXP_INSTR(t.col1, ' [^ ]*$') FROM t;So in this case, it'll remain a matter of taste what you want to use. If the search pattern has to look for the last occurrence of another regular expression, this is the way to solve such a requirement.

One more thing about backreferences. They can be used for a sort of primitive "string swapping". If for example you have to transform column values like swapping first and last name, backreferenc is your friend. Here's an example:
SELECT REGEXP_REPLACE('John Doe', '^(.*) (.*)$', '\2, \1') FROM dual ;What about middle names, for example 'John J. Doe'? Look for yourself, it still works.

You can even use that for strings with delimiters, for example reversing delimited "fields" like in this string '10~20~30~40~50' into '50~40~30~20~10'. Using REVERSE, I would get '05~04~03~02~01', so there has to be another way. Using backreferences however is limited to 9 subexpressions, which limits the following solution a bit, if you need to process strings with more than 9 fields. If you want, you can think this example through and see if your solution matches mine.
SELECT REGEXP_REPLACE('10~20~30~40~50', '^(.*)~(.*)~(.*)~(.*)~(.*)$', '\5~\4~\3~\2~\1' ) FROM dual;After what you've learned so far, that wasn't too hard, was it? Enough for now ...


Continued in 437109.

C.

Fixed some typos and a flawed example ...
cd
«13

Comments

  • cd_2
    cd_2 Member Posts: 5,021
    For the 3rd and final part I have this topics left:

    -- variable IN lists
    -- LIKE and IN together
    -- phone numbers
    -- checking inet values, like IP address, e-mail or URL
    -- Advancements in 10g2

    C.
  • Rob van Wijk
    Rob van Wijk Member Posts: 5,856 Bronze Trophy
    Great information C !

    One german typo :-)
    I'm replacing all characters except spaces mit a null string.
    Looking forward to the last part.

    Regards,
    Rob.
  • marias
    marias Member Posts: 1,529
    SELECT REGEXP_REPLACE('John Doe', '^(.*) (.*)$', '\2, \1') FROM dual ;

    SELECT REGEXP_REPLACE('10~20~30~40~50', '^(.*)~(.*)~(.*)~(.*)~(.*)$',
    '\5~\4~\3~\2~\1' ) FROM dual;

    those 2 are pretty interesting.

    good article.
  • Thank you very much C. Awaiting other parts.... keep going.
    One german typo :-)
    I'm replacing all characters except spaces mit a
    null string.
    I received a functional spec from my Dutch analyst in which it is written
    tnsnames voor EDWH:
    PCESCRD1 = (DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)
    (HOST=blah.blah.blah.com)
    (PORT=5227)))
    (CONNECT_DATA=(SID=pcescrd1)))
    db user: BW_I2_VIEWER / BW_I2_VIEWER_SCRD1
    Had to look for translators.

    Cheers
    Sarma.
  • Rob van Wijk
    Rob van Wijk Member Posts: 5,856 Bronze Trophy
    :-)

    "voor" means "for" usually.
  • ebrian
    ebrian Member Posts: 2,736 Gold Badge
    cd...excellent write-up!!!

    Just curious...this:
    WITH t AS (SELECT 'Where is the last space?' col1
    FROM dual)
    SELECT LENGTH(t.col1) - DECODE(INSTR(REVERSE(t.col1), REVERSE(' ')),
    0, LENGTH(t.col1),
    INSTR(REVERSE(t.col1), REVERSE(' ')) - 1)
    FROM t;
    could actually be simplified to this...right?
    select instr('Where is the last space?',' ',-1) from dual;
  • cd_2
    cd_2 Member Posts: 5,021
    I knew I should have read up the INSTR function in the user guide. Completely missed that one. So yes, searching backwords is possible through INSTR. If you want to search backward with regular expression, however, you'll still have to use the method I discussed.

    C.
  • nurhidayat
    nurhidayat Member Posts: 736
    edited Oct 5, 2006 10:57PM
    After what you've learned so far, that wasn't too hard, was it? Enough for now ...
    remembering how hard i was to understand regular expression in my college time, yes that wasn't so hard.

    Thanks, cd.
    May be you could post this helpful article to Oracle Magazine

    Message was edited by:
    nurhidayat
  • cd_2
    cd_2 Member Posts: 5,021
    edited Oct 6, 2006 2:48AM
    When I encountered regular expressions for the first time, I went like "what the heck?". In the end it's just a matter of practice ...

    As for the idea of publishing it on Oracle Mag., I'm afraid I already violated one submission rule:

    "The article is an original work and has not been published in any other form,"

    C.
  • 511769
    511769 Member Posts: 15
    Dear Friend,

    I need your help please.

    I am using Oracle 8.

    Quite a lot of forenames in our database have more than one name in the forename column.

    I need SQL or Function which:

    1) Detect when this is the case,
    2) Can pick out a Single name - like first,second or third,
    3) where you pass the forename and the number of the name - like
    F_extract_name(forename,2)
    4) I would need to return blanks where there isn't second or third or fourth name ...


    Thanks

    Thamil
This discussion has been closed.