Forum Stats

  • 3,757,566 Users
  • 2,251,245 Discussions
  • 7,869,867 Comments

Discussions

Using regular expression to find any recurring substring within string

649496
649496 Member Posts: 6
edited Jul 15, 2008 6:00PM in SQL & PL/SQL
I'm trying to come up with a regular expression to identify fields where a text string, which is made up of a set of substrings delimited by ! , contains a repeated substring. For example, the expression should match:
!x!y!z!y!
and:
! str_1 ! str_2 ! str_3 ! str_1 !
but not:
! str ! str_A ! str_B ! _B !

Something like the following REGEXP_LIKE expression with a back reference would work if the substrings were a fixed length and I knew how far apart the duplicate substrings were:
SELECT
(CASE WHEN
REGEXP_LIKE('!a!b!c!b!', '(!.!).\1')
THEN 'TRUE'
ELSE 'FALSE' END)
FROM DUAL;

But the substrings can be of any length, and duplicates can be separated by any number of substrings (or none), and using asterisks in place of the periods doesn't work: REGEXP_LIKE('!a!b!c!b!', '(*)*\1')

Is there a way to do this with regular expressions, or will I have to use a PL/SQL block loop through each string and separate out the substrings?

Thanks.

Comments

  • Sentinel
    Sentinel Member Posts: 1,284 Silver Badge
    Yep it's possible:
    with t1 as (select '!x!y!z!y!' str from dual
    union all select '! str_1 ! str_2 ! str_3 ! str_1 !' from dual
    union all select '! str ! str_A ! str_B ! _B !' from dual
    )
    select str
    from t1
    where regexp_like(str, '(![^!]+!).*\1')
  • 649496
    649496 Member Posts: 6
    This helps a lot, but there's one minor issue: since the pattern to be matched expects at least one character between the matching substrings, this solution does not work for strings where where the duplicate substrings are consecutive, such as:
    !x!y!y!z!

    I think I can fix this by using REPLACE to add extra delimiter characters between the substrings :

    REGEXP_LIKE('!'||replace(str,'!','!!!')||'!', '(![^!]+!).+\1','i')

    (Note that I'm also adding the leading and trailing delimiter here --they are not actually included in the original string.)

    I may play around with REGEXP_REPLACE to remove any white space within the substrings before comparing them, but unless someone has a more elegant solution, this seems to do the trick. Thanks for the help!
  • 649102
    649102 Member Posts: 90
    How about this. It works with your test data but I'm not sure about your real data.
    with t1 as (select '!x!y!y!z!' str from dual
    union all select '! str_1 ! str_2 ! str_3 ! str_1 !' from dual
    union all select '! str ! str_A ! str_B ! _B !' from dual
    )
    select str
    from t1
    where regexp_like(str, '(![^!]+!).*\1')
    OR regexp_like(str, '(!?[^!]+!)\1')
  • 649496
    649496 Member Posts: 6
    Not quite, because that would return TRUE for non-matching substrings in cases like:
    '!xy!y!'

    However, using .* instead of .+ before the back reference as you did makes sense. Then the REPLACE only needs to replace each delimiter with 2 delimiters rather than 3 in order to separate the subgroups, making it (slightly) more clear what the expression is trying to accomplish.
    REGEXP_LIKE('!'||replace(str,'!','!!')||'!', '(![^!]+!).*\1','i')

    Your example helped to clarify what was confusing me with my first attempt. I was thinking of the * as meaning "0 or more characters", but it really means "0 or more of the immediately preceding character (or grouping within parenthesis)." When I left out the period before the asterisk, the asterisk represented 0 or more consecutive occurances of the entire substring I was trying to find a duplicate for.

    I think I'm finally starting to get it...
    Thanks!
  • 649102
    649102 Member Posts: 90
    Sorry but I'm determined to solve your issue without you having to do unnecessary replaces. How bout this one:
    with t1 as (select '!xy!y!z!' str from dual
    union all select '! str_1 ! str_2 ! str_3 ! str_1 !' from dual
    union all select '! str ! str_A ! str_B ! _B !' from dual
    )
    select str
    from t1
    where regexp_like(str, '(![^!]+!).*\1')
    OR regexp_like(str, '(!)([^!]+)(!)\2\3')
    Regexs can be tricky but once you get the hang of them, they are a priceless tool.
  • 649496
    649496 Member Posts: 6
    That appears to work! You inspired me to keep working on it, and after removing some parentheses and using the Altenator operator (|) instead of the OR clause, I managed to distill it down to just this:

    regexp_like(str, '!([^!]+!)(|.*!)\1')

    Persistence pays off!
    Thanks.
  • Aketi Jyuuzou
    Aketi Jyuuzou Member Posts: 1,072 Bronze Badge
    edited Jul 15, 2008 6:24AM
    with t1 as (select '!x!y!z!y!' str from dual
    union all select '! str_1 ! str_2 ! str_3 ! str_1 !' from dual
    union all select '! str ! str_A ! str_B ! _B !' from dual)
    select str
    from t1
    where RegExp_Like(str,'(![^!]+).*\1(!|$)');
    STR
    ---------------------------------
    !x!y!z!y!
    ! str_1 ! str_2 ! str_3 ! str_1 !
    or
    with t1 as (select '!x!y!z!y!' str from dual
    union all select '! str_1 ! str_2 ! str_3 ! str_1 !' from dual
    union all select '! str ! str_A ! str_B ! _B !' from dual)
    select str
    from t1
    where RegExp_Like(str,'(![^!]+).*\1!');
  • 649496
    649496 Member Posts: 6
    <<
    where RegExp_Like(str,'(![^!]+).*\1(!|$)');
    >>
    <<
    where RegExp_Like(str,'(![^!]+).*\1!');
    >>

    Actually, these two expressions return TRUE if they find a second substring that begins with the same characters as the first, even if they're not duplicates, for example:
    '!String1!Str!'

    I read that all regular expressions are "greedy", which I thought meant that (![^!]+) would be taken as '!String1', which of course would not have found a match. However, it apparently means that it will match the longest expression it can for the subexpression and still find a match for the overall pattern. So it tries '!String1' first, and then '!String', '!Strin', '!Stri', and finally '!Str', where it finds a match and stops looking.

    Your first example reminded, though, that if the leading and trailing delimiter characters are not included in the original string, I can use the the ^ and $ metacharacters in the search pattern to avoid having to concatenate them onto STR. For anyone interested, this is my latest version. (We should find a match for the first 10 only):
    with t1 as (select 'x!x' str from dual
    union all select 'x!x!y!z' str from dual
    union all select 'y!x!x!z' from dual
    union all select 'y!z!x!x' from dual
    union all select 'x!y!x!z' from dual
    union all select 'y!x!z!x' from dual
    union all select 'w!x!y!x!z' from dual
    union all select 'str1!str1!str2' from dual
    union all select 'str1!str2!str1' from dual
    union all select 'str1!str2!str3!str2!str4' from dual
    union all select 'w!x!y!z' from dual
    union all select 'x!str1!str!1!tr!str2!_str2!2!str2_!str3!3' from dual)
    select str
    from t1
    WHERE RegExp_Like(str,'(^|!)([^!]+)!(|.+!)\2(!|$)');
  • Aketi Jyuuzou
    Aketi Jyuuzou Member Posts: 1,072 Bronze Badge
    Oh, I see.
    My final solution which is similar your solution.
    col match for a20
    with t1 as (select 'x!x' str from dual
    union all select 'x!x!y!z' str from dual
    union all select 'y!x!x!z' from dual
    union all select 'y!z!x!x' from dual
    union all select 'x!y!x!z' from dual
    union all select 'y!x!z!x' from dual
    union all select 'w!x!y!x!z' from dual
    union all select 'str1!str1!str2' from dual
    union all select 'str1!str2!str1' from dual
    union all select 'str1!str2!str3!str2!str4' from dual
    union all select 'w!x!y!z' from dual
    union all select 'x!str1!str!1!tr!str2!_str2!2!str2_!str3!3' from dual
    union all select 'AAAA!x!y' from dual
    union all select 'AAAA!AAA!y' from dual)
    select str,RegExp_Substr(str,'(^|!)([^!]+)!(.+!)?\2(!|$)') as match
    from t1
    WHERE RegExp_Like(str,'(^|!)([^!]+)!(.+!)?\2(!|$)');
    STR                       MATCH
    ------------------------ -------------------
    x!x x!x
    x!x!y!z x!x!
    y!x!x!z !x!x!
    y!z!x!x !x!x
    x!y!x!z x!y!x!
    y!x!z!x !x!z!x
    w!x!y!x!z !x!y!x!
    str1!str1!str2 str1!str1!
    str1!str2!str1 str1!str2!str1
    str1!str2!str3!str2!str4 !str2!str3!str2!
This discussion has been closed.