9 Replies Latest reply on Jul 15, 2008 10:00 PM by Aketi Jyuuzou

    Using regular expression to find any recurring substring within string

    649496
      I'm trying to come up with a regular expression to identify fields where a text string, which is made up of a set of substrings delimited by ! , contains a repeated substring. For example, the expression should match:
      !x!y!z!y!
      and:
      ! str_1 ! str_2 ! str_3 ! str_1 !
      but not:
      ! str ! str_A ! str_B ! _B !

      Something like the following REGEXP_LIKE expression with a back reference would work if the substrings were a fixed length and I knew how far apart the duplicate substrings were:
      SELECT
      (CASE WHEN
      REGEXP_LIKE('!a!b!c!b!', '(!.!).\1')
      THEN 'TRUE'
      ELSE 'FALSE' END)
      FROM DUAL;

      But the substrings can be of any length, and duplicates can be separated by any number of substrings (or none), and using asterisks in place of the periods doesn't work: REGEXP_LIKE('!a!b!c!b!', '(*)*\1')

      Is there a way to do this with regular expressions, or will I have to use a PL/SQL block loop through each string and separate out the substrings?

      Thanks.
        • 1. Re: Using regular expression to find any recurring substring within string
          Sentinel
          Yep it's possible:
          with t1 as (select '!x!y!z!y!' str from dual
          union all select '! str_1 ! str_2 ! str_3 ! str_1 !' from dual
          union all select '! str ! str_A ! str_B ! _B !' from dual
          )
          select str
          from t1
          where regexp_like(str, '(![^!]+!).*\1')
          • 2. Re: Using regular expression to find any recurring substring within string
            649496
            This helps a lot, but there's one minor issue: since the pattern to be matched expects at least one character between the matching substrings, this solution does not work for strings where where the duplicate substrings are consecutive, such as:
            !x!y!y!z!

            I think I can fix this by using REPLACE to add extra delimiter characters between the substrings :

            REGEXP_LIKE('!'||replace(str,'!','!!!')||'!', '(![^!]+!).+\1','i')

            (Note that I'm also adding the leading and trailing delimiter here --they are not actually included in the original string.)

            I may play around with REGEXP_REPLACE to remove any white space within the substrings before comparing them, but unless someone has a more elegant solution, this seems to do the trick. Thanks for the help!
            • 3. Re: Using regular expression to find any recurring substring within string
              649102
              How about this. It works with your test data but I'm not sure about your real data.
              with t1 as (select '!x!y!y!z!' str from dual
                union all select '! str_1 ! str_2 ! str_3 ! str_1 !' from dual
                union all select '! str ! str_A ! str_B ! _B !' from dual
              )
              select str
              from t1
              where regexp_like(str, '(![^!]+!).*\1')
              OR regexp_like(str, '(!?[^!]+!)\1')
              • 4. Re: Using regular expression to find any recurring substring within string
                649496
                Not quite, because that would return TRUE for non-matching substrings in cases like:
                '!xy!y!'

                However, using .* instead of .+ before the back reference as you did makes sense. Then the REPLACE only needs to replace each delimiter with 2 delimiters rather than 3 in order to separate the subgroups, making it (slightly) more clear what the expression is trying to accomplish.
                REGEXP_LIKE('!'||replace(str,'!','!!')||'!', '(![^!]+!).*\1','i')

                Your example helped to clarify what was confusing me with my first attempt. I was thinking of the * as meaning "0 or more characters", but it really means "0 or more of the immediately preceding character (or grouping within parenthesis)." When I left out the period before the asterisk, the asterisk represented 0 or more consecutive occurances of the entire substring I was trying to find a duplicate for.

                I think I'm finally starting to get it...
                Thanks!
                • 5. Re: Using regular expression to find any recurring substring within string
                  649102
                  Sorry but I'm determined to solve your issue without you having to do unnecessary replaces. How bout this one:
                  with t1 as (select '!xy!y!z!' str from dual
                  union all select '! str_1 ! str_2 ! str_3 ! str_1 !' from dual
                  union all select '! str ! str_A ! str_B ! _B !' from dual
                  )
                  select str
                  from t1
                  where regexp_like(str, '(![^!]+!).*\1')
                  OR regexp_like(str, '(!)([^!]+)(!)\2\3')
                  Regexs can be tricky but once you get the hang of them, they are a priceless tool.
                  • 6. Re: Using regular expression to find any recurring substring within string
                    649496
                    That appears to work! You inspired me to keep working on it, and after removing some parentheses and using the Altenator operator (|) instead of the OR clause, I managed to distill it down to just this:

                    regexp_like(str, '!([^!]+!)(|.*!)\1')

                    Persistence pays off!
                    Thanks.
                    • 7. Re: Using regular expression to find any recurring substring within string
                      Aketi Jyuuzou
                      with t1 as (select '!x!y!z!y!' str from dual
                      union all select '! str_1 ! str_2 ! str_3 ! str_1 !' from dual
                      union all select '! str ! str_A ! str_B ! _B !' from dual)
                      select str
                        from t1
                      where RegExp_Like(str,'(![^!]+).*\1(!|$)');
                      STR
                      ---------------------------------
                      !x!y!z!y!
                      ! str_1 ! str_2 ! str_3 ! str_1 !
                      or
                      with t1 as (select '!x!y!z!y!' str from dual
                      union all select '! str_1 ! str_2 ! str_3 ! str_1 !' from dual
                      union all select '! str ! str_A ! str_B ! _B !' from dual)
                      select str
                        from t1
                      where RegExp_Like(str,'(![^!]+).*\1!');
                      • 8. Re: Using regular expression to find any recurring substring within string
                        649496
                        <<
                        where RegExp_Like(str,'(![^!]+).*\1(!|$)');
                        >>
                        <<
                        where RegExp_Like(str,'(![^!]+).*\1!');
                        >>

                        Actually, these two expressions return TRUE if they find a second substring that begins with the same characters as the first, even if they're not duplicates, for example:
                        '!String1!Str!'

                        I read that all regular expressions are "greedy", which I thought meant that (![^!]+) would be taken as '!String1', which of course would not have found a match. However, it apparently means that it will match the longest expression it can for the subexpression and still find a match for the overall pattern. So it tries '!String1' first, and then '!String', '!Strin', '!Stri', and finally '!Str', where it finds a match and stops looking.

                        Your first example reminded, though, that if the leading and trailing delimiter characters are not included in the original string, I can use the the ^ and $ metacharacters in the search pattern to avoid having to concatenate them onto STR. For anyone interested, this is my latest version. (We should find a match for the first 10 only):
                        with t1 as (select 'x!x' str from dual
                        union all select 'x!x!y!z' str from dual
                        union all select 'y!x!x!z' from dual
                        union all select 'y!z!x!x' from dual
                        union all select 'x!y!x!z' from dual
                        union all select 'y!x!z!x' from dual
                        union all select 'w!x!y!x!z' from dual
                        union all select 'str1!str1!str2' from dual
                        union all select 'str1!str2!str1' from dual
                        union all select 'str1!str2!str3!str2!str4' from dual
                        union all select 'w!x!y!z' from dual
                        union all select 'x!str1!str!1!tr!str2!_str2!2!str2_!str3!3' from dual)
                        select str
                        from t1
                        WHERE RegExp_Like(str,'(^|!)([^!]+)!(|.+!)\2(!|$)');
                        • 9. Re: Using regular expression to find any recurring substring within string
                          Aketi Jyuuzou
                          Oh, I see.
                          My final solution which is similar your solution.
                          col match for a20
                          with t1 as (select 'x!x' str from dual
                          union all select 'x!x!y!z' str from dual
                          union all select 'y!x!x!z' from dual
                          union all select 'y!z!x!x' from dual
                          union all select 'x!y!x!z' from dual
                          union all select 'y!x!z!x' from dual
                          union all select 'w!x!y!x!z' from dual
                          union all select 'str1!str1!str2' from dual
                          union all select 'str1!str2!str1' from dual
                          union all select 'str1!str2!str3!str2!str4' from dual
                          union all select 'w!x!y!z' from dual
                          union all select 'x!str1!str!1!tr!str2!_str2!2!str2_!str3!3' from dual
                          union all select 'AAAA!x!y' from dual
                          union all select 'AAAA!AAA!y' from dual)
                          select str,RegExp_Substr(str,'(^|!)([^!]+)!(.+!)?\2(!|$)') as match
                          from t1
                          WHERE RegExp_Like(str,'(^|!)([^!]+)!(.+!)?\2(!|$)');
                          STR                       MATCH
                          ------------------------  -------------------
                          x!x                       x!x
                          x!x!y!z                   x!x!
                          y!x!x!z                   !x!x!
                          y!z!x!x                   !x!x
                          x!y!x!z                   x!y!x!
                          y!x!z!x                   !x!z!x
                          w!x!y!x!z                 !x!y!x!
                          str1!str1!str2            str1!str1!
                          str1!str2!str1            str1!str2!str1
                          str1!str2!str3!str2!str4  !str2!str3!str2!