4 Replies Latest reply: Jul 19, 2013 4:24 PM by Barbara Boehmer RSS

    String comparison using regex

    23fbd40c-af91-4ac9-aade-c8afeec630bd

      I try to develop a string compare with tiny substrings to have a value for similarity.

      That is not a problem if you use loops. But I wanted to do so with (recursive) RegEx.

       

      An example:

      If I have the two Names 'Indiana Jones' and 'John Dow'. I then want to count these substring which are identically.

      next you cut each name in two letter substrings like 'In', 'nd', 'di', 'ia', 'an', 'na', 'Jo', 'on', 'ne', 'es' and 'Jo', 'oh', 'hn', 'Do', 'ow'. Now I want do count the identical substrings out of MAX(COUNT(substrings_name1),COUNT(substrings_name1)). So I have a value for comparison.

       

      Does anybody have an idea how the RegEx have to look like.

       

      Many thanks in advance.

      Steffen

        • 1. Re: String comparison using regex
          Barbara Boehmer

          If you are trying to create a function to count the identical two-character sub-strings between two strings and return the percentage of total sub-strings in the longest of the strings, then the following demonstrates a couple of methods.  I did not include two-character strings with a space, as in the examples that you provided.  It also shows the results for other methods of comparing names or strings using the utl_match package.

           

          SCOTT@orcl_11gR2> CREATE OR REPLACE FUNCTION string_compare

            2    (p_string1     IN VARCHAR2,

            3      p_string2     IN VARCHAR2)

            4    RETURN            NUMBER

            5  AS

            6    v_similarity        NUMBER;

            7  BEGIN

            8    SELECT ROUND ((COUNT(*) /

            9          GREATEST

          10            (LENGTH (p_string1) - 1 - (REGEXP_COUNT (p_string1, ' ') * 2),

          11             LENGTH (p_string2) - 1 - (REGEXP_COUNT (p_string2, ' ') * 2)))

          12          * 100)

          13    INTO   v_similarity

          14    FROM   (SELECT SUBSTR (p_string1, COLUMN_VALUE, 2) string1

          15             FROM   DUAL,

          16                TABLE

          17              (CAST

          18                 (MULTISET

          19                    (SELECT LEVEL

          20                     FROM   DUAL

          21                     CONNECT BY LEVEL <= (LENGTH (p_string1) - 1))

          22               AS SYS.ODCINUMBERLIST))) s1,

          23            (SELECT SUBSTR (p_string2, COLUMN_VALUE, 2) string2

          24             FROM   DUAL,

          25                TABLE

          26              (CAST

          27                 (MULTISET

          28                    (SELECT LEVEL

          29                     FROM   DUAL

          30                     CONNECT BY LEVEL <= (LENGTH (p_string2) - 1))

          31               AS SYS.ODCINUMBERLIST))) s2

          32    WHERE  s1.string1 = s2.string2

          33    AND    LENGTH (REPLACE (string1, ' ', '')) = 2

          34    AND    LENGTH (REPLACE (string2, ' ', '')) = 2;

          35    RETURN v_similarity;

          36  END string_compare;

          37  /

           

          Function created.

           

          SCOTT@orcl_11gR2> SHOW ERRORS

          No errors.

          SCOTT@orcl_11gR2> COLUMN "% of identical strings"   FORMAT A24

          SCOTT@orcl_11gR2> COLUMN "Jaro Winkler similarity"  FORMAT A24

          SCOTT@orcl_11gR2> COLUMN "edit distance similarity" FORMAT A24

          SCOTT@orcl_11gR2> SELECT string_compare

            2            ('Indiana Jones', 'John Dow') || '%'

            3            AS "% of identical strings",

            4          utl_match.jaro_winkler_similarity

            5            ('Indiana Jones', 'John Dow') || '%'

            6            AS "Jaro Winkler similarity",

            7          utl_match.edit_distance_similarity

            8            ('Indiana Jones', 'John Dow') || '%'

            9            AS "edit distance similarity"

          10  FROM   DUAL

          11  /

           

          % of identical strings   Jaro Winkler similarity  edit distance similarity

          ------------------------ ------------------------ ------------------------

          10%                      53%                      24%

           

          1 row selected.

           

          SCOTT@orcl_11gR2> CREATE OR REPLACE FUNCTION string_compare

            2    (p_string1     IN VARCHAR2,

            3      p_string2     IN VARCHAR2)

            4    RETURN            NUMBER

            5  AS

            6    v_similarity        NUMBER;

            7  BEGIN

            8    SELECT ROUND ((COUNT(*) /

            9          GREATEST

          10            (LENGTH (p_string1) - 1 - (REGEXP_COUNT (p_string1, ' ') * 2),

          11             LENGTH (p_string2) - 1 - (REGEXP_COUNT (p_string2, ' ') * 2)))

          12          * 100)

          13    INTO   v_similarity

          14    FROM   (WITH

          15           data (r, string1) AS

          16             (SELECT 1 r, SUBSTR (p_string1, 1, 2) string1

          17              FROM   DUAL

          18              UNION ALL

          19              SELECT r + 1, SUBSTR (p_string1, r + 1, 2)

          20              FROM   data

          21              WHERE  r <= LENGTH (p_string1) - 1)

          22             SELECT string1

          23             FROM   data) s1,

          24            (WITH

          25           data (r, string2) AS

          26             (SELECT 1 r, SUBSTR (p_string2, 1, 2) string2

          27              FROM   DUAL

          28              UNION ALL

          29              SELECT r + 1, SUBSTR (p_string2, r + 1, 2)

          30              FROM   data

          31              WHERE  r <= LENGTH (p_string2) - 1)

          32             SELECT string2

          33             FROM   data) s2

          34    WHERE  s1.string1 = s2.string2

          35    AND    LENGTH (REPLACE (string1, ' ', '')) = 2

          36    AND    LENGTH (REPLACE (string2, ' ', '')) = 2;

          37    RETURN v_similarity;

          38  END string_compare;

          39  /

           

          Function created.

           

          SCOTT@orcl_11gR2> SHOW ERRORS

          No errors.

          SCOTT@orcl_11gR2> COLUMN "% of identical strings"   FORMAT A24

          SCOTT@orcl_11gR2> COLUMN "Jaro Winkler similarity"  FORMAT A24

          SCOTT@orcl_11gR2> COLUMN "edit distance similarity" FORMAT A24

          SCOTT@orcl_11gR2> SELECT string_compare

            2            ('Indiana Jones', 'John Dow') || '%'

            3            AS "% of identical strings",

            4          utl_match.jaro_winkler_similarity

            5            ('Indiana Jones', 'John Dow') || '%'

            6            AS "Jaro Winkler similarity",

            7          utl_match.edit_distance_similarity

            8            ('Indiana Jones', 'John Dow') || '%'

            9            AS "edit distance similarity"

          10  FROM   DUAL

          11  /

           

          % of identical strings   Jaro Winkler similarity  edit distance similarity

          ------------------------ ------------------------ ------------------------

          10%                      53%                      24%

           

          1 row selected.

          • 2. Re: String comparison using regex
            23fbd40c-af91-4ac9-aade-c8afeec630bd

            Hello Barbara,

            First of all many thanks for your answer.

            I will have a closer look as soon as possible...

            Actually I hoped to realize the compare in RegEx solely. But maybe that is impossible so I will test a function as well.

            A nice weekend meanwhile!

            • 3. Re: String comparison using regex
              23fbd40c-af91-4ac9-aade-c8afeec630bd

              Hello Barbara,

               

              again thanks for your help. I tried some ideas (inspired by your solution) but unfortunately I cannot use a function for different reason.

              If a have a smart solution I will let you know...   :-)

               

              Steffen

              • 4. Re: String comparison using regex
                Barbara Boehmer

                Why can't you use a function?  If you are going to say that you can't use something, then you should explain why you think so, so that alternatives can be offered.  Are you not allowed to create functions?  If so, then you can just use a select statement as demonstrated below.  Are there features in the function that are not supported by your Oracle version?  If so, then there are alternative syntaxes.  It might help to know if your values are being passed as variables or separate columns in the same table or different tables or what.  I doubt that there is not some way that the below cannot be adapted for your purposes.  You just need to say why you think you can't use it.

                 

                SCOTT@orcl_11gR2> COLUMN "% of identical strings"   FORMAT A24

                SCOTT@orcl_11gR2> COLUMN "Jaro Winkler similarity"  FORMAT A24

                SCOTT@orcl_11gR2> COLUMN "edit distance similarity" FORMAT A24

                SCOTT@orcl_11gR2> WITH

                  2    test_data AS

                  3       (SELECT 'Indiana Jones' AS p_string1,

                  4           'John Dow'     AS p_string2

                  5        FROM     DUAL)

                  6    SELECT ROUND ((COUNT(*) /

                  7          GREATEST

                  8            (LENGTH (p_string1) - 1 - (REGEXP_COUNT (p_string1, ' ') * 2),

                  9             LENGTH (p_string2) - 1 - (REGEXP_COUNT (p_string2, ' ') * 2)))

                10          * 100) || '%' AS "% of identical strings",

                11            utl_match.jaro_winkler_similarity

                12          ('Indiana Jones', 'John Dow') || '%'

                13          AS "Jaro Winkler similarity",

                14            utl_match.edit_distance_similarity

                15          ('Indiana Jones', 'John Dow') || '%'

                16          AS "edit distance similarity"

                17    FROM   (SELECT SUBSTR (p_string1, COLUMN_VALUE, 2) string1, p_string1

                18             FROM   test_data,

                19                TABLE

                20              (CAST

                21                 (MULTISET

                22                    (SELECT LEVEL

                23                     FROM   DUAL

                24                     CONNECT BY LEVEL <= (LENGTH (p_string1) - 1))

                25               AS SYS.ODCINUMBERLIST))) s1,

                26            (SELECT SUBSTR (p_string2, COLUMN_VALUE, 2) string2, p_string2

                27             FROM   test_data,

                28                TABLE

                29              (CAST

                30                 (MULTISET

                31                    (SELECT LEVEL

                32                     FROM   DUAL

                33                     CONNECT BY LEVEL <= (LENGTH (p_string2) - 1))

                34               AS SYS.ODCINUMBERLIST))) s2

                35    WHERE  s1.string1 = s2.string2

                36    AND    LENGTH (REPLACE (string1, ' ', '')) = 2

                37    AND    LENGTH (REPLACE (string2, ' ', '')) = 2

                38  /

                 

                % of identical strings   Jaro Winkler similarity  edit distance similarity

                ------------------------ ------------------------ ------------------------

                10%                      53%                      24%

                 

                1 row selected.