This discussion is archived
4 Replies Latest reply: Jul 19, 2013 2:24 PM by Barbara Boehmer RSS

String comparison using regex

23fbd40c-af91-4ac9-aade-c8afeec630bd Newbie
Currently Being Moderated

I try to develop a string compare with tiny substrings to have a value for similarity.

That is not a problem if you use loops. But I wanted to do so with (recursive) RegEx.

 

An example:

If I have the two Names 'Indiana Jones' and 'John Dow'. I then want to count these substring which are identically.

next you cut each name in two letter substrings like 'In', 'nd', 'di', 'ia', 'an', 'na', 'Jo', 'on', 'ne', 'es' and 'Jo', 'oh', 'hn', 'Do', 'ow'. Now I want do count the identical substrings out of MAX(COUNT(substrings_name1),COUNT(substrings_name1)). So I have a value for comparison.

 

Does anybody have an idea how the RegEx have to look like.

 

Many thanks in advance.

Steffen

  • 1. Re: String comparison using regex
    Barbara Boehmer Oracle ACE
    Currently Being Moderated

    If you are trying to create a function to count the identical two-character sub-strings between two strings and return the percentage of total sub-strings in the longest of the strings, then the following demonstrates a couple of methods.  I did not include two-character strings with a space, as in the examples that you provided.  It also shows the results for other methods of comparing names or strings using the utl_match package.

     

    SCOTT@orcl_11gR2> CREATE OR REPLACE FUNCTION string_compare

      2    (p_string1     IN VARCHAR2,

      3      p_string2     IN VARCHAR2)

      4    RETURN            NUMBER

      5  AS

      6    v_similarity        NUMBER;

      7  BEGIN

      8    SELECT ROUND ((COUNT(*) /

      9          GREATEST

    10            (LENGTH (p_string1) - 1 - (REGEXP_COUNT (p_string1, ' ') * 2),

    11             LENGTH (p_string2) - 1 - (REGEXP_COUNT (p_string2, ' ') * 2)))

    12          * 100)

    13    INTO   v_similarity

    14    FROM   (SELECT SUBSTR (p_string1, COLUMN_VALUE, 2) string1

    15             FROM   DUAL,

    16                TABLE

    17              (CAST

    18                 (MULTISET

    19                    (SELECT LEVEL

    20                     FROM   DUAL

    21                     CONNECT BY LEVEL <= (LENGTH (p_string1) - 1))

    22               AS SYS.ODCINUMBERLIST))) s1,

    23            (SELECT SUBSTR (p_string2, COLUMN_VALUE, 2) string2

    24             FROM   DUAL,

    25                TABLE

    26              (CAST

    27                 (MULTISET

    28                    (SELECT LEVEL

    29                     FROM   DUAL

    30                     CONNECT BY LEVEL <= (LENGTH (p_string2) - 1))

    31               AS SYS.ODCINUMBERLIST))) s2

    32    WHERE  s1.string1 = s2.string2

    33    AND    LENGTH (REPLACE (string1, ' ', '')) = 2

    34    AND    LENGTH (REPLACE (string2, ' ', '')) = 2;

    35    RETURN v_similarity;

    36  END string_compare;

    37  /

     

    Function created.

     

    SCOTT@orcl_11gR2> SHOW ERRORS

    No errors.

    SCOTT@orcl_11gR2> COLUMN "% of identical strings"   FORMAT A24

    SCOTT@orcl_11gR2> COLUMN "Jaro Winkler similarity"  FORMAT A24

    SCOTT@orcl_11gR2> COLUMN "edit distance similarity" FORMAT A24

    SCOTT@orcl_11gR2> SELECT string_compare

      2            ('Indiana Jones', 'John Dow') || '%'

      3            AS "% of identical strings",

      4          utl_match.jaro_winkler_similarity

      5            ('Indiana Jones', 'John Dow') || '%'

      6            AS "Jaro Winkler similarity",

      7          utl_match.edit_distance_similarity

      8            ('Indiana Jones', 'John Dow') || '%'

      9            AS "edit distance similarity"

    10  FROM   DUAL

    11  /

     

    % of identical strings   Jaro Winkler similarity  edit distance similarity

    ------------------------ ------------------------ ------------------------

    10%                      53%                      24%

     

    1 row selected.

     

    SCOTT@orcl_11gR2> CREATE OR REPLACE FUNCTION string_compare

      2    (p_string1     IN VARCHAR2,

      3      p_string2     IN VARCHAR2)

      4    RETURN            NUMBER

      5  AS

      6    v_similarity        NUMBER;

      7  BEGIN

      8    SELECT ROUND ((COUNT(*) /

      9          GREATEST

    10            (LENGTH (p_string1) - 1 - (REGEXP_COUNT (p_string1, ' ') * 2),

    11             LENGTH (p_string2) - 1 - (REGEXP_COUNT (p_string2, ' ') * 2)))

    12          * 100)

    13    INTO   v_similarity

    14    FROM   (WITH

    15           data (r, string1) AS

    16             (SELECT 1 r, SUBSTR (p_string1, 1, 2) string1

    17              FROM   DUAL

    18              UNION ALL

    19              SELECT r + 1, SUBSTR (p_string1, r + 1, 2)

    20              FROM   data

    21              WHERE  r <= LENGTH (p_string1) - 1)

    22             SELECT string1

    23             FROM   data) s1,

    24            (WITH

    25           data (r, string2) AS

    26             (SELECT 1 r, SUBSTR (p_string2, 1, 2) string2

    27              FROM   DUAL

    28              UNION ALL

    29              SELECT r + 1, SUBSTR (p_string2, r + 1, 2)

    30              FROM   data

    31              WHERE  r <= LENGTH (p_string2) - 1)

    32             SELECT string2

    33             FROM   data) s2

    34    WHERE  s1.string1 = s2.string2

    35    AND    LENGTH (REPLACE (string1, ' ', '')) = 2

    36    AND    LENGTH (REPLACE (string2, ' ', '')) = 2;

    37    RETURN v_similarity;

    38  END string_compare;

    39  /

     

    Function created.

     

    SCOTT@orcl_11gR2> SHOW ERRORS

    No errors.

    SCOTT@orcl_11gR2> COLUMN "% of identical strings"   FORMAT A24

    SCOTT@orcl_11gR2> COLUMN "Jaro Winkler similarity"  FORMAT A24

    SCOTT@orcl_11gR2> COLUMN "edit distance similarity" FORMAT A24

    SCOTT@orcl_11gR2> SELECT string_compare

      2            ('Indiana Jones', 'John Dow') || '%'

      3            AS "% of identical strings",

      4          utl_match.jaro_winkler_similarity

      5            ('Indiana Jones', 'John Dow') || '%'

      6            AS "Jaro Winkler similarity",

      7          utl_match.edit_distance_similarity

      8            ('Indiana Jones', 'John Dow') || '%'

      9            AS "edit distance similarity"

    10  FROM   DUAL

    11  /

     

    % of identical strings   Jaro Winkler similarity  edit distance similarity

    ------------------------ ------------------------ ------------------------

    10%                      53%                      24%

     

    1 row selected.

  • 2. Re: String comparison using regex
    23fbd40c-af91-4ac9-aade-c8afeec630bd Newbie
    Currently Being Moderated

    Hello Barbara,

    First of all many thanks for your answer.

    I will have a closer look as soon as possible...

    Actually I hoped to realize the compare in RegEx solely. But maybe that is impossible so I will test a function as well.

    A nice weekend meanwhile!

  • 3. Re: String comparison using regex
    23fbd40c-af91-4ac9-aade-c8afeec630bd Newbie
    Currently Being Moderated

    Hello Barbara,

     

    again thanks for your help. I tried some ideas (inspired by your solution) but unfortunately I cannot use a function for different reason.

    If a have a smart solution I will let you know...   :-)

     

    Steffen

  • 4. Re: String comparison using regex
    Barbara Boehmer Oracle ACE
    Currently Being Moderated

    Why can't you use a function?  If you are going to say that you can't use something, then you should explain why you think so, so that alternatives can be offered.  Are you not allowed to create functions?  If so, then you can just use a select statement as demonstrated below.  Are there features in the function that are not supported by your Oracle version?  If so, then there are alternative syntaxes.  It might help to know if your values are being passed as variables or separate columns in the same table or different tables or what.  I doubt that there is not some way that the below cannot be adapted for your purposes.  You just need to say why you think you can't use it.

     

    SCOTT@orcl_11gR2> COLUMN "% of identical strings"   FORMAT A24

    SCOTT@orcl_11gR2> COLUMN "Jaro Winkler similarity"  FORMAT A24

    SCOTT@orcl_11gR2> COLUMN "edit distance similarity" FORMAT A24

    SCOTT@orcl_11gR2> WITH

      2    test_data AS

      3       (SELECT 'Indiana Jones' AS p_string1,

      4           'John Dow'     AS p_string2

      5        FROM     DUAL)

      6    SELECT ROUND ((COUNT(*) /

      7          GREATEST

      8            (LENGTH (p_string1) - 1 - (REGEXP_COUNT (p_string1, ' ') * 2),

      9             LENGTH (p_string2) - 1 - (REGEXP_COUNT (p_string2, ' ') * 2)))

    10          * 100) || '%' AS "% of identical strings",

    11            utl_match.jaro_winkler_similarity

    12          ('Indiana Jones', 'John Dow') || '%'

    13          AS "Jaro Winkler similarity",

    14            utl_match.edit_distance_similarity

    15          ('Indiana Jones', 'John Dow') || '%'

    16          AS "edit distance similarity"

    17    FROM   (SELECT SUBSTR (p_string1, COLUMN_VALUE, 2) string1, p_string1

    18             FROM   test_data,

    19                TABLE

    20              (CAST

    21                 (MULTISET

    22                    (SELECT LEVEL

    23                     FROM   DUAL

    24                     CONNECT BY LEVEL <= (LENGTH (p_string1) - 1))

    25               AS SYS.ODCINUMBERLIST))) s1,

    26            (SELECT SUBSTR (p_string2, COLUMN_VALUE, 2) string2, p_string2

    27             FROM   test_data,

    28                TABLE

    29              (CAST

    30                 (MULTISET

    31                    (SELECT LEVEL

    32                     FROM   DUAL

    33                     CONNECT BY LEVEL <= (LENGTH (p_string2) - 1))

    34               AS SYS.ODCINUMBERLIST))) s2

    35    WHERE  s1.string1 = s2.string2

    36    AND    LENGTH (REPLACE (string1, ' ', '')) = 2

    37    AND    LENGTH (REPLACE (string2, ' ', '')) = 2

    38  /

     

    % of identical strings   Jaro Winkler similarity  edit distance similarity

    ------------------------ ------------------------ ------------------------

    10%                      53%                      24%

     

    1 row selected.