This discussion is archived
1 2 Previous Next 19 Replies Latest reply: Jul 16, 2007 2:57 PM by Vadim Tropashko RSS

REGEXP_ compare two tokenized strings

-K- Guru
Currently Being Moderated
Hi,

I need to compare two fields (from two tables), to see if any of the tokenized contents match.
I have two tables with a "name" field, but the names might be incomplete or in random order. To get a match, I just need 1 part of both to really match, for example "John Doe" should match to "Doe J.".
Something like:
SELECT names1.name, names2.name
FROM names1, names2
WHERE INSTR(names1.name, REGEXP_SUBSTR(names2.name, '[[:alpha:]]+')) > 0
;
But the statement should consider all tokens, not just the first one as in the example.
Any suggestion on how to achieve this?

Thanks,
K.
  • 1. Re: REGEXP_ compare two tokenized strings
    60660 Journeyer
    Currently Being Moderated
    For that you would need some sort of "loop" inside that sql statement.

    C.
  • 2. Re: REGEXP_ compare two tokenized strings
    -K- Guru
    Currently Being Moderated
    Doe that mean the only option is to revert to PL/SQL?
  • 3. Re: REGEXP_ compare two tokenized strings
    60660 Journeyer
    Currently Being Moderated
    No. I meant you have to use a CONNECT BY, MODEL or even XML-query construct, here's an example:
    WITH t AS (SELECT 'John Doe' name1, 'Doe J.' name2
                 FROM dual
              )
    SELECT name1
         , name2
         , pos
      FROM t
     MODEL
     DIMENSION BY (0 dim)
     MEASURES(name1, name2, 0 pos)
     RULES ITERATE (9) UNTIL (pos[0] > 0)
     (pos[0] = INSTR(name1[0], REGEXP_SUBSTR(name2[0], '[[:alpha:]]+')))
    ;    
    
    NAME1    NAME2         POS
    -------- ------ ----------
    John Doe Doe J.          6
    C.
  • 4. Re: REGEXP_ compare two tokenized strings
    RobvanWijk Oracle ACE
    Currently Being Moderated
    An example using model and two tables:
    SQL> create table names1
      2  as
      3  select 1 id, 'John Doe' name from dual union all
      4  select 2, 'Larry Ellison' from dual union all
      5  select 3, 'Catharina-Amalia Beatrix Carmen Victoria van Oranje' from dual
      6  /

    Tabel is aangemaakt.

    SQL> create table names2
      2  as
      3  select 1001 id, 'Doe J.' name from dual union all
      4  select 1002, 'Gates Bill' from dual union all
      5  select 1003, 'Beatrix Wilhelmina Armgard van Oranje' from dual
      6  /

    Tabel is aangemaakt.

    SQL> select distinct n1.original_name name1
      2       , n2.original_name name2
      3    from ( select id
      4                , name
      5                , original_name
      6             from names1
      7            model
      8                  return updated rows
      9                  partition by (id)
    10                  dimension by (0 i)
    11                  measures (' ' || name || ' ' name, cast(null as varchar2(100)) original_name)
    12                  rules iterate (100) until (iteration_number = length(regexp_replace(name[0],'[^ ]')) - 2)
    13                  ( name[iteration_number+1] = regexp_substr(name[0],'[^ ]+',1,iteration_number+1)
    14                  , original_name[iteration_number+1] = trim(name[0])
    15                  )
    16         ) n1
    17       , ( select id
    18                , name
    19                , original_name
    20             from names2
    21            model
    22                  return updated rows
    23                  partition by (id)
    24                  dimension by (0 i)
    25                  measures (' ' || name || ' ' name, cast(null as varchar2(100)) original_name)
    26                  rules iterate (100) until (iteration_number = length(regexp_replace(name[0],'[^ ]')) - 2)
    27                  ( name[iteration_number+1] = regexp_substr(name[0],'[^ ]+',1,iteration_number+1)
    28                  , original_name[iteration_number+1] = trim(name[0])
    29                  )
    30         ) n2
    31   where n1.name = n2.name
    32  /

    NAME1                                               NAME2
    --------------------------------------------------- ---------------------------------------------------
    Catharina-Amalia Beatrix Carmen Victoria van Oranje Beatrix Wilhelmina Armgard van Oranje
    John Doe                                            Doe J.

    2 rijen zijn geselecteerd.
    Regards,
    Rob.
  • 5. Re: REGEXP_ compare two tokenized strings
    -K- Guru
    Currently Being Moderated
    Thanks both for your examples of the model clause. Never used this before and took me some time to understand...

    Rob, your solution works just fine. Just hope the resulting query won't be too stressful on the database (still have to condition matches on more than part of the name, of course). Hartelijk bedankt!

    Regards,
    K.
  • 6. Re: REGEXP_ compare two tokenized strings
    494018 Newbie
    Currently Being Moderated
    Perhaps I missed something, but isn't this all you need?
    SQL*Plus: Release 10.2.0.1.0 - Production on Thu Jul 12 13:56:41 2007

    create table n1
    as
    select 1 id, 'John Doe' name from dual union all
    select 2, 'Larry Ellison' from dual union all
    select 3, 'Catharina-Amalia Beatrix Carmen Victoria van Oranje' from dual union all
    select 4, null from dual
    /

    Table created.


    create table n2
    as
    select 1001 id, 'Doe J.' name from dual union all
    select 1002, 'Gates Bill' from dual union all
    select 1003, 'Beatrix Wilhelmina Armgard van Oranje' from dual
    /

    Table created.


    column name wrap format a20

    select n1.id, n1.name, n2.id, n2.name
    from   n1, n2
    where
      regexp_instr
      ( ' ' || n1.name || ' ',
        ' ' || replace( n2.name, ' ', ' | ' ) || ' '
      ) > 0
    ;

            ID NAME                         ID NAME
    ---------- -------------------- ---------- --------------------
             1 John Doe                   1001 Doe J.
             3 Catharina-Amalia Bea       1003 Beatrix Wilhelmina A
               trix Carmen Victoria            rmgard van Oranje
                van Oranje


    2 rows selected.
    --
    Joe Fuda
    SQL Snippets
  • 7. Re: REGEXP_ compare two tokenized strings
    RobvanWijk Oracle ACE
    Currently Being Moderated
    Perhaps I missed something, but isn't this all you
    need?
    ...
    regexp_instr
    ( ' ' || n1.name || ' ',
    ' ' || replace( n2.name, ' ', ' | ' ) || ' '
    0
    ...
    Joe,

    No, I think you are not missing anything.
    This looks much better, although I'm not sure about performance.
    To be continued ... ;-)

    Regards,
    Rob.
  • 8. Re: REGEXP_ compare two tokenized strings
    60660 Journeyer
    Currently Being Moderated
    erased Looks ok to me, after a second read. Further testing when I'm awake. ;-)

    C.
  • 9. Re: REGEXP_ compare two tokenized strings
    RobvanWijk Oracle ACE
    Currently Being Moderated
    The performance of the regexp_instr in the where clause is terrible ... I stopped working out other scenarios, because the conclusion will be the same:
    SQL> create table names1
      2  as
      3  select 1 id, 'John Doe' name from dual union all
      4  select 2, 'Larry Ellison' from dual union all
      5  select 3, 'Catharina-Amalia Beatrix Carmen Victoria van Oranje' from dual union all
      6  select object_id, o
      7    from ( select object_id
      8                , owner || object_name o
      9                , ntile(200) over (order by object_id) nt
    10             from all_objects
    11         )
    12   where nt = 11
    13  /

    Tabel is aangemaakt.

    SQL> create table names2
      2  as
      3  select 1001 id, 'Doe J.' name from dual union all
      4  select 1002, 'Gates Bill' from dual union all
      5  select 1003, 'Beatrix Wilhelmina Armgard van Oranje' from dual union all
      6  select object_id, o
      7    from ( select object_id
      8                , owner || object_name o
      9                , ntile(200) over (order by object_id) nt
    10             from all_objects
    11         )
    12   where nt = 22
    13  /

    Tabel is aangemaakt.

    SQL> select count(*) from names1
      2  /

      COUNT(*)
    ----------
           241

    1 rij is geselecteerd.

    SQL> select count(*) from names2
      2  /

      COUNT(*)
    ----------
           241

    1 rij is geselecteerd.

    SQL> exec dbms_stats.gather_table_stats(user,'names1')

    PL/SQL-procedure is geslaagd.

    SQL> exec dbms_stats.gather_table_stats(user,'names2')

    PL/SQL-procedure is geslaagd.

    SQL> set timing on
    SQL> set autotrace on
    SQL> select distinct n1.original_name name1
      2       , n2.original_name name2
      3    from ( select id
      4                , name
      5                , original_name
      6             from names1
      7            model
      8                  return updated rows
      9                  partition by (id)
    10                  dimension by (0 i)
    11                  measures (' ' || name || ' ' name, cast(null as varchar2(100)) original_name)
    12                  rules iterate (100) until (iteration_number = length(regexp_replace(name[0],'[^ ]')) - 2)
    13                  ( name[iteration_number+1] = regexp_substr(name[0],'[^ ]+',1,iteration_number+1)
    14                  , original_name[iteration_number+1] = trim(name[0])
    15                  )
    16         ) n1
    17       , ( select id
    18                , name
    19                , original_name
    20             from names2
    21            model
    22                  return updated rows
    23                  partition by (id)
    24                  dimension by (0 i)
    25                  measures (' ' || name || ' ' name, cast(null as varchar2(100)) original_name)
    26                  rules iterate (100) until (iteration_number = length(regexp_replace(name[0],'[^ ]')) - 2)
    27                  ( name[iteration_number+1] = regexp_substr(name[0],'[^ ]+',1,iteration_number+1)
    28                  , original_name[iteration_number+1] = trim(name[0])
    29                  )
    30         ) n2
    31   where n1.name = n2.name
    32  /

    NAME1
    ----------------------------------------------------------------------------------------------------
    NAME2
    ----------------------------------------------------------------------------------------------------
    Catharina-Amalia Beatrix Carmen Victoria van Oranje
    Beatrix Wilhelmina Armgard van Oranje

    John Doe
    Doe J.


    2 rijen zijn geselecteerd.

    Verstreken: 00:00:00.07
    Uitvoeringspan
    ----------------------------------------------------------
       0      SELECT STATEMENT Optimizer=CHOOSE (Cost=8 Card=581 Bytes=987
              70)

       1    0   SORT (UNIQUE) (Cost=8 Card=581 Bytes=98770)
       2    1     HASH JOIN (Cost=7 Card=581 Bytes=98770)
       3    2       VIEW (Cost=3 Card=241 Bytes=20485)
       4    3         BUFFER (SORT)
       5    4           SQL MODEL (ORDERED FAST)
       6    5             TABLE ACCESS (FULL) OF 'NAMES1' (TABLE) (Cost=3
              Card=241 Bytes=5784)

       7    2       VIEW (Cost=3 Card=241 Bytes=20485)
       8    7         BUFFER (SORT)
       9    8           SQL MODEL (ORDERED FAST)
      10    9             TABLE ACCESS (FULL) OF 'NAMES2' (TABLE) (Cost=3
              Card=241 Bytes=5784)





    Statistics
    ----------------------------------------------------------
            387  recursive calls
              0  db block gets
             52  consistent gets
              0  physical reads
              0  redo size
            590  bytes sent via SQL*Net to client
            508  bytes received via SQL*Net from client
              2  SQL*Net roundtrips to/from client
             11  sorts (memory)
              0  sorts (disk)
              2  rows processed

    SQL> select n1.name name1
      2       , n2.name name2
      3    from names1 n1
      4       , names2 n2
      5   where regexp_instr  ( ' ' || n1.name || ' ',       ' ' || replace( n2.name, ' ', ' | ' ) || ' '  ) > 0
      6  /

    NAME1                                                        NAME2
    ------------------------------------------------------------ ------------------------------------------------------------
    John Doe                                                     Doe J.
    Catharina-Amalia Beatrix Carmen Victoria van Oranje          Beatrix Wilhelmina Armgard van Oranje

    2 rijen zijn geselecteerd.

    Verstreken: 00:00:46.54
    Uitvoeringspan
    ----------------------------------------------------------
       0      SELECT STATEMENT Optimizer=CHOOSE (Cost=218 Card=2904 Bytes=
              121968)

       1    0   NESTED LOOPS (Cost=218 Card=2904 Bytes=121968)
       2    1     TABLE ACCESS (FULL) OF 'NAMES1' (TABLE) (Cost=3 Card=241
               Bytes=5061)

       3    1     TABLE ACCESS (FULL) OF 'NAMES2' (TABLE) (Cost=1 Card=12
              Bytes=252)





    Statistics
    ----------------------------------------------------------
              1  recursive calls
              0  db block gets
            728  consistent gets
              0  physical reads
              0  redo size
            590  bytes sent via SQL*Net to client
            508  bytes received via SQL*Net from client
              2  SQL*Net roundtrips to/from client
              0  sorts (memory)
              0  sorts (disk)
              2  rows processed
    Regards,
    Rob.
  • 10. Re: REGEXP_ compare two tokenized strings
    60660 Journeyer
    Currently Being Moderated
    Interesting, I wouldn't have thought that MODEL is really that fast.

    C.
  • 11. Re: REGEXP_ compare two tokenized strings
    Vadim Tropashko Pro
    Currently Being Moderated
    select n1.id, n1.name, n2.id, n2.name
    from n1, n2
    where
    regexp_instr
    ( ' ' || n1.name || ' ',
    ' ' || replace( n2.name, ' ', ' | ' ) || ' '
    ) > 0
    ;
    Very nice, Joe!

    The fact that you can do such impressive things with RegExp is not surprising at all, because there is rich Language Parsing theory that underlines it. Anyway, here is continuation of the theme:
    http://vadimtropashko.wordpress.com/2007/07/12/string-decomposition-with-regexp/
  • 12. Re: REGEXP_ compare two tokenized strings
    60660 Journeyer
    Currently Being Moderated
    Regarding Model clause, could anybody please suggest
    a single problem that can be solved with it (I
    mean where it is not merely "yet another" ugly
    solution)?
    The documentation shows some examples. Ugly? I think
    in some cases MODEL offers better readability than
    alternative solutions, but that's just me.

    C.
  • 13. Re: REGEXP_ compare two tokenized strings
    -K- Guru
    Currently Being Moderated
    Aaah, this is what I was looking for!
    In the end it was so simple, I can't believe I didn't came up with this myself. A combination of keeping your head clear, focused and open for different perspectives.

    Thanks for switching on the light ;-)
    K.
  • 14. Re: REGEXP_ compare two tokenized strings
    -K- Guru
    Currently Being Moderated
    It's obvious the model clause performs much better, in spite of the regexp_ rules vs. just the one regexp_like.
    Nevertheless, 47 seconds seem just impossible. Something else must be interfering here. I get similar return times between the two solutions, not even near the abyss you got... weird.

    K.
1 2 Previous Next