1 2 Previous Next 19 Replies Latest reply: Jul 16, 2007 4:57 PM by Vadim Tropashko-Oracle RSS

    REGEXP_ compare two tokenized strings

    -K-
      Hi,

      I need to compare two fields (from two tables), to see if any of the tokenized contents match.
      I have two tables with a "name" field, but the names might be incomplete or in random order. To get a match, I just need 1 part of both to really match, for example "John Doe" should match to "Doe J.".
      Something like:
      SELECT names1.name, names2.name
      FROM names1, names2
      WHERE INSTR(names1.name, REGEXP_SUBSTR(names2.name, '[[:alpha:]]+')) > 0
      ;
      But the statement should consider all tokens, not just the first one as in the example.
      Any suggestion on how to achieve this?

      Thanks,
      K.
        • 1. Re: REGEXP_ compare two tokenized strings
          cd_2
          For that you would need some sort of "loop" inside that sql statement.

          C.
          • 2. Re: REGEXP_ compare two tokenized strings
            -K-
            Doe that mean the only option is to revert to PL/SQL?
            • 3. Re: REGEXP_ compare two tokenized strings
              cd_2
              No. I meant you have to use a CONNECT BY, MODEL or even XML-query construct, here's an example:
              WITH t AS (SELECT 'John Doe' name1, 'Doe J.' name2
                           FROM dual
                        )
              SELECT name1
                   , name2
                   , pos
                FROM t
               MODEL
               DIMENSION BY (0 dim)
               MEASURES(name1, name2, 0 pos)
               RULES ITERATE (9) UNTIL (pos[0] > 0)
               (pos[0] = INSTR(name1[0], REGEXP_SUBSTR(name2[0], '[[:alpha:]]+')))
              ;    
              
              NAME1    NAME2         POS
              -------- ------ ----------
              John Doe Doe J.          6
              C.
              • 4. Re: REGEXP_ compare two tokenized strings
                Rob van Wijk
                An example using model and two tables:
                SQL> create table names1
                  2  as
                  3  select 1 id, 'John Doe' name from dual union all
                  4  select 2, 'Larry Ellison' from dual union all
                  5  select 3, 'Catharina-Amalia Beatrix Carmen Victoria van Oranje' from dual
                  6  /

                Tabel is aangemaakt.

                SQL> create table names2
                  2  as
                  3  select 1001 id, 'Doe J.' name from dual union all
                  4  select 1002, 'Gates Bill' from dual union all
                  5  select 1003, 'Beatrix Wilhelmina Armgard van Oranje' from dual
                  6  /

                Tabel is aangemaakt.

                SQL> select distinct n1.original_name name1
                  2       , n2.original_name name2
                  3    from ( select id
                  4                , name
                  5                , original_name
                  6             from names1
                  7            model
                  8                  return updated rows
                  9                  partition by (id)
                10                  dimension by (0 i)
                11                  measures (' ' || name || ' ' name, cast(null as varchar2(100)) original_name)
                12                  rules iterate (100) until (iteration_number = length(regexp_replace(name[0],'[^ ]')) - 2)
                13                  ( name[iteration_number+1] = regexp_substr(name[0],'[^ ]+',1,iteration_number+1)
                14                  , original_name[iteration_number+1] = trim(name[0])
                15                  )
                16         ) n1
                17       , ( select id
                18                , name
                19                , original_name
                20             from names2
                21            model
                22                  return updated rows
                23                  partition by (id)
                24                  dimension by (0 i)
                25                  measures (' ' || name || ' ' name, cast(null as varchar2(100)) original_name)
                26                  rules iterate (100) until (iteration_number = length(regexp_replace(name[0],'[^ ]')) - 2)
                27                  ( name[iteration_number+1] = regexp_substr(name[0],'[^ ]+',1,iteration_number+1)
                28                  , original_name[iteration_number+1] = trim(name[0])
                29                  )
                30         ) n2
                31   where n1.name = n2.name
                32  /

                NAME1                                               NAME2
                --------------------------------------------------- ---------------------------------------------------
                Catharina-Amalia Beatrix Carmen Victoria van Oranje Beatrix Wilhelmina Armgard van Oranje
                John Doe                                            Doe J.

                2 rijen zijn geselecteerd.
                Regards,
                Rob.
                • 5. Re: REGEXP_ compare two tokenized strings
                  -K-
                  Thanks both for your examples of the model clause. Never used this before and took me some time to understand...

                  Rob, your solution works just fine. Just hope the resulting query won't be too stressful on the database (still have to condition matches on more than part of the name, of course). Hartelijk bedankt!

                  Regards,
                  K.
                  • 6. Re: REGEXP_ compare two tokenized strings
                    494018
                    Perhaps I missed something, but isn't this all you need?
                    SQL*Plus: Release 10.2.0.1.0 - Production on Thu Jul 12 13:56:41 2007

                    create table n1
                    as
                    select 1 id, 'John Doe' name from dual union all
                    select 2, 'Larry Ellison' from dual union all
                    select 3, 'Catharina-Amalia Beatrix Carmen Victoria van Oranje' from dual union all
                    select 4, null from dual
                    /

                    Table created.


                    create table n2
                    as
                    select 1001 id, 'Doe J.' name from dual union all
                    select 1002, 'Gates Bill' from dual union all
                    select 1003, 'Beatrix Wilhelmina Armgard van Oranje' from dual
                    /

                    Table created.


                    column name wrap format a20

                    select n1.id, n1.name, n2.id, n2.name
                    from   n1, n2
                    where
                      regexp_instr
                      ( ' ' || n1.name || ' ',
                        ' ' || replace( n2.name, ' ', ' | ' ) || ' '
                      ) > 0
                    ;

                            ID NAME                         ID NAME
                    ---------- -------------------- ---------- --------------------
                             1 John Doe                   1001 Doe J.
                             3 Catharina-Amalia Bea       1003 Beatrix Wilhelmina A
                               trix Carmen Victoria            rmgard van Oranje
                                van Oranje


                    2 rows selected.
                    --
                    Joe Fuda
                    SQL Snippets
                    • 7. Re: REGEXP_ compare two tokenized strings
                      Rob van Wijk
                      Perhaps I missed something, but isn't this all you
                      need?
                      ...
                      regexp_instr
                      ( ' ' || n1.name || ' ',
                      ' ' || replace( n2.name, ' ', ' | ' ) || ' '
                      0
                      ...
                      Joe,

                      No, I think you are not missing anything.
                      This looks much better, although I'm not sure about performance.
                      To be continued ... ;-)

                      Regards,
                      Rob.
                      • 8. Re: REGEXP_ compare two tokenized strings
                        cd_2
                        erased Looks ok to me, after a second read. Further testing when I'm awake. ;-)

                        C.
                        • 9. Re: REGEXP_ compare two tokenized strings
                          Rob van Wijk
                          The performance of the regexp_instr in the where clause is terrible ... I stopped working out other scenarios, because the conclusion will be the same:
                          SQL> create table names1
                            2  as
                            3  select 1 id, 'John Doe' name from dual union all
                            4  select 2, 'Larry Ellison' from dual union all
                            5  select 3, 'Catharina-Amalia Beatrix Carmen Victoria van Oranje' from dual union all
                            6  select object_id, o
                            7    from ( select object_id
                            8                , owner || object_name o
                            9                , ntile(200) over (order by object_id) nt
                          10             from all_objects
                          11         )
                          12   where nt = 11
                          13  /

                          Tabel is aangemaakt.

                          SQL> create table names2
                            2  as
                            3  select 1001 id, 'Doe J.' name from dual union all
                            4  select 1002, 'Gates Bill' from dual union all
                            5  select 1003, 'Beatrix Wilhelmina Armgard van Oranje' from dual union all
                            6  select object_id, o
                            7    from ( select object_id
                            8                , owner || object_name o
                            9                , ntile(200) over (order by object_id) nt
                          10             from all_objects
                          11         )
                          12   where nt = 22
                          13  /

                          Tabel is aangemaakt.

                          SQL> select count(*) from names1
                            2  /

                            COUNT(*)
                          ----------
                                 241

                          1 rij is geselecteerd.

                          SQL> select count(*) from names2
                            2  /

                            COUNT(*)
                          ----------
                                 241

                          1 rij is geselecteerd.

                          SQL> exec dbms_stats.gather_table_stats(user,'names1')

                          PL/SQL-procedure is geslaagd.

                          SQL> exec dbms_stats.gather_table_stats(user,'names2')

                          PL/SQL-procedure is geslaagd.

                          SQL> set timing on
                          SQL> set autotrace on
                          SQL> select distinct n1.original_name name1
                            2       , n2.original_name name2
                            3    from ( select id
                            4                , name
                            5                , original_name
                            6             from names1
                            7            model
                            8                  return updated rows
                            9                  partition by (id)
                          10                  dimension by (0 i)
                          11                  measures (' ' || name || ' ' name, cast(null as varchar2(100)) original_name)
                          12                  rules iterate (100) until (iteration_number = length(regexp_replace(name[0],'[^ ]')) - 2)
                          13                  ( name[iteration_number+1] = regexp_substr(name[0],'[^ ]+',1,iteration_number+1)
                          14                  , original_name[iteration_number+1] = trim(name[0])
                          15                  )
                          16         ) n1
                          17       , ( select id
                          18                , name
                          19                , original_name
                          20             from names2
                          21            model
                          22                  return updated rows
                          23                  partition by (id)
                          24                  dimension by (0 i)
                          25                  measures (' ' || name || ' ' name, cast(null as varchar2(100)) original_name)
                          26                  rules iterate (100) until (iteration_number = length(regexp_replace(name[0],'[^ ]')) - 2)
                          27                  ( name[iteration_number+1] = regexp_substr(name[0],'[^ ]+',1,iteration_number+1)
                          28                  , original_name[iteration_number+1] = trim(name[0])
                          29                  )
                          30         ) n2
                          31   where n1.name = n2.name
                          32  /

                          NAME1
                          ----------------------------------------------------------------------------------------------------
                          NAME2
                          ----------------------------------------------------------------------------------------------------
                          Catharina-Amalia Beatrix Carmen Victoria van Oranje
                          Beatrix Wilhelmina Armgard van Oranje

                          John Doe
                          Doe J.


                          2 rijen zijn geselecteerd.

                          Verstreken: 00:00:00.07
                          Uitvoeringspan
                          ----------------------------------------------------------
                             0      SELECT STATEMENT Optimizer=CHOOSE (Cost=8 Card=581 Bytes=987
                                    70)

                             1    0   SORT (UNIQUE) (Cost=8 Card=581 Bytes=98770)
                             2    1     HASH JOIN (Cost=7 Card=581 Bytes=98770)
                             3    2       VIEW (Cost=3 Card=241 Bytes=20485)
                             4    3         BUFFER (SORT)
                             5    4           SQL MODEL (ORDERED FAST)
                             6    5             TABLE ACCESS (FULL) OF 'NAMES1' (TABLE) (Cost=3
                                    Card=241 Bytes=5784)

                             7    2       VIEW (Cost=3 Card=241 Bytes=20485)
                             8    7         BUFFER (SORT)
                             9    8           SQL MODEL (ORDERED FAST)
                            10    9             TABLE ACCESS (FULL) OF 'NAMES2' (TABLE) (Cost=3
                                    Card=241 Bytes=5784)





                          Statistics
                          ----------------------------------------------------------
                                  387  recursive calls
                                    0  db block gets
                                   52  consistent gets
                                    0  physical reads
                                    0  redo size
                                  590  bytes sent via SQL*Net to client
                                  508  bytes received via SQL*Net from client
                                    2  SQL*Net roundtrips to/from client
                                   11  sorts (memory)
                                    0  sorts (disk)
                                    2  rows processed

                          SQL> select n1.name name1
                            2       , n2.name name2
                            3    from names1 n1
                            4       , names2 n2
                            5   where regexp_instr  ( ' ' || n1.name || ' ',       ' ' || replace( n2.name, ' ', ' | ' ) || ' '  ) > 0
                            6  /

                          NAME1                                                        NAME2
                          ------------------------------------------------------------ ------------------------------------------------------------
                          John Doe                                                     Doe J.
                          Catharina-Amalia Beatrix Carmen Victoria van Oranje          Beatrix Wilhelmina Armgard van Oranje

                          2 rijen zijn geselecteerd.

                          Verstreken: 00:00:46.54
                          Uitvoeringspan
                          ----------------------------------------------------------
                             0      SELECT STATEMENT Optimizer=CHOOSE (Cost=218 Card=2904 Bytes=
                                    121968)

                             1    0   NESTED LOOPS (Cost=218 Card=2904 Bytes=121968)
                             2    1     TABLE ACCESS (FULL) OF 'NAMES1' (TABLE) (Cost=3 Card=241
                                     Bytes=5061)

                             3    1     TABLE ACCESS (FULL) OF 'NAMES2' (TABLE) (Cost=1 Card=12
                                    Bytes=252)





                          Statistics
                          ----------------------------------------------------------
                                    1  recursive calls
                                    0  db block gets
                                  728  consistent gets
                                    0  physical reads
                                    0  redo size
                                  590  bytes sent via SQL*Net to client
                                  508  bytes received via SQL*Net from client
                                    2  SQL*Net roundtrips to/from client
                                    0  sorts (memory)
                                    0  sorts (disk)
                                    2  rows processed
                          Regards,
                          Rob.
                          • 10. Re: REGEXP_ compare two tokenized strings
                            cd_2
                            Interesting, I wouldn't have thought that MODEL is really that fast.

                            C.
                            • 11. Re: REGEXP_ compare two tokenized strings
                              Vadim Tropashko-Oracle
                              select n1.id, n1.name, n2.id, n2.name
                              from n1, n2
                              where
                              regexp_instr
                              ( ' ' || n1.name || ' ',
                              ' ' || replace( n2.name, ' ', ' | ' ) || ' '
                              ) > 0
                              ;
                              Very nice, Joe!

                              The fact that you can do such impressive things with RegExp is not surprising at all, because there is rich Language Parsing theory that underlines it. Anyway, here is continuation of the theme:
                              http://vadimtropashko.wordpress.com/2007/07/12/string-decomposition-with-regexp/
                              • 12. Re: REGEXP_ compare two tokenized strings
                                cd_2
                                Regarding Model clause, could anybody please suggest
                                a single problem that can be solved with it (I
                                mean where it is not merely "yet another" ugly
                                solution)?
                                The documentation shows some examples. Ugly? I think
                                in some cases MODEL offers better readability than
                                alternative solutions, but that's just me.

                                C.
                                • 13. Re: REGEXP_ compare two tokenized strings
                                  -K-
                                  Aaah, this is what I was looking for!
                                  In the end it was so simple, I can't believe I didn't came up with this myself. A combination of keeping your head clear, focused and open for different perspectives.

                                  Thanks for switching on the light ;-)
                                  K.
                                  • 14. Re: REGEXP_ compare two tokenized strings
                                    -K-
                                    It's obvious the model clause performs much better, in spite of the regexp_ rules vs. just the one regexp_like.
                                    Nevertheless, 47 seconds seem just impossible. Something else must be interfering here. I get similar return times between the two solutions, not even near the abyss you got... weird.

                                    K.
                                    1 2 Previous Next