3 Replies Latest reply: Mar 14, 2013 10:55 PM by Barbara Boehmer RSS

    Address Matching Using Oracle Text

    user11181058
      Hi,

      I am a newbie to Oracle Text. Hence please pardon my ignorance if this is a "RTFM" Query.

      We would like to clash our customer addresses against a table (GEO) that has addresses and geographic coordinates (latitude, longitude etc). The customer address data are in four VARCHAR2 columns (address1, address2, address3 & address4) and need not be in the same order as the address data in the GEO table.

      Has anybody used Oracle Text to do similar work to score addresses?

      Any links & pointers will be highly appreciated.

      Thanks in advance.

      Best Regards

      Ramdas
        • 1. Re: Address Matching Using Oracle Text
          Barbara Boehmer
          You could use a multi_column_datastore to search all four columns and use some replace statements to replace spaces in the data in the geo table with commas, removing resulting double commas, then order by the descending score. The comma is the accumulate operator. The end result should be that you get a list of pairs of values in the customers table and geo table in descending order of similarity. Please see the example below.
          SCOTT@orcl_11gR2> CREATE TABLE customers
            2    (address1  VARCHAR2(30),
            3       address2  VARCHAR2(30),
            4       address3  VARCHAR2(30),
            5       address4  VARCHAR2(30))
            6  /
          
          Table created.
          
          SCOTT@orcl_11gR2> INSERT INTO customers VALUES
            2    ('Oracle Pkwy', '500', 'CA 94065', 'Redwood City')
            3  /
          
          1 row created.
          
          SCOTT@orcl_11gR2> INSERT INTO customers VALUES
            2    ('123', 'Some Street', 'Some City', 'CA')
            3  /
          
          1 row created.
          
          SCOTT@orcl_11gR2> BEGIN
            2    CTX_DDL.CREATE_PREFERENCE
            3        ('cust_datastore',
            4         'MULTI_COLUMN_DATASTORE');
            5    CTX_DDL.SET_ATTRIBUTE
            6        ('cust_datastore',
            7         'COLUMNS',
            8         'address1, address2, address3, address4');
            9  END;
           10  /
          
          PL/SQL procedure successfully completed.
          
          SCOTT@orcl_11gR2> CREATE INDEX customers_idx
            2  ON customers (address1)
            3  INDEXTYPE IS CTXSYS.CONTEXT
            4  PARAMETERS ('DATASTORE cust_datastore')
            5  /
          
          Index created.
          
          SCOTT@orcl_11gR2> CREATE TABLE geo
            2    (address      VARCHAR2(40),
            3       coordinates  VARCHAR2(30))
            4  /
          
          Table created.
          
          SCOTT@orcl_11gR2> INSERT INTO geo VALUES
            2    ('500 Oracle Pkwy, Redwood City, CA 94065',
            3       '37° 31'' N / 122° 15'' W')
            4  /
          
          1 row created.
          
          SCOTT@orcl_11gR2> INSERT INTO geo VALUES
            2    ('123 Some Street, Some City, CA',
            3       NULL)
            4  /
          
          1 row created.
          
          SCOTT@orcl_11gR2> SELECT SCORE(1), c.*, g.*
            2  FROM   customers c, geo g
            3  WHERE  CONTAINS
            4             (c.address1,
            5              REPLACE (REPLACE (g.address, ' ', ','), ',,', ','),
            6              1) > 0
            7  ORDER  BY SCORE(1) DESC
            8  /
          
            SCORE(1) ADDRESS1                       ADDRESS2
          ---------- ------------------------------ ------------------------------
          ADDRESS3                       ADDRESS4
          ------------------------------ ------------------------------
          ADDRESS                                  COORDINATES
          ---------------------------------------- ------------------------------
                  86 Oracle Pkwy                    500
          CA 94065                       Redwood City
          500 Oracle Pkwy, Redwood City, CA 94065  37° 31' N / 122° 15' W
          
                  76 123                            Some Street
          Some City                      CA
          123 Some Street, Some City, CA
          
                  26 Oracle Pkwy                    500
          CA 94065                       Redwood City
          123 Some Street, Some City, CA
          
                  15 123                            Some Street
          Some City                      CA
          500 Oracle Pkwy, Redwood City, CA 94065  37° 31' N / 122° 15' W
          
          
          4 rows selected.
          • 2. Re: Address Matching Using Oracle Text
            user11181058
            Hi Barbara,

            Thx for the Quick response.

            A couple of queries.

            If I would like the scoring to be based on either address1 or address2 or address3 or address4 being found in the address in the GEO table is it just a matter of adding the following clause

            OR CONTAINS (c.address2, REPLACE (REPLACE (g.address, ' ', ','), ',,', ','),1) > 0
            OR CONTAINS (c.address3, REPLACE (REPLACE (g.address, ' ', ','), ',,', ','),1) > 0
            OR CONTAINS (c.address4, REPLACE (REPLACE (g.address, ' ', ','), ',,', ','),1) > 0

            Will the scoring be based on address1, address2, address3 & address4 being found in GEO.address1 and can a higher weightage be given for address1 and address2 compared to address3 and address4.

            Best Regards

            Ramdas
            • 3. Re: Address Matching Using Oracle Text
              Barbara Boehmer
              When you use a multi_column_datastore, even though the index is only created using one column name and only that column name is used in the contains clause, it searches all columns named in the multi_column_datastore. It may be clearer if you use a dummy column with a name that has more meaning; I have used addresses in the revised example below.

              Queries run faster if you put everything in one contains clause instead of using multiple contains clauses.

              If you add a section group and field sections, then you can search within those sections and apply weights. In the following example I have doubled the score for results in address1 and address2 and halved the score for results in address3 and address4.
              SCOTT@orcl_11gR2> CREATE TABLE customers
                2    (id       NUMBER,
                3       address1  VARCHAR2(30),
                4       address2  VARCHAR2(30),
                5       address3  VARCHAR2(30),
                6       address4  VARCHAR2(30),
                7       addresses VARCHAR2(1))
                8  /
              
              Table created.
              
              SCOTT@orcl_11gR2> INSERT INTO customers VALUES
                2    (1, '123 Somewhere, Someplace', 'nowhere', 'nowhere', 'nowhere', null)
                3  /
              
              1 row created.
              
              SCOTT@orcl_11gR2> INSERT INTO customers VALUES
                2    (2, 'nowhere', 'nowhere', 'nowhere', '123 Somewhere, Someplace', null)
                3  /
              
              1 row created.
              
              SCOTT@orcl_11gR2> INSERT INTO customers VALUES
                2    (3, 'nowhere', '500 Oracle Pkwy', 'nowhere', 'nowhere', null)
                3  /
              
              1 row created.
              
              SCOTT@orcl_11gR2> INSERT INTO customers VALUES
                2    (4, 'nowhere', 'nowhere', '500 Oracle Pkwy', 'nowhere', null)
                3  /
              
              1 row created.
              
              SCOTT@orcl_11gR2> 
              SCOTT@orcl_11gR2> BEGIN
                2    CTX_DDL.CREATE_PREFERENCE
                3        ('cust_datastore',
                4         'MULTI_COLUMN_DATASTORE');
                5    CTX_DDL.SET_ATTRIBUTE
                6        ('cust_datastore',
                7         'COLUMNS',
                8         'address1, address2, address3, address4');
                9    CTX_DDL.CREATE_SECTION_GROUP
               10        ('cust_sg', 'BASIC_SECTION_GROUP');
               11    CTX_DDL.ADD_FIELD_SECTION ('cust_sg', 'address1', 'address1', true);
               12    CTX_DDL.ADD_FIELD_SECTION ('cust_sg', 'address2', 'address2', true);
               13    CTX_DDL.ADD_FIELD_SECTION ('cust_sg', 'address3', 'address3', true);
               14    CTX_DDL.ADD_FIELD_SECTION ('cust_sg', 'address4', 'address4', true);
               15  END;
               16  /
              
              PL/SQL procedure successfully completed.
              
              SCOTT@orcl_11gR2> CREATE INDEX customers_idx
                2  ON customers (addresses)
                3  INDEXTYPE IS CTXSYS.CONTEXT
                4  PARAMETERS
                5    ('DATASTORE     cust_datastore
                6        SECTION GROUP cust_sg')
                7  /
              
              Index created.
              
              SCOTT@orcl_11gR2> CREATE TABLE geo
                2    (address      VARCHAR2(40),
                3       coordinates  VARCHAR2(30))
                4  /
              
              Table created.
              
              SCOTT@orcl_11gR2> INSERT INTO geo VALUES
                2    ('500 Oracle Pkwy, Redwood City, CA 94065',
                3       '37° 31'' N / 122° 15'' W')
                4  /
              
              1 row created.
              
              SCOTT@orcl_11gR2> INSERT INTO geo VALUES
                2    ('123 Somewhere Street, Someplace City, CA',
                3       NULL)
                4  /
              
              1 row created.
              
              SCOTT@orcl_11gR2> SELECT SCORE(1), c.*, g.address, g.coordinates
                2  FROM   customers c,
                3           (SELECT address,
                4                '(' || REPLACE (REPLACE (address, ' ', ','), ',,', ',') || ')' addr,
                5                coordinates
                6            FROM   geo) g
                7  WHERE  CONTAINS
                8             (c.addresses,
                9              '((' || g.addr || ' WITHIN address1 OR ' ||
               10                   g.addr || ' WITHIN address2) * 2) OR ' ||
               11              '((' || g.addr || ' WITHIN address3 OR ' ||
               12                   g.addr || ' WITHIN address4) * 0.5)',
               13              1) > 0
               14  ORDER  BY SCORE(1) DESC
               15  /
              
                SCORE(1)         ID ADDRESS1
              ---------- ---------- ------------------------------
              ADDRESS2                       ADDRESS3
              ------------------------------ ------------------------------
              ADDRESS4                       A ADDRESS
              ------------------------------ - ----------------------------------------
              COORDINATES
              ------------------------------
                      68          1 123 Somewhere, Someplace
              nowhere                        nowhere
              nowhere                          123 Somewhere Street, Someplace City, CA
              
              
                      59          3 nowhere
              500 Oracle Pkwy                nowhere
              nowhere                          500 Oracle Pkwy, Redwood City, CA 94065
              37° 31' N / 122° 15' W
              
                      17          2 nowhere
              nowhere                        nowhere
              123 Somewhere, Someplace         123 Somewhere Street, Someplace City, CA
              
              
                      15          4 nowhere
              nowhere                        500 Oracle Pkwy
              nowhere                          500 Oracle Pkwy, Redwood City, CA 94065
              37° 31' N / 122° 15' W
              
              
              4 rows selected.