7 Replies Latest reply: Mar 30, 2013 11:52 AM by Etbin RSS

    need to identify the same people based on their (misspelled) names

    UserMB
      hello
      we have a table with persons and their name (first name and last name in 1 field) ;
      the names are often mispelled, so some string comparison is required;
      can you advise what the best approach would be to uniquely identify the same people?
      so far i have only found this functions :utl_match.jaro_winkler_similarity and SOUNDEX
      is there anything else 'out of box' i could make use of to implement the above?
      i appreciate any tips
      thanks very much
      rgds
        • 1. Re: need to identify the same people based on their (misspelled) names
          onedbguru
          Due to the number of permutations of any given name, even using SOUNDEX is not going to be 100% fool-proof. There are also "mis-spellings" of names that while sound the same are, in fact different. Take Smith, Smyth, etc... while "mis-spelled" can be different people. You should investigate using some other method of identifying individuals.
          • 2. Re: need to identify the same people based on their (misspelled) names
            JustinCave
            UTL_MATCH (either the Jaro-Winkler or the edit distance functions) would generally be preferred. SOUNDEX is a less sophisticated algorithm.

            In the general case, however, doing this sort of thing yourself is extremely difficult. There are commercial products out there that just help you do fuzzy matching on names. If you're going to build something yourself, you're likely going to spend a large amount of time trying to fine-tune the algorithm to try to balance type 1 (false positive, you match names that you shouldn't) and type 2 errors (false negative, you fail to match names that you should). To do it well will require a rather large number of meetings with users trying to figure out the appropriate balance of errors in your particular environment.

            Justin
            • 3. Re: need to identify the same people based on their (misspelled) names
              UserMB
              hello
              thank you for your answer
              there is unfortunately no other way then using names (no other identifiers are available for these people)
              i was thinking about giving it a 'weight' on how likley they are the same people or not;
              other than that i dont know how this can be done better
              rgds
              • 4. Re: need to identify the same people based on their (misspelled) names
                onedbguru
                Due to the number of permutations of any given name, even using SOUNDEX is not going to be 100% fool-proof. There are also "mis-spellings" of names that while sound the same are, in fact different. Take Smith, Smyth, etc... while "mis-spelled" can be different people. You should investigate using some other method of identifying individuals.
                • 5. Re: need to identify the same people based on their (misspelled) names
                  APC
                  UserMB wrote:
                  can you advise what the best approach would be to uniquely identify the same people?
                  There is no way you can do this with just names. Names simply aren't a UID. You need other factors for certainty.
                  so far i have only found this functions :utl_match.jaro_winkler_similarity and SOUNDEX
                  Those are the two generic functions that Oracle offers out of the box. Unless you're on 11g in which case you should have a look at the Name Search functionality which comes with Oracle Text. [url http://docs.oracle.com/cd/E11882_01/text.112/e24435/search.htm#CCAPP9534]Find out more.

                  Cheers, APC
                  • 6. Re: need to identify the same people based on their (misspelled) names
                    EdStevens
                    What about exactly the opposite situation ... two distinctly different people who happen to have exactly the same name?
                    • 7. Re: need to identify the same people based on their (misspelled) names
                      Etbin
                      Take a look at http://names.mongabay.com/most_common_surnames.htm
                      You could play with combined frequencies for first names (follow the Female First Names and Male First Names links therein) and surnames (like the Optimizer does with the Database statistical data) to get the expected cardinalities of how many different persons can have the same name (misspelled or not).
                      There are no exceptions around here too: http://en.wikipedia.org/wiki/List_of_the_most_common_surnames_in_Europe
                      Anyway not as rare as two children of the same mother having the same birth date and not being twins ;)

                      Regards

                      Etbin