4 Replies Latest reply: Jul 31, 2013 6:36 AM by 1003320 RSS

    Deduplicate Processor

    1003320

      Hi,

       

      With an incoming list of data we could have multiple contacts per account but we only want our output to include 1 contact per account. (the best contact)

       

      Currently i am giving the contacts a score based on their job title. I am then using value from the highest when merging the data.

       

      However the problem i am having is that in a lot of the matches, a few contacts may have the same high score and as a result i get a mash of both contacts (first name from one, last name from the other).

       

      Is there a way to select a row when 2 records have both the same value from the highest? I know you can create an error when this occurs but it would be far too time consuming to manually pick a winning record each time records have the same score.

       

      Is this the best way to carry out what i want to do?

       

      Regards

        • 1. Re: Deduplicate Processor
          Nick Gorman-Oracle

          I'd start to address this problem by coming up with a better scoring function for the data as it's likely that a high percentage of the records won't have a job title. What's are business requirements for choosing the contact to output? Perhaps you can use a combination of attributes for scoring purposes. If you really need to make an arbitrary selection between two very similar records then you could consider adding a unique sequence number to differentiate between two tied records.

           

          Alternatively, if more complex selection logic is required then it is possible to write custom selection functions but I'd try to avoid this route if possible.

           

          regards,

          Nick

          • 2. Re: Deduplicate Processor
            1003320


            Thanks for your reply Nick,

             

            An arbitrary selection between the highest scoring records is what we want. We have a unique ID on each record in the incoming file.

             

            I tried adding the value from this ID as the second input for the merge but it didn't seem to secondary use this ID when there was a tie.

             

            How do i configure the merge processor to choose a secondary value from the highest if there is a tie?

            • 3. Re: Deduplicate Processor
              Nick Gorman-Oracle

              The use of a secondary value would require a custom selection function.

               

              How about combining the score with your id so that the id forms the decimal part of the number so it will always result in a winner.

               

              For example:

               

              Record 1

              ID = 123

              Score = 100.123

               

              Record 2

              ID = 456

              Score = 100.456

               

              100.456 > 100.123 so record 2 wins!

               

              You'll probably need to do this by concatenating strings and then converting to a numeric attribute at the end.

               

              regards,

              Nick

              • 4. Re: Deduplicate Processor
                1003320

                Thanks for this Nick, it's greatly appreciated. This works well