Forum Stats

  • 3,784,365 Users
  • 2,254,928 Discussions
  • 7,880,792 Comments

Discussions

Special character standardization and cleansing

2711043
2711043 Member Posts: 2
edited Jul 14, 2014 4:54AM in Data Quality

For foreign character transliteration.  Can OEDQ examine a field's entry character by character, or would it need to do a token by token analysis from a reference list.  Either way I'd like to use a reference list containing special characters, but it would be more convenient to use a pure list of special characters rather than a long list of words (tokens) containing those characters.

Tagged:

Answers

  • Mike-Matthews-Oracle
    Mike-Matthews-Oracle Member Posts: 1,544 Employee
    edited Jul 11, 2014 4:34AM

    Hi,

    A number of approaches are available here. EDQ comes with a Transliterate processor which will do this task (the data will likely need post-transliteration normalization, for example to remove diacritic marks and standardize accented characters), or you can use Character Replace if you want to perform precise character substitutions. Whole token transcription is normally only used for scripts such as Arabic which cannot be readily transliterated using character level rules or by common APIs.

    Best Practice of transformation of names for matching purposes is encapsulated by the EDQ Customer Data Services Pack's built-in standardization services which you can repurpose for other means if needed.

    If your intention is to match data in UCM, the out of the box services will do all this for you, and match both on the original script names and on their converted forms, so test the service first and consider tuning it rather than attempt to reinvent the transformations.

    Regards,

    Mike

    Mike-Matthews-Oracle
  • 2711043
    2711043 Member Posts: 2

    Hi Mike,

    Thank you so much for taking time out to answer. Yes these questions are based on doing work in UCM.

    Let me dive a little deeper and get specific.  I think what you're saying is that OEDQ has OOTB functionality that will support any character set, just some (Double-Byte, Arabic) take a little more work than others.  What I want to do in this case is have Polish special characters appear in the UI in their Polish form.  I don't want them substituted for normalized English, or any other characters.

    In the matching process we don't want these special character tokenized attributes to enter the auto-match with similar (but different) English versions automatically.  We will probably want them to enter the suspect match queue for data steward (librarian) review. That is a matter of the match rules tuning process.  Which will be the next thing I'm querying this forum about.

    What I think you are saying in the previous reply is that OEDQ OOTB functionality has the ability to publish Polish special characters in the UI and we can adjust the matching rules during the tuning process to ensure that a data steward checks a special character vs. non-special character exact match?

    Thanks-Aaron

  • Mike-Matthews-Oracle
    Mike-Matthews-Oracle Member Posts: 1,544 Employee

    Hi Aaron,

    Which UI do you mean? If the matches are reviewed in UCM, none of the manipulations of the data that EDQ does for matching purposes are seen - the user just reviews the flagged records.

    If you want to use EDQ's match review UI, for example to do a match review as part of deduplicating the data before load, you can see both the original data and the manipulated forms (both are used for matching in any case).

    As far as whether to auto-match is concerned, this is simply a matter of configuring your automatch threshold (in UCM) and EDQ match scoring (EDQ match rules) appropriately to your needs.

    Regards,

    Mike

This discussion has been closed.