I'm trying to write an application design for a solution which will make use of the Oracle EDQ Siebel connector. EDQ looks very interesting, but I have difficulties understanding the documentation.
First question is about the live duplicate check. A software that needs to execute a duplicate check needs access to the entire data in the system - so does EDQ. But how does that work? - According to the installation document, mappings and database connections are only needed if a staging database is used - and staging databases are only necessary for batch jobs.
The live system (check accounts + contacts for duplicates on creation / update) is not a batch job: it's a webservice.
But how will EDQ know which fields are to be used, if we do not have jobs > do not use staging databases > do not have mappings?
Thanks in advance...
EDQ uses a stateless architecture when providing matching services to applications. Matching uses two services - a key generation service, and a matching service.
First, the key gen service is run in batch on all existing records in the application. The keys are written to a simple table (pretty much Record ID + Value).
Then, for 'online processing', the clustering service is called for any record that is to be added or updated (the 'driving record'), and returns several key values.
The application then performs 'candidate selection' by querying the system for any records that share any of those key values. It then sends the driving record and its candidates to the matching service. The matching service returns any candidates that are a decent level of match (over a configurable threshold) with the driving record, and adds a score (how strong the match is) and other information about the nature of the match. (Siebel can only use the score.)
The matches can then be processed by the application. This may include treating matches over a certain score as an automatic match, and/or may mean presenting possible matches for user review.
If the application is a hub, or otherwise has merge capability, it may then generate a modified master record, which then needs to be rekeyed (another call to the clustering service).
This approach ensures there is no need for synchronization of data between the application and EDQ, ensures transactional and matching integrity (for example, no lag in a record being available to match against), and means the DQ services can easily be scaled between many machines without any concern over the data they match against since it is being sent on messages. Performance of the EDQ services is extremely good due to EDQ's use of memory, multithreading etc.
A note is that where EDQ is attached to Siebel, this architecture (that is, the ability to use the EDQ key generation service to ensure proper candidate selection), is only available from version 22.214.171.124 of Siebel.
The process is summarized in Section 6 of the Customer Data Services Pack Business Services Guide:
Okay - so if I get you (and the provided link) right,
Would you agree?
Message was edited by: user13351020 (corrected typo)
I'v found the Siebel Integration Guide for EDQ now http://www.oracle.com/technetwork/middleware/oedq/documentation/cdssiebel-1688412.pdf, there seems to be some valuable information inside.
It is at the central documentation page for EDQ (Oracle Enterprise Data Quality Documentation</title><meta name="Title" content="Oracle Enterpris…), I just did not find it. On the first glance, it looks as if documentation for the most recent version of EDQ is at the top of that page, and further down are only older versions. However, further down are also links to documentation of other products. The version numbers can be kind of misleading.
Please note that the document linked above is not the latest version.
All the latest EDQ documentation is in the documentation library here:
The keys are only stored in Siebel. S_DQ_CON_KEY for contacts, S_DQ_ACC_KEY for accounts.
You can profile the tables in EDQ if you connect it up to the Siebel database. This can be a useful way of ensuring the key generation settings are suitable for your data and volumes... you should aim for no more than 500 records that share a given key value.