This content has been marked as final. Show 6 replies
The primary key is a technical key for the unicity of the row.
The Natural Key is a datawarehouse concept, it's the key that represents the source. It is USED during the ETL Process to detect the row that must be updated
In a type 2 slowly changing dimension, you historize data. In a "dimension customer" table, the "customer ID" that comes from source is your natural key. But the primary key cannot be this customer ID, since the customer ID will be present for many ROWS (because of historisation).
In the SCD table, the unicity is given by "natural key + historisation date".
For your primary key, the best practice is to create a new field called "surrogate key". The surrogate key will be unique (like a sequence).
In source table, you have these 2 rows
customer ID / Name / City
01526 / Mr John / Paris
07399 / Mr Walter / New York
Customer ID is the primary key in this source table.
Suppose that Mr John was in Versailles before going to Paris. He moved this year, the 05 of January.
Our dimension will historize that :
surrogate key / natural key / Name / City / Date beginning / Date end / Flag Actif
00003 / 01526 / Mr John / Paris / 2013-01-05 / null / O
00002 / 01526 / Mr John / Versailles / 2012-01-01 / 2013-01-05 / N
00001 / 07399 / Mr Walter / New York / 2010-07-15 / null / O
As you can see, the Natural Key is the customer ID. But it can't be the primary key, since it's not unique.
The ETL process needs to identify the row in the target table if he wants to historize it. This identification is done thanks to the Natural Key.
Primary key is database construct that is unique to each row and defines that entity as unique within that table.
If it is a surrogate key or is a 'made up' value, e.g a sequence ID it doesnt offer anything to actually identify that row (its just a number)
A natural key however is based on a business value that means something to that row, and is unique to that entity, e.g Employee Number, Product Code etc. etc.
A natural key is not always the primary key though, it uniquely identifies one or more rows (e.g in a Slowly changing Dimension) in the dimension.
Hope this helps,
When natural key is not sufficient enough to identify a record uniquely, we took the surrogate key as the primary key.
But the surrogate keys also come with some disadvantages. The values of surrogate keys have no relationship with the real world meaning of the data held in a row. Therefore over usage of surrogate keys (often in the name of “standardization”) lead to the problem of disassociation and creates unnecessary ETL burden and performance degradation.
So before assigning a surrogate key, try to find our if any other unique not null natural key is there or not. If it is then avoid the surrogate key else go for it.
Thanks so much for explaining so clearly.
Please close the thread (mark it as "answered").
And do not hesitate to mark usefull posts as "correct" or "helpful".
thanks a lot