We have some really bizarre thing going on in our DW environment, same ETL jobs is running twice and because of extremely bad design with no constraints at all, we have lots of duplicate records getting inserted.
So the actual row count in DIM table should be 3 mill, but because of these multiple runs the count is like 9 mill rows. I'll illustrate this with an example.
This is the Dimension table, column (HEADER_ITEM ,LINE_ITEM) is the NATURAL_KEY, while DIM_SK is the PK, as you can see the table has three occurrence of NATURAL_KEY with increasing PK, but some time it may have same PK too, as there is no actual PK constraint on this table.
INSERT INTO SABEGH_FACT VALUES (1,'RAIN','DIDI', 'FF');
INSERT INTO SABEGH_FACT VALUES (3,'RAIN','DIDI', 'FF');
INSERT INTO SABEGH_FACT VALUES (5,'CNG','ZION','TEST');
INSERT INTO SABEGH_FACT VALUES (11,'CNG','ZION','TEST');
What I would like to do some clean up, on both these tables.
1. Preserve the Maximum Key within the group of same NATURAL_KEY in DIM table, if there are dups, load only one record. So my SABEGH_DIM table finally should have these values:
Since this FACT table, may still refer the old PK from the DIM, we need to change the PK appropriately, so that it ties up with the DIM correctly. SABEGH_FACT should contain now this data set, which is the final results.
i replaced old keys with new PK from the Dim, and if after this operation there are dups in FACT, load only one occurrence.
These table are really huge, so it would be nice to have some fast query, I can add INDEXES, CONSTRAINTS; if it helps to speed up the process. Also, I would like to have a cross-reference table, which as both old keys and new keys, there are other FACT tables, which are still referring the OLD PK from this DIM table, so I have to update that FACT table key as well,