This discussion is archived
1 Reply Latest reply: Jul 7, 2012 8:51 AM by sb92075 RSS

Dim and Fact Key sync and Clean up

getsaby Newbie
Currently Being Moderated
Hello all,

We have some really bizarre thing going on in our DW environment, same ETL jobs is running twice and because of extremely bad design with no constraints at all, we have lots of duplicate records getting inserted.
So the actual row count in DIM table should be 3 mill, but because of these multiple runs the count is like 9 mill rows. I'll illustrate this with an example.

This is the Dimension table, column (HEADER_ITEM ,LINE_ITEM) is the NATURAL_KEY, while DIM_SK is the PK, as you can see the table has three occurrence of NATURAL_KEY with increasing PK, but some time it may have same PK too, as there is no actual PK constraint on this table.

CREATE TABLE SABEGH_DIM
(
DIM_SK DECIMAL(10),
HEADER_ITEM VARCHAR2(10),
LINE_ITEM VARCHAR2(10)
);



INSERT INTO SABEGH_DIM VALUES (1,'A','X');
INSERT INTO SABEGH_DIM VALUES (2,'A','X');
INSERT INTO SABEGH_DIM VALUES (3,'A','X');

INSERT INTO SABEGH_DIM VALUES (4,'B','Y');
INSERT INTO SABEGH_DIM VALUES (5,'B','Y');
INSERT INTO SABEGH_DIM VALUES (6,'B','Y');

INSERT INTO SABEGH_DIM VALUES (11,'C','Z');
INSERT INTO SABEGH_DIM VALUES (11,'C','Z');
INSERT INTO SABEGH_DIM VALUES (11,'C','Z');

And then there is a FACT table too, which is one-to-one with this DIM.


CREATE TABLE SABEGH_FACT
(
FACT_SK DECIMAL(10),
COL1 VARCHAR2(10),
COL2 VARCHAR2(10),
COL3 VARChAR2(10)
);


INSERT INTO SABEGH_FACT VALUES (1,'RAIN','DIDI', 'FF');
INSERT INTO SABEGH_FACT VALUES (3,'RAIN','DIDI', 'FF');
INSERT INTO SABEGH_FACT VALUES (5,'CNG','ZION','TEST');
INSERT INTO SABEGH_FACT VALUES (11,'CNG','ZION','TEST');


What I would like to do some clean up, on both these tables.

SABEGH_DIM :

1. Preserve the Maximum Key within the group of same NATURAL_KEY in DIM table, if there are dups, load only one record. So my SABEGH_DIM table finally should have these values:

3......A.....X
6......B.....Y
11....C.... .Z

SABEGH_FACT:

Since this FACT table, may still refer the old PK from the DIM, we need to change the PK appropriately, so that it ties up with the DIM correctly. SABEGH_FACT should contain now this data set, which is the final results.
i replaced old keys with new PK from the Dim, and if after this operation there are dups in FACT, load only one occurrence.

3.......RAIN........     DIDI.........FF
6.......CNG........     ZION........TEST
11.....CNG........     ZION........TEST


These table are really huge, so it would be nice to have some fast query, I can add INDEXES, CONSTRAINTS; if it helps to speed up the process. Also, I would like to have a cross-reference table, which as both old keys and new keys, there are other FACT tables, which are still referring the OLD PK from this DIM table, so I have to update that FACT table key as well,

DIM_PK.......HEADER_ITEM.......LINE_ITEM.......MAX_DIM_PK
1......A.....     X.....3
2.....     A.....     X.....3
3.....     A.....     X.....3
4.....     B.....     Y.....6
5.....     B.....     Y.....6
6.....     B.....     Y.....6
11.....C.....     Z.....11
11....     C.....     Z.....11
11.....C.....     Z.....11


So, I'll use this Cross-Reference table to update KEY in other FACT table.

Thanks a lot!

Sabegh

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points