Forum Stats

  • 3,758,414 Users
  • 2,251,384 Discussions
  • 7,870,187 Comments

Discussions

best way to cleanse huge data

if I have a huge table in data base.. around 40 million records that I have to check that all data are correct inside fields based on some business rules. The DB is an Oracle DB. let us assume this is my huge table... and the rules are:

  1. US only can have [national_id_type]=NID.. others must be VISA
  2. US only with NA can own house.. other nationality with VISA cant own a house


what I think

  1. first to land the table inside a landing table to preserve the orginal data
  2. rewrite error in the column to display the errors and the number of them inside a power bi dashboard so the quality department can fix them


what I will do is

case when nationality <> 'US' and NATINAL_ID_TYPE = 'VISA' then '99\US only can have [national_id_type]=NID' ELSE NATINAL_ID_TYPE END as NATINAL_ID_TYPE ,
case when nationality <> 'US' and NATINAL_ID_TYPE <> 'NA' AND own_house='YES' then '99\US only with NA can own house' ELSE NATINAL_ID_TYPE END as own_house

is this way efficient to detect errors in 40 million Record X 70 Columns I think this will be very slow...

can you help me with some tips I can use to provide the errors for dashboarding and cleansing that can be used for performance

thank you

Tagged:

Answers

  • Solomon Yakobson
    Solomon Yakobson Member Posts: 18,729 Black Diamond
    edited Jul 17, 2021 3:55PM

    40 million rows isn't very large table, although it is all relative to hardware - 40 million rows on my laptop would be very large table while 40 million rows on,say, Exadata would be just another table. Anyway, first question you need to consider is how clean your data is. If data is relatively clean I'd create FBI:

    CASE
      WHEN NATIONALITY = 'US' AND NATINAL_ID_TYPE = 'VISA' THEN 1
      WHEN NATIONALITY <> 'US' AND NATINAL_ID_TYPE <> 'NA' AND OWN_HOUSE = 'YES' THEN 2
    END
    

    Such index will have small footprint since it only indexes "bad" rows and we know data is relatively clean. Now we can:

    UPDATE YOUR_TABLE
       SET NATINAL_ID_TYPE = '99\US only can have [national_id_type]=NID'
     WHERE 1 = CASE
                 WHEN NATIONALITY = 'US' AND NATINAL_ID_TYPE = 'VISA' THEN 1
                 WHEN NATIONALITY <> 'US' AND NATINAL_ID_TYPE <> 'NA' AND OWN_HOUSE = 'YES' THEN 2
               END
    /
    COMMIT
    /
    UPDATE YOUR_TABLE
       SET OWN_HOUSE = '99\US only with NA can own house'
     WHERE 2 = CASE
                 WHEN NATIONALITY = 'US' AND NATINAL_ID_TYPE = 'VISA' THEN 1
                 WHEN NATIONALITY <> 'US' AND NATINAL_ID_TYPE <> 'NA' AND OWN_HOUSE = 'YES' THEN 2
               END
    /
    COMMIT
    /
    
    

    And if data is quite dirty then use:

    UPDATE YOUR_TABLE
       SET NATINAL_ID_TYPE = CASE
                               WHEN NATIONALITY = 'US' AND NATINAL_ID_TYPE = 'VISA' THEN '99\US only can have [national_id_type]=NID'
                               ELSE NATINAL_ID_TYPE
                             END,
           OWN_HOUSE = CASE
                         WHEN NATIONALITY <> 'US' AND NATINAL_ID_TYPE <> 'NA' AND OWN_HOUSE = 'YES' THEN '99\US only with NA can own house'
                         ELSE OWN_HOUSE
                       END
     WHERE (
                NATIONALITY = 'US'
            AND
                NATINAL_ID_TYPE = 'VISA'
           )
        OR (
                NATIONALITY <> 'US'
            AND
                NATINAL_ID_TYPE <> 'NA'
            AND
                OWN_HOUSE = 'YES'
           )
    /
    COMMIT
    /
    

    SY.