Forum Stats

  • 3,815,619 Users
  • 2,259,062 Discussions
  • 7,893,186 Comments

Discussions

merge data deliveries with main table; report I/U's

484086
484086 Member Posts: 73
edited Dec 14, 2008 9:37PM in General Database Discussions
==@admin: i just see this should probably be in 'SQL and PL/SQL'. Can you please move it there? Thanks==

Hi all,

I was hoping to find some help in solving the following problem in an efficient manner.
Basically i would like to merge records into an existing table, but also have a result dataset containing all the updates/inserts.

Table T1:
a set of 40,000,000 records.

Incoming dataset (delivered up to 100 times daily) T2:
a total daily number of around 100,000,000 records (e.g., 100 files containing 1,000,000 records each). In light of the main table, some of these are new accno's, some are relevant change records (different product id). However by far, most contain unchanged and/or old data (from timestamp) and can be discarded. Therefore we can say the main table is relatively stable (>>indexing?).

Both tables hold 1 row per account_no.

Two Requirements:
1) a refreshed table containing:
- all 'old' T1 recs that are not in T2
- all 'new' T2 recs that are not in T1
- all 'updated' T2 recs that are in T1 but have different attr1 field and the timestamp fields are at least <n> days newer than the timestamp for the record in T1

2) an output dataset containing:
- all new and updated records. (

As the delivery may happen up to 100 times per day, i am not so sure if and how to put indexes; join into a temp table, etc...

any help is greatly appreciated!!

many thanks

PS oracle version is 10g.

Edited by: user481083 on 13-dec-2008 15:09

Edited by: user481083 on 13-dec-2008 15:17

Edited by: user481083 on 13-dec-2008 15:30
- date difference on timestamp should be >n days
- attri_1 fields should be different

Answers

  • 3520
    3520 Member Posts: 1,846
    This won't be valid SQL statement but rather idea:

    1) Add column flag indicating WHEN the merge is done

    2)
    MERGE INTO t1 USING t2
    ON (t1.accno = t2.accno)
    WHEN NOT MATCHED THEN INSERT VALUES (..., flag_for_this_time)
    WHEN MATCHED THEN UPDATE SET t1.cols = t2.cols,
    flag = flag_for_this_time
    WHERE t2.attr1 - t1.attr1 >= n

    As a result you have:
    - merged rows into t1
    - possibility to get all inserted and updated rows from t1 with query filtering out just rows where flag = flag_for_this_time

    Flag can be date or number or varchar or whatever.

    Merge statement is doing outer join so simply watch out MERGE execution plan whether the join method and scan mechanism is OK for your data volumes and type.

    Gints Plivna
    http://www.gplivna.eu
    3520
  • OrionNet
    OrionNet Member Posts: 4,542
    edited Dec 13, 2008 6:34PM
    Hello,

    Use this and extend this however you like, I found this couple of years ago and using it effectively since then
    {CREATE OR REPLACE PACKAGE klondike.merge_cnter
    AS
       i_inserting CONSTANT   NUMBER := 0;
       i_updating CONSTANT    NUMBER := 1;
    
       FUNCTION merge_cnter (action_in IN number DEFAULT i_inserting )
          RETURN NUMBER;
    
       FUNCTION get_update_cnt
          RETURN NUMBER;
    
       FUNCTION get_update_cnt (merge_cnt_in IN number)
          RETURN NUMBER;
    
       FUNCTION get_insert_cnt
          RETURN NUMBER;
    
       FUNCTION get_insert_cnt (merge_cnt_in IN number)
          RETURN NUMBER;
    END merge_cnter;
    /
    
    CREATE OR REPLACE PACKAGE BODY klondike.merge_cnter
    AS
       g_update_cnter   NUMBER NOT NULL := 0;
       g_insert_cnter   NUMBER NOT NULL := 0;
    
    
       FUNCTION merge_cnter (action_in IN number DEFAULT i_inserting )
          RETURN NUMBER
       IS
       BEGIN
          CASE action_in
             WHEN i_updating
             THEN
                g_update_cnter   := g_update_cnter + 1;
             WHEN i_inserting
             THEN
                g_insert_cnter   := g_insert_cnter + 1;
             ELSE
                RAISE PROGRAM_ERROR;
          END CASE;
    
          RETURN 0;
       END merge_cnter;
    
       FUNCTION get_update_cnt
          RETURN NUMBER
       IS
       BEGIN
          RETURN g_update_cnter;
       END get_update_cnt;
    
       FUNCTION get_update_cnt (merge_cnt_in IN number)
          RETURN NUMBER
       IS
       BEGIN
          RETURN NVL (merge_cnt_in - g_insert_cnter, 0);
       END get_update_cnt;
    
    
       FUNCTION get_insert_cnt
          RETURN NUMBER
       IS
       BEGIN
          RETURN g_insert_cnter;
       END get_insert_cnt;
    
    
       FUNCTION get_insert_cnt (merge_cnt_in IN number)
          RETURN NUMBER
       IS
       BEGIN
          RETURN NVL (merge_cnt_in - g_update_cnter, 0);
       END get_insert_cnt;
    
       
    END merge_cnter;
    /
    Edited by: OrionNet on Dec 13, 2008 6:34 PM
  • 484086
    484086 Member Posts: 73
    hi gints,

    thanks for the reply. I have updated my text a bit, the cause for an update should be:
    same acc_no, different attr_1 fields AND dtimestamp new is more than <n> days newer than timestamp old.

    as to the approach with a merge, this would work i gues, however after the insert a full table scan will have to be done on the flag;

    not knowing if this is going to be quick, i was hoping to find some way of e.g. building up two temporary tables, one holding the U's and I's of which I could then directly report (simply unload all); and another table (some minus of some sort), which together with the other new table would form the final table (drop old, create new as select all from those two tables). as the incoming dataset is small, i could put an index on it, however on the big new one i couldn't really.

    i may be off to nowhere with that thought, so let me know if that's the case :)

    thanks again
  • OrionNet
    OrionNet Member Posts: 4,542
    Hello<

    I just posted a small package couting i/u for mege beside merge, you can always extend this to match or fits your needs.

    Regards
    OrionNet
  • 484086
    484086 Member Posts: 73
    edited Dec 13, 2008 6:55PM
    hi orionnet,

    many thanks for that, i am now reading the code and trying to understand a little how this would be used (feel free to send a pm/email :P)
    so you would suggest not to use straight SQL statements, but work with pl/sql and functions? Some sort of cursor? Is that not going to slow things down a tad?
    i guess i need to understand more of the approach :)

    cheers

    PS Also, I wouldnt need the counts, but in fact the actual records that were updated/inserted. These have to be sent out to an external party. So effectively, the use of this is that the receiving party gets all accounts that have changed their <attribute 1> field lately, but their last change must be at least <n> days old.

    Edited by: user481083 on 13-dec-2008 15:52
  • OrionNet
    OrionNet Member Posts: 4,542
    edited Dec 13, 2008 7:17PM
    Here is the test case you see how its being used, in this case I won't suggest using straight sql as it might get complex. But you can give it a try and see which solution fits your needs better. I found using functions,procedures/packages a good practice and bring reusability and you can easily paramerrized and extend functionality of functions and procedures. As far performance goes, I didn't see any difference on my side but every system and data is different then you have to try and find out.
    BEGIN
       MERGE INTO employee e
    USING (SELECT *
           FROM employee_stg) estg
    ON (e.empno = estg.empno)
       WHEN MATCHED
       THEN
          UPDATE SET e.ename = estg.ename, e.sal = estg.sal
       WHEN NOT MATCHED
       THEN
          INSERT (et.empno, et.ename)
          VALUES (
                    CASE merge_counter.merge_cntr (merge_counter.i_inserting)
                       WHEN 0 THEN estg.empno
                    END,
                    estg.ename,
                    estg.sal
                 );
    
       DBMS_OUTPUT.put_line ('(' || TO_CHAR (sql%ROWCOUNT) || ') rows merged.');
    
    
       DBMS_OUTPUT.put_line(   '('
                            || TO_CHAR (merge_counter.get_merge_insert_count)
                            || ' )rows inserted.');
       DBMS_OUTPUT.put_line('('
                            || TO_CHAR(merge_counter.get_merge_update_count(sql%ROWCOUNT))
                            || ' )rows updated.');
    END;
    /
    Regards
    OrionNet

    P.S: Let me know if this answers your question.

    Edited by: OrionNet on Dec 13, 2008 7:14 PM

    Edited by: OrionNet on Dec 13, 2008 7:16 PM
    OrionNet
  • 484086
    484086 Member Posts: 73
    hi orionnet,

    thanks again
    i will have to take a look at this tomorrow, its 1.45 am here now..
    many thanks for the help, this might be a good approach.

    arne
  • OrionNet
    OrionNet Member Posts: 4,542
    Hello Arne,

    If it helped you and answered your question, please mark it as answered so other can benefit.

    Regards
    OrionNet
  • 484086
    484086 Member Posts: 73
    edited Dec 14, 2008 6:40PM
    Hi OrionNet,

    thanks again for the info. In order to understand your code correctly, is it true that at the end of the merging (where inserts and updates are performed), the output is only the count of U/I's?
    The problem in my case is that I actually have to have the updated and inserted rows, and write them to an output file.

    Like you suggested I had a go at doing it in SQL and came up with the following (i agree it is messy):

    1)

    acct is the 40,000,000 rec table, new_acct is the incoming staging table, 1,000,000 recs or so. We could put an index on one of them maybe.

    insert first
    when (new_accno = -1) then into temp_acct values (cur_accno, cur_prodid, cur_date, cur_descr)
    when (cur_accno = -1) then into out_acct values (new_accno, new_prodid, new_date, new_descr)
    when (cur_prodid = new_prodid) then into temp_acct values (cur_accno, cur_prodid, cur_date, cur_descr)
    when (new_date -3 > cur_date) then into out_acct values (new_accno, new_prodid, new_date, new_descr)
    when (1=1) then into temp_acct values (cur_accno, cur_prodid, cur_date, cur_descr)
    select NVL(cur.accno, -1) cur_accno, cur.prodid cur_prodid, cur.date cur_date, cur.descr cur_descr,
    NVL(new.accno, -1) new_accno, new.prodid new_prodid, new.date new_date, new.descr new_descr
    from cur_acct cur full join new_acct new
    on cur.accno=new.accno;

    2) out_accno provides the table that can be extracted to the filesystem.
    to end up with the final new acct table we would need one statement to create a new account table, the total of out_acct + temp_acct (these tables are mutually exclusive). UNION ALL would suffice in this case (much faster than UNION).

    I would be interested to hear what you think of this approach? Some indexing might speed up the stuff, e.g. on the small incoming staging table.

    ps new_date -3 > cur_date, anything not at least 3 days newer should be ignored.

    thanks

    Arne

    Edited by: Rustydud on 14-dec-2008 15:39
  • OrionNet
    OrionNet Member Posts: 4,542
    Hello Arne

    You asked
    thanks again for the info. In order to understand your code correctly, is it true that at the end of the merging (where inserts and updates are performed), the output is only the count of U/I's?
    The problem in my case is that I actually have to have the updated and inserted rows, and write them to an output file.

    It is true, it will give you output for insert and updates and in addtiion total number of merge
    E.e. 10 Row merged
    3 Rows updated
    7 Row inserted.

    Now you got 40 Million rows in acct table and arprox 1 million rows in new_acct table. Now question, do you have any distiint column which can be used for partitioning the table? Meaning a date column or a region column? It will be helpful if you can post structure of your acct and new_acct table. Also do you have any criteria to select records based on a key column (again date or region), so instead of scanning entire table you can get subset of data that needs to be updated.
    select NVL(cur.accno, -1) cur_accno, cur.prodid cur_prodid, cur.date cur_date, cur.descr cur_descr,
    NVL(new.accno, -1) new_accno, new.prodid new_prodid, new.date new_date, new.descr new_descr
    from cur_acct cur full join new_acct new
    on cur.accno=new.accno
    *where column name = some codition;* 
    How many distinct entries thie column acct_no contains?

    Select count(distinct acct_no) from acct;

    Lastly, indexing would help only if its used in selective criteria (where clause or using hint). Indexing is good when it comes to select but an overhead when it comes to add data to the same table. Please post your table structures and some sample data if possible?

    Regards
    OrionNet
This discussion has been closed.