Forum Stats

  • 3,875,502 Users
  • 2,266,930 Discussions
  • 7,912,239 Comments

Discussions

Loading 50-60 million records efficiently

Tina
Tina Member Posts: 249 Bronze Badge
edited Apr 13, 2020 11:57AM in SQL & PL/SQL

Hi Gurus,

  I am having some performance issues on bulk loading and have tried to break down pieces below on whats happening and what am I trying.

  Would you be able to suggest any way that would be better or I am missing?

1)

    create table table_a

    (

      loan_nbr      number,

      yr_mnth_cd    varchar2(6 byte),

      start_dt      date,

      maturity_dt   date ,

      number_of_terms number

      );  

    primary_key loan_acct_nbr, yr_mnth_cd (unique indexed)

    indexed by loan_nbr,yr_mnth_cd

  

  

    create table final_load_table

       (

          loan_nbr      number,

          yr_mnth_cd    varchar2(6 byte),

          ......,

          ......,

          ......);

                      

    indexed by loan_nbr,yr_mnth_cd   

  

2)  function my_function ( loan_nbr,loan_start_dt,maturity_dt,number_of_terms) return table_type

3)  Cursor below is used in a code which joins table_a and function my_function and that resultset is being inserted into final_table

    one loan would have 50-65 rows generated by function which is equal to 'NUMBER_OF_TERMS'.

    some business logic is applied on for each row and then that row is loaded to collection and returned by function.

    so i have roughly not 50-60 million records on each every month,

    I have tried few approach on loading this

         1- using bulk collect and comitting on every 100K records

         2- direct insert

               insert into FINAL_LOAD_TABLE

               as select * from cursor;

             

    However both of this takes very long time.Is there any approach we can take for this loading?

    We are planning for creating a partition on final_load_table based on yr_month_Cd (for each month).              

               

Cursor:

    SELECT Y.*,

           TRUNC(SYSDATE,'MM'),

           USER

     FROM TABLE_A A

     JOIN TABLE(MY_FUNCTION (A.LOAN_NBR,A.START_DT,A.MATURITY_DT,A.NUM_OF_TERMS)) Y

           ON A.LOAN_NBR = Y.LOAN_ACCT_NBR

     WHERE A.YR_MNTH_CD = TO_CHAR(ADD_MONTHS(SYSDATE,-1),'YYYYMM')  --last month

     AND A.LOAN_TYP='ABC'

     AND NOT EXISTS ( SELECT 1 FROM FINAL_LOAD_TABLE L

                                                        WHERE L.LOAN_ACCT_NBR =Y.LOAN_ACCT_NBR

                                                        AND Y.YR_MNTH_CD=L.YR_MNTH_CD );

Thanks

«1

Answers

  • John Thorton
    John Thorton Member Posts: 14,493 Silver Crown
    edited Jan 2, 2020 7:13PM

    Don't do in PL/SQL that which can be done in plain SQL

    Function (PL/SQL) requires expensive context switch for every row. SLOW

    COMMIT inside loop makes elapsed time LONGER

    Click & read both URLs below & respond accordingly!

    How do I ask a question on the forums?

    How to  improve the performance of my query? / My query is running slow.

  • Paulzip
    Paulzip Member Posts: 8,811 Blue Diamond
    edited Jan 2, 2020 8:03PM

    Avoid pl/sql table function, if you really can't, use cardinality hint for it (set to 60) or dynamic sampling (Oracle will assume 8k rows for the execution plan otherwise.

    Use subquery cursor caching for the date filtering, it'll be evaluated once, rather than for every filtered row...

    A.YR_MNTH_CD =  (select TO_CHAR(ADD_MONTHS(SYSDATE,-1),'YYYYMM') from dual)

    Use insert into ... select...

    Post execution plan, it often highlights issues.

  • BEDE
    BEDE Oracle Developer Member Posts: 2,475 Gold Trophy
    edited Jan 3, 2020 12:39AM

    About that function. If, by any means, you just can't avoid using that function, then, at least it should be a pipelined function. In what you have written so far, it does not appear to be pipelined, so that it takes much memory for storing the whole table. And, pipelined functions may also use parallel on a table they fetch records from.

    Still, if the whole thing can't be done in one insert select, I'd go for using the facilities of dbms_parallel_execute to chunk data and have it inserted in chunks.

  • AndrewSayer
    AndrewSayer Member Posts: 13,007 Gold Crown
    edited Jan 3, 2020 3:28AM

    Where is the time currently going? Is it executing your function or is it in doing the insert? I suggest you enable extended SQL tracing and run the process for a representative chunk of data (a couple of months?) then run tkprof against the resulting trace file, this should show you exactly where time is spent and should indicate any easy wins.

    insert into.. select..; is almost always faster and less memory intensive than select bulk collect into; insert into ; because you’re not having to copy everything into PGA first. But this might not be the most important thing.

    Indexing tables and then bulk loading them is going to be slower than bulk loading and then indexing, but that might not be important here.
    Projecting an additional USER call per row is not going to be fun, but that might not be important.

  • Tina
    Tina Member Posts: 249 Bronze Badge
    edited Jan 9, 2020 3:57PM

    I have changed my code to use pipeline and even after that it is still slow.

    From my analysis, not exist clause is taking very very long time.

    Execution of this select (cursor), is not even happening in 4-5 hours.

    Query 1 (i.e before not exists ) has tentatively 60-70 million data and FINAL_LOAD_TABLE has about 50 million data.

    Index on table_A :                        YR_MNTH_CD, LOAN_ACCT_NBR

    Index on FINAL_LOAD_TABLE : LOAN_ACCT_NBR, YR_MNTH_CD

    I hope order of composite index doesn't make that difference.

        SELECT Y.*,

               TRUNC(SYSDATE,'MM'),

               USER

         FROM TABLE_A A

         JOIN TABLE(MY_FUNCTION (A.LOAN_NBR,A.START_DT,A.MATURITY_DT,A.NUM_OF_TERMS)) Y

               ON A.LOAN_NBR = Y.LOAN_ACCT_NBR

         WHERE A.YR_MNTH_CD = TO_CHAR(ADD_MONTHS(SYSDATE,-1),'YYYYMM')  --last month

         AND A.LOAN_TYP='ABC'    -- Query1

        AND NOT EXISTS ( SELECT 1 FROM FINAL_LOAD_TABLE L   -- Query2

                                                            WHERE L.LOAN_ACCT_NBR =Y.LOAN_ACCT_NBR

                                                            AND Y.YR_MNTH_CD=L.YR_MNTH_CD );

    I have included the explain plan below, just in case:

    pastedImage_5.png

  • Jonathan Lewis
    Jonathan Lewis Member Posts: 10,117 Blue Diamond
    edited Jan 10, 2020 4:53AM

    A couple of questions about the SQL statement - which looks as if it should probably be used in a "insert as select"

    How many rows are there in the final_load table before you start your "insert as select" -- I am wondering why the "not exists" is needed if it starts as an empty table. [UPDATE: rethinking your comments about "query 1" in your later post, do you have 50M rows in final_load_table before you start the insert - if so how many rows get inserted.]

    How many rows generated by the function will be eliminated by the "not exists" -- I am wondering why the loan could exist with its monthly entries partially in place with some gaps needing filling.  (Possibly one strategy is to delete any data for loans you are about to insert and commit so you don't have to do the not exists subquery).

    Part of the problem indicated by the plan is that Oracle doesn't know you're getting 50 - 60 rows from the function per input, and then it's done its usual join arithmetic to produce an estimate of 5,326 rows to test with the filter subquery. If you can persuade it that there will be some 50M rows in the initial join it might switch to a hash anti join which might be far more efficient. (Alternatively  you could just hint that path.)

    As a first test, adding the hint /*+ cardinality(a, y, 60000000) */  to the SELECT may result in the nested loop operation at (current) operation 2 to report Rows = 60M; and that may be sufficient for the plan to switch to the hash anti-join.

    Regards

    Jonathan Lewis

  • Jonathan Lewis
    Jonathan Lewis Member Posts: 10,117 Blue Diamond
    edited Jan 10, 2020 6:45AM

    I've just run up a little model of your query.

    Hinting the cardinality won't work, the filter subquery HAS to run as a filter because it's not valid for Oracle to unnest it because the correlation predicates are illegal (referring back to the table-function output).

    If you want to get a hash antijoin you will need to put your "query1" into an inline view (perhaps with a no-merge() hint) and then apply the "not-exists" subquery  to the view alias, something like:

    select  v.*

    from    (

            select  /*+ no_merge */

                    y.*

            from

                    table_a a,

                    table(my_function(

                            a.loan_acct_nbr,

                            a.yr_mnth_cd,

                            a.start_dt,

                            a.maturity_dt,

                            a.number_of_terms

                    )) y

            where

                    a.yr_mnth_cd = '202001'

            )      v

    where  not exists (

                    select

                            /*+ unnest */

                            null

                    from    final_load_table_l l

                    where  l.loan_acct_nbr = v.loan_acct_nbr

                    and    l.yr_mnth_cd    = v.yr_mnth_cd

            )

    /

    If you have 50M rows already in the final_load_table_l then you're going to need a PGA of something like 1.2GB for the session to get an optimal hash join, so the join may go one-pass - but that might still be a lot better than your 50M or so probes of the PK index.

    Note - you should be using (basic) compression on the two-column indexes, there's a lot of repetition in their leading columns.#

    Regards

    Jonathan Lewis

  • Tina
    Tina Member Posts: 249 Bronze Badge
    edited Jan 10, 2020 11:45AM

    Thanks Jonathan for getting back this quickly and for your time/effort.

    Final_load_table has about 56 million record and every month about 2 million record gets inserted.

    We don't want to insert any row that is already inserted before and I can say this not exist is taking long time.

    Not exist is using Y.LOAN_ACCT_NBR & Y.YR_MNTH_CD and that is basically a data set from a table function (which doesn't have any index as such).

    Also not exists eliminates large portion of query1, out of 52 million result set from query 1, only close to 2.2 million data gets inserted.

      SELECT Y.*,

               TRUNC(SYSDATE,'MM'),

               USER

         FROM TABLE_A A

         JOIN TABLE(MY_FUNCTION (A.LOAN_NBR,A.START_DT,A.MATURITY_DT,A.NUM_OF_TERMS)) Y

               ON A.LOAN_NBR = Y.LOAN_ACCT_NBR

         WHERE A.YR_MNTH_CD = TO_CHAR(ADD_MONTHS(SYSDATE,-1),'YYYYMM')  --last month

         AND A.LOAN_TYP='ABC'    -- Query1  about 52 million data

    ------------------------------------------------------------------------------------------------------------------------------------------------------

        AND NOT EXISTS ( SELECT 1 FROM FINAL_LOAD_TABLE L   -- Query2 (FINAL_LOAD_TABLE has about 56 million which increases every month by ~2 million)

                                                            WHERE L.LOAN_ACCT_NBR =Y.LOAN_ACCT_NBR

                                                            AND Y.YR_MNTH_CD=L.YR_MNTH_CD );

  • Jonathan Lewis
    Jonathan Lewis Member Posts: 10,117 Blue Diamond
    edited Jan 10, 2020 12:55PM

    So you generate a lot of data that you throw away.

    You can make the mechanics a little more efficient by using basic compression on the indexes so that they are smaller and more likely to be buffered, but you'll still be running a filter subquery tens of millions of times to find the last few (couple of million) rows you're interested in.

    I've written up a model of what you're doing with more details on what the problem is (an inherent limitation of the optimizer) and how to work around it (which is the inline view approach I've given above): https://jonathanlewis.wordpress.com/2020/01/10/collection-limitation/

    The problem with the alternative strategy is that you'll need a large memory allocation for it to run really efficiently.

    Regards

    Jonathan Lewis

  • Tina
    Tina Member Posts: 249 Bronze Badge
    edited Jan 10, 2020 3:24PM

    Thanks Jonathan,

    For calculation in this function I would need all the rows depending on the terms.

    After calculations only those rows that aren't in final_table needs to be loaded.

    Right now this function is returning everything which we are filtering later ( with 'not_exists' clause and this is causing the slowness).

    Is there any way where we can utilize all the rows (based on terms),

    however after calculation insert only those rows that aren't in final table?

    I think that should reduce time greatly.

    FUNCTION MY_FUNCTION (P_LN_NBR NUMBER,P_START_DT DATE,P_MATURITY_DT DATE,P_NUM_OF_TERMS NUMBER )

      RETURN T_TBL_TYPE

      PIPELINED

      AS

        V_RET T_TBL_TYPE := T_TBL_TYPE();

      BEGIN

            FOR REC IN (

                        select LN_NBR, yr_mnth_cd, amt_1, amt2, terms

                             from ....

                        )

            LOOP

               

                V_RET.EXTEND;

                CALCULATIONS....

                .................

                ..................

               -- LOAD COLLECTION

               ---  Can we load this collection in some way so that we can only load those records that aren't in final_table?

                V_RET(V_RET.COUNT):=T_TBL_TYPE  (   P_LOAN_ACCT_NBR,

                                                                                       YR_MNTH_CD,

                                                                                       ....

                                                                                       ....

                                                                                    );

               

               PIPE ROW (V_RET(V_RET.last));

            END LOOP;

            RETURN;

      END;