1 2 Previous Next 15 Replies Latest reply on Apr 13, 2020 3:57 PM by EdStevens

    Loading 50-60 million records efficiently

    Tina

      Hi Gurus,

        I am having some performance issues on bulk loading and have tried to break down pieces below on whats happening and what am I trying.

        Would you be able to suggest any way that would be better or I am missing?

       

       

      1)

          create table table_a

          (

            loan_nbr      number,

            yr_mnth_cd    varchar2(6 byte),

            start_dt      date,

            maturity_dt   date ,

            number_of_terms number

            );  

          primary_key loan_acct_nbr, yr_mnth_cd (unique indexed)

          indexed by loan_nbr,yr_mnth_cd

        

        

          create table final_load_table

             (

                loan_nbr      number,

                yr_mnth_cd    varchar2(6 byte),

                ......,

                ......,

                ......);

                            

          indexed by loan_nbr,yr_mnth_cd   

        

       

       

      2)  function my_function ( loan_nbr,loan_start_dt,maturity_dt,number_of_terms) return table_type

       

       

      3)  Cursor below is used in a code which joins table_a and function my_function and that resultset is being inserted into final_table

          one loan would have 50-65 rows generated by function which is equal to 'NUMBER_OF_TERMS'.

          some business logic is applied on for each row and then that row is loaded to collection and returned by function.

          so i have roughly not 50-60 million records on each every month,

          I have tried few approach on loading this

               1- using bulk collect and comitting on every 100K records

               2- direct insert

                     insert into FINAL_LOAD_TABLE

                     as select * from cursor;

                   

          However both of this takes very long time.Is there any approach we can take for this loading?

          We are planning for creating a partition on final_load_table based on yr_month_Cd (for each month).              

                     

       

      Cursor:

          SELECT Y.*,

                 TRUNC(SYSDATE,'MM'),

                 USER

           FROM TABLE_A A

           JOIN TABLE(MY_FUNCTION (A.LOAN_NBR,A.START_DT,A.MATURITY_DT,A.NUM_OF_TERMS)) Y

                 ON A.LOAN_NBR = Y.LOAN_ACCT_NBR

           WHERE A.YR_MNTH_CD = TO_CHAR(ADD_MONTHS(SYSDATE,-1),'YYYYMM')  --last month

           AND A.LOAN_TYP='ABC'

           AND NOT EXISTS ( SELECT 1 FROM FINAL_LOAD_TABLE L

                                                              WHERE L.LOAN_ACCT_NBR =Y.LOAN_ACCT_NBR

                                                              AND Y.YR_MNTH_CD=L.YR_MNTH_CD );

       

      Thanks

        • 1. Re: Loading 50-60 million records efficiently
          John Thorton

          Don't do in PL/SQL that which can be done in plain SQL

           

          Function (PL/SQL) requires expensive context switch for every row. SLOW

          COMMIT inside loop makes elapsed time LONGER

           

          Click & read both URLs below & respond accordingly!

           

          How do I ask a question on the forums?

          How to  improve the performance of my query? / My query is running slow.

          • 2. Re: Loading 50-60 million records efficiently
            Paulzip

            Avoid pl/sql table function, if you really can't, use cardinality hint for it (set to 60) or dynamic sampling (Oracle will assume 8k rows for the execution plan otherwise.

             

            Use subquery cursor caching for the date filtering, it'll be evaluated once, rather than for every filtered row...

            A.YR_MNTH_CD =  (select TO_CHAR(ADD_MONTHS(SYSDATE,-1),'YYYYMM') from dual)

             

            Use insert into ... select...

             

            Post execution plan, it often highlights issues.

            • 3. Re: Loading 50-60 million records efficiently
              BEDE

              About that function. If, by any means, you just can't avoid using that function, then, at least it should be a pipelined function. In what you have written so far, it does not appear to be pipelined, so that it takes much memory for storing the whole table. And, pipelined functions may also use parallel on a table they fetch records from.

              Still, if the whole thing can't be done in one insert select, I'd go for using the facilities of dbms_parallel_execute to chunk data and have it inserted in chunks.

              • 4. Re: Loading 50-60 million records efficiently
                AndrewSayer

                Where is the time currently going? Is it executing your function or is it in doing the insert? I suggest you enable extended SQL tracing and run the process for a representative chunk of data (a couple of months?) then run tkprof against the resulting trace file, this should show you exactly where time is spent and should indicate any easy wins.

                 

                insert into.. select..; is almost always faster and less memory intensive than select bulk collect into; insert into ; because you’re not having to copy everything into PGA first. But this might not be the most important thing.

                Indexing tables and then bulk loading them is going to be slower than bulk loading and then indexing, but that might not be important here.
                Projecting an additional USER call per row is not going to be fun, but that might not be important.

                • 5. Re: Loading 50-60 million records efficiently
                  Tina

                  I have changed my code to use pipeline and even after that it is still slow.

                  From my analysis, not exist clause is taking very very long time.

                  Execution of this select (cursor), is not even happening in 4-5 hours.

                   

                  Query 1 (i.e before not exists ) has tentatively 60-70 million data and FINAL_LOAD_TABLE has about 50 million data.

                   

                  Index on table_A :                        YR_MNTH_CD, LOAN_ACCT_NBR

                  Index on FINAL_LOAD_TABLE : LOAN_ACCT_NBR, YR_MNTH_CD

                   

                  I hope order of composite index doesn't make that difference.

                   

                      SELECT Y.*,

                             TRUNC(SYSDATE,'MM'),

                             USER

                       FROM TABLE_A A

                       JOIN TABLE(MY_FUNCTION (A.LOAN_NBR,A.START_DT,A.MATURITY_DT,A.NUM_OF_TERMS)) Y

                             ON A.LOAN_NBR = Y.LOAN_ACCT_NBR

                       WHERE A.YR_MNTH_CD = TO_CHAR(ADD_MONTHS(SYSDATE,-1),'YYYYMM')  --last month

                       AND A.LOAN_TYP='ABC'    -- Query1

                      AND NOT EXISTS ( SELECT 1 FROM FINAL_LOAD_TABLE L   -- Query2

                                                                          WHERE L.LOAN_ACCT_NBR =Y.LOAN_ACCT_NBR

                                                                          AND Y.YR_MNTH_CD=L.YR_MNTH_CD );

                   

                  I have included the explain plan below, just in case:

                  • 6. Re: Loading 50-60 million records efficiently
                    Jonathan Lewis

                    A couple of questions about the SQL statement - which looks as if it should probably be used in a "insert as select"

                     

                    How many rows are there in the final_load table before you start your "insert as select" -- I am wondering why the "not exists" is needed if it starts as an empty table. [UPDATE: rethinking your comments about "query 1" in your later post, do you have 50M rows in final_load_table before you start the insert - if so how many rows get inserted.]

                     

                    How many rows generated by the function will be eliminated by the "not exists" -- I am wondering why the loan could exist with its monthly entries partially in place with some gaps needing filling.  (Possibly one strategy is to delete any data for loans you are about to insert and commit so you don't have to do the not exists subquery).

                     

                    Part of the problem indicated by the plan is that Oracle doesn't know you're getting 50 - 60 rows from the function per input, and then it's done its usual join arithmetic to produce an estimate of 5,326 rows to test with the filter subquery. If you can persuade it that there will be some 50M rows in the initial join it might switch to a hash anti join which might be far more efficient. (Alternatively  you could just hint that path.)

                     

                    As a first test, adding the hint /*+ cardinality(a, y, 60000000) */  to the SELECT may result in the nested loop operation at (current) operation 2 to report Rows = 60M; and that may be sufficient for the plan to switch to the hash anti-join.

                     

                    Regards

                    Jonathan Lewis

                    • 7. Re: Loading 50-60 million records efficiently
                      Jonathan Lewis

                      I've just run up a little model of your query.

                      Hinting the cardinality won't work, the filter subquery HAS to run as a filter because it's not valid for Oracle to unnest it because the correlation predicates are illegal (referring back to the table-function output).

                       

                      If you want to get a hash antijoin you will need to put your "query1" into an inline view (perhaps with a no-merge() hint) and then apply the "not-exists" subquery  to the view alias, something like:

                       

                      select  v.*

                      from    (

                              select  /*+ no_merge */

                                      y.*

                              from

                                      table_a a,

                                      table(my_function(

                                              a.loan_acct_nbr,

                                              a.yr_mnth_cd,

                                              a.start_dt,

                                              a.maturity_dt,

                                              a.number_of_terms

                                      )) y

                              where

                                      a.yr_mnth_cd = '202001'

                              )      v

                      where  not exists (

                                      select

                                              /*+ unnest */

                                              null

                                      from    final_load_table_l l

                                      where  l.loan_acct_nbr = v.loan_acct_nbr

                                      and    l.yr_mnth_cd    = v.yr_mnth_cd

                              )

                      /

                       

                       

                      If you have 50M rows already in the final_load_table_l then you're going to need a PGA of something like 1.2GB for the session to get an optimal hash join, so the join may go one-pass - but that might still be a lot better than your 50M or so probes of the PK index.

                       

                      Note - you should be using (basic) compression on the two-column indexes, there's a lot of repetition in their leading columns.#

                       

                      Regards

                      Jonathan Lewis

                      • 8. Re: Loading 50-60 million records efficiently
                        Tina

                        Thanks Jonathan for getting back this quickly and for your time/effort.

                         

                        Final_load_table has about 56 million record and every month about 2 million record gets inserted.

                         

                        We don't want to insert any row that is already inserted before and I can say this not exist is taking long time.

                        Not exist is using Y.LOAN_ACCT_NBR & Y.YR_MNTH_CD and that is basically a data set from a table function (which doesn't have any index as such).

                         

                        Also not exists eliminates large portion of query1, out of 52 million result set from query 1, only close to 2.2 million data gets inserted.

                         

                          SELECT Y.*,

                                   TRUNC(SYSDATE,'MM'),

                                   USER

                             FROM TABLE_A A

                             JOIN TABLE(MY_FUNCTION (A.LOAN_NBR,A.START_DT,A.MATURITY_DT,A.NUM_OF_TERMS)) Y

                                   ON A.LOAN_NBR = Y.LOAN_ACCT_NBR

                             WHERE A.YR_MNTH_CD = TO_CHAR(ADD_MONTHS(SYSDATE,-1),'YYYYMM')  --last month

                             AND A.LOAN_TYP='ABC'    -- Query1  about 52 million data

                        ------------------------------------------------------------------------------------------------------------------------------------------------------

                            AND NOT EXISTS ( SELECT 1 FROM FINAL_LOAD_TABLE L   -- Query2 (FINAL_LOAD_TABLE has about 56 million which increases every month by ~2 million)

                                                                                WHERE L.LOAN_ACCT_NBR =Y.LOAN_ACCT_NBR

                                                                                AND Y.YR_MNTH_CD=L.YR_MNTH_CD );

                        • 9. Re: Loading 50-60 million records efficiently
                          Jonathan Lewis

                          So you generate a lot of data that you throw away.

                           

                          You can make the mechanics a little more efficient by using basic compression on the indexes so that they are smaller and more likely to be buffered, but you'll still be running a filter subquery tens of millions of times to find the last few (couple of million) rows you're interested in.

                           

                          I've written up a model of what you're doing with more details on what the problem is (an inherent limitation of the optimizer) and how to work around it (which is the inline view approach I've given above): https://jonathanlewis.wordpress.com/2020/01/10/collection-limitation/

                           

                          The problem with the alternative strategy is that you'll need a large memory allocation for it to run really efficiently.

                           

                          Regards

                          Jonathan Lewis

                          • 10. Re: Loading 50-60 million records efficiently
                            Tina

                            Thanks Jonathan,

                             

                            For calculation in this function I would need all the rows depending on the terms.

                            After calculations only those rows that aren't in final_table needs to be loaded.

                            Right now this function is returning everything which we are filtering later ( with 'not_exists' clause and this is causing the slowness).

                             

                            Is there any way where we can utilize all the rows (based on terms),

                            however after calculation insert only those rows that aren't in final table?

                            I think that should reduce time greatly.

                             

                            FUNCTION MY_FUNCTION (P_LN_NBR NUMBER,P_START_DT DATE,P_MATURITY_DT DATE,P_NUM_OF_TERMS NUMBER )

                              RETURN T_TBL_TYPE

                              PIPELINED

                              AS

                                V_RET T_TBL_TYPE := T_TBL_TYPE();

                              BEGIN

                                    FOR REC IN (

                                                select LN_NBR, yr_mnth_cd, amt_1, amt2, terms

                                                     from ....

                                                )

                                    LOOP

                                       

                                        V_RET.EXTEND;

                                        CALCULATIONS....

                                        .................

                                        ..................

                                       -- LOAD COLLECTION

                                       ---  Can we load this collection in some way so that we can only load those records that aren't in final_table?

                                        V_RET(V_RET.COUNT):=T_TBL_TYPE  (   P_LOAN_ACCT_NBR,

                                                                                                               YR_MNTH_CD,

                                                                                                               ....

                                                                                                               ....

                                                                                                            );

                                       

                                       PIPE ROW (V_RET(V_RET.last));

                             

                                    END LOOP;

                                    RETURN;

                              END;

                            • 11. Re: Loading 50-60 million records efficiently
                              Jonathan Lewis

                              Tina,

                               

                              Three things.

                               

                              First - how do you know that the largest fraction of the time is being spent in the subquery ?  (It seems likely, and that's what I assumed when I saw your first post, but you need proof to avoid wasting time and effort on the wrong thing.) If you're licensed to use the diagnostic and performacne tools then you can start the query running and generate an SQL Monitor report after a few minutes - see: https://jonathanlewis.wordpress.com/2018/04/06/sql-monitor/

                               

                              Second - if you are correct and most of the time is spent in the subquery this will come from two directions: trivially you will be executing the subquery something like 60 million times and even though the query is very lightweight it will be doing 3 or 4 buffer gets on each execution, and since the index is quite large it may also be doing a fairly regular disk read to pick up a leaf block; less trivially, when you turn the SELECT into "insert as select" the subquery will be running against an index that is constantly changing and require increase amount of work to be done with undo records to get a read-consistent version of the index.  There are a couple of things you could do with the SQL that may reduce the scale of the work but you can't avoid it. Changing the query so that you get an execution plan with a hash anti-join instead of a filter subquery means you'll read the data ONCE before any other activity happens and have it available in local memory so that you won't have to keep doing consistent gets. This may make an enormous difference to the performance.

                               

                              Finally - you're right about the ideal way to reduce the workload: if you could avoid generating the table in the function so that you didn't have to have the subquery at all then that would be the ideal strategy.  However it's possible that the only way to populate the collection with a minimum set is (effectively) to run the subquery inside the function instead of after the join - which means moving the workload rather than reducing it.  There is one business-related possibility, though, that might help.  Before I describe it, I'll just mention that you're wasting resources in you current code which goes:  "extend collection, populate last element of collection, pipe last element of collection" this is probably a side effect of starting with a table function rather than a pipelined function.  You don't need the collection, you can just have a single variable of the object type that defines the table type, populate that variable then pipe that variable.

                               

                              Possible (business related strategy):

                              Assume you have a typical table_a row that will generate 60 final_load_table rows. We know that most of those rows already exist in final_load_table.  WHY do they already exist ?  Is it the case that (e.g) the first 55 rows exist and the last 5 need to be created and inserted to get to the end of the term ?

                               

                              IF this is the case then the function can query final_load_table to find the maximum yrmnth_cd for the acct_nbr and the loop can then check whether or not it has reached that yrmnth_cd and only populate the collection from that point onwards. This means you run the subquery ONCE for each acct_nbr.

                               

                              On the other hand if there could be arbitrary gaps for in the final_load_table for any  acct_nbr then you need to find a different strategy.  (If there are likely to be only a couple of random gaps per acct_nbr there is a more complex strategy for minimising the number of checks, but you need to find out how the business really works before you worry about adding complexity.)

                               

                              Regards

                              Jonathan Lewis

                               

                              P.S.  I have effectively retired from consulting, but for sufficiently interesting problems that need to be fixed in a hurry I still do the occasional day or two off-site on a "no fix - no fee" basis.

                              • 12. Re: Loading 50-60 million records efficiently
                                Paulzip

                                Hi Tina

                                 

                                1. What is the relationship between A.LOAN_NBR and Y.LOAN_ACCT_NBR, is it one to one, or one to many based on the number of terms?

                                If it is a one to one, perhaps the exists correlation could be based on TABLE_A, rather than the output of the pipelined function, which may facilitate hash anti join, and prevent less function evaluations.

                                 

                                e.g.

                                select y.*, trunc(sysdate,'MM'), user

                                from (                                                       

                                   select a.loan_nbr, a.start_dt, a.maturity_dt, a.num_of_terms

                                   from table_a a

                                   where a.yr_mnth_cd = to_char(add_months(sysdate,-1),'YYYYMM')  --last month

                                     and a.loan_typ='ABC'    -- Query1

                                    and not exists ( select 1

                                                     from final_load_table l   -- Query2

                                                     where l.loan_acct_nbr = a.loan_nbr

                                                     and y.yr_mnth_cd between a.yr_mnth_cd and to_char(add_months(a.start_dt, a.num_of_terms), 'YYYYMM'))

                                ) a

                                join table(my_function (a.loan_nbr, a.start_dt, a.maturity_dt, a.num_of_terms)) y on 1 = 1

                                 

                                2. Are there many records in table_a that have reached say the number of terms, so won't populate any records in the final table?

                                 

                                3. Are there any gaps in the final table or are the records inserted always new months?

                                If so, I'm thinking could could avoid piping values that won't be used.

                                 

                                4. Can you give more information about the logic within the pipeline function?  Is it doing an amortization calculation or a payment schedule of sorts?

                                It might be more efficient to populate the full schedule for all records in table_a and mark certain records as in the future, rather than calculate everything and filter each month.

                                • 13. Re: Loading 50-60 million records efficiently
                                  Atul Joshi

                                  How about not exists written in different way? Please see below sample code,

                                  -- Define table Y

                                  with Y (column1, Column2,...) AS (MY_FUNCTION (A.LOAN_NBR,A.START_DT,A.MATURITY_DT,A.NUM_OF_TERMS)

                                  )

                                  select * from Y;

                                  ----

                                  with first1 (EmpNo, EmpName) AS

                                  (select 101 EmpNo, 'Hari1' EmpName from dual

                                  UNION

                                  select 102 EmpNo, 'Steve' EmpName from dual

                                  )

                                  --select * from first1

                                  , second1 (EmpNo, EmpName) AS

                                  (select 103 EmpNo, 'Hari2' EmpName from dual

                                  UNION

                                  select 102 EmpNo, 'Steve' EmpName from dual

                                  )

                                  select * from (select s.EmpNo, s.EmpName, f1.EmpNo as EmpNo1, f1.EmpName as EmpName1

                                  from second1 s

                                  Left JOIN first1 f1

                                  ON f1.EmpNo = s.EmpNo

                                  ) x

                                  -- Pickup the records from second1 which are not there in first1 (Modify the above code as needed)

                                  where x.EmpNo1 is null

                                  • 14. Re: Loading 50-60 million records efficiently
                                    Tubby

                                    You haven't shown us the complete code but if I've followed along appropriately, and your version supports it (one reason you should always share your Version number when posting) perhaps you could make use of the hint IGNORE_ROW_ON_DUPKEY_INDEX assuming you have an appropriate unique index on the target table.

                                     

                                    As an example

                                    https://richardfoote.wordpress.com/2010/12/20/oracle11g-ignore_row_on_dupkey_index-hint-micro-cuts/

                                     

                                    Using that would allow you to completely remove the NOT EXISTS clause from the insert, and assuming that is your pain point (which it seems to be) things should speed up significantly. I prefer Jonathan's solutions to this one, but that's because his is much more fun and I'm always hesitant to recommend or use hints where a viable alternative exists.

                                     

                                    Cheers,

                                    1 2 Previous Next