1 Reply Latest reply: Mar 20, 2014 2:17 AM by Franck Pachot RSS

    Parallelizing data loads



      My company is purchasing an Exadata appliance with 8 nodes for an Online Inquiry application i.e. primary access is account id; each account has one or more transactions. I am the design architect and am new to Exadata even though I have lots of experience with Oracle RAC from some years ago.

      Some years ago, In tIn order to fully use all the nodes for parallelization, I remember Oracle best practices once created a metadata table with each row in the table storing an account id range. This metadata table was first used to auto generate the CREATE TABLE partition specs for RANGE on account id. The second use of this metadata table was as follows: For a data loading job, this same table was also read by the middle tier to spawn threads across all the nodes. Each loading thread was directed to load one and only one partition. Say the cluster was 5 nodes and we had 200 account id range partitions. 200/5 = 40 partitions per node. So the middle directs 40 threads per node, each thread loads a single range partition taken from the metadata table. This approach fully paralleizes across all the nodes in the cluster and since each thread loads a single partition there is no contention between two threads.


      Does current Exadata make all the above work unnecessary?  For example for a daily load can I load a stage table with daily data and execute a simpel INSERT AS SELECT statement and have EXADATA do all the smarts to parallelize across all the nodes?

      Question 2:

      Is Exadata smart enough to try achieve thread to partition affinity i.e. will it pay attention to the number of partitions  and make these kind of decisions 200 partititions / 5 nodes = 40 partitions and auto set the degree of parallelism accordingly?  What Oracle manual and/or any white papers can I read to avoid all the manual coding at the middle tier that we used to do years ago as descriobed above?



      Willing to learn

        • 1. Re: Parallelizing data loads
          Franck Pachot



          Exadata or not, Partitioning and Parallel DML is the way to do that automatically.


          What I understand is that you needed your metadata table in order to add the 'partition' clause to the insert clause in order to lock only the partition when you insert in direct path. is that right ? Then Parallel DML should do that automatically. Well I've done that kind of insert as select in parallel both in non exadata RAC and in Exadata, and it scales well.


          Exadata will add the startscan feature for the SELECT part, and that's usually very good as you don't need all the rows/columns.

          Question 2 is also not related to Exadata, not even RAC. If you're doing parallel insert, then each parallel process will insert in its own segment (partition).


          About auto DOP, take care on Exadata because parallel_degree_policy=AUTO enables also in-memory parallel query, which you may not want on Exadata.