Forum Stats

  • 3,876,125 Users
  • 2,267,065 Discussions
  • 7,912,441 Comments

Discussions

fast loading large multitable flat files (sqlldr-external tables-utl_file)

DeBooserPiet
DeBooserPiet Member Posts: 35 Blue Ribbon
edited Oct 24, 2008 9:18AM in Warehouse Builder
<p>
Actual situation

Loading staging tables from large flat files.

The flat files are also multi recordtype files.

The process overview.
</p>
<ul>
<li>We have several directories where input files reside</li>
<li>Each of the directories has more than one input file per day, maximum is unlimited

The filenames have a sequence number in the file name so there are never two files with the same name in the same directory at any day</li>
<li>The files are concatenated per input directory into on big flat file.

This input file always has the same name , it is recreated once day</li>
<li>The big concatenated file is known in oracle as an external table with two columns, record type and one big column of type varchar2(4000)</li>
<li>The big external table is - I haven't got any idea why - inserted into an oracle table with the same layout.</li>
<li>Then this oracle table is read and processed with a multitable insert into several staging tables. It is read with a huge select consisting for about 80% of substrings.</li>
<li>This huge select is used for the multi table insert.</li>
</ul>
<p>
The process is extremely slow, processing 8,000,000 records takes more than 20 hours , the biggest part is spent with the huge selects.

Of course I want to change this process into something that performs much better.
</p>
<p>
First of all I was thinking about external tables in Oracle but this leads to the following problems
</p>
<ul>
<li>Each eternal table has more then one flat file as source.
<ul>
<li>This can be solved by altering the table over and over again</li>
</ul>
</li>
<li>The source files are multitable file
<ul>
<li>This can be solved by the <strong>load when</strong> clause of the external table definition</li>
</ul>
</li>
</ul>
<p>
But I create one such an external tble and I'm opening it to see the record type defined in the load when clause and this takes ages to complete, the table is about 500MB And this file has then to be reprocessed for all the possible recordtypes.(around 20 different tables)

An other way to proceed is using sql loader with <strong>load when</strong> clauses and perform a multitable insert based on the load when clause.

I think is the better way but then I'm uncertain how to process this in OWB.

Do I have to restart the sql loader process for every other input file and do I have to alter the external table pointing to every other flat file. This means I cannot process in parallel and won't this be contraproductive because I have to restart he mapping so many times (once per input file)?.

I know sql loader cn process multiple source files at once but then I do not know how to do it with the OWB baecause of the differnet input fienames and the variation in number of input files, then I

have to switch over to calling the SQL loader from the command line with a config file thet can be altered for every run (how to do this) and I'm loosing control out of OWB

So what's the best (and fastest executing) way to proceed with this setup
</p>
Tagged:

Answers

  • Oleg
    Oleg Member Posts: 1,463
    edited Oct 23, 2008 2:58AM
    Hi Pieter,
    as I understand you have three stages of processing your files:
    1) concatenating files into big file
    2) loading data from external table to oracle table
    3) multitable insert

    How execution time is distributed across these stages?
    Are you using parallel processing (parallel query and parallel DML)?
    How many CPU (cores) installed in your server?
    Did you monitor server resource usage (CPU usage, disk throughput) during executing differenet stages of your process?

    You told that the biggest part is multitable insert, but why do you start with optimizing extracting external table stage?
    Maybe you want to avoid concatenating files into big file, am I right?
    I think it is a good idea (avoiding of concatenating), as I understand yot want to alter external table with command like
    alter table YOUR_EXT_TABLE location ('file001.txt','file002.txt')
    There is exists simple method for identification source files - create in each your directory file with file list (for example with name filelist.txt),
    for example on Unix with command (and you need to update this file before executing your process)
    ls /u01/oradata/extfiles>/u01/oradata/extfiles/filelist.txt
    create external table based on this file (filelist.txt) and develop PL/SQL procedure which
    - extract file names from external table
    - compose (in string variable) command ALTER TABLE ... LOCATION
    - execute this command with EXECUTE IMMEDIATE

    I think LOAD WHEN (with 20 external table) will be ineffective because in this case selecting from any table perform scanning ALL source files.

    In my environment scanning external table (file 130MB contains 1,000,000 rows) in parallel (8 CPU cores) takes no more than 10 seconds.
    I think there are no reason to switching from external table to SQL*Loader.

    Regards,
    Oleg
  • DeBooserPiet
    DeBooserPiet Member Posts: 35 Blue Ribbon
    Hi Oleg.
    Load when might be very bad as somenone told me to have alook at the explain plan.
    It's possible load when causes so many full table scnas as there are load whens, I can check this next monday.
    I know there will always be one full table san but that's normal.
    If I understand well you're telling me to start the process not from one big flat file but start it over and over again on each of the smaller files and that I can acomplish this be renaming the fiel source for the external table.
    But what is done now is a multitale insert base on a sql like this
    if source column 1 is UUDC then we have
    substr(varchar2'4000) field, 1, 10) into col A target table AA,
    substr(varchar2'4000) field, 11, 5) into col B target table AA,
    substr(varchar2'4000) field, 16, 2) into col C target table AA,
    substr(varchar2'4000) field, 18, 5) into col D target table AA

    if source column 1 is UUDAO then we have
    substr(varchar2'4000) field, 4, 7) into col A target table BB,
    substr(varchar2'4000) field, 11, 2) into col B target table BB,
    substr(varchar2'4000) field, 13, 10) into col C target table BB,
    substr(varchar2'4000) field, 23, 5) into col D target table BB

    I think this one causes so many file scans as there are when clauses for inserting.
    I thought this would be solved by SQ loader scanning the file once and inserting the correct table for every line read (done by checking the line read and inserting into AA or BB or ...)
    It's therefore I think SQL*loder will be faster.
    Right or wrong?
    regards, Piet
  • Oleg
    Oleg Member Posts: 1,463
    edited Oct 24, 2008 4:58AM
    Hi Piet,
    If I understand well you're telling me to start the process not from one big flat file but start it over and over again on each of the smaller files
    and that I can acomplish this be renaming the fiel source for the external table.
    No, you don't need join multiply files into one big - you can define external table based on multiply source files,
    the bad thing - OWB doesn't support dynamic reconfiguration external table to new file names
    (there is no possibility to identify file names in directory inside OWB and alter external table accordingly), but this task you can perform with additional stored procedure
    (as I described).

    Are you sure that OWB generate multitable insert in your procedure? Please check this.

    Regards,
    Oleg
  • DeBooserPiet
    DeBooserPiet Member Posts: 35 Blue Ribbon
    Hi Oleg,
    the procedure as I have it right now is not generated by OWB, it is handwritten and so I am sure it containes the multitable insert.
    I think this was in my overview of the atual process.

    regards, Piet
This discussion has been closed.