Forum Stats

  • 3,781,449 Users
  • 2,254,522 Discussions
  • 7,879,706 Comments

Discussions

Data staging tables instead of flat files

User_OKMG9
User_OKMG9 Member Posts: 29 Red Ribbon
edited Jan 27, 2020 11:20AM in General Database Discussions

Hi Guys,

I've been a project to move data to staging tables. In the past what we have done is created flat files (from 50k to 2gb) from one database and placed them in a directory so a second database could pick them up.

The company now wants this data going into staging tables instead of a flat file, once this data has been loaded into the second database we delete the file after a 1 month from the directory.

1. What would be the best strategy would these be database staging tables or global temporary staging tables? Would it be best to use partitioned tables where I could delete the partitions after a month?

2. These tables will only be inserted into by first database and selected from the second database, so what structure of the table would be best?

3, what other performance techniques could i use to make this whole process quicker (i.e loading into the staging tables).

Is there any information on this or a white paper on how to look at this situation. has a similar company done this before.

thanks in advance

Tagged:
Jonathan LewisUser_OKMG9
«1

Answers

  • GregV
    GregV Member Posts: 3,075 Gold Crown
    edited Jan 14, 2020 2:00PM

    Hi,

    Staging tables are not created as global temporary tables simply because the processes populating them are usually not the same as those querying them.

    If you are the one in charge of populating them, then use INSERT INTO ... SELECT statements. I guess the target DB will have a db link or something to query them.

    You would want to use ids to identify the loading/extracting processes, and probably statuses.

  • L. Fernigrini
    L. Fernigrini Data Engineer Sr Consultant Member Posts: 3,696 Silver Crown
    edited Jan 14, 2020 2:12PM

    If you can use partitioning, that would be a way to quickly purge old data.

    If you are not using (paid) partitioning you can still create a table for each month (or whatever period you import) and create a view that "merges" the data from all staging tables using UNION ALL, you can then just truncate the month tables once you do not need them.

    You can write the import process using a synonym, and change it to point to the "new" empty table each month.

  • Jonathan Lewis
    Jonathan Lewis Member Posts: 9,848 Gold Crown
    edited Jan 15, 2020 4:41AM

    The first thoughts that comes to mind:

    • Create a schema on the source database,, create a pipelined function in the schema that can derived the database based on some critical input values.
    • Create a database link from the destination database that connects to the source database schema, and create an interval partitioned table with suitable interval to address your "one month retention period".
    • Execute "insert /*+ append */ into destination table select * from [email protected]_link";  (where "source_table" is actually a call to the pipelined function.
    • If you have to index the destination table, try to stick with local indexes.
    • The insert should ideally be to a specified partition (you could also do a "touch" on the partition before the running the insert to ensure that it exists on demand.
    • Depending on version of Oracle you might be able to drop old partitions without hitting the "barrier" between the initially created partitions and the automatically created ones.  (I think that detail was addressed in 12.2).

    There's an example of something similar on my blog, but there are a few important differences between its implementaiton and what you need to achieve.   In particular you would need to write your data extraction in PL/SQL, and then you would need to work out exactly how you could call a pipelined function created in one database from another database - in my example the call and the pipelined function (and its dependent types) are declared in the same database.

    Regards

    Jonathan Lewis

    UPDATE:  Just did a quick test of the pipelined function idea - you can't write a pipelined function in one database and select from it in another. If you try Oracle raises:

    ORA-30626: function/procedure parameters of remote object types are not supported

    The principle of using array processing PL/SQL to generate the data still applies (as does the interval partitioning), and if you can do the raw data acquisition and processing efficiently from the destination database then you could write a pipelined function in the destination database.  (I''m assuming that the processing requires more than just a simple "insert as select").

  • Dom Brooks
    Dom Brooks Member Posts: 5,552 Silver Crown
    edited Jan 15, 2020 5:11AM

    If the purpose of the staging table is not to store a history of the source data for a certain period for traceability and auditability then the question is why do this? What's the benefit driving this change? But if it is for those reasons then any sort of temporary table is not fit for purpose (as has been mentioned by others here, a true temporary table approach would not work anyway if a different session did the insert than the subsequent select).

    The least invasive change is to continue to generate the flat file from the source database and use an external table on the target database to load the data into a partitioned staging table, the load of which you're going to want to do with direct path probably with the extended partition syntax (i.e. dynamic sql to insert into a specific partition determined at runtime).

    The trade-off of being least invasive is that you've got the continued overhead of writing to file/reading from file.

    So you could do a series of incremental changes, of which this is the first, eventually leading to getting rid of the file generation.

    It's always worth considering how you can do things incrementally with feature toggles so that change can be delivered incrementally with the possibility of being switched back (via a feature flag/property) in the event of production issue, provided of course that such toggles and old code paths are removed later once fully proved.

    Jonathan Lewis
  • Dom Brooks
    Dom Brooks Member Posts: 5,552 Silver Crown
    edited Jan 15, 2020 5:07AM

    There are numerous Oracle Data Warehouse Best Practice papers published by Oracle.

    Google it and you find quite a few version specific ones which will repeat the core features usually found for best throughput, bits of which will be relevant to your question

    https://www.oracle.com/technetwork/database/bi-datawarehousing/twp-dw-best-practies-11g11-2008-09-132076.pdf

    https://blogs.oracle.com/datawarehousing/data-warehouse-best-practices

    https://www.oracle.com/technetwork/database/bi-datawarehousing/pres-best-practices-for-extreme-per-130805.pdf

    etc

    User_OKMG9
  • User_OKMG9
    User_OKMG9 Member Posts: 29 Red Ribbon
    edited Jan 15, 2020 11:22AM

    Thanks guys for all your replies. The reason for staging tables (which come to think of it will not be temporary tables as you've highlighted why it just wouldn't work), Is the data will be held in the staging tables for around two weeks before they are processed into the second database by another process. Although we'd like to keep the data in the staging tables for around another month after they've been processed into the second database just in case there was issues with the data which had been put into the second database and we could then inspect the data in the staging table if needed and be able to correct the data.

    The database is 12c on an exadata box.

    1. Create the partition on the destination staging table which will be held on database 1.

    2. Execute "insert /*+ append */ into destination staging table on database 1 (not indexed) using the process that use to create the flat file to the partition created in step 1.

    3. Process/manipulate the data using a database link from the staging table into second database main tables in 2 weeks time.

    4. Drop the partition in the staging table after a period of time (i.e a month to 6 weeks).

    There are a few points that I didn't understand.

    1. If you are not using (paid) partitioning? What is paid partitioning?

    2. Create a view that "merges" the data from all staging tables using UNION ALL, you can then just truncate the month tables once you do not need them. I think dropping partition would be quicker than truncate, as I think they only be one staging table which will replace the flat file. The reason for a staging table is that the data is easier to inspect if something goes wrong. I wouldn't want all the data truncating from the staging tables as that data maybe in a later partition.

    3. You can write the import process using a synonym, and change it to point to the "new" empty table each month. Not sure what the benefit of this is... please explain?

    4. You could also do a "touch" on the partition before the running the insert to ensure that it exists on demand. What does this mean? Is this checking to see if the partition exists before trying to insert into it?

    5. drop old partitions without hitting the "barrier" between the initially created partitions and the automatically created ones. Sorry I've not really worked much with partitioned tables so please explain?

    6. the question is why do this? What's the benefit driving this change? They want a cleaner process and be able to track/inspect the data more easily than it would be if it was in a bunch of flat files. The process needs to be efficient but not too complicated (without making much change to the process) we'd just switch the writeline to a file to write the line to a staging table.

    Thanks for all your replies and the links you've provided its been a big help.

  • Jonathan Lewis
    Jonathan Lewis Member Posts: 9,848 Gold Crown
    edited Jan 15, 2020 11:53AM

    I'll take (4) and (5) because they're quoting me:

    4. You could also do a "touch" on the partition before the running the insert to ensure that it exists on demand. What does this mean? Is this checking to see if the partition exists before trying to insert into it?

    Correct -  but with one extra detail. With interval partitioning a partition will be created for the data you're inserting if it doesn't already exist. If you want to make sure it is present before you do the insert (and you may have a reason for wanting this) you could execute a statement like:

    lock table x partition for (to_date('01-Feb-2020','dd-Mon-yyyy')) in exclusive mode;

    commit;

    And this will lock (and release) the partition if it exists, but create it if it doesn't. I've got an example in the middle of this blog note.

    5. drop old partitions without hitting the "barrier" between the initially created partitions and the automatically created ones. Sorry I've not really worked much with partitioned tables so please explain? 

    If you read this blog note up to Option 1 then that will explain the idea behind interval partitioning and tell you about the problem that originally appeared with interval partitions when dropping old partitions.  Then read the paragraph titled "The Good News".

    Regards

    Jonathan Lewis

    User_OKMG9
  • L. Fernigrini
    L. Fernigrini Data Engineer Sr Consultant Member Posts: 3,696 Silver Crown
    edited Jan 15, 2020 12:29PM

    Regarding your questions 1, 2 and 3, you never mentioned Oracle edition (Enterprise or Standard) on the original post. Partitioning is a feature available on Enterprise edition only and with an extra cost, that is why I mentioned "If you are not using (paid) partitioning).

    The method I mentioned is usually known and "poor man's partitioning" and consists of having different tables that behaves as partitions, and a view that does UNION ALL and emulates the "entire" table.

    A very simple example would be:

    CREATE TABLE Invoices_201910 ( InvoiceID Number(10), InvoiceAmount Number (10,2), InvoiceDate DATE);

    CREATE TABLE Invoices_201911 ( InvoiceID Number(10), InvoiceAmount Number (10,2), InvoiceDate DATE);

    CREATE TABLE Invoices_201912 ( InvoiceID Number(10), InvoiceAmount Number (10,2), InvoiceDate DATE);

    CREATE TABLE Invoices_202001 ( InvoiceID Number(10), InvoiceAmount Number (10,2), InvoiceDate DATE);

    CREATE OR REPLACE VIEW CurrentInvoices AS

         SELECT * FROM Invoices_201910

        UNION ALL

         SELECT * FROM Invoices_201911

        UNION ALL

         SELECT * FROM Invoices_201912;

    CREATE SYNONYM LastMonth FOR Invoices_202001;

    You can have a procedure that does the insert from your source using LastMonth as the "target. That way you insert data each day, at the end of the week, at the end of month, whenever you want, but it is not "available" since the view does not include the table Invoices_202001.

    When you are done with the import on the "current" month table , yo need to recreate the view to take into account the new data and forget about the old one (October 2019):

    CREATE OR REPLACE VIEW CurrentInvoices AS

         SELECT * FROM Invoices_201911

        UNION ALL

         SELECT * FROM Invoices_201912;

        UNION ALL

         SELECT * FROM Invoices_202001;

    Then you create a new empty table for next month and recreate the synonym so your import process now inserts data on the new table:

    CREATE TABLE Invoices_202002 ( InvoiceID Number(10), InvoiceAmount Number (10,2), InvoiceDate DATE);

    DROP SYNONYM LastMonth;

    CREATE SYNONYM LastMonth FOR Invoices_202002;

    And you may drop Invoices_201910 table whenever you want, since it is not used anymore.

    That is a poor man's replacement of partitioning.... If you include a WHERE clause that filters by date on the view it will also emulate partitioning pruning if you query by dates :-)

    CREATE OR REPLACE VIEW CurrentInvoices AS

         SELECT * FROM Invoices_201911 WHERE InvoiceDate >= DATE '20191101' AND InvoiceDate < DATE '20191201'

        UNION ALL

         SELECT * FROM Invoices_201912 WHERE InvoiceDate >= DATE '20191201' AND InvoiceDate < DATE '20200101';

        UNION ALL

         SELECT * FROM Invoices_202001 WHERE InvoiceDate >= DATE '20200101' AND InvoiceDate < DATE '20200201';

    Having explained the approach, if you are already paying for the partitioning option then forget about this!!!!

    User_OKMG9
  • User_OKMG9
    User_OKMG9 Member Posts: 29 Red Ribbon
    edited Jan 27, 2020 10:49AM

    Hi Guys, after some talks with the company.

    They've told me the staging table (if there is one) will not have any partitions on it.

    They've asked me if I could investigate other ways of transporting data from one database to another database. These being the below.

    1. Datapump - I believe this is just an upgrade to import/export and if the company wants to get rid of flat files then I don't see why they want to create a dump file from one database which would have to be loaded into a staging table in the second database.

    2. Golden gate - I'm not sure if this would be any good but maybe you can enlighten me.

    3. Transportable tablespaces - Put the data into a staging table then copy and move the data's tablespace from the first database to the second database.

    I've suggested creating a view. Using the the Query that creates the flat file but not sure what the performance implications to this are.......

    thanks for all your input up to now.

  • User_OKMG9
    User_OKMG9 Member Posts: 29 Red Ribbon
    edited Jan 27, 2020 11:08AM

    Hi my friend,

    Well I've been told any staging table (if created) will not have partitions on it for me to delete, so your view idea maybe a good option.

    Like I said currently we create a flatfile using a query in PL/SQL, I'm starting to think of using that query to create a view named <view_name>_<sysdate> , maybe this could be a materlized_view.

    Then create a view which would hold a number of materlized views on database 1 (example below) or should this be on database 2?.

    CREATE OR REPLACE VIEW CurrentInvoices AS

         SELECT * FROM Invoices_201911 - materlized view

        UNION ALL

         SELECT * FROM Invoices_201912 - materlized view

        UNION ALL

         SELECT * FROM Invoices_202001; - materlized view

    The second database would then process the data from the view into it's database.