Forum Stats

  • 3,873,336 Users
  • 2,266,535 Discussions
  • 7,911,512 Comments

Discussions

Handling replication for Large & Small volume transactions

765115
765115 Member Posts: 61
edited Mar 25, 2011 3:34PM in GoldenGate
I want to setup GG to replicate data between Source & Target (both Oracle) databases.

The Source database can have 2 types of transactions at any time -

1) Small size transactions which might involve 4-5 records and which occurs frequently
2) Big volume transaction which might involve 10 million records and which occurs very rarely

I'm planning to use the Extract-Replicat process in GG to implement this replication. I'm concerned that the big volume replication, would choke the subsequent small replications, because, the small replications will start happening only after the big volume replication completes fully.

I'm thinking about the following 2 options. Can someone please advise on these options ?

1) Is there an option to turn on GG in parallel mode to save time ? In parallel mode, will referential integrity be maintained ?

2) Is there an option to replicate the big volume transaction alone in a different way- by not using extract-replicat (because I think this is the slowest for big transactions) and use GG - SQL*Loader combination to load data. I found this approach mentioned in the Administration Guide for replicating initial data load. However I would still prefer to replicate the small transactions via extract-replicat. Given this, is there a way to implement two different replication strategies at the same target.

Answers

  • -joe
    -joe Member Posts: 226
    Hi Museshad.

    Are the 10 million operations followed by a single commit or are they broken up into multiple transactions? Also, are you able to perform an operation on a new table just before this batch job?

    If indeed it is one monster transaction and if these same tables are taking both batches and OLTP transactions then you may want to perform this batch load on source and target separately and if you can perform an operation on a new table used for "marker" purposes we can talk about how to make the extract ignore the transaction (or user) and how to make the replicat stop gracefully at the point in time just before the batch and then start back up afterward.

    1) Aside from using multiple threads when capturing/extracting from a RAC instance (one thread per node) there is not a parallel mode.

    2) There is not an option to perform some transactions with SQL*Loader (or other methods) and other in a normal fashion but there is an INSERTAPPEND parameter for replicat that will add the insert append hint.

    If it turns out that the 10 million operations are or can be broken up into smaller transactions then we might be over thinking this here. Often we get a lot of questions regarding performance only to find that once we simply test the software it just works with minimal configuration options.

    Regards,
    -joe
  • 765115
    765115 Member Posts: 61
    Hi Joe,

    Thanks for your response.

    A commit issued after loading every table at Source. We have one huge table with 10 million rows. Given this, there will be a single commit after loading 10 million rows for this table at Source. Typically multiple tables (approx row count for each table might be 1 million rows, 50K rows, 10 million rows) will get loaded one after the other via ETL and then get committed. This is like the initial load. At the same time, small transcations could be happening to these tables getting loaded or different tables also. Concern is GG is going to get choked in replicating the initial load tables and might take a long time before it can start replicating the small transactions.

    To make sure I understand your suggestion- are you suggesting that we should skip the initial load transactions from the Extract's view and take it up later during off-peak hours (when there are no small transactions happening) ? The marker row is inserted to get a reference to the initial load transactions later ? I will check with business to see if this is a feasible option.

    Can you please provide more information on how the Extract can be designed to skip the initial load transactions or rather big volume transactions ? Also how do we get GG move back to the initial load transactions at a later point ?
  • -joe
    -joe Member Posts: 226
    Hi Museshad.

    If you have transactions occurring on these tables while the initial load is occurring then running the load separately on source and target isn't going to work as a method to remove the possible bottleneck if the OLTP transactions hit the same rows as the load. If you can separate them out during off hours then this "run the load on source and target separately" would work. But if we can separate these things then we'd use a marker table to stop the replicat at a specific point in time/in the data.


    Just to be extra clear here, let's say the load takes 2 hours. During those 2 hours say 500 other short transactions occur and commit. Then the 2 hour transaction commits. The 2 hour will not hold up the short transactions. Extract will be using resources (mostly memory) to track the large transaction but the short ones will be written the trail. Once the 2 hour transaction commits then that will be written to trail. This could take some time so any other short transaction occurring after the 2 hour commit could get held up will the 10M rows work their way through to the target.

    That all being said, I would grab a test system as close to production as possible and just run the test. It may not be an issue. Sure, you may see some lag accrue while applying the data to target but we could be worrying needlessly.

    Regards,
    -joe
  • 765115
    765115 Member Posts: 61
    Hi Joe,

    Transactions (from OLTP) will be happening on the same tables to which ETL will be loading data, but the ETL would be INSERT-only. Hence the OLTP transactions will be happening on different records in the same table.

    I understand that the OLTP transactions will NOT get held up until the ETL load (for 2 hours) completes fully and COMMIT is issued. GG starts the replication to Target only after the ETL load COMMIT. But I'm concerned that the replication of the next OLTP transaction will happen only after GG has completed replication of the ETL load fully at Target. If the load takes 2 hours at Source, do you have an idea how long the replication for the same might take at Target ? Anyways, I will try to a run a test for the same to check the timing.

    However, I would like to understand your suggestion of how this can be addressed by stopping the Replicat process. I would like to know if something like this can be achieved - once the ETL load COMMIT happens at Source, I do not want the huge set of records to be replicated at Target immediately. However, the subsequent OLTP transactions (these will not be dependent on the ETL records) should continue to get replicated at Target. Dring off-peak time, I want to replicate the ETL records alone by identifying them in some way. I would like to know if something like this can be achieved. I understand that this can achieved easily by timing the ETL load to Source itself at a off-peak time, but I do not have control on timing the ETL load to Source, hence the problem.

    Thanks for your inputs.
  • -joe
    -joe Member Posts: 226
    Hi Museshad.

    Sorry for the delay. Ok, your scenario is better than I thought so long as the ELT tool can log in as another user that does not perform OLTP transactions. If so then let's forget about using eventactions to stop the replicats because you won't need it to extend behavior this way.

    What you can do since the ETL inserted rows are not being acted upon by the OLTP transaction is to create a second extract > pump > replicat set. Next, use tranLogOptions ignoreUser <user> in each extract where <user> for the OLTP flow is the ETL user and where <user> for the ETL flow are all other users that may perform OLTP. There is not yet an "includeOnlyUser" parameter.

    This effectively splits up your replication into two flows and each won't affect the other. The ETL's massive bursts will run independently of the OLTP flow. Also, in the ETL replicat I suggest using the parameter batchSQL, which will give you better throughput by grouping like operations on tables into an arrayed single execution buffer. When this is on, by default 1200 operations are grouped in a single execution, which is usually sufficient but you can take this up to 100 000 - just make sure you don't stress out your target system if it's not as capable as the source.

    Good luck,
    -joe
  • 765115
    765115 Member Posts: 61
    Hi Joe,

    I was on to something else and was away from Golden Gate for sometime.

    I understand the solution you are suggesting, but I'm concerned if the approach might cause referential integrity issues.

    Eg: Say there are 2 users- Bulk & OLTP. As you had suggested, we could setup 2 pairs of Extract & Replicat processes setup to replicate data mutually exclusive between Bulk & OLTP. Now, consider the below scenario-

    1. Bulk inserts a large no. of employees and one final commit in the end. Say the commit happens at 1:30 AM. Its replicat process starts to replicate the employee data to the target. Say this is going to take an hour i.e till 2:30 AM.

    2. At 1:35 AM, OLTP user updates the employee details of one of the employees inserted by Bulk and commits at source. Now, the OLTP replicat process would try to replicate this to the target and might fail if this happens before 2:30 AM.

    This would cause referential integrity issue now.

    If you agree that this issue might occur, then the only fool-proof solution I can think of is to perform intermediate commits (say after every 100k rows) during bulk loads and prevent Golden Gate from choking.

    Please let me know your comments.
  • stevencallan
    stevencallan Member Posts: 3,459
    You could also group the related transactions (per the admin guide, keep related RI transactions/extract/replicat together). Could also do the processing within an intermediate set of tables or schema and then apply the finished result at the end.

    You could commit more frequently (chunks of 1000, for example, kind of like what is recommended in Streams).

    Bottom line, you have to test, evaluate, and decide for yourself.
  • skurup2
    skurup2 Member Posts: 186
    Have you tried splitting txns to parallel pump.
    Below syntax for each pump to split the records using HASH on primary key and each pump writes to it's own trail on target. At the target, use dedicated apply for each trail.
    TABLE owner.table_name , FILTER (@RANGE (1, 5, Key_col1,Key_col2..Key_coln)
    TABLE owner.table_name , FILTER (@RANGE (2, 5, Key_col1,Key_col2..Key_coln)
    TABLE owner.table_name , FILTER (@RANGE (3, 5, Key_col1,Key_col2..Key_coln)
    TABLE owner.table_name , FILTER (@RANGE (4, 5, Key_col1,Key_col2..Key_coln)
    TABLE owner.table_name , FILTER (@RANGE (5, 5, Key_col1,Key_col2..Key_coln)
This discussion has been closed.