I have an ETL job that needs to know how much (or if) the CDC process is delayed. For example, if it is behind by 2 hours (transactions entered two hours ago in the source are not in the CDC database yet) then the ETL job needs to wait until it catches up.
As near as I can tell, the only way I can determine this is to look at the table CDC$M_<change set name> which is owned by the publisher. This table has the columns CSCN$ and COMMIT_TIMESTAMP$. If max(COMMIT_TIMESTAMP$) is more than two hours old then I'll have the ETL job go to sleep for X number of minutes.
Does anyone have a better idea?
And if it never catches up? If it is always behind? Seems to me the issue is one of design. The ETL job needs to be written so it can handle that which exists and not worry about what does not yet exist.
The approach is still evolving and I'm open to suggestions. I'm thinking that the ETL job will wait for a while but eventually it will pull whatever is available and then send an email to the support group letting them know that CDC is behind and we may not have all the data we should. This ETL job has been in production for a year and it has been pleasantly reliable. It pulls whatever is available at 2 am and loads it to the warehouse.
The change we are working on is in response to a secondary project that is responsible for verifying that the data we have in the warehouse is accurate compared to the source. On those days when CDC is behind (by as much as 6 hours sometimes!) the comparison is wrong because the warehouse doesn't have the data it should.
The challenge for the ETL is figuring out if and how far behind CDC is.