There are numerous approaches to this challenge. It's important to understand that this challenge exists in all solutions where by queues of information are held in some manner while another system is "off-line".
Ultimately the solution depends on your application and business DR strategy. For example:
1. How long (really) is a system going to be offline? "Forever" is not really reasonable as people in general can't purchase and infinite amount of storage. "Days" might be reasonable, but what in practice this is unlikely. Hours or minutes is the usual scenario.
2. When it is offline for a long time (many hours/days), what is the recovery strategy? Do you want to "replay" all those hours/days/months/years worth of events? Or alternatively do you want to rebuild/copy the cluster and ignore/drop the events?
The choice of what do you really depends more on the application SLA than anything else, however here's some pointers with respect to Push Replication.
If you've configured the Event Distribution pattern to use the Coherence Messaging solution as the infrastructure for queuing replication events, then by default, the size of your cluster is the limit to your queues. If you need more memory ie: more time, then you can increase the size of your cluster. Alternatively you can change the queue (messages) cache to use off heap storage, or flash storage, in which case you're bound by disk size, but at the same time increase application latency.
Alternatively you may like to configure the Event Distribution pattern to use a JMS-based solution (as part of the Push Replication build we test JMS compliant ActiveMQ). In this scenario there's no cost on the cluster storage or size. You're only limited by disk capacity. This is an attractive option if you already have "enterprise class messaging" in-place. There's no need to introduce more "messaging" layers into your application stack - simply use what your enterprise already has available.
I think in reality most people simply "stop" event distribution when a site is lost for a long time - which should be rare. That is, after a long period of time they know it may take them longer to replicate the events than simply rebuild the cache from scratch. Furthermore, for longer outages there's ofter other issues - like database replication - that have higher priority for recovery. These systems will often have precedence over caches. This decision often has more to do with the physics of the connection and the type, duration, frequency and size of updates and application is making than anything else.
Hope this helps.
PS: While we could have a simple "write-everything-to-disk" approach, this essentially makes the caches disk-based. ie: way less performant and generally not the reason you'd be using Coherence. If performance is not a problem, simply use database replication :)
Brian Oliver | Architect | Oracle Coherence
Keeping in mind going OOM is one part of the equation. If you are using the default coherence messaging to queue your replication data, it's implemented as single cache entry (i.e. the queue). This means every time you add an item to your cache, it needs to deserialize the whole queue and add the entry and then reserialize (via the publishing cache store). If you have frequent writes to that cache while the remote site is down, your local cluster will start to grind to a halt, whereby even the jmx controls to suspend/drain the queue do not respond. Or indeed if you have a very large cache, and someone does a cache.clear() it can also kill your cluster.
In short, a word of caution with using Coherence as your messaging middleware:
- Cohernece messaging works well when all sites are up an running, and writes/deletes are not very concentrated or frequent. Or indeed when there is a "temporary" network blip.
- if the remote site goes down for a prolonged period, and writes are frequent, your local site will start to degrade in performance rapidly (not really what you want!).
Of course you have the option to use an alternative messaging provider, you just need to weigh up the pros and cons.
Slight correction here:
... it's implemented as single cache entry (i.e. the queue). This means every time you add an item to your cache, it needs to deserialize the whole queue and add the entry and then reserialize (via the publishing cache store).
The Coherence Messaging Pattern does not write or queue each entry to be queued into a single Cache Entry. Only the identities of Messages are queued, not the actual entries.
As with all deployments, the amount of memory required is based on the number of updates occurring while one or more devices (they may not be sites) are unavailable. These things should be tested not assumed. The same happens if you back your caches by disk, but have no limits on them. The amount of memory consumed is application specific.
The alternative approach (supported out-of-the-box) is to use a store-and-forward messaging provider, like something that implements the JMS specification. All releases of Push Replication are testing with both Coherence Messaging and JMS-based providers.
Hope this helps
Brian Oliver | Architect | Oracle Coherence