Discussions
Categories
- 17.9K All Categories
- 3.4K Industry Applications
- 3.3K Intelligent Advisor
- 62 Insurance
- 536.1K On-Premises Infrastructure
- 138.2K Analytics Software
- 38.6K Application Development Software
- 5.7K Cloud Platform
- 109.4K Database Software
- 17.5K Enterprise Manager
- 8.8K Hardware
- 71.1K Infrastructure Software
- 105.2K Integration
- 41.6K Security Software
Persistence Recovery - 'dynamic quorum policy objections'

Hi,
I recently took a machine out of a coherence cluster because it's battery had failed - the data was rebalanced to the remaining three machines. I then stopped the other three machines and restarted them. To my surprise the data was not restored from persistence.
I looked at the persistence tab of the VisualVM coherence plugin and saw there was a 'force recovery' option. When I pressed this I got the following pop-up "Proceeding with recovery despite the dynamic quorum policy objections may lead to the partial or full data loss at the corresponding cache service. Are you sure you want to force recovery?'
I said 'yes' and it seems the data was restored ok.
I'm surprised at this because I explicitly set my recover quorum to 0 - meaning let coherence decide. The other quorum values I set to 'the number of cache nodes per machine * (the number of machines -1 ) i.e. allow for the loss of one machine without invoking quorum but not more than that...
<partitioned-quorum-policy-scheme>
<distribution-quorum system-property="dsp.cache.distribution.quorum">18</distribution-quorum>
<restore-quorum system-property="dsp.cache.restore.quorum">18</restore-quorum>
<read-quorum system-property="dsp.cache.read.quorum">18</read-quorum>
<write-quorum system-property="dsp.cache.write.quorum">18</write-quorum>
<!-- recover quorum of 0 enables dynamic recovery quorum policy -->
<recover-quorum system-property="dsp.cache.recover.quorum">0</recover-quorum>
<!-- persistence-hosts-list is defined in the tangosol override file -->
<recovery-hosts>persistence-hosts-list</recovery-hosts>
</partitioned-quorum-policy-scheme>
[email protected] scripts]$ grep quorum dealing_cluster.properties
# 08 Mar 2017 3.2 Joe Holder Leave Recover-quorum at 0 - to enable dynamic recovery quorum policy (persistence)
QUORUM_SYSTEM_PROPERTIES="-Ddsp.cache.distribution.quorum=${CACHE_QUORUM}"
QUORUM_SYSTEM_PROPERTIES="$QUORUM_SYSTEM_PROPERTIES -Ddsp.cache.restore.quorum=${CACHE_QUORUM}"
QUORUM_SYSTEM_PROPERTIES="$QUORUM_SYSTEM_PROPERTIES -Ddsp.cache.read.quorum=${CACHE_QUORUM}"
QUORUM_SYSTEM_PROPERTIES="$QUORUM_SYSTEM_PROPERTIES -Ddsp.cache.write.quorum=${CACHE_QUORUM}"
#Recover-quorum of 0 to enable dynamic recovery quorum policy
QUORUM_SYSTEM_PROPERTIES="$QUORUM_SYSTEM_PROPERTIES -Ddsp.cache.recover.quorum=0"
[[email protected] scripts]$ grep CACHE_QUORUM dealing_cluster.properties
((CACHE_QUORUM=($NBR_OF_CNSDS - 1) * $MAX_CACHESERVER))
QUORUM_SYSTEM_PROPERTIES="-Ddsp.cache.distribution.quorum=${CACHE_QUORUM}"
QUORUM_SYSTEM_PROPERTIES="$QUORUM_SYSTEM_PROPERTIES -Ddsp.cache.restore.quorum=${CACHE_QUORUM}"
QUORUM_SYSTEM_PROPERTIES="$QUORUM_SYSTEM_PROPERTIES -Ddsp.cache.read.quorum=${CACHE_QUORUM}"
QUORUM_SYSTEM_PROPERTIES="$QUORUM_SYSTEM_PROPERTIES -Ddsp.cache.write.quorum=${CACHE_QUORUM}"
[[email protected] scripts]$ grep NBR_OF_CNSDS dealing_cluster.properties
NBR_OF_CNSDS=4
((CACHE_QUORUM=($NBR_OF_CNSDS - 1) * $MAX_CACHESERVER))
However this behaviour seems to mean that if we lost one machine and subsequently restarted the others persistence would not be automatically invoked. Is this the case? If so what should the quorum values be set to to safely invoke it automatically?
Answers
-
Hi Joe.
Did you see any messages in the log files regarding recover quorum before you issued the force recovery?
What version are you using of Coherence?
Tim
-
We are using version 12.2.1.2.1 - Unfortunately I no longer have the log files but yes I believe it did log.
-
Hi Joe.
I noticed you also have recovery-hosts set to persistence-hosts-list.
Does this point to all 4 hosts?
if you are using the dynamic recovery quorum you should not need to include the recovery-hosts.Let me do some testing here, but can you test the scenario in your test env removing the recovery-hosts?
Thanks
Tim
-
HI Tim,
Ok we will try that
Joe
-
And yes - the 'persistence-hosts-list' address provider currently set to include all machines in the cluster
-
I removed the recovery-list from the cache config and started the cluster. (full complement of machines)
Data was not recovered from persistence - I had errors like this -
Unreachable quorum info PartitionSet{530, 568, 602, 715, 745, 920, 951, 998, 1006, 1053, 1056, 1122, 1209, 1276, 1313, 1355, 1429, 1530, 1557, 1684} - recovery of PartitionSet{0..1810} is disallowed
2017-09-05 12:37:11,282 [DEBUG] [[email protected] 12.2.1.2.1] [Coherence] 2017-09-05 12:37:11.281/945.867 Oracle Coherence GE 12.2.1.2.1 <D7> (thread=FederatedCache:AdminCacheService, member=4): Metadata for cache desks: FederatedCacheMetadata{f_sCacheName='desks', f_mapParticipantMetadata={NTCN-DEALING=ParticipantMetadata{m_sDestinationCache='desks', m_setSenders=[STCN-DEALING, NTCN-DEALING], m_setRepeaters=[]}, STCN-DEALING=ParticipantMetadata{m_sDestinationCache='desks', m_setSenders=[STCN-DEALING, NTCN-DEALING], m_setRepeaters=[]}}}
2017-09-05 12:37:17,049 [DEBUG] [[email protected] 12.2.1.2.1] [Coherence] 2017-09-05 12:37:17.049/951.636 Oracle Coherence GE 12.2.1.2.1 <D7> (thread=FederatedCache:AdminCacheService, member=4): Metadata for cache d3messages: FederatedCacheMetadata{f_sCacheName='d3messages', f_mapParticipantMetadata={NTCN-DEALING=ParticipantMetadata{m_sDestinationCache='d3messages', m_setSenders=[STCN-DEALING, NTCN-DEALING], m_setRepeaters=[]}, STCN-DEALING=ParticipantMetadata{m_sDestinationCache='d3messages', m_setSenders=[STCN-DEALING, NTCN-DEALING], m_setRepeaters=[]}}}
2017-09-05 12:37:23,298 [DEBUG] [[email protected] 12.2.1.2.1] [Coherence] 2017-09-05 12:37:23.298/957.884 Oracle Coherence GE 12.2.1.2.1 <D7> (thread=FederatedCache:AdminCacheService, member=4): Metadata for cache dasUpdates: FederatedCacheMetadata{f_sCacheName='dasUpdates', f_mapParticipantMetadata={NTCN-DEALING=ParticipantMetadata{m_sDestinationCache='dasUpdates', m_setSenders=[STCN-DEALING, NTCN-DEALING], m_setRepeaters=[]}, STCN-DEALING=ParticipantMetadata{m_sDestinationCache='dasUpdates', m_setSenders=[STCN-DEALING, NTCN-DEALING], m_setRepeaters=[]}}}
2017-09-05 12:38:10,879 [WARN] [[email protected] 12.2.1.2.1] [Coherence] 2017-09-05 12:38:10.879/1005.465 Oracle Coherence GE 12.2.1.2.1 <Warning> (thread=FederatedCache:AdminCacheService, member=4): Action "recover" disallowed:
Unreachable quorum info PartitionSet{530, 568, 602, 715, 745, 920, 951, 998, 1006, 1053, 1056, 1122, 1209, 1276, 1313, 1355, 1429, 1530, 1557, 1684} - recovery of PartitionSet{0..1810} is disallowed
2017-09-05 12:38:33,675 [DEBUG] [[email protected] 12.2.1.2.1] [Coherence] 2017-09-05 12:38:33.675/1028.262 Oracle Coherence GE 12.2.1.2.1 <D9> (thread=FlashJournalRM-Collector, member=4): [Journal GC reclaimed: 0.000000KB 0ms 0.91 load-factor]
2017-09-05 12:38:33,721 [DEBUG] [[email protected] 12.2.1.2.1] [Coherence] 2017-09-05 12:38:33.721/1028.307 Oracle Coherence GE 12.2.1.2.1 <D9> (thread=RamJournalRM-Collector, member=4): [Journal GC reclaimed: 0.000000KB 0ms 0.25 load-factor]
2017-09-05 12:39:11,379 [WARN] [[email protected] 12.2.1.2.1] [Coherence] 2017-09-05 12:39:11.379/1065.965 Oracle Coherence GE 12.2.1.2.1 <Warning> (thread=FederatedCache:AdminCacheService, member=4): Action "recover" disallowed:
Unreachable quorum info PartitionSet{530, 568, 602, 715, 745, 920, 951, 998, 1006, 1053, 1056, 1122, 1209, 1276, 1313, 1355, 1429, 1530, 1557, 1684} - recovery of PartitionSet{0..1810} is disallowed
2017-09-05 12:40:11,746 [WARN] [[email protected] 12.2.1.2.1] [Coherence] 2017-09-05 12:40:11.745/1126.332 Oracle Coherence GE 12.2.1.2.1 <Warning> (thread=FederatedCache:AdminCacheService, member=4): Action "recover" disallowed:
Unreachable quorum info PartitionSet{530, 568, 602, 715, 745, 920, 951, 998, 1006, 1053, 1056, 1122, 1209, 1276, 1313, 1355, 1429, 1530, 1557, 1684} - recovery of PartitionSet{0..1810} is disallowed
2017-09-05 12:40:52,376 [DEBUG] [[email protected] 12.2.1.2.1] [Coherence] 2017-09-05 12:40:52.376/1166.962 Oracle Coherence GE 12.2.1.2.1 <D9> (thread=FlashJournalRM-Collector, member=4): [Journal GC reclaimed: 0.000000KB 2ms 0.91 load-factor]
2017-09-05 12:40:52,420 [DEBUG] [[email protected] 12.2.1.2.1] [Coherence] 2017-09-05 12:40:52.420/1167.006 Oracle Coherence GE 12.2.1.2.1 <D9> (thread=RamJournalRM-Collector, member=4): [Journal GC reclaimed: 0.000000KB 0ms 0.25 load-factor]
2017-09-05 12:41:12,166 [WARN] [[email protected] 12.2.1.2.1] [Coherence] 2017-09-05 12:41:12.166/1186.752 Oracle Coherence GE 12.2.1.2.1 <Warning> (thread=FederatedCache:AdminCacheService, member=4): Action "recover" disallowed:
Unreachable quorum info PartitionSet{530, 568, 602, 715, 745, 920, 951, 998, 1006, 1053, 1056, 1122, 1209, 1276, 1313, 1355, 1429, 1530, 1557, 1684} - recovery of PartitionSet{0..1810} is disallowed
-
What that message is saying is it can't reach the the following partitions to recover them.
Unreachable quorum info PartitionSet{530, 568, 602, 715, 745, 920, 951, 998, 1006, 1053, 1056, 1122, 1209, 1276, 1313, 1355, 1429, 1530, 1557, 1684} - recovery of PartitionSet{0..1810} is disallowed
If you have started cache servers up all all machines that were present before, then it should be able to find the partitions.
Can you see where the active files for those partitions are on disk?
Tim
-
Yes, all machines were started up.
-
Hi Joe,
Can you file a SR (Service Request) with Coherence, so that we can help you in a detailed manner ?
Regards,
Eshan