Forum Stats

  • 3,827,872 Users
  • 2,260,836 Discussions
  • 7,897,401 Comments

Discussions

Persistence Recovery - 'dynamic quorum policy objections'

JoeHolder
JoeHolder Member Posts: 54
edited Sep 7, 2017 12:40AM in Coherence Support

Hi,

I recently took a machine out of a coherence cluster because it's battery had failed - the data was rebalanced to the remaining three machines.  I then stopped the other three machines and restarted them.  To my surprise the data was not restored from persistence.

I looked at the persistence tab of the VisualVM coherence plugin and saw there was a 'force recovery' option.  When I pressed this I got the following pop-up "Proceeding with recovery despite the dynamic quorum policy objections may lead to the partial or full data loss at the corresponding cache service. Are you sure you want to force recovery?'

I said 'yes' and it seems the data was restored ok.

I'm surprised at this because I explicitly set my recover quorum to 0 - meaning let coherence decide.  The other quorum values I set to 'the number of cache nodes per machine * (the number of machines -1 ) i.e. allow for the loss of one machine without invoking quorum but not more than that...

<partitioned-quorum-policy-scheme>

   <distribution-quorum system-property="dsp.cache.distribution.quorum">18</distribution-quorum>

   <restore-quorum system-property="dsp.cache.restore.quorum">18</restore-quorum>

   <read-quorum system-property="dsp.cache.read.quorum">18</read-quorum>

   <write-quorum system-property="dsp.cache.write.quorum">18</write-quorum>

   <!-- recover quorum of 0 enables dynamic recovery quorum policy -->
   <recover-quorum system-property="dsp.cache.recover.quorum">0</recover-quorum>

   <!-- persistence-hosts-list is defined in the tangosol override file -->
   <recovery-hosts>persistence-hosts-list</recovery-hosts>

</partitioned-quorum-policy-scheme>

[email protected] scripts]$ grep quorum dealing_cluster.properties

# 08 Mar   2017   3.2                Joe Holder          Leave Recover-quorum at 0 - to enable dynamic recovery quorum policy (persistence)

QUORUM_SYSTEM_PROPERTIES="-Ddsp.cache.distribution.quorum=${CACHE_QUORUM}"

QUORUM_SYSTEM_PROPERTIES="$QUORUM_SYSTEM_PROPERTIES -Ddsp.cache.restore.quorum=${CACHE_QUORUM}"

QUORUM_SYSTEM_PROPERTIES="$QUORUM_SYSTEM_PROPERTIES -Ddsp.cache.read.quorum=${CACHE_QUORUM}"

QUORUM_SYSTEM_PROPERTIES="$QUORUM_SYSTEM_PROPERTIES -Ddsp.cache.write.quorum=${CACHE_QUORUM}"

#Recover-quorum of 0 to enable dynamic recovery quorum policy

QUORUM_SYSTEM_PROPERTIES="$QUORUM_SYSTEM_PROPERTIES -Ddsp.cache.recover.quorum=0"

[[email protected] scripts]$ grep CACHE_QUORUM dealing_cluster.properties

((CACHE_QUORUM=($NBR_OF_CNSDS - 1) * $MAX_CACHESERVER))

QUORUM_SYSTEM_PROPERTIES="-Ddsp.cache.distribution.quorum=${CACHE_QUORUM}"

QUORUM_SYSTEM_PROPERTIES="$QUORUM_SYSTEM_PROPERTIES -Ddsp.cache.restore.quorum=${CACHE_QUORUM}"

QUORUM_SYSTEM_PROPERTIES="$QUORUM_SYSTEM_PROPERTIES -Ddsp.cache.read.quorum=${CACHE_QUORUM}"

QUORUM_SYSTEM_PROPERTIES="$QUORUM_SYSTEM_PROPERTIES -Ddsp.cache.write.quorum=${CACHE_QUORUM}"

[[email protected] scripts]$ grep NBR_OF_CNSDS dealing_cluster.properties

NBR_OF_CNSDS=4

((CACHE_QUORUM=($NBR_OF_CNSDS - 1) * $MAX_CACHESERVER))

However this behaviour seems to mean that if we lost one machine and subsequently restarted the others persistence would not be automatically invoked.  Is this the case?  If so what should the quorum values be set to to safely invoke it automatically?

Answers

  • Tmiddlet-Oracle
    Tmiddlet-Oracle Member Posts: 125
    edited Aug 24, 2017 8:08PM

    Hi Joe.

    Did you see any messages in the log files regarding recover quorum before you issued the force recovery?

    What version are you using of Coherence?

    Tim

  • JoeHolder
    JoeHolder Member Posts: 54
    edited Aug 31, 2017 5:51AM

    We are using version 12.2.1.2.1  - Unfortunately I no longer have the log files but yes I believe it did log. 

  • Tmiddlet-Oracle
    Tmiddlet-Oracle Member Posts: 125
    edited Aug 31, 2017 10:43PM

    Hi Joe.

    I noticed you also have recovery-hosts set to persistence-hosts-list.

    Does this point to all 4 hosts?


    if you are using the dynamic recovery quorum you should not need to include the recovery-hosts.

    Let me do some testing here, but can you test the scenario in your test env removing the recovery-hosts?

    Thanks

    Tim

  • JoeHolder
    JoeHolder Member Posts: 54
    edited Sep 1, 2017 3:33AM

    HI Tim,

    Ok we will try that

    Joe

  • JoeHolder
    JoeHolder Member Posts: 54
    edited Sep 1, 2017 3:47AM

    And yes - the 'persistence-hosts-list' address provider currently set to include all machines in the cluster

  • JoeHolder
    JoeHolder Member Posts: 54
    edited Sep 5, 2017 8:42AM

    I removed the recovery-list from the cache config and started the cluster. (full complement of machines)

    Data was not recovered from persistence - I had errors like this -

    Unreachable quorum info PartitionSet{530, 568, 602, 715, 745, 920, 951, 998, 1006, 1053, 1056, 1122, 1209, 1276, 1313, 1355, 1429, 1530, 1557, 1684} - recovery of PartitionSet{0..1810} is disallowed

    2017-09-05 12:37:11,282 [DEBUG] [[email protected] 12.2.1.2.1] [Coherence] 2017-09-05 12:37:11.281/945.867 Oracle Coherence GE 12.2.1.2.1 <D7> (thread=FederatedCache:AdminCacheService, member=4): Metadata for cache desks: FederatedCacheMetadata{f_sCacheName='desks', f_mapParticipantMetadata={NTCN-DEALING=ParticipantMetadata{m_sDestinationCache='desks', m_setSenders=[STCN-DEALING, NTCN-DEALING], m_setRepeaters=[]}, STCN-DEALING=ParticipantMetadata{m_sDestinationCache='desks', m_setSenders=[STCN-DEALING, NTCN-DEALING], m_setRepeaters=[]}}}

    2017-09-05 12:37:17,049 [DEBUG] [[email protected] 12.2.1.2.1] [Coherence] 2017-09-05 12:37:17.049/951.636 Oracle Coherence GE 12.2.1.2.1 <D7> (thread=FederatedCache:AdminCacheService, member=4): Metadata for cache d3messages: FederatedCacheMetadata{f_sCacheName='d3messages', f_mapParticipantMetadata={NTCN-DEALING=ParticipantMetadata{m_sDestinationCache='d3messages', m_setSenders=[STCN-DEALING, NTCN-DEALING], m_setRepeaters=[]}, STCN-DEALING=ParticipantMetadata{m_sDestinationCache='d3messages', m_setSenders=[STCN-DEALING, NTCN-DEALING], m_setRepeaters=[]}}}

    2017-09-05 12:37:23,298 [DEBUG] [[email protected] 12.2.1.2.1] [Coherence] 2017-09-05 12:37:23.298/957.884 Oracle Coherence GE 12.2.1.2.1 <D7> (thread=FederatedCache:AdminCacheService, member=4): Metadata for cache dasUpdates: FederatedCacheMetadata{f_sCacheName='dasUpdates', f_mapParticipantMetadata={NTCN-DEALING=ParticipantMetadata{m_sDestinationCache='dasUpdates', m_setSenders=[STCN-DEALING, NTCN-DEALING], m_setRepeaters=[]}, STCN-DEALING=ParticipantMetadata{m_sDestinationCache='dasUpdates', m_setSenders=[STCN-DEALING, NTCN-DEALING], m_setRepeaters=[]}}}

    2017-09-05 12:38:10,879 [WARN] [[email protected] 12.2.1.2.1] [Coherence] 2017-09-05 12:38:10.879/1005.465 Oracle Coherence GE 12.2.1.2.1 <Warning> (thread=FederatedCache:AdminCacheService, member=4): Action "recover" disallowed:

    Unreachable quorum info PartitionSet{530, 568, 602, 715, 745, 920, 951, 998, 1006, 1053, 1056, 1122, 1209, 1276, 1313, 1355, 1429, 1530, 1557, 1684} - recovery of PartitionSet{0..1810} is disallowed

    2017-09-05 12:38:33,675 [DEBUG] [[email protected]67 12.2.1.2.1] [Coherence] 2017-09-05 12:38:33.675/1028.262 Oracle Coherence GE 12.2.1.2.1 <D9> (thread=FlashJournalRM-Collector, member=4): [Journal GC reclaimed: 0.000000KB 0ms 0.91 load-factor]

    2017-09-05 12:38:33,721 [DEBUG] [[email protected] 12.2.1.2.1] [Coherence] 2017-09-05 12:38:33.721/1028.307 Oracle Coherence GE 12.2.1.2.1 <D9> (thread=RamJournalRM-Collector, member=4): [Journal GC reclaimed: 0.000000KB 0ms 0.25 load-factor]

    2017-09-05 12:39:11,379 [WARN] [[email protected] 12.2.1.2.1] [Coherence] 2017-09-05 12:39:11.379/1065.965 Oracle Coherence GE 12.2.1.2.1 <Warning> (thread=FederatedCache:AdminCacheService, member=4): Action "recover" disallowed:

    Unreachable quorum info PartitionSet{530, 568, 602, 715, 745, 920, 951, 998, 1006, 1053, 1056, 1122, 1209, 1276, 1313, 1355, 1429, 1530, 1557, 1684} - recovery of PartitionSet{0..1810} is disallowed

    2017-09-05 12:40:11,746 [WARN] [[email protected] 12.2.1.2.1] [Coherence] 2017-09-05 12:40:11.745/1126.332 Oracle Coherence GE 12.2.1.2.1 <Warning> (thread=FederatedCache:AdminCacheService, member=4): Action "recover" disallowed:

    Unreachable quorum info PartitionSet{530, 568, 602, 715, 745, 920, 951, 998, 1006, 1053, 1056, 1122, 1209, 1276, 1313, 1355, 1429, 1530, 1557, 1684} - recovery of PartitionSet{0..1810} is disallowed

    2017-09-05 12:40:52,376 [DEBUG] [[email protected] 12.2.1.2.1] [Coherence] 2017-09-05 12:40:52.376/1166.962 Oracle Coherence GE 12.2.1.2.1 <D9> (thread=FlashJournalRM-Collector, member=4): [Journal GC reclaimed: 0.000000KB 2ms 0.91 load-factor]

    2017-09-05 12:40:52,420 [DEBUG] [[email protected] 12.2.1.2.1] [Coherence] 2017-09-05 12:40:52.420/1167.006 Oracle Coherence GE 12.2.1.2.1 <D9> (thread=RamJournalRM-Collector, member=4): [Journal GC reclaimed: 0.000000KB 0ms 0.25 load-factor]

    2017-09-05 12:41:12,166 [WARN] [[email protected] 12.2.1.2.1] [Coherence] 2017-09-05 12:41:12.166/1186.752 Oracle Coherence GE 12.2.1.2.1 <Warning> (thread=FederatedCache:AdminCacheService, member=4): Action "recover" disallowed:

    Unreachable quorum info PartitionSet{530, 568, 602, 715, 745, 920, 951, 998, 1006, 1053, 1056, 1122, 1209, 1276, 1313, 1355, 1429, 1530, 1557, 1684} - recovery of PartitionSet{0..1810} is disallowed

  • Tmiddlet-Oracle
    Tmiddlet-Oracle Member Posts: 125
    edited Sep 5, 2017 9:53AM

    What that message is saying is it can't reach the the following partitions to recover them.

    Unreachable quorum info PartitionSet{530, 568, 602, 715, 745, 920, 951, 998, 1006, 1053, 1056, 1122, 1209, 1276, 1313, 1355, 1429, 1530, 1557, 1684} - recovery of PartitionSet{0..1810} is disallowed

    If you have started cache servers up all all machines that were present before, then it should be able to find the partitions.

    Can you see where the active files for those partitions are on disk?

    Tim

  • JoeHolder
    JoeHolder Member Posts: 54
    edited Sep 6, 2017 3:40AM

    Yes, all machines were started up. 

  • 2692471
    2692471 Member Posts: 1
    edited Sep 7, 2017 12:40AM

    Hi Joe,

    Can you file a SR (Service Request) with Coherence, so that we can help you in a detailed manner ?

    Regards,

    Eshan

This discussion has been closed.