Forum Stats

  • 3,874,234 Users
  • 2,266,700 Discussions
  • 7,911,777 Comments

Discussions

Federation ActiveActive Topology: Replication delay backing-up local writes

Chris San Buenaventura
Chris San Buenaventura Member Posts: 17 Green Ribbon
edited Jan 9, 2020 9:58PM in Coherence Support

Coherence version: 12.2.1.4.0

We are using ActiveActive federation across two clusters (one cluster in London and one cluster in New York). Today we had prolonged network glitch which slowed down communication between our London and New York servers.

What we have observed is due to the cross-Atlantic network delay, local writes to a cluster were getting backed up as well. Can you please check if this is a bug or expected behaviiour? If it is the latter then, is there anyway we can configure Federation so that the replication flow does not back-up local writes?

We were getting below logs like below during the network glitch:

2020-01-05T23:41:08,356 WARN  [[email protected] 12.2.1.4.0][Coherence] (thread=SelectionService(channels=7, selector=MultiplexedSelector([email protected]), id=150693841), member=4) tmb://10.53.200.125:9300.45391 accepted connection migration with tmb://10.53.200.126:9300.54452 on MultiplexedSocketChannel(MultiplexedSocket{Socket[addr=/10.53.200.126,port=9300,localport=52414]}): peer=tmb://10.53.200.126:9300.54452, state=ACTIVE, socket=MultiplexedSocket{Socket[addr=/10.53.200.126,port=9300,localport=52414]}, migrations=6, bytes(in=13683513, out=23235313), flushlock false, bufferedOut=6.95KB, unflushed=0B, delivered(in=54117, out=58173), timeout(ack=7.49s), interestOps=1, unflushed receipt=0, receiptReturn 0, isReceiptFlushRequired false, bufferedIn(), msgs(in=28914, out=29393/29412)

java.io.IOException: ack timeout after 15s

        at com.oracle.common.internal.net.socketbus.BufferedSocketBus$BufferedConnection.checkHealth(BufferedSocketBus.java:890)

        at com.oracle.common.internal.net.socketbus.AbstractSocketBus$5.lambda$run$0(AbstractSocketBus.java:644)

        at com.oracle.common.internal.net.socketbus.AbstractSocketBus$5$$Lambda$208/626754434.accept(Unknown Source)

        at java.util.concurrent.ConcurrentHashMap$ValuesView.forEach(ConcurrentHashMap.java:4707)

        at com.oracle.common.internal.net.socketbus.AbstractSocketBus$5.run(AbstractSocketBus.java:644)

        at com.oracle.common.internal.net.socketbus.AbstractSocketBus$3.run(AbstractSocketBus.java:426)

        at com.oracle.common.internal.net.RunnableSelectionService.processRunnables(RunnableSelectionService.java:533)

        at com.oracle.common.internal.net.RunnableSelectionService.process(RunnableSelectionService.java:349)

        at com.oracle.common.internal.net.RunnableSelectionService.run(RunnableSelectionService.java:274)

        at com.oracle.common.internal.net.ResumableSelectionService.run(ResumableSelectionService.java:133)

        at java.lang.Thread.run(Thread.java:745)

2020-01-05T23:41:33,974 WARN  [[email protected] 12.2.1.4.0][Coherence] (thread=SelectionService(channels=15, selector=MultiplexedSelector([email protected]a8), id=505818397), member=4) tmb://10.53.200.125:9300.45391 accepted connection migration with tmb://10.53.200.126:9300.54452 on MultiplexedSocketChannel(MultiplexedSocket{Socket[addr=/10.53.200.126,port=9300,localport=52530]}): peer=tmb://10.53.200.126:9300.54452, state=ACTIVE, socket=MultiplexedSocket{Socket[addr=/10.53.200.126,port=9300,localport=52530]}, migrations=7, bytes(in=13766783, out=23336481), flushlock false, bufferedOut=14.4KB, unflushed=0B, delivered(in=54389, out=58422), timeout(ack=2.13s), interestOps=1, unflushed receipt=0, receiptReturn 0, isReceiptFlushRequired false, bufferedIn(), msgs(in=29054, out=29524/29537)

java.io.IOException: Connection reset by peer

        at sun.nio.ch.FileDispatcherImpl.readv0(Native Method)

        at sun.nio.ch.SocketDispatcher.readv(SocketDispatcher.java:43)

        at sun.nio.ch.IOUtil.read(IOUtil.java:278)

        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:435)

        at com.oracle.common.internal.net.WrapperSocketChannel.read(WrapperSocketChannel.java:130)

        at com.oracle.common.internal.net.MultiplexedSocketProvider$MultiplexedSocketChannel.read(MultiplexedSocketProvider.java:1547)

        at com.oracle.common.internal.net.socketbus.AbstractSocketBus$Connection.read(AbstractSocketBus.java:1956)

        at com.oracle.common.internal.net.socketbus.BufferedSocketBus$BufferedConnection.read(BufferedSocketBus.java:93)

        at com.oracle.common.internal.net.socketbus.SocketMessageBus$MessageConnection$ReadBatch.read(SocketMessageBus.java:615)

        at com.oracle.common.internal.net.socketbus.SocketMessageBus$MessageConnection.processReads(SocketMessageBus.java:206)

        at com.oracle.common.internal.net.socketbus.BufferedSocketBus$BufferedConnection.onReadySafe(BufferedSocketBus.java:700)

        at com.oracle.common.internal.net.socketbus.AbstractSocketBus$Connection.onReady(AbstractSocketBus.java:2135)

        at com.oracle.common.internal.net.RunnableSelectionService.process(RunnableSelectionService.java:401)

        at com.oracle.common.internal.net.RunnableSelectionService.run(RunnableSelectionService.java:274)

        at com.oracle.common.internal.net.ResumableSelectionService.run(ResumableSelectionService.java:133)

        at java.lang.Thread.run(Thread.java:745)

Tagged:
Chris San Buenaventura

Best Answer

  • Randy Stafford-Oracle
    Randy Stafford-Oracle Member Posts: 24 Employee
    edited Jan 8, 2020 12:03AM Answer ✓

    Hi Chris,

    Back to My Oracle Support again, I'd like to request that you create a Service Request for this.  That will allow us to collect relevant information, engage the correct product engineers, etc.  Could you please do that and let me know the SR number?

    Thanks,
    Randy

Answers

  • Randy Stafford-Oracle
    Randy Stafford-Oracle Member Posts: 24 Employee
    edited Jan 8, 2020 12:03AM Answer ✓

    Hi Chris,

    Back to My Oracle Support again, I'd like to request that you create a Service Request for this.  That will allow us to collect relevant information, engage the correct product engineers, etc.  Could you please do that and let me know the SR number?

    Thanks,
    Randy

  • Chris San Buenaventura
    Chris San Buenaventura Member Posts: 17 Green Ribbon
    edited Jan 9, 2020 9:07PM

    Thanks Randy, my account just got sorted today to allow me to raise SRs. For closure here is the SR number: 3-22025752461

    3-22025752461

  • Randy Stafford-Oracle
    Randy Stafford-Oracle Member Posts: 24 Employee
    edited Jan 9, 2020 9:54PM

    Glad to hear it!  Can you download patches now too?

    I'll get attention focused on the SR.

    Cheers,
    Randy

    Chris San Buenaventura
  • Chris San Buenaventura
    Chris San Buenaventura Member Posts: 17 Green Ribbon
    edited Jan 9, 2020 9:58PM
  • Pathikrit Kumar
    Pathikrit Kumar Member Posts: 1 Red Ribbon

    What was the resolution please for this issue

  • Randy Stafford-Oracle
    Randy Stafford-Oracle Member Posts: 24 Employee

    Hello Pathikrit, the resolution was as follows.

    A slow network could trigger tmb:// connection migrations. So, the warning is okay. As for suggestions on any way we can configure Federation so that the replication flow does not back-up local writes, the user can use the FederationManager MBean to stop federation to a destination (remote cluster) temporarily. Once the network is back to normal, the user can use the FederationManager MBean to perform a startWithSync operation, which performs a replicationAll first upon starting federation to the destination. The user can also temporarily pause federation to a destination, but the changes will be accumulated in the local federation caches. Later on, when the federation is started again, the changes are federated to the destination.


    In the 12.2.1.4.0 version, federated cache schemes are configurable with journalcache-highunits - added the ability to specify federation's using a memory based value (for example, 500MB).


    The journalcache-highunits element contains either a memory limit or a maximum number for entries that the federated cache service's internal cache will hold in its backlog for replication to destination participants. The element provides a mechanism to constrain resources utilized by federation service internal caches. Once the journalcache-highunits is reached, the federation service will move all the destination participants to the ERROR state and will remove all pending entries from federation's internal backlog cache.


    Valid values are memory values (e.g. "1G") or positive integers and zero. A memory value is treated as a memory limit on federation's backlog. If no units are specified, then the value is treated as a limit on the number of entries in the backlog. Zero implies no limit. The default value is 0.