3 Replies Latest reply: Sep 2, 2010 10:19 AM by 701681 RSS

    PushReplication retries cause local extend clients to be blocked!

    701681
      I have an issue where Push Replication retries cause local extend clients to be blocked!

      Scenario as follows:

      1) I start up my London cluster, and add around 130 push replication publishers (i.e. one per cache, per remote site)

      2) I set these publishers with auto start equal to true, and infinite retries in the case of failure. (i.e. i want my system to be totally automated, don't want an end user to go in to jmx and have to click resume across all 130 publishers).

      3) If my remote sites are up and available, the RemoteInvocationPublisher's alll connect, and all is fine.

      4) If however the remote sites are not available I see the push replication publishers retry periodically. The problem is any extend clients connecting into London get blocked by the 130 push rep publishing service threads that are retrying, and eventually timeout (stack trace below)

      ==========================================================================
      Name: Proxy:ExtendedTcpProxyService:TcpAcceptorWorker:17
      State: BLOCKED on com.tangosol.coherence.component.util.SafeCluster@18cad92 owned by: PublishingService:Thread-5
      Total blocked: 10 Total waited: 122

      Stack trace:
      com.tangosol.net.CacheFactory.ensureCluster(CacheFactory.java:995)
      com.tangosol.net.DefaultConfigurableCacheFactory.ensureService(DefaultConfigurableCacheFactory.java:915)
      com.oracle.coherence.environment.extensible.ExtensibleEnvironment.ensureService(ExtensibleEnvironment.java:374)
      com.tangosol.net.DefaultConfigurableCacheFactory.ensureCache(DefaultConfigurableCacheFactory.java:877)
      com.tangosol.net.DefaultConfigurableCacheFactory.configureCache(DefaultConfigurableCacheFactory.java:1088)
      com.tangosol.net.DefaultConfigurableCacheFactory.ensureCache(DefaultConfigurableCacheFactory.java:304)
      com.tangosol.coherence.component.net.extend.proxy.CacheServiceProxy.ensureNamedCacheProxy(CacheServiceProxy.CDB:27)
      - locked java.util.HashMap@18060fd
      com.tangosol.coherence.component.net.extend.messageFactory.CacheServiceFactory$EnsureCacheRequest.onRun(CacheServiceFactory.CDB:13)
      com.tangosol.coherence.component.net.extend.message.Request.run(Request.CDB:4)
      com.tangosol.coherence.component.net.extend.proxy.CacheServiceProxy.onMessage(CacheServiceProxy.CDB:1)
      com.tangosol.coherence.component.net.extend.Channel.execute(Channel.CDB:28)
      com.tangosol.coherence.component.net.extend.Channel.receive(Channel.CDB:26)
      com.tangosol.coherence.component.util.daemon.queueProcessor.service.Peer$DaemonPool$WrapperTask.run(Peer.CDB:9)
      com.tangosol.coherence.component.util.DaemonPool$WrapperTask.run(DaemonPool.CDB:32)
      com.tangosol.coherence.component.util.DaemonPool$Daemon.onNotify(DaemonPool.CDB:63)
      com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:42)
      java.lang.Thread.run(Thread.java:619)
      ==========================================================================

      It's a massive issue if a remote failure causes local clients to block!!!!

      Even if I set my retry count to not be infinite, and set my retry interval higher, it would seem there will still be a period where extend clients will be blocked by the publishing service thread retries!

      Any help much appreciated.

      Cheers,
      Neville.

      NB: Coherence version 3.5.3-p3 Incubator version used: (push rep 2.6.1.14471, messaging 2.6.1.14471, common 1.6.1.14470)