11 Replies Latest reply: Apr 14, 2011 1:43 PM by Bob Hanckel RSS

    deadlock on cluster startup - push replication + coherence 3.6.1.2

    701681
      Has anything changed in 3.6.1.2 that would give a deadlock on startup?

      Push replication is making a call to this on CoherencePushReplicationProvider.registerPublisher(...):

      NamedCache destinationCache = CacheFactory.getCache("coherence.messagingpattern.destinations");


      interesting area of thread dump:

      2011-04-13 15:55:35,293 ERROR [Logger@9240861 3.6.1.2] Coherence: 2011-04-13 15:55:35.292/363.217 Oracle Coherence GE 3.6.1.2 <Error> (thread=Cluster, member
      =7): Detected soft timeout) of {WrapperGuardable Guard{Daemon=DistributedCache:StaticDataCacheStoreService} Service=PartitionedCache{Name=StaticDataCacheStor
      eService, State=(SERVICE_STARTED), LocalStorage=enabled, PartitionCount=257, BackupCount=1, AssignedPartitions=257, BackupPartitions=0}}
      2011-04-13 15:55:35,294 WARN [Logger@9240861 3.6.1.2] Coherence: 2011-04-13 15:55:35.294/363.219 Oracle Coherence GE 3.6.1.2 <Warning> (thread=Recovery Thre
      ad, member=7): Logging stacktrace due to soft-timeout
      2011-04-13 15:55:35,314 ERROR [Logger@9240861 3.6.1.2] Coherence: 2011-04-13 15:55:35.310/363.235 Oracle Coherence GE 3.6.1.2 <Error> (thread=Recovery Thread
      , member=7): Full Thread Dump


      Thread[DistributedCache:StaticDataCacheStoreService,5,Cluster]
      java.lang.Object.wait(Native Method)
      java.lang.Object.wait(Object.java:485)
      com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.poll(Grid.CDB:32)
      com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.poll(Grid.CDB:11)
      com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.partitionedService.PartitionedCache$BinaryMap.invoke(PartitionedCache.CDB:30
      )
      com.tangosol.util.ConverterCollections$ConverterInvocableMap.invoke(ConverterCollections.java:2156)
      com.tangosol.util.ConverterCollections$ConverterNamedCache.invoke(ConverterCollections.java:2622)
      com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.partitionedService.PartitionedCache$ViewMap.invoke(PartitionedCache.CDB:11)
      com.tangosol.coherence.component.util.SafeNamedCache.invoke(SafeNamedCache.CDB:1)
      com.oracle.coherence.patterns.pushreplication.providers.coherence.CoherencePushReplicationProvider.registerPublisher(CoherencePushReplicationProvider
      .java:172)
      ---XXXPushReplicationProvider.registerCachePublishers(---XXXPushReplicationProvider.java:114)
      ---XXXPushReplicationProvider.establishPublishingInfrastructureFor(---XXXPushReplicationProvider.java:71
      )
      com.oracle.coherence.patterns.pushreplication.PublishingCacheStore.<init>(PublishingCacheStore.java:196)
      sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
      sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
      sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
      java.lang.reflect.Constructor.newInstance(Constructor.java:513)
      com.tangosol.util.ClassHelper.newInstance(ClassHelper.java:615)
      com.tangosol.run.xml.XmlHelper.createInstance(XmlHelper.java:2501)
      com.tangosol.run.xml.XmlHelper.createInstance(XmlHelper.java:2426)
      com.tangosol.net.DefaultConfigurableCacheFactory.instantiateAny(DefaultConfigurableCacheFactory.java:3256)
      com.oracle.coherence.environment.extensible.ExtensibleEnvironment.instantiateAny(ExtensibleEnvironment.java:720)
      com.tangosol.net.DefaultConfigurableCacheFactory.instantiateCacheStore(DefaultConfigurableCacheFactory.java:3106)
      com.tangosol.net.DefaultConfigurableCacheFactory.instantiateReadWriteBackingMap(DefaultConfigurableCacheFactory.java:1674)
      com.tangosol.net.DefaultConfigurableCacheFactory.configureBackingMap(DefaultConfigurableCacheFactory.java:1429)
      com.tangosol.net.DefaultConfigurableCacheFactory$Manager.instantiateBackingMap(DefaultConfigurableCacheFactory.java:3904)
      com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.partitionedService.PartitionedCache$Storage.instantiateResourceMap(Partition
      edCache.CDB:22)
      com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.partitionedService.PartitionedCache$Storage.setCacheName(PartitionedCache.CD
      B:25)
      com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.partitionedService.PartitionedCache$ServiceConfig$ConfigListener.entryInsert
      ed(PartitionedCache.CDB:17)
      com.tangosol.util.MapEvent.dispatch(MapEvent.java:266)
      com.tangosol.util.MapEvent.dispatch(MapEvent.java:226)
      com.tangosol.util.MapListenerSupport.fireEvent(MapListenerSupport.java:556)
      com.tangosol.util.ObservableHashMap.dispatchEvent(ObservableHashMap.java:229)
      com.tangosol.util.ObservableHashMap$Entry.onAdd(ObservableHashMap.java:270)
      com.tangosol.util.SafeHashMap.put(SafeHashMap.java:244)
      com.tangosol.coherence.component.util.ServiceConfig$Map.updateInternal(ServiceConfig.CDB:30)
      com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid$ConfigUpdate.onReceived(Grid.CDB:42)
      com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.onMessage(Grid.CDB:11)
      com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.onNotify(Grid.CDB:33)
      com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.PartitionedService.onNotify(PartitionedService.CDB:3)
      com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.partitionedService.PartitionedCache.onNotify(PartitionedCache.CDB:3)
      com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:42)
      java.lang.Thread.run(Thread.java:662)

      Thread[Main-Load-Thread,5,main]
      java.lang.Object.wait(Native Method)
      java.lang.Object.wait(Object.java:485)
      com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.poll(Grid.CDB:32)
      com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.poll(Grid.CDB:11)
      com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.partitionedService.PartitionedCache.ensureCache(PartitionedCache.CDB:29)
      com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.partitionedService.PartitionedCache.ensureCache(PartitionedCache.CDB:36)
      com.tangosol.coherence.component.util.safeService.SafeCacheService.ensureCache$Router(SafeCacheService.CDB:1)
      com.tangosol.coherence.component.util.safeService.SafeCacheService.ensureCache(SafeCacheService.CDB:33)
      com.tangosol.net.DefaultConfigurableCacheFactory.ensureCache(DefaultConfigurableCacheFactory.java:875)
      com.tangosol.net.DefaultConfigurableCacheFactory.configureCache(DefaultConfigurableCacheFactory.java:1231)
      com.tangosol.net.DefaultConfigurableCacheFactory.ensureCache(DefaultConfigurableCacheFactory.java:290)
      com.tangosol.net.CacheFactory.getCache(CacheFactory.java:735)
      com.tangosol.net.CacheFactory.getCache(CacheFactory.java:712)
      ---main.StorageMain.registerCachesWithCluster(StorageMain.java:174)
      ---.main.StorageMain.loadCaches(StorageMain.java:117)
      ---.StorageMain.start(StorageMain.java:76)
      ---.MainController$1.run(MainController.java:106)
      java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
      java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
      java.lang.Thread.run(Thread.java:662)


      Cheers,
      Neville.

      Edited by: neville deboo on Apr 13, 2011 7:44 PM
        • 1. Re: deadlock on cluster startup - push replication + coherence 3.6.1.2
          Bob Hanckel
          Hi Neville,

          Could you remind us of what version of Push Rep you are using?

          Thanks,

          Bob
          • 2. Re: deadlock on cluster startup - push replication + coherence 3.6.1.2
            Bob Hanckel
            Also, could you post the relevant Push Rep configuration embedded in your Coherence config file?

            Thanks,

            Bob
            • 3. Re: deadlock on cluster startup - push replication + coherence 3.6.1.2
              701681
              Hi,

              We are using push rep version 3.0.3.20019 which has been working fine against 3.6.0.2 in production.

              Cheers,
              Neville.
              • 4. Re: deadlock on cluster startup - push replication + coherence 3.6.1.2
                701681
                As for push rep config, we configure push replication programatically, not in config (this code has not changed for some months).

                Code looks something like this:


                RemoteClusterPublisherScheme publisherScheme = new RemoteClusterPublisherScheme();
                publisherScheme.setAutostart(true);
                publisherScheme.setBatchPublishingDelay(AbstractPublisherScheme.BATCH_PUBLISHING_DELAY_DEFAULT);
                publisherScheme.setBatchSize(AbstractPublisherScheme.BATCH_SIZE_DEFAULT);
                publisherScheme.setTotalConsecutiveFailuresBeforeSuspending(-1);
                publisherScheme.setRestartDelay(RESTART_DELAY);
                publisherScheme.setRemoteInvocationServiceName(remoteSiteName);

                LocalCachePublisherScheme localCachePublisherScheme = new LocalCachePublisherScheme();
                Expression<String> expression = new MacroParameterExpression(String.class, cacheName);
                ClassScheme<ConflictResolver> conflictResolverClassScheme = new ParameterizedClassScheme<ConflictResolver>(NullConflictResolver.class.getName(), new Object[]{});
                localCachePublisherScheme.setConflictResolverScheme(conflictResolverClassScheme);
                localCachePublisherScheme.setTargetCacheName(expression);
                publisherScheme.setRemotePublisherScheme(localCachePublisherScheme);

                GuardSupport.heartbeat();
                this.registerPublisher(cacheName, remoteSiteName, publisherScheme, parameterizedClassScheme);
                GuardSupport.heartbeat();
                • 5. Re: deadlock on cluster startup - push replication + coherence 3.6.1.2
                  Bob Hanckel
                  Hi Neville,

                  I'm trying to understand the relationship between the CoherencePushReplicationProvider and
                  your own provider (i.e. -XXXPushReplicationProvider). Is -XXX extending CoherencePushReplicationProvider?

                  I will need to follow up with some folks on this and will get back to you.

                  Regards,

                  Bob
                  • 6. Re: deadlock on cluster startup - push replication + coherence 3.6.1.2
                    Bob Hanckel
                    This could be a guardian tuning problem. Any idea of how long registerCachePublisher is taking?
                    • 7. Re: deadlock on cluster startup - push replication + coherence 3.6.1.2
                      701681
                      Yes the XXXPushReplicationProvider does indeed extend the CoherencePushReplicationProvider.

                      Our implementation does two things

                      1. It overrides the publish methods which simply checks a control flag to prevent write through to coherence messaging, on startup.

                      2. overrides the establish infrastucture for a given cache, so we can be programatically selective about what publishers we create based on cache name.
                      • 8. Re: deadlock on cluster startup - push replication + coherence 3.6.1.2
                        701681
                        In the above scenario the cluster just hung for around 5minutes on startup. Btw our service guardian is disabled and in logging mode only. We had a similar cluster deadlock type issue with 3.5.3 and the upgrade to 3.6.0.2 was the advised fix. I notice the release notes for 3.6.1.1 mention a fix around ensureCluster deadlock, not sure if its related.
                        • 9. Re: deadlock on cluster startup - push replication + coherence 3.6.1.2
                          701681
                          Some more info.

                          We run a total of 2 storage nodes. The thread dump snippet above was from one of the storage nodes.

                          Here is the snippet from the other node:

                          2011-04-13 15:55:35,357 WARN [Logger@9240084 3.6.1.2] Coherence: 2011-04-13 15:55:35.356/370.414 Oracle Coherence GE 3.6.1.2 <Warning> (thread=Recovery Thre
                          ad, member=3): Logging stacktrace due to soft-timeout
                          2011-04-13 15:55:35,365 ERROR [Logger@9240084 3.6.1.2] Coherence: 2011-04-13 15:55:35.364/370.422 Oracle Coherence GE 3.6.1.2 <Error> (thread=Recovery Thread
                          , member=3): Full Thread Dump

                          Thread[DistributedCacheForDestinationsWorker:0,5,DistributedCacheForDestinations]
                          com.tangosol.net.CacheFactory.ensureCluster(CacheFactory.java:995)
                          com.tangosol.net.DefaultConfigurableCacheFactory.ensureServiceInternal(DefaultConfigurableCacheFactory.java:923)
                          com.tangosol.net.DefaultConfigurableCacheFactory.ensureService(DefaultConfigurableCacheFactory.java:892)
                          com.oracle.coherence.environment.extensible.ExtensibleEnvironment.ensureService(ExtensibleEnvironment.java:497)
                          com.tangosol.net.DefaultConfigurableCacheFactory.ensureCache(DefaultConfigurableCacheFactory.java:874)
                          com.tangosol.net.DefaultConfigurableCacheFactory.configureCache(DefaultConfigurableCacheFactory.java:1231)
                          com.tangosol.net.DefaultConfigurableCacheFactory.ensureCache(DefaultConfigurableCacheFactory.java:290)
                          com.tangosol.net.CacheFactory.getCache(CacheFactory.java:735)
                          com.tangosol.net.CacheFactory.getCache(CacheFactory.java:712)
                          com.oracle.coherence.patterns.messaging.Destination.subscribe(Destination.java:265)
                          com.oracle.coherence.patterns.messaging.Topic.subscribe(Topic.java:74)
                          com.oracle.coherence.patterns.messaging.entryprocessors.TopicSubscribeProcessor.process(TopicSubscribeProcessor.java:117)
                          com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.partitionedService.PartitionedCache$Storage.invoke(PartitionedCache.CDB:6)
                          com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.partitionedService.PartitionedCache.onInvokeRequest(PartitionedCache.CDB:62)
                          com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.partitionedService.PartitionedCache$InvokeRequest.run(PartitionedCache.CDB:1
                          )
                          com.tangosol.coherence.component.util.DaemonPool$WrapperTask.run(DaemonPool.CDB:1)
                          com.tangosol.coherence.component.util.DaemonPool$WrapperTask.run(DaemonPool.CDB:32)
                          com.tangosol.coherence.component.util.DaemonPool$Daemon.onNotify(DaemonPool.CDB:63)
                          com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:42)
                          java.lang.Thread.run(Thread.java:662)

                          Thread[Main-Load-Thread,5,main]
                          java.lang.Object.wait(Native Method)
                          com.tangosol.coherence.component.util.daemon.queueProcessor.Service.start(Service.CDB:18)
                          com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.start(Grid.CDB:6)
                          com.tangosol.coherence.component.util.SafeService.startService(SafeService.CDB:28)
                          com.tangosol.coherence.component.util.safeService.SafeCacheService.startService(SafeCacheService.CDB:5)
                          com.tangosol.coherence.component.util.SafeService.ensureRunningService(SafeService.CDB:27)
                          com.tangosol.coherence.component.util.SafeService.start(SafeService.CDB:14)
                          com.tangosol.net.DefaultConfigurableCacheFactory.ensureServiceInternal(DefaultConfigurableCacheFactory.java:1057)
                          com.tangosol.net.DefaultConfigurableCacheFactory.ensureService(DefaultConfigurableCacheFactory.java:892)
                          com.oracle.coherence.environment.extensible.ExtensibleEnvironment.ensureService(ExtensibleEnvironment.java:497)
                          com.tangosol.net.DefaultCacheServer.startServices(DefaultCacheServer.java:81)
                          com.tangosol.net.DefaultCacheServer.start(DefaultCacheServer.java:230)
                          ---.GenericMain.start(GenericMain.java:192)
                          ---.StorageMain.start(StorageMain.java:62)
                          ---.MainController$1.run(MainController.java:106)
                          java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
                          java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
                          java.lang.Thread.run(Thread.java:662)

                          2011-04-13 15:56:00,114 ERROR [Main-Load-Thread] StorageMain: Error on start()
                          com.tangosol.net.RequestTimeoutException: Timeout during service start: ServiceInfo(Id=8, Name=StaticDataCacheStoreService, Type=DistributedCache
                          MemberSet=ServiceMemberSet(.....................
                          • 10. Re: deadlock on cluster startup - push replication + coherence 3.6.1.2
                            Bob Hanckel
                            Thanks for the additional info Neville. I look at the registerProvider function today and chase down
                            other push rep experts.
                            • 11. Re: deadlock on cluster startup - push replication + coherence 3.6.1.2
                              Bob Hanckel
                              Hi Neville,

                              We've spend some time looking at this but the problem is not obvious. Could you file
                              an SR with detailed logs and configuration info?

                              Thanks,

                              Bob