1 Reply Latest reply: Jul 8, 2010 9:05 AM by Brian Oliver RSS

    Error while starting service "DistributedCacheForMessages" - timeout

    704908
      Hi Guys,

      We get occasional problems starting the cluster:
      2010-04-14 18:15:46,619 ERROR lccohd2-3 [2.5.806] Log4j [Logger@9222213 3.5.3/465p2] - 35.507 <Error> (thread=main, member=6): Error while starting service "DistributedCacheForMessages": com.tangosol.net.RequestTimeoutException: Timeout during service start: ServiceInfo(Id=4, Name=DistributedCacheForMessages, Type=DistributedCache
        MemberSet=ServiceMemberSet( 
          OldestMember=Member(Id=1, Timestamp=2010-04-14 18:15:01.801, Address=192.168.15.101:8088, MachineId=26469, Location=site:london,machine:lccohd1,process:10672,member:lccohd1-1, Role=)
          ActualMemberSet=MemberSet(Size=11, BitSetCount=2
            Member(Id=1, Timestamp=2010-04-14 18:15:01.801, Address=192.168.15.101:8088, MachineId=26469, Location=site:london,machine:lccohd1,process:10672,member:lccohd1-1, Role=)
            Member(Id=2, Timestamp=2010-04-14 18:15:05.19, Address=192.168.15.102:8088, MachineId=26470, Location=site:london,machine:lccohd2,process:17631,member:lccohd2-1, Role=)
            Member(Id=3, Timestamp=2010-04-14 18:15:12.834, Address=192.168.15.102:8089, MachineId=26470, Location=site:london,machine:lccohd2,process:17767,member:lccohd2-2, Role=)
            Member(Id=4, Timestamp=2010-04-14 18:15:12.986, Address=192.168.15.101:8089, MachineId=26469, Location=site:london,machine:lccohd1,process:10808,member:lccohd1-2, Role=)
            Member(Id=6, Timestamp=2010-04-14 18:15:14.044, Address=192.168.15.102:8090, MachineId=26470, Location=site:london,machine:lccohd2,process:17804,member:lccohd2-3, Role=)
            Member(Id=7, Timestamp=2010-04-14 18:15:14.813, Address=192.168.15.101:8091, MachineId=26469, Location=site:london,machine:lccohd1,process:10891,member:lccohd1-4, Role=)
            Member(Id=8, Timestamp=2010-04-14 18:15:15.009, Address=192.168.15.102:8091, MachineId=26470, Location=site:london,machine:lccohd2,process:17846,member:lccohd2-4, Role=)
            Member(Id=9, Timestamp=2010-04-14 18:15:15.481, Address=192.168.15.101:8092, MachineId=26469, Location=site:london,machine:lccohd1,process:10938,member:lccohd1-5, Role=)
            Member(Id=10, Timestamp=2010-04-14 18:15:16.275, Address=192.168.15.102:8092, MachineId=26470, Location=site:london,machine:lccohd2,process:17902,member:lccohd2-5, Role=)
            Member(Id=11, Timestamp=2010-04-14 18:15:16.497, Address=192.168.15.101:8093, MachineId=26469, Location=site:london,machine:lccohd1,process:10998,member:lccohd1-6, Role=)
            Member(Id=12, Timestamp=2010-04-14 18:15:17.323, Address=192.168.15.102:8093, MachineId=26470, Location=site:london,machine:lccohd2,process:17966,member:lccohd2-6, Role=)
            )
      
          MemberId/ServiceVersion/ServiceJoined/ServiceLeaving
            1/3.5/Wed Apr 14 18:15:05 BST 2010/false,
            2/3.5/Wed Apr 14 18:15:06 BST 2010/false,
            3/3.5/Wed Apr 14 18:15:14 BST 2010/false,
            4/3.5/Wed Apr 14 18:15:14 BST 2010/false,
            6/3.5/Wed Apr 14 18:15:16 BST 2010/false,
            7/3.5/Wed Apr 14 18:15:17 BST 2010/false,
            8/3.5/Wed Apr 14 18:15:17 BST 2010/false,
            9/3.5/Wed Apr 14 18:15:18 BST 2010/false,
            10/3.5/Wed Apr 14 18:15:18 BST 2010/false,
            11/3.5/Wed Apr 14 18:15:19 BST 2010/false,
            12/3.5/Wed Apr 14 18:15:20 BST 2010/false
          )
      )
              at com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.onStartupTimeout(Grid.CDB:6)
              at com.tangosol.coherence.component.util.daemon.queueProcessor.Service.start(Service.CDB:28)
              at com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.start(Grid.CDB:38)
              at com.tangosol.coherence.component.util.SafeService.startService(SafeService.CDB:28)
              at com.tangosol.coherence.component.util.safeService.SafeCacheService.startService(SafeCacheService.CDB:5)
              at com.tangosol.coherence.component.util.SafeService.ensureRunningService(SafeService.CDB:27)
              at com.tangosol.coherence.component.util.SafeService.start(SafeService.CDB:14)
              at com.tangosol.net.DefaultConfigurableCacheFactory.ensureService(DefaultConfigurableCacheFactory.java:1008)
              at com.oracle.coherence.environment.extensible.ExtensibleEnvironment.ensureService(ExtensibleEnvironment.java:374)
              at com.tangosol.net.DefaultCacheServer.start(DefaultCacheServer.java:139)
              at com.core.AbstractQTGServer.start(AbstractQTGServer.java:108)
              at com.EmsServer.main(EmsServer.java:42)
      
      
      
      2010-04-14 18:15:46,621 ERROR lccohd2-3 [2.5.806] Log4j [Logger@9222213 3.5.3/465p2] - 35.508 <Error> (thread=DistributedCache:DistributedCacheForMessages, member=6): validatePolls: This service timed-out due to unanswered handshake request. Manual intervention is required to stop the members that have not responded to this Poll
        {
        PollId=1, active
        InitTimeMillis=1271265316553
        Service=DistributedCacheForMessages (4)
        RespondedMemberSet=[1,2,3,5]
        LeftMemberSet=[]
        RemainingMemberSet=[4]
        }
      As they are so difficult to catch, I've followed the a advice on metalink and implemented an auto-caputure of the state of the non-responsive node. This showed the following:
      "DistributedCache:DistributedCacheForMessages" daemon prio=10 tid=0x000000004ee62400 nid=0x2b78 waiting for monitor entry [0x00000000454f4000..0x00000000454f5c90]
         java.lang.Thread.State: BLOCKED (on object monitor)
              at com.tangosol.coherence.component.util.SafeService.start(SafeService.CDB:4)
              - waiting to lock <0x00002aaabd0049a8> (a com.tangosol.coherence.component.util.SafeCluster)
              at com.tangosol.net.cache.ContinuousQueryCache.getCache(ContinuousQueryCache.java:182)
              at com.tangosol.net.cache.ContinuousQueryCache.configureSynchronization(ContinuousQueryCache.java:1157)
              - locked <0x00002aaab28da500> (a com.tangosol.net.cache.ContinuousQueryCache)
              at com.tangosol.net.cache.ContinuousQueryCache.ensureSynchronized(ContinuousQueryCache.java:1360)
              - locked <0x00002aaab28da500> (a com.tangosol.net.cache.ContinuousQueryCache)
              at com.tangosol.net.cache.ContinuousQueryCache.<init>(ContinuousQueryCache.java:159)
              at com.tangosol.net.cache.ContinuousQueryCache.<init>(ContinuousQueryCache.java:67)
              at com.oracle.coherence.patterns.messaging.MessageEventManager.<init>(MessageEventManager.java:85)
              at com.oracle.coherence.patterns.messaging.MessageEventManager.<clinit>(MessageEventManager.java:70)
              at com.oracle.coherence.patterns.messaging.MessagePartitionListener.onEvent(MessagePartitionListener.java:76)
              at com.oracle.coherence.patterns.messaging.MessagePartitionListener.onPartitionEvent(MessagePartitionListener.java:56)
              at com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.DistributedCache.firePartitionEvent(DistributedCache.CDB:9)
              at com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.DistributedCache.firePartitionEvent(DistributedCache.CDB:13)
              at com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.DistributedCache.onTransferRequest(DistributedCache.CDB:157)
              at com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.DistributedCache$TransferRequest.onReceived(DistributedCache.CDB:1)
              at com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.onMessage(Grid.CDB:9)
              at com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.onNotify(Grid.CDB:136)
              at com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.DistributedCache.onNotify(DistributedCache.CDB:3)
              at com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:42)
              at java.lang.Thread.run(Thread.java:619)
      Which I thought might be interesting for you guys?

      I have full cluster logs available as well as full stack dump for the unresponsive node.

      Regards, Paul