This discussion is archived
1 Reply Latest reply: Jul 8, 2010 7:05 AM by Brian Oliver RSS

Error while starting service "DistributedCacheForMessages" - timeout

704908 Newbie
Currently Being Moderated
Hi Guys,

We get occasional problems starting the cluster:
2010-04-14 18:15:46,619 ERROR lccohd2-3 [2.5.806] Log4j [Logger@9222213 3.5.3/465p2] - 35.507 <Error> (thread=main, member=6): Error while starting service "DistributedCacheForMessages": com.tangosol.net.RequestTimeoutException: Timeout during service start: ServiceInfo(Id=4, Name=DistributedCacheForMessages, Type=DistributedCache
  MemberSet=ServiceMemberSet( 
    OldestMember=Member(Id=1, Timestamp=2010-04-14 18:15:01.801, Address=192.168.15.101:8088, MachineId=26469, Location=site:london,machine:lccohd1,process:10672,member:lccohd1-1, Role=)
    ActualMemberSet=MemberSet(Size=11, BitSetCount=2
      Member(Id=1, Timestamp=2010-04-14 18:15:01.801, Address=192.168.15.101:8088, MachineId=26469, Location=site:london,machine:lccohd1,process:10672,member:lccohd1-1, Role=)
      Member(Id=2, Timestamp=2010-04-14 18:15:05.19, Address=192.168.15.102:8088, MachineId=26470, Location=site:london,machine:lccohd2,process:17631,member:lccohd2-1, Role=)
      Member(Id=3, Timestamp=2010-04-14 18:15:12.834, Address=192.168.15.102:8089, MachineId=26470, Location=site:london,machine:lccohd2,process:17767,member:lccohd2-2, Role=)
      Member(Id=4, Timestamp=2010-04-14 18:15:12.986, Address=192.168.15.101:8089, MachineId=26469, Location=site:london,machine:lccohd1,process:10808,member:lccohd1-2, Role=)
      Member(Id=6, Timestamp=2010-04-14 18:15:14.044, Address=192.168.15.102:8090, MachineId=26470, Location=site:london,machine:lccohd2,process:17804,member:lccohd2-3, Role=)
      Member(Id=7, Timestamp=2010-04-14 18:15:14.813, Address=192.168.15.101:8091, MachineId=26469, Location=site:london,machine:lccohd1,process:10891,member:lccohd1-4, Role=)
      Member(Id=8, Timestamp=2010-04-14 18:15:15.009, Address=192.168.15.102:8091, MachineId=26470, Location=site:london,machine:lccohd2,process:17846,member:lccohd2-4, Role=)
      Member(Id=9, Timestamp=2010-04-14 18:15:15.481, Address=192.168.15.101:8092, MachineId=26469, Location=site:london,machine:lccohd1,process:10938,member:lccohd1-5, Role=)
      Member(Id=10, Timestamp=2010-04-14 18:15:16.275, Address=192.168.15.102:8092, MachineId=26470, Location=site:london,machine:lccohd2,process:17902,member:lccohd2-5, Role=)
      Member(Id=11, Timestamp=2010-04-14 18:15:16.497, Address=192.168.15.101:8093, MachineId=26469, Location=site:london,machine:lccohd1,process:10998,member:lccohd1-6, Role=)
      Member(Id=12, Timestamp=2010-04-14 18:15:17.323, Address=192.168.15.102:8093, MachineId=26470, Location=site:london,machine:lccohd2,process:17966,member:lccohd2-6, Role=)
      )

    MemberId/ServiceVersion/ServiceJoined/ServiceLeaving
      1/3.5/Wed Apr 14 18:15:05 BST 2010/false,
      2/3.5/Wed Apr 14 18:15:06 BST 2010/false,
      3/3.5/Wed Apr 14 18:15:14 BST 2010/false,
      4/3.5/Wed Apr 14 18:15:14 BST 2010/false,
      6/3.5/Wed Apr 14 18:15:16 BST 2010/false,
      7/3.5/Wed Apr 14 18:15:17 BST 2010/false,
      8/3.5/Wed Apr 14 18:15:17 BST 2010/false,
      9/3.5/Wed Apr 14 18:15:18 BST 2010/false,
      10/3.5/Wed Apr 14 18:15:18 BST 2010/false,
      11/3.5/Wed Apr 14 18:15:19 BST 2010/false,
      12/3.5/Wed Apr 14 18:15:20 BST 2010/false
    )
)
        at com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.onStartupTimeout(Grid.CDB:6)
        at com.tangosol.coherence.component.util.daemon.queueProcessor.Service.start(Service.CDB:28)
        at com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.start(Grid.CDB:38)
        at com.tangosol.coherence.component.util.SafeService.startService(SafeService.CDB:28)
        at com.tangosol.coherence.component.util.safeService.SafeCacheService.startService(SafeCacheService.CDB:5)
        at com.tangosol.coherence.component.util.SafeService.ensureRunningService(SafeService.CDB:27)
        at com.tangosol.coherence.component.util.SafeService.start(SafeService.CDB:14)
        at com.tangosol.net.DefaultConfigurableCacheFactory.ensureService(DefaultConfigurableCacheFactory.java:1008)
        at com.oracle.coherence.environment.extensible.ExtensibleEnvironment.ensureService(ExtensibleEnvironment.java:374)
        at com.tangosol.net.DefaultCacheServer.start(DefaultCacheServer.java:139)
        at com.core.AbstractQTGServer.start(AbstractQTGServer.java:108)
        at com.EmsServer.main(EmsServer.java:42)



2010-04-14 18:15:46,621 ERROR lccohd2-3 [2.5.806] Log4j [Logger@9222213 3.5.3/465p2] - 35.508 <Error> (thread=DistributedCache:DistributedCacheForMessages, member=6): validatePolls: This service timed-out due to unanswered handshake request. Manual intervention is required to stop the members that have not responded to this Poll
  {
  PollId=1, active
  InitTimeMillis=1271265316553
  Service=DistributedCacheForMessages (4)
  RespondedMemberSet=[1,2,3,5]
  LeftMemberSet=[]
  RemainingMemberSet=[4]
  }
As they are so difficult to catch, I've followed the a advice on metalink and implemented an auto-caputure of the state of the non-responsive node. This showed the following:
"DistributedCache:DistributedCacheForMessages" daemon prio=10 tid=0x000000004ee62400 nid=0x2b78 waiting for monitor entry [0x00000000454f4000..0x00000000454f5c90]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at com.tangosol.coherence.component.util.SafeService.start(SafeService.CDB:4)
        - waiting to lock <0x00002aaabd0049a8> (a com.tangosol.coherence.component.util.SafeCluster)
        at com.tangosol.net.cache.ContinuousQueryCache.getCache(ContinuousQueryCache.java:182)
        at com.tangosol.net.cache.ContinuousQueryCache.configureSynchronization(ContinuousQueryCache.java:1157)
        - locked <0x00002aaab28da500> (a com.tangosol.net.cache.ContinuousQueryCache)
        at com.tangosol.net.cache.ContinuousQueryCache.ensureSynchronized(ContinuousQueryCache.java:1360)
        - locked <0x00002aaab28da500> (a com.tangosol.net.cache.ContinuousQueryCache)
        at com.tangosol.net.cache.ContinuousQueryCache.<init>(ContinuousQueryCache.java:159)
        at com.tangosol.net.cache.ContinuousQueryCache.<init>(ContinuousQueryCache.java:67)
        at com.oracle.coherence.patterns.messaging.MessageEventManager.<init>(MessageEventManager.java:85)
        at com.oracle.coherence.patterns.messaging.MessageEventManager.<clinit>(MessageEventManager.java:70)
        at com.oracle.coherence.patterns.messaging.MessagePartitionListener.onEvent(MessagePartitionListener.java:76)
        at com.oracle.coherence.patterns.messaging.MessagePartitionListener.onPartitionEvent(MessagePartitionListener.java:56)
        at com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.DistributedCache.firePartitionEvent(DistributedCache.CDB:9)
        at com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.DistributedCache.firePartitionEvent(DistributedCache.CDB:13)
        at com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.DistributedCache.onTransferRequest(DistributedCache.CDB:157)
        at com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.DistributedCache$TransferRequest.onReceived(DistributedCache.CDB:1)
        at com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.onMessage(Grid.CDB:9)
        at com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.onNotify(Grid.CDB:136)
        at com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.DistributedCache.onNotify(DistributedCache.CDB:3)
        at com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:42)
        at java.lang.Thread.run(Thread.java:619)
Which I thought might be interesting for you guys?

I have full cluster logs available as well as full stack dump for the unresponsive node.

Regards, Paul

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points