This discussion is archived
1 Reply Latest reply: Jan 28, 2013 12:12 PM by pmackin RSS

Correct assessment of the problem?

660069 Newbie
Currently Being Moderated
Hi,

During the small hours of this morning we experienced some cluster trouble. Digging through the logs I have conclude that there was either an outage on the multicast lan, or the DNS for the multicast was down.

Here's an extract of the logs, could you give it a quick glance and confirm/correct my conclusion.


First sign of the problem was this
2013-01-18 03:40:32.019/3833087.695 Oracle Coherence GE 3.7.1.5 <Error> (thread=Cluster, member=9): Received cluster heartbeat from the senior Member(Id=1, Timestamp=2012-12-04 18:55:34.204, Address=10.2.50.99:8088, MachineId=56722, Location=site:,machine:LONS00110630,process:1472,member:_echo.LONS00110630.cache.1, Role=CacheNode) that does not contain this Member(Id=9, Timestamp=2012-12-04 18:56:06.278, Address=10.2.50.100:8092, MachineId=62739, Location=site:,machine:LONS00110631,process:744,member:_echo.LONS00110631.proc.1, Role=ProcessorNode); stopping cluster service.
2013-01-18 03:40:32.019/3833087.695 Oracle Coherence GE 3.7.1.5 <Error> (thread=Cluster, member=9): Full Thread Dump

2013-01-18 03:40:32.783/3833088.459 Oracle Coherence GE 3.7.1.5 <D7> (thread=PacketListenerN, member=n/a): Growing MultiplexingWriteBufferPool segment '65536' to 6 generations
2013-01-18 03:40:32.986/3833088.662 Oracle Coherence GE 3.7.1.5 <D7> (thread=PacketListenerN, member=n/a): Growing MultiplexingWriteBufferPool segment '65536' to 7 generations

Coherence stops and restarts at this point after a full thread dump
When Coherence restarts I get the following
2013-01-18 03:41:14.592/35.195 Oracle Coherence GE 3.7.1.5 <Warning> (thread=Cluster, member=n/a): This Member(Id=0, Timestamp=2013-01-18 03:40:44.546, Address=10.2.50.100:8094, MachineId=62739, Location=site:,machine:LONS00110631,process:4984,member:_echo.LONS00110631.proc.1, Role=ProcessorNode) has been attempting to join the cluster at address echo-lon-pri.uk.net.intra/239.192.101.71:19000 with TTL 120 for 30 seconds without success; this could indicate a mis-configured TTL value, or it may simply be the result of a busy cluster or active failover.
2013-01-18 03:41:14.592/35.195 Oracle Coherence GE 3.7.1.5 <Warning> (thread=Cluster, member=n/a): Received a discovery message that indicates an older member joining:
Message "NewMemberRequestId"
  {
  FromMember=Member(Id=108, Timestamp=2013-01-18 03:40:46.812, Address=10.2.50.100:8092, MachineId=62739, Location=site:,machine:LONS00110631,process:1652,member:_echo.LONS00110631.proc.2, Role=ProcessorNode)
  FromMessageId=0
  Internal=false
  MessagePartCount=0
  PendingCount=0
  MessageType=10
  ToPollId=0
  Poll=null
  Packets
    {
    }
  Service=ClusterService{Name=Cluster, State=(SERVICE_STARTED, STATE_JOINING), Id=0, Version=3.7.1, OldestMemberId=1}
  ToMemberSet=null
  NotifySent=false
  AttemptCounter=1
  AttemptLimit=151
  ServiceVersion=3.7.1
  }
This continues for several hours, I'm assuming until the network comes back up. Everything is fine now, however I need to get to the bottom of what happened and why?

Any comments would be most welcome.

Thanks
Rich

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points