1 Reply Latest reply: Jan 28, 2013 2:12 PM by pmackin RSS

    Correct assessment of the problem?

    660069
      Hi,

      During the small hours of this morning we experienced some cluster trouble. Digging through the logs I have conclude that there was either an outage on the multicast lan, or the DNS for the multicast was down.

      Here's an extract of the logs, could you give it a quick glance and confirm/correct my conclusion.


      First sign of the problem was this
      2013-01-18 03:40:32.019/3833087.695 Oracle Coherence GE 3.7.1.5 <Error> (thread=Cluster, member=9): Received cluster heartbeat from the senior Member(Id=1, Timestamp=2012-12-04 18:55:34.204, Address=10.2.50.99:8088, MachineId=56722, Location=site:,machine:LONS00110630,process:1472,member:_echo.LONS00110630.cache.1, Role=CacheNode) that does not contain this Member(Id=9, Timestamp=2012-12-04 18:56:06.278, Address=10.2.50.100:8092, MachineId=62739, Location=site:,machine:LONS00110631,process:744,member:_echo.LONS00110631.proc.1, Role=ProcessorNode); stopping cluster service.
      2013-01-18 03:40:32.019/3833087.695 Oracle Coherence GE 3.7.1.5 <Error> (thread=Cluster, member=9): Full Thread Dump
      
      2013-01-18 03:40:32.783/3833088.459 Oracle Coherence GE 3.7.1.5 <D7> (thread=PacketListenerN, member=n/a): Growing MultiplexingWriteBufferPool segment '65536' to 6 generations
      2013-01-18 03:40:32.986/3833088.662 Oracle Coherence GE 3.7.1.5 <D7> (thread=PacketListenerN, member=n/a): Growing MultiplexingWriteBufferPool segment '65536' to 7 generations
      
      Coherence stops and restarts at this point after a full thread dump
      When Coherence restarts I get the following
      2013-01-18 03:41:14.592/35.195 Oracle Coherence GE 3.7.1.5 <Warning> (thread=Cluster, member=n/a): This Member(Id=0, Timestamp=2013-01-18 03:40:44.546, Address=10.2.50.100:8094, MachineId=62739, Location=site:,machine:LONS00110631,process:4984,member:_echo.LONS00110631.proc.1, Role=ProcessorNode) has been attempting to join the cluster at address echo-lon-pri.uk.net.intra/239.192.101.71:19000 with TTL 120 for 30 seconds without success; this could indicate a mis-configured TTL value, or it may simply be the result of a busy cluster or active failover.
      2013-01-18 03:41:14.592/35.195 Oracle Coherence GE 3.7.1.5 <Warning> (thread=Cluster, member=n/a): Received a discovery message that indicates an older member joining:
      Message "NewMemberRequestId"
        {
        FromMember=Member(Id=108, Timestamp=2013-01-18 03:40:46.812, Address=10.2.50.100:8092, MachineId=62739, Location=site:,machine:LONS00110631,process:1652,member:_echo.LONS00110631.proc.2, Role=ProcessorNode)
        FromMessageId=0
        Internal=false
        MessagePartCount=0
        PendingCount=0
        MessageType=10
        ToPollId=0
        Poll=null
        Packets
          {
          }
        Service=ClusterService{Name=Cluster, State=(SERVICE_STARTED, STATE_JOINING), Id=0, Version=3.7.1, OldestMemberId=1}
        ToMemberSet=null
        NotifySent=false
        AttemptCounter=1
        AttemptLimit=151
        ServiceVersion=3.7.1
        }
      This continues for several hours, I'm assuming until the network comes back up. Everything is fine now, however I need to get to the bottom of what happened and why?

      Any comments would be most welcome.

      Thanks
      Rich