0 Replies Latest reply: Mar 29, 2013 3:31 AM by 999930 RSS

    Received panic from senior Member

    999930
      We have some strange errors from Coherence on the prod boxes which seem to affect also the response time of the other endpoint operations.

      TTL and log level settings below:
      tangosol.coherence.ttl = 16
      tangosol.coherence.log.level = 9

      Sorry for the long mail but it may be the only way to show you the course of "log events":

      002 is the oldest member in the cluster of 4, and the second oldest is 004. The machines 001 and 003 are constantly trying to join the cluster but they somehow don't accept 002's authority:

      host=001:
      2013-03-11 16:32:30.012: Coherence WARN - (member=2): The member formerly known as Member(Id=3, Timestamp=2013-03-11 16:32:29.623, Address=x.x.x.189:8088, MachineId=189, Location=site:prd.zzz,machine:002,process:17989, Role=BusinessLauncher) has been forcefully evicted from the cluster, but continues to emit a cluster heartbeat; henceforth, the member will be shunned and its messages will be ignored.

      host=002:
      2013-03-11 16:20:41.068: Coherence WARN - (member=3): Received panic from junior member Member(Id=5, Timestamp=2013-03-06 14:33:57.373, Address=x.x.x.193:8088, MachineId=193, Location=site:prd.zzz,machine:004,process:18380) caused by Member(Id=4, Timestamp=2013-03-11 16:15:50.286, Address=x.x.x.192:8088, MachineId=192, Location=site:prd.zzzr,machine:003,process:15762)

      002 on the other hand doesn't accept 001 or 003 as master, and sends Panic :
      host=002:
      2013-03-11 16:54:34.228: Coherence WARN - (member=3): An existence of a cluster island with senior Member(Id=1, Timestamp=2013-03-11 16:42:34.35, Address=x.x.x.192:8088, MachineId=192, Location=site:prd.zzz,machine:003,process:15762) containing 2 nodes have been detected. Since this Member(Id=3, Timestamp=2013-02-19 08:11:34.196, Address=x.x.x.189:8088, MachineId=189, Location=site:prd.zzz,machine:002,process:17989, Role=BusinessLauncher) is the senior of an older cluster island, the panic protocol is being activated to stop the other island's senior and all junior nodes that belong to it.

      On the other hand, the communication between 004 and 001/003:

      004 receives a Panic from 002 (caused by 003):
      host=004:
      2013-03-11 16:20:41.068: Coherence ERROR - (member=5): Received panic from senior Member(Id=3, Timestamp=2013-02-19 08:11:34.196, Address=x.x.x.189:8088, MachineId=189, Location=site:prd.zzz,machine:002,process:17989, Role=BusinessLauncher) caused by Member(Id=4, Timestamp=2013-03-11 16:15:50.286, Address=x.x.x.192:8088, MachineId=192, Location=site:prd.zzz,machine:003,process:15762)

      003 then receives a Kill message from 004:
      host=003
      2013-03-11 16:20:41.068: Coherence ERROR - (member=4): Received a Kill message from a valid Member(Id=5, Timestamp=2013-03-06 14:33:57.373, Address=x.x.x.193:8088, MachineId=193, Location=site:prd.zzz,machine:004,process:18380); stopping cluster service.

      001 receives the Kill message from 003:
      host=001:
      2013-03-11 16:20:41.068: Coherence ERROR - (member=6): Received a Kill message from a valid Member(Id=4, Timestamp=2013-03-11 16:15:50.286, Address=x.x.x.192:8088, MachineId=192, Location=site:prd.zzz,machine:003,process:15762); stopping cluster service.

      These "conversations" happen frequently, one every 5-15 minutes

      After these kills, 001 and 003 are trying to rejoin the cluster. Further below excerpts from 001's log, after it receives a Kill message from 004:

      2013-03-11 16:20:41.068: Coherence ERROR - (member=6): Received a Kill message from a valid Member(Id=4, Timestamp=2013-03-11 16:15:50.286, Address=x.x.x.192:8088, MachineId=192, Location=site:prd.zzz,machine:003,process:15762); stopping cluster service.

      2013-03-11 16:20:43.159: Coherence INFO - (member=6): Restarting NamedCache: HealthCheckCache
      2013-03-11 16:20:43.159: Coherence INFO - (member=6): Restarting Service: MemoryOnlyCacheService
      2013-03-11 16:20:43.159: Coherence INFO - (member=n/a): Restarting cluster

      2013-03-11 16:20:43.159: Coherence WARN - (member=n/a): UnicastUdpSocket failed to set receive buffer size to 1428 packets (2096304 bytes); actual size is 89 packets (131071 bytes). Consult your OS documentation regarding increasing the maximum socket buffer size. Proceeding with the actual value may cause sub-optimal performance.

      In the last 30 days, this UnicastUdpSocket error appears only once on 002, 4 times on 004 and over 25000 times on both 001 and 003. Should these be the cause of the "misunderstandings"?

      2013-03-11 16:20:45.451: Coherence INFO - (member=n/a): This Member(Id=2, Timestamp=2013-03-11 16:20:45.241, Address=x.x.x.188:8088, MachineId=188, Location=site:prd.zzz,machine:001,process:28262, Edition=Grid Edition, Mode=Development, CpuCount=2, SocketCount=2) joined cluster "cluster:0x70FB" with senior Member(Id=3, Timestamp=2013-02-19 08:11:34.196, Address=x.x.x.189:8088, MachineId=189, Location=site:prd.zzz,machine:002,process:17989, Role=BusinessLauncher, Edition=Grid Edition, Mode=Development, CpuCount=2, SocketCount=2)

      2013-03-11 16:22:41.523: Coherence INFO - (member=2): Restarting NamedCache: CardAPICache
      2013-03-11 16:22:41.523: Coherence INFO - (member=2): Restarting Service: NonBackedUpDistributedCache

      2013-03-11 16:32:22.967: Coherence WARN - (member=2): A potential communication problem has been detected. A packet has failed to be delivered (or acknowledged) after 22 seconds, although other packets were acknowledged by the same cluster member (Member(Id=3, Timestamp=2013-02-19 08:11:34.196, Address=x.x.x.189:8088, MachineId=189, Location=site:prd.zzz,machine:002,process:17989, Role=BusinessLauncher)) to this member (Member(Id=2, Timestamp=2013-03-11 16:20:45.241, Address=x.x.x.188:8088, MachineId=188, Location=site:prd.zzz,machine:001,process:28262)) as recently as 1 seconds ago. Possible causes include network failure, poor thread scheduling (see FAQ if running on Windows), an extremely overloaded server, a server that is attempting to run its processes using swap space, and unreasonably lengthy GC times.

      2013-03-11 16:32:22.967: Coherence WARN - (member=2): A potential communication problem has been detected. A packet has failed to be delivered (or acknowledged) after 22 seconds, although other packets were acknowledged by the same cluster member (Member(Id=5, Timestamp=2013-03-06 14:33:57.373, Address=x.x.x.193:8088, MachineId=193, Location=site:prd.zzz,machine: 004,process:18380)) to this member (Member(Id=2, Timestamp=2013-03-11 16:20:45.241, Address=x.x.x.188:8088, MachineId=188, Location=site:prd.zzz,machine:001,process:28262)) as recently as 1 seconds ago. Possible causes include network failure, poor thread scheduling (see FAQ if running on Windows), an extremely overloaded server, a server that is attempting to run its processes using swap space, and unreasonably lengthy GC times.

      2013-03-11 16:32:29.636: Coherence WARN - (member=2): Assigned 130 orphaned primary partitions
      2013-03-11 16:32:29.639: Coherence INFO - (member=2): Restored from backup 38 partitions

      2013-03-11 16:32:30.012: Coherence WARN - (member=2): The member formerly known as Member(Id=3, Timestamp=2013-03-11 16:32:29.623, Address=x.x.x.189:8088, MachineId=189, Location=site:prd.zzz,machine: 002,process:17989, Role=BusinessLauncher) has been forcefully evicted from the cluster, but continues to emit a cluster heartbeat; henceforth, the member will be shunned and its messages will be ignored.

      And it starts again:
      2013-03-11 16:32:41.793: Coherence ERROR - (member=2): Received a Kill message from a valid Member(Id=1, Timestamp=2013-03-11 16:20:45.098, Address=x.x.x.192:8088, MachineId=192, Location=site:prd.zzz,machine:003,process:15762); stopping cluster service.

      I would appreciate any help you could give me.