3 Replies Latest reply on Nov 1, 2012 2:11 PM by drowland

    Coherence cluster down issue.

    964690
      Hi All,

      I have a question regarding servers leaving the cluster. We have our coherence cluster configured on Production environment with 16 servers as part of our cluster and among the 16 servers one server is the driver for all servers. we see that one of the server(ex: coherence1server) is not able to ping the other coherence server(ex:Coherence2server) during this time Coherence2server left the cluster which made the whole coherence cluster to bring down and all the Saves/searches on cache are slowing down when the cluster is trying to rebalance the whole cache. We are not able to find the root cause for this is issue which contributed to it because we don't find any load on system during this time. Here is the log snippet which we found when server went down. Please let us know if there is any issue and a way to debug this issue

      Log on Coherence1Server when coherence2server left the cluster

      2012-10-27 12:48:34.317/1744142.197 Oracle Coherence GE 3.7.0.2 <Warning> (thread=Cluster, member=47): Failed to reach address /17.34.25.246 within the IpMonitor timeout. Members [Member(Id=14, Timestamp=2012-10-07 08:18:46.098, Address=17.34.25.246:14240, MachineId=10998, Location=site:corp.apple.com,machine:nwk-gcrmp-ap11,process:29283,member:nwk_gcrmp_ap11_14240, Role=CoherenceServer), Member(Id=13, Timestamp=2012-10-07 08:18:46.089, Address=17.34.25.246:14250, MachineId=10998, Location=site:corp.apple.com,machine:nwk-gcrmp-ap11,process:29284,member:nwk_gcrmp_ap11_14250, Role=CoherenceServer), Member(Id=15, Timestamp=2012-10-07 08:18:46.119, Address=17.34.25.246:14210, MachineId=10998, Location=site:corp.apple.com,machine:nwk-gcrmp-ap11,process:29280,member:nwk-gcrmp-ap11_14210, Role=CoherenceServer), Member(Id=16, Timestamp=2012-10-07 08:18:46.132, Address=17.34.25.246:14220, MachineId=10998, Location=site:corp.apple.com,machine:nwk-gcrmp-ap11,process:29281,member:nwk_gcrmp_ap11_14220, Role=CoherenceServer), Member(Id=12, Timestamp=2012-10-07 08:18:46.074, Address=17.34.25.246:14230, MachineId=10998, Location=site:corp.apple.com,machine:nwk-gcrmp-ap11,process:29282,member:nwk_gcrmp_ap11_14230, Role=CoherenceServer)] are suspect.
      2012-10-27 12:48:34.319/1744142.199 Oracle Coherence GE 3.7.0.2 <Warning> (thread=Cluster, member=47): Timed-out members MemberSet(Size=5, BitSetCount=2
      Member(Id=12, Timestamp=2012-10-07 08:18:46.074, Address=17.34.25.246:14230, MachineId=10998, Location=site:corp.apple.com,machine:nwk-gcrmp-ap11,process:29282,member:nwk_gcrmp_ap11_14230, Role=CoherenceServer)
      Member(Id=13, Timestamp=2012-10-07 08:18:46.089, Address=17.34.25.246:14250, MachineId=10998, Location=site:corp.apple.com,machine:nwk-gcrmp-ap11,process:29284,member:nwk_gcrmp_ap11_14250, Role=CoherenceServer)
      Member(Id=14, Timestamp=2012-10-07 08:18:46.098, Address=17.34.25.246:14240, MachineId=10998, Location=site:corp.apple.com,machine:nwk-gcrmp-ap11,process:29283,member:nwk_gcrmp_ap11_14240, Role=CoherenceServer)
      Member(Id=15, Timestamp=2012-10-07 08:18:46.119, Address=17.34.25.246:14210, MachineId=10998, Location=site:corp.apple.com,machine:nwk-gcrmp-ap11,process:29280,member:nwk-gcrmp-ap11_14210, Role=CoherenceServer)
      Member(Id=16, Timestamp=2012-10-07 08:18:46.132, Address=17.34.25.246:14220, MachineId=10998, Location=site:corp.apple.com,machine:nwk-gcrmp-ap11,process:29281,member:nwk_gcrmp_ap11_14220, Role=CoherenceServer)
      ) will be removed.
      2012-10-27 12:48:34.319/1744142.199 Oracle Coherence GE 3.7.0.2 <D5> (thread=Cluster, member=47): Member 12 left service Management with senior member 1
      2012-10-27 12:48:34.319/1744142.199 Oracle Coherence GE 3.7.0.2 <D5> (thread=Cluster, member=47): Member 12 left service DefaultPartitioned with senior member 1
      2012-10-27 12:48:34.319/1744142.199 Oracle Coherence GE 3.7.0.2 <D5> (thread=Cluster, member=47): Member 12 left service InvocationService with senior member 1
      2012-10-27 12:48:34.320/1744142.200 Oracle Coherence GE 3.7.0.2 <D5> (thread=Cluster, member=47): Member(Id=12, Timestamp=2012-10-27 12:48:34.32, Address=17.34.25.246:14230, MachineId=10998, Location=site:corp.apple.com,machine:nwk-gcrmp-ap11,process:29282,member:nwk_gcrmp_ap11_14230, Role=CoherenceServer) left Cluster with senior member 1
      Log snippet on Main driver server
      2012-10-27 12:48:34.320/1744250.954 Oracle Coherence GE 3.7.0.2 <D5> (thread=Cluster, member=1): MemberLeft notification for Member(Id=12, Timestamp=2012-10-07 08:18:46.074, Address=17.34.25.246:14230, MachineId=10998, Location=site:corp.apple.com,machine:nwk-gcrmp-ap11,process:29282,member:nwk_gcrmp_ap11_14230, Role=CoherenceServer) received from Member(Id=44, Timestamp=2012-10-07 08:19:35.148, Address=17.34.25.212:14250, MachineId=10964, Location=site:corp.apple.com,machine:simcity,process:4769,member:simcity_14250, Role=CoherenceServer)
      2012-10-27 12:48:34.320/1744250.954 Oracle Coherence GE 3.7.0.2 <D5> (thread=Cluster, member=1): Member 12 left service Management with senior member 1
      2012-10-27 12:48:34.320/1744250.954 Oracle Coherence GE 3.7.0.2 <D5> (thread=Cluster, member=1): Member 12 left service DefaultPartitioned with senior member 1
      2012-10-27 12:48:34.321/1744250.955 Oracle Coherence GE 3.7.0.2 <D5> (thread=Cluster, member=1): Member 12 left service InvocationService with senior member 1
      2012-10-27 12:48:34.321/1744250.955 Oracle Coherence GE 3.7.0.2 <D5> (thread=Cluster, member=1): Member(Id=12, Timestamp=2012-10-27 12:48:34.321, Address=17.34.25.246:14230, MachineId=10998, Location=site:corp.apple.com,machine:nwk-gcrmp-ap11,process:29282,member:nwk_gcrmp_ap11_14230, Role=CoherenceServer) left Cluster with senior member 1
      2012-10-27 12:48:34.326/1744250.960 Oracle Coherence GE 3.7.0.2 <D5> (thread=Cluster, member=1): MemberLeft notification for Member(Id=13, Timestamp=2012-10-07 08:18:46.089, Address=17.34.25.246:14250, MachineId=10998, Location=site:corp.apple.com,machine:nwk-gcrmp-ap11,process:29284,member:nwk_gcrmp_ap11_14250, Role=CoherenceServer) received from Member(Id=44, Timestamp=2012-10-07 08:19:35.148, Address=17.34.25.212:14250, MachineId=10964, Location=site:corp.apple.com,machine:simcity,process:4769,member:simcity_14250, Role=CoherenceServer)
      2012-10-27 12:48:34.326/1744250.960 Oracle Coherence GE 3.7.0.2 <D5> (thread=Cluster, member=1): Member 13 left service Management with senior member 1
      2012-10-27 12:48:34.326/1744250.960 Oracle Coherence GE 3.7.0.2 <D5> (thread=Cluster, member=1): Member 13 left service DefaultPartitioned with senior member 1
      2012-10-27 12:48:34.326/1744250.960 Oracle Coherence GE 3.7.0.2 <D5> (thread=Cluster, member=1): Member 13 left service InvocationService with senior member 1
      2012-10-27 12:48:34.326/1744250.960 Oracle Coherence GE 3.7.0.2 <D5> (thread=Cluster, member=1): Member(Id=13, Timestamp=2012-10-27 12:48:34.326, Address=17.34.25.246:14250, MachineId=10998, Location=site:corp.apple.com,machine:nwk-gcrmp-ap11,process:29284,member:nwk_gcrmp_ap11_14250, Role=CoherenceServer) left Cluster with senior member 1

      Hope I find some answers
        • 1. Re: Coherence cluster down issue.
          drowland
          The log snippet show network timeout in the cluster, with one Machine (id 10998 address 17.34.25.246) leaving the cluster. Will need to see all logs, especially from the machines leaving the cluster to try and determine root cause.


          Dave
          • 2. Re: Coherence cluster down issue.
            964690
            Hi Dave,

            Thank you for your reply. As our log files are huge we cannot include all of them this is the more log file information this for Machine which left the cluster(id 10998 address 17.34.25.246 - nwk-gcrmp-ap11.corp.apple.com). Even we suspect the same thing but not sure why this happened. But seems little confusing as the other server(17.34.25.212 - simcity) was not able to ping this server which went down (As we see in our logs) but the server which was not reachable says its not able to reach server(simcity). Hope this would be helpful to find the root cause.

            2012-10-27 12:48:33.544/1744190.596 Oracle Coherence GE 3.7.0.2 <D6> (thread=PacketPublisher, member=15): Member(Id=28, Timestamp=2012-10-07 08:19:10.413, Address=17.34.25.243:14250, MachineId=10995, Location=site:corp.apple.com,machine:nwk-gcrmp-ap08,process:17724,member:nwk_gcrmp_ap08_14250, Role=CoherenceServer) has failed to respond to 17 packets; declaring this member as paused.
            2012-10-27 12:49:13.467/1744230.519 Oracle Coherence GE 3.7.0.2 <Warning> (thread=Cluster, member=15): Failed to reach address /17.34.25.212 within the IpMonitor timeout. Members [Member(Id=44, Timestamp=2012-10-07 08:19:35.148, Address=17.34.25.212:14250, MachineId=10964, Location=site:corp.apple.com,machine:simcity,process:4769,member:simcity_14250, Role=CoherenceServer), Member(Id=46, Timestamp=2012-10-07 08:19:35.193, Address=17.34.25.212:14230, MachineId=10964, Location=site:corp.apple.com,machine:simcity,process:4767,member:simcity_14230, Role=CoherenceServer), Member(Id=49, Timestamp=2012-10-07 08:19:35.374, Address=17.34.25.212:14240, MachineId=10964, Location=site:corp.apple.com,machine:simcity,process:4768,member:simcity_14240, Role=CoherenceServer), Member(Id=45, Timestamp=2012-10-07 08:19:35.169, Address=17.34.25.212:14260, MachineId=10964, Location=site:corp.apple.com,machine:simcity,process:4770,member:simcity_14260, Role=CoherenceServer), Member(Id=47, Timestamp=2012-10-07 08:19:35.359, Address=17.34.25.212:14210, MachineId=10964, Location=site:corp.apple.com,machine:simcity,process:4765,member:simcity_14210, Role=CoherenceServer), Member(Id=48, Timestamp=2012-10-07 08:19:35.367, Address=17.34.25.212:14220, MachineId=10964, Location=site:corp.apple.com,machine:simcity,process:4766,member:simcity_14220, Role=CoherenceServer)] are suspect.
            2012-10-27 12:49:13.469/1744230.521 Oracle Coherence GE 3.7.0.2 <Warning> (thread=Cluster, member=15): Timed-out members MemberSet(Size=6, BitSetCount=3

            Member(Id=47, Timestamp=2012-10-07 08:19:35.359, Address=17.34.25.212:14210, MachineId=10964, Location=site:corp.apple.com,machine:simcity,process:4765,member:simcity_14210, Role=CoherenceServer)
            Member(Id=48, Timestamp=2012-10-07 08:19:35.367, Address=17.34.25.212:14220, MachineId=10964, Location=site:corp.apple.com,machine:simcity,process:4766,member:simcity_14220, Role=CoherenceServer)
            ) will be removed.

            2012-10-27 12:49:13.475/1744230.527 Oracle Coherence GE 3.7.0.2 <D5> (thread=Cluster, member=15): Member 44 left service DefaultPartitioned with senior member 1
            2012-10-27 12:49:13.476/1744230.528 Oracle Coherence GE 3.7.0.2 <D5> (thread=Cluster, member=15): Member(Id=44, Timestamp=2012-10-27 12:49:13.475, Address=17.34.25.212:14250, MachineId=10964, Location=site:corp.apple.com,machine:simcity,process:4769,member:simcity_14250, Role=CoherenceServer) left Cluster with senior member 1

            2012-10-27 12:49:13.477/1744230.529 Oracle Coherence GE 3.7.0.2 <D5> (thread=Cluster, member=15): Member 45 left service DefaultPartitioned with senior member 1
            2012-10-27 12:49:13.477/1744230.529 Oracle Coherence GE 3.7.0.2 <D5> (thread=Cluster, member=15): Member(Id=45, Timestamp=2012-10-27 12:49:13.477, Address=17.34.25.212:14260, MachineId=10964, Location=site:corp.apple.com,machine:simcity,process:4770,member:simcity_14260, Role=CoherenceServer) left Cluster with senior member 1
            2012-10-27 12:49:13.478/1744230.530 Oracle Coherence GE 3.7.0.2 <D5> (thread=Cluster, member=15): Member 46 left service DefaultPartitioned with senior member 1
            2012-10-27 12:49:13.478/1744230.530 Oracle Coherence GE 3.7.0.2 <D5> (thread=Cluster, member=15): Member(Id=46, Timestamp=2012-10-27 12:49:13.478, Address=17.34.25.212:14230, MachineId=10964, Location=site:corp.apple.com,machine:simcity,process:4767,member:simcity_14230, Role=CoherenceServer) left Cluster with senior member 1
            2012-10-27 12:49:13.479/1744230.531 Oracle Coherence GE 3.7.0.2 <D5> (thread=Cluster, member=15): Member 47 left service DefaultPartitioned with senior member 1
            2012-10-27 12:49:13.480/1744230.532 Oracle Coherence GE 3.7.0.2 <D5> (thread=Cluster, member=15): Member(Id=47, Timestamp=2012-10-27 12:49:13.48, Address=17.34.25.212:14210, MachineId=10964, Location=site:corp.apple.com,machine:simcity,process:4765,member:simcity_14210, Role=CoherenceServer) left Cluster with senior member 1
            2012-10-27 12:49:13.482/1744230.534 Oracle Coherence GE 3.7.0.2 <D5> (thread=Cluster, member=15): Member 48 left service DefaultPartitioned with senior member 1
            2012-10-27 12:49:13.482/1744230.534 Oracle Coherence GE 3.7.0.2 <D5> (thread=Cluster, member=15): Member(Id=48, Timestamp=2012-10-27 12:49:13.482, Address=17.34.25.212:14220, MachineId=10964, Location=site:corp.apple.com,machine:simcity,process:4766,member:simcity_14220, Role=CoherenceServer) left Cluster with senior member 1
            2012-10-27 12:49:13.482/1744230.534 Oracle Coherence GE 3.7.0.2 <D5> (thread=Cluster, member=15): Member 49 left service DefaultPartitioned with senior member 1
            2012-10-27 12:49:13.482/1744230.534 Oracle Coherence GE 3.7.0.2 <D5> (thread=Cluster, member=15): Member(Id=49, Timestamp=2012-10-27 12:49:13.482, Address=17.34.25.212:14240, MachineId=10964, Location=site:corp.apple.com,machine:simcity,process:4768,member:simcity_14240, Role=CoherenceServer) left Cluster with senior member 1
            2012-10-27 12:49:13.483/1744230.535 Oracle Coherence GE 3.7.0.2 <Warning> (thread=Cluster, member=15): Failed to reach address /17.34.25.244 within the IpMonitor timeout. Members [Member(Id=22, Timestamp=2012-10-07 08:19:02.061, Address=17.34.25.244:14240, MachineId=10996, Location=site:corp.apple.com,machine:nwk-gcrmp-ap09,process:17305,member:nwk_gcrmp_ap09_14240, Role=CoherenceServer), Member(Id=25, Timestamp=2012-10-07 08:19:02.082, Address=17.34.25.244:14250, MachineId=10996, Location=site:corp.apple.com,machine:nwk-gcrmp-ap09,process:17306,member:nwk_gcrmp_ap09_14250, Role=CoherenceServer), Member(Id=24, Timestamp=2012-10-07 08:19:02.071, Address=17.34.25.244:14220, MachineId=10996, Location=site:corp.apple.com,machine:nwk-gcrmp-ap09,process:17303,member:nwk_gcrmp_ap09_14220, Role=CoherenceServer), Member(Id=26, Timestamp=2012-10-07 08:19:02.083, Address=17.34.25.244:14230, MachineId=10996, Location=site:corp.apple.com,machine:nwk-gcrmp-ap09,process:17304,member:nwk_gcrmp_ap09_14230, Role=CoherenceServer), Member(Id=23, Timestamp=2012-10-07 08:19:02.066, Address=17.34.25.244:14210, MachineId=10996, Location=site:corp.apple.com,machine:nwk-gcrmp-ap09,process:17302,member:nwk-gcrmp-ap09_14210, Role=CoherenceServer)] are suspect.
            2012-10-27 12:49:13.483/1744230.535 Oracle Coherence GE 3.7.0.2 <Warning> (thread=Cluster, member=15): Timed-out members MemberSet(Size=5, BitSetCount=3
            Member(Id=22, Timestamp=2012-10-07 08:19:02.061, Address=17.34.25.244:14240, MachineId=10996, Location=site:corp.apple.com,machine:nwk-gcrmp-ap09,process:17305,member:nwk_gcrmp_ap09_14240, Role=CoherenceServer)
            Member(Id=23, Timestamp=2012-10-07 08:19:02.066, Address=17.34.25.244:14210, MachineId=10996, Location=site:corp.apple.com,machine:nwk-gcrmp-ap09,process:17302,member:nwk-gcrmp-ap09_14210, Role=CoherenceServer)

            Member(Id=24, Timestamp=2012-10-07 08:19:02.071, Address=17.34.25.244:14220, MachineId=10996, Location=site:corp.apple.com,machine:nwk-gcrmp-ap09,process:17303,member:nwk_gcrmp_ap09_14220, Role=CoherenceServer)
            Member(Id=25, Timestamp=2012-10-07 08:19:02.082, Address=17.34.25.244:14250, MachineId=10996, Location=site:corp.apple.com,machine:nwk-gcrmp-ap09,process:17306,member:nwk_gcrmp_ap09_14250, Role=CoherenceServer)
            Member(Id=26, Timestamp=2012-10-07 08:19:02.083, Address=17.34.25.244:14230, MachineId=10996, Location=site:corp.apple.com,machine:nwk-gcrmp-ap09,process:17304,member:nwk_gcrmp_ap09_14230, Role=CoherenceServer)
            ) will be removed.
            2012-10-27 12:49:13.483/1744230.535 Oracle Coherence GE 3.7.0.2 <D5> (thread=Cluster, member=15): Member 22 left service DefaultPartitioned with senior member 1
            2012-10-27 12:49:13.484/1744230.536 Oracle Coherence GE 3.7.0.2 <D5> (thread=Cluster, member=15): Member(Id=22, Timestamp=2012-10-27 12:49:13.484, Address=17.34.25.244:14240, MachineId=10996, Location=site:corp.apple.com,machine:nwk-gcrmp-ap09,process:17305,member:nwk_gcrmp_ap09_14240, Role=CoherenceServer) left Cluster with senior member 1
            2012-10-27 12:49:13.484/1744230.536 Oracle Coherence GE 3.7.0.2 <D5> (thread=Cluster, member=15): Member 23 left service DefaultPartitioned with senior member 1
            2012-10-27 12:49:13.484/1744230.536 Oracle Coherence GE 3.7.0.2 <D5> (thread=Cluster, member=15): Member(Id=23, Timestamp=2012-10-27 12:49:13.484, Address=17.34.25.244:14210, MachineId=10996, Location=site:corp.apple.com,machine:nwk-gcrmp-ap09,process:17302,member:nwk-gcrmp-ap09_14210, Role=CoherenceServer) left Cluster with senior member 1
            2012-10-27 12:49:13.485/1744230.537 Oracle Coherence GE 3.7.0.2 <D5> (thread=Cluster, member=15): Member 24 left service DefaultPartitioned with senior member 1
            2012-10-27 12:49:13.485/1744230.537 Oracle Coherence GE 3.7.0.2 <D5> (thread=Cluster, member=15): Member(Id=24, Timestamp=2012-10-27 12:49:13.485, Address=17.34.25.244:14220, MachineId=10996, Location=site:corp.apple.com,machine:nwk-gcrmp-ap09,process:17303,member:nwk_gcrmp_ap09_14220, Role=CoherenceServer) left Cluster with senior member 1
            2012-10-27 12:49:13.485/1744230.537 Oracle Coherence GE 3.7.0.2 <D5> (thread=Cluster, member=15): Member 25 left service DefaultPartitioned with senior member 1
            2012-10-27 12:49:13.486/1744230.538 Oracle Coherence GE 3.7.0.2 <D5> (thread=Cluster, member=15): Member(Id=25, Timestamp=2012-10-27 12:49:13.485, Address=17.34.25.244:14250, MachineId=10996, Location=site:corp.apple.com,machine:nwk-gcrmp-ap09,process:17306,member:nwk_gcrmp_ap09_14250, Role=CoherenceServer) left Cluster with senior member 1
            2012-10-27 12:49:13.486/1744230.538 Oracle Coherence GE 3.7.0.2 <D5> (thread=Cluster, member=15): Member 26 left service DefaultPartitioned with senior member 1
            2012-10-27 12:49:13.486/1744230.538 Oracle Coherence GE 3.7.0.2 <D5> (thread=Cluster, member=15): Member(Id=26, Timestamp=2012-10-27 12:49:13.486, Address=17.34.25.244:14230, MachineId=10996, Location=site:corp.apple.com,machine:nwk-gcrmp-ap09,process:17304,member:nwk_gcrmp_ap09_14230, Role=CoherenceServer) left Cluster with senior member 1
            2012-10-27 12:49:13.487/1744230.539 Oracle Coherence GE 3.7.0.2 <Error> (thread=Cluster, member=15): Received cluster heartbeat from the senior Member(Id=1, Timestamp=2012-10-07 08:17:46.009, Address=17.34.25.210:14210, MachineId=10962, Location=site:corp.apple.com,machine:wallville,process:1259,member:wallville_14210, Role=CoherenceServer) that does not contain this Member(Id=15, Timestamp=2012-10-07 08:18:46.119, Address=17.34.25.246:14210, MachineId=10998, Location=site:corp.apple.com,machine:nwk-gcrmp-ap11,process:29280,member:nwk-gcrmp-ap11_14210, Role=CoherenceServer); stopping cluster service.
            2012-10-27 12:49:13.728/1744230.780 Oracle Coherence GE 3.7.0.2 <Error> (thread=Cluster, member=15): Full Thread Dump

            Thread[Reference Handler,10,system]
            java.lang.ref.Reference.waitForActivatedQueue(Native Method)
            java.lang.ref.Reference.access$100(Reference.java:11)
            java.lang.ref.Reference$ReferenceHandler.run(Reference.java:82)

            Thread[Main Thread,5,main]
            java.lang.Object.wait(Native Method)
            com.tangosol.net.DefaultCacheServer.monitorServices(DefaultCacheServer.java:270)
            com.tangosol.net.DefaultCacheServer.startAndMonitor(DefaultCacheServer.java:56)
            com.tangosol.net.DefaultCacheServer.main(DefaultCacheServer.java:197)

            Thread[VM JFR Buffer Thread,5,main]

            Thread[PacketListener1,8,Cluster]
            java.net.PlainDatagramSocketImpl.receive0(Native Method)
            java.net.PlainDatagramSocketImpl.receive(PlainDatagramSocketImpl.java:136)
            java.net.DatagramSocket.receive(DatagramSocket.java:712)
            com.tangosol.coherence.component.net.socket.UdpSocket.receive(UdpSocket.CDB:22)
            com.tangosol.coherence.component.net.UdpPacket.receive(UdpPacket.CDB:1)
            com.tangosol.coherence.component.util.daemon.queueProcessor.packetProcessor.PacketListener.onNotify(PacketListener.CDB:20)
            com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:42)
            java.lang.Thread.run(Thread.java:619)

            Thread[(Signal Handler),5,main]

            Thread[DistributedCache:DefaultPartitioned,5,Cluster]
            java.lang.Object.wait(Native Method)
            com.tangosol.coherence.component.util.Daemon.onWait(Daemon.CDB:18)
            com.tangosol.coherence.component.util.daemon.queueProcessor.Service.onWait(Service.CDB:4)
            com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.onWait(Grid.CDB:3)
            com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:39)
            java.lang.Thread.run(Thread.java:619)

            Thread[IpMonitor,6,Cluster]
            java.lang.Object.wait(Native Method)
            com.tangosol.coherence.component.util.Daemon.onWait(Daemon.CDB:18)
            com.tangosol.coherence.component.util.daemon.IpMonitor.onWait(IpMonitor.CDB:4)
            com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:39)
            java.lang.Thread.run(Thread.java:619)

            Thread[RMI Scheduler(0),5,system]
            sun.misc.Unsafe.park(Native Method)
            java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
            java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
            java.util.concurrent.DelayQueue.take(DelayQueue.java:160)
            java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:583)
            java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:576)
            java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:947)
            java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907)
            java.lang.Thread.run(Thread.java:619)

            Thread[(Code Optimization Thread 1),5,main]

            Thread[PacketPublisher,6,Cluster]
            java.lang.Object.wait(Native Method)
            com.tangosol.coherence.component.util.Daemon.onWait(Daemon.CDB:18)
            com.tangosol.coherence.component.util.daemon.queueProcessor.packetProcessor.PacketPublisher.onWait(PacketPublisher.CDB:2)
            com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:39)
            java.lang.Thread.run(Thread.java:619)

            Thread[Invocation:InvocationService,5,Cluster]
            java.lang.Object.wait(Native Method)
            com.tangosol.coherence.component.util.Daemon.onWait(Daemon.CDB:18)
            com.tangosol.coherence.component.util.daemon.queueProcessor.Service.onWait(Service.CDB:4)
            com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.onWait(Grid.CDB:3)
            com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:39)
            java.lang.Thread.run(Thread.java:619)

            Thread[Finalizer,8,system]
            jrockit.memory.Finalizer.waitForFinalizees(Native Method)
            jrockit.memory.Finalizer.access$700(Finalizer.java:12)
            jrockit.memory.Finalizer$4.run(Finalizer.java:189)
            java.lang.Thread.run(Thread.java:619)

            Thread[JFR request timer,5,main]
            java.lang.Object.wait(Native Method)
            java.lang.Object.wait(Object.java:485)
            java.util.TimerThread.mainLoop(Timer.java:483)
            java.util.TimerThread.run(Timer.java:462)

            Thread[(Sensor Event Thread),5,main]

            Thread[RMI TCP Accept-0,5,system]
            java.net.PlainSocketImpl.socketAccept(Native Method)
            java.net.PlainSocketImpl.accept(PlainSocketImpl.java:390)
            java.net.ServerSocket.implAccept(ServerSocket.java:453)
            java.net.ServerSocket.accept(ServerSocket.java:421)
            oracle.jrockit.management.server.LocalJMXConnector$LocalRMIServerSocketFactory$1.accept(LocalJMXConnector.java:96)
            sun.rmi.transport.tcp.TCPTransport$AcceptLoop.executeAcceptLoop(TCPTransport.java:369)
            sun.rmi.transport.tcp.TCPTransport$AcceptLoop.run(TCPTransport.java:341)
            java.lang.Thread.run(Thread.java:619)

            Thread[Invocation:Management:EventDispatcher,5,Cluster]
            java.lang.Object.wait(Native Method)
            com.tangosol.coherence.component.util.Daemon.onWait(Daemon.CDB:18)
            com.tangosol.coherence.component.util.daemon.queueProcessor.Service$EventDispatcher.onWait(Service.CDB:7)
            com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:39)
            java.lang.Thread.run(Thread.java:619)

            Thread[PacketSpeaker,8,Cluster]
            java.lang.Object.wait(Native Method)
            com.tangosol.coherence.component.util.queue.ConcurrentQueue.waitForEntry(ConcurrentQueue.CDB:16)
            com.tangosol.coherence.component.util.queue.ConcurrentQueue.remove(ConcurrentQueue.CDB:7)
            com.tangosol.coherence.component.util.Queue.remove(Queue.CDB:1)
            com.tangosol.coherence.component.util.daemon.queueProcessor.packetProcessor.PacketSpeaker.onNotify(PacketSpeaker.CDB:21)
            com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:42)
            java.lang.Thread.run(Thread.java:619)

            Thread[PacketReceiver,7,Cluster]
            java.lang.Object.wait(Native Method)
            com.tangosol.coherence.component.util.Daemon.onWait(Daemon.CDB:18)
            com.tangosol.coherence.component.util.daemon.queueProcessor.packetProcessor.PacketReceiver.onWait(PacketReceiver.CDB:2)
            com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:39)
            java.lang.Thread.run(Thread.java:619)

            Thread[Cluster:EventDispatcher,5,Cluster]
            java.lang.Object.wait(Native Method)
            com.tangosol.coherence.component.util.Daemon.onWait(Daemon.CDB:18)
            com.tangosol.coherence.component.util.daemon.queueProcessor.Service$EventDispatcher.onWait(Service.CDB:7)
            com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:39)
            java.lang.Thread.run(Thread.java:619)

            Thread[RMI TCP Accept-14214,5,system]
            java.net.PlainSocketImpl.socketAccept(Native Method)
            java.net.PlainSocketImpl.accept(PlainSocketImpl.java:390)
            java.net.ServerSocket.implAccept(ServerSocket.java:453)
            java.net.ServerSocket.accept(ServerSocket.java:421)
            sun.rmi.transport.tcp.TCPTransport$AcceptLoop.executeAcceptLoop(TCPTransport.java:369)
            sun.rmi.transport.tcp.TCPTransport$AcceptLoop.run(TCPTransport.java:341)
            java.lang.Thread.run(Thread.java:619)

            Thread[Invocation:Management,5,Cluster]
            java.lang.Object.wait(Native Method)
            com.tangosol.coherence.component.util.Daemon.onWait(Daemon.CDB:18)
            com.tangosol.coherence.component.util.daemon.queueProcessor.Service.onWait(Service.CDB:4)
            com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.onWait(Grid.CDB:3)
            com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:39)
            java.lang.Thread.run(Thread.java:619)

            Thread[PacketListener1P,8,Cluster]
            java.net.PlainDatagramSocketImpl.receive0(Native Method)
            java.net.PlainDatagramSocketImpl.receive(PlainDatagramSocketImpl.java:136)
            java.net.DatagramSocket.receive(DatagramSocket.java:712)
            com.tangosol.coherence.component.net.socket.UdpSocket.receive(UdpSocket.CDB:22)
            com.tangosol.coherence.component.net.UdpPacket.receive(UdpPacket.CDB:1)
            com.tangosol.coherence.component.util.daemon.queueProcessor.packetProcessor.PacketListener.onNotify(PacketListener.CDB:20)
            com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:42)
            java.lang.Thread.run(Thread.java:619)

            Thread[(VM Periodic Task),10,main]

            Thread[Logger@9222990 3.7.0.2,3,main]
            java.io.FileOutputStream.writeBytes(FileOutputStream.java)
            java.io.FileOutputStream.write(FileOutputStream.java:260)
            sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:202)
            sun.nio.cs.StreamEncoder.implFlushBuffer(StreamEncoder.java:272)
            sun.nio.cs.StreamEncoder.implFlush(StreamEncoder.java:276)
            sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:122)
            java.io.OutputStreamWriter.flush(OutputStreamWriter.java:212)
            org.apache.log4j.helpers.QuietWriter.flush(QuietWriter.java:57)
            org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:315)
            org.apache.log4j.RollingFileAppender.subAppend(RollingFileAppender.java:236)
            org.apache.log4j.WriterAppender.append(WriterAppender.java:159)
            org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:230)
            org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:65)
            org.apache.log4j.Category.callAppenders(Category.java:203)
            org.apache.log4j.Category.forcedLog(Category.java:388)
            org.apache.log4j.Category.log(Category.java:835)
            com.tangosol.coherence.component.util.logOutput.Log4j.log(Log4j.CDB:3)
            com.tangosol.coherence.component.util.LogOutput.log(LogOutput.CDB:1)
            com.tangosol.coherence.component.util.daemon.queueProcessor.Logger.onNotify(Logger.CDB:99)
            com.tangosol.coherence.component.application.console.Coherence$Logger.onNotify(Coherence.CDB:4)
            com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:42)
            java.lang.Thread.run(Thread.java:619)

            Thread[(Code Generation Thread 1),5,main]

            Thread[(OC Main Thread),5,main]

            ThreadCluster
            java.lang.Thread.dumpThreads(Native Method)
            java.lang.Thread.getAllStackTraces(Thread.java:1487)
            com.tangosol.net.GuardSupport.logStackTraces(GuardSupport.java:810)
            com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.ClusterService$SeniorMemberHeartbeat.onReceived(ClusterService.CDB:33)
            com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.onMessage(Grid.CDB:33)
            com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.onNotify(Grid.CDB:33)
            com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.ClusterService.onNotify(ClusterService.CDB:3)
            com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:42)
            java.lang.Thread.run(Thread.java:619)

            2012-10-27 12:49:13.735/1744230.787 Oracle Coherence GE 3.7.0.2 <D5> (thread=Cluster, member=15): Service Cluster left the cluster
            • 3. Re: Coherence cluster down issue.
              drowland
              I think when we get into a specific problem and need to look through large number of log files, that it is best to open an SR.

              Can you open up an SR?

              This will allows us to maintain context of the problem over time and allow for the uploading and review of all of the log files.

              Thanks,
              Dave