0 Replies Latest reply on Dec 1, 2012 2:53 AM by 899771

    Coherence issues on SOA Cluster


      Have a 2 Node cluster od soa Have 3 instances of 2 Node cluster running on the same machine. It's basically one for DEV,TST,UAT on the same machine. We have cronjobs that run every night shuts down all the cluster instances and then bring them back up. Recently one of our UAT soa cluster having issues. Both the Nodes come up fine but soa-infra in one node is not getting deployed and is down. I tried to manually start that node but no luck these are the errors I am getting in the Node where soa-infra is down. It's been working fine since recently. Any idea what might be wrong and where to look at. Any help is appreciated.

      Inside the console all the servers are up and running but when going to Enterprise Manager it shows soa-infra down and only see one node.

      ####<Nov 28, 2012 11:10:23 PM CST> <Notice> <Stdout> <soadev01> <WLS_SOA1> <Logger@2067016530> <<WLS Kernel>>   <1354165823364> <BEA-000000> <bpel.fatal.conection.max.retry is set to 3<Nov 28, 2012 11:10:23 PM CST> <Warning> <Coherence> <BEA-000000> <2012-11-28 23:10:23.360/533.658 Oracle Coherence GE <Warning> (thread=Cluster, member=n/a): This Member(Id=0, Timestamp=2012-11-28 23:09:52.784, Address=, MachineId=3944, Location=site:xxorg.com,machine:web1,process:19968, Role=WeblogicServer) has been attempting to join the cluster using WKA list [web2.xxorg.com/, web1.xxorg.com/] for 30 seconds without success; this could indicate a mis-configured WKA, or it may simply be the result of a busy cluster or active failover.>> 
      ####<Nov 28, 2012 11:10:23 PM CST> <Notice> <Stdout> <soadev01> <WLS_SOA1> <Logger@2067016530> <<WLS Kernel>>   <1354165823365> <BEA-000000> <<Nov 28, 2012 11:10:23 PM CST> <Warning> <Coherence> <BEA-000000> <2012-11-28 23:10:23.363/533.661 Oracle Coherence GE <Warning> (thread=Cluster, member=n/a): Received a discovery message that indicates the presence of an existing cluster that does not respond to join requests; this is usually caused by a network layer failure:
      Message "SeniorMemberHeartbeat"
        FromMember=Member(Id=1, Timestamp=2012-11-28 21:28:09.017, Address=, MachineId=3968, Location=site:xxorg.com,machine:web2,process:3396, Role=WeblogicServer)
          [000]=Broadcast{PacketType=0x0DDF00D2, ToId=0, FromId=1, Direction=Incoming, ReceivedMillis=23:10:23.353, MessageType=17, ServiceId=0, MessagePartCount=1, MessagePartIndex=0, Body=0}
        Service=ClusterService{Name=Cluster, State=(SERVICE_STARTED, STATE_ANNOUNCE), Id=0, Version=3.6}
        MemberSet=MemberSet(Size=1, BitSetCount=1, ids=[1])
      <Nov 28, 2012 11:14:53 PM CST> <Error> <Deployer> <BEA-149231> <Unable to set the activation state to true for the application 'soa-infra'.
           at weblogic.servlet.internal.WebAppModule.startContexts(WebAppModule.java:1510)
           at weblogic.servlet.internal.WebAppModule.start(WebAppModule.java:482)
           at weblogic.application.internal.flow.ModuleStateDriver$3.next(ModuleStateDriver.java:425)
           at weblogic.application.utils.StateMachineDriver.nextState(StateMachineDriver.java:52)
           at weblogic.application.internal.flow.ModuleStateDriver.start(ModuleStateDriver.java:119)
           Truncated. see log file for complete stacktrace
      Caused By: com.tangosol.net.RequestTimeoutException: Timeout during service start: ServiceInfo(Id=0, Name=Cluster, Type=Cluster
          ActualMemberSet=MemberSet(Size=0, BitSetCount=0
           at com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.onStartupTimeout(Grid.CDB:6)
           at com.tangosol.coherence.component.util.daemon.queueProcessor.Service.start(Service.CDB:28)
           at com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.start(Grid.CDB:6)
           at com.tangosol.coherence.component.net.Cluster.onStart(Cluster.CDB:637)
           at com.tangosol.coherence.component.net.Cluster.start(Cluster.CDB:11)
           Truncated. see log file for complete stacktrace
      <div class="jive-quote"> </div>
      As we have 2 physical machines and have DEV (2-node soa cluster), TST (2-node soa cluster), UAT (2-node soa cluster) all running on the same machines and comes up and down for daily DB maintenance everynight. As it the UAT environment any help appreciated in pointing to tle location to debug.

      Also have Unicast operation, and use Well-Known Addressing (WKA) configured in Console under server(s)->start tab. But looks like something with coherence that's not working for the cluster to come up and failing.