2 Replies Latest reply: Apr 18, 2013 8:26 AM by ggarciao.com RSS

    CacheFactory.shutdown() call seems to hang

    ggarciao.com
      Hello everyone,

      Here is my problem ... We are implementing a continuous delivery environment to redeploy a coherence cluster every night/week. The coherence cluster have 4 nodes:
      * Three (3) tomcat web applications
      * One (1) tomcat web application with no local storage (-Dtangosol.coherence.distributed.localstorage=false)

      To deploy our cluster, we use the Tomcat Manager to REDEPLOY each web application without killing the jvm that runs each tomcat instance. To get and release the cluster resources correctly we create a Spring bean called "CacheManager" that have the methods init() and destroy().
      public void init() {
          CacheFactory.ensureCluster();
          LOGGER.info("Cluster ensured!");
          
          // Other stuffs ...
      }
      
      public void destroy() {
         try {
                     
           LOGGER.info("Shutting down the Cache Factory ...");
           CacheFactory.shutdown();
           LOGGER.info("Cache Factory shutdown!");
      
          } catch (Exception e) {
           LOGGER.warn("The cache manager was not destroyed correctly. This might cause problems when redeploying the application. Reason: "+e.getMessage(), e);
          }
      }
      Of course, the init() method is called when the Spring context is created, and the destroy() method is called when the context is destroyed.

      The problem
      When we redeploy a single member of the cluster (and not the one with no local storage) we saw that the shutdown method of the CacheFactory is called, but it hangs the redeployment (we need to kill the jvm if we want to reuse the tomcat instance).

      Here are the logs (of the member that is been redeployed):
      2013-04-04 17:33:37,856 [http-bio-8080-exec-3] INFO - Shutting down the Cache Factory ...
      2013-04-04 17:33:37,980 [Logger@9253293 3.7.1.0] INFO - 2013-04-04 17:33:37.980/5735.764 Oracle Coherence GE 3.7.1.0 <Info> (thread=DistributedCache:DistributedCacheForSequenceGenerators, member=n/a): Remains to transfer before shutting down: 86 primary partitions, 86 backup partitions
      2013-04-04 17:33:37,997 [Logger@9253293 3.7.1.0] INFO - 2013-04-04 17:33:37.997/5735.781 Oracle Coherence GE 3.7.1.0 <Info> (thread=DistributedCache:DistributedCacheForSequenceGenerators, member=n/a): Remains to transfer before shutting down: 43 primary partitions, 0 backup partitions
      2013-04-04 17:33:38,145 [Logger@9253293 3.7.1.0] INFO - 2013-04-04 17:33:38.145/5735.929 Oracle Coherence GE 3.7.1.0 <Info> (thread=DistributedCache:DistributedCacheForLiveObjects, member=n/a): Remains to transfer before shutting down: 86 primary partitions, 86 backup partitions
      2013-04-04 17:33:38,195 [Logger@9253293 3.7.1.0] INFO - 2013-04-04 17:33:38.195/5735.979 Oracle Coherence GE 3.7.1.0 <Info> (thread=DistributedCache:DistributedCacheForLiveObjects, member=n/a): Remains to transfer before shutting down: 43 primary partitions, 0 backup partitions
      2013-04-04 17:33:38,325 [Logger@9253293 3.7.1.0] INFO - 2013-04-04 17:33:38.324/5736.108 Oracle Coherence GE 3.7.1.0 <Info> (thread=DistributedCache:DistributedCacheForCommandPattern, member=n/a): Remains to transfer before shutting down: 86 primary partitions, 86 backup partitions
      And after a while (several seconds - minutes) this appears on the logs too:
      2013-04-04 17:38:12,852 [Logger@9253293 3.7.1.0] ERROR - 2013-04-04 17:38:12.852/6010.636 Oracle Coherence GE 3.7.1.0 <Error> (thread=Cluster, member=n/a): Detected soft timeout) of {WrapperGuardable Guard{Daemon=DistributedCache:DistributedCacheForCommandPattern} Service=PartitionedCache{Name=DistributedCacheForCommandPattern, State=(SERVICE_STOPPING), LocalStorage=disabled}}
      2013-04-04 17:38:12,860 [Logger@9253293 3.7.1.0] ERROR - 2013-04-04 17:38:12.860/6010.644 Oracle Coherence GE 3.7.1.0 <Error> (thread=Recovery Thread, member=n/a): Full Thread Dump
      
      ... (The thread dump here)
      And then after a while
      2013-04-04 17:38:12,861 [Logger@9253293 3.7.1.0] WARN - 2013-04-04 17:38:12.860/6010.644 Oracle Coherence GE 3.7.1.0 <Warning> (thread=Recovery Thread, member=n/a): Attempting recovery of Guard{Daemon=DistributedCache:DistributedCacheForCommandPattern}
      2013-04-04 17:38:43,353 [Logger@9253293 3.7.1.0] ERROR - 2013-04-04 17:38:43.353/6041.137 Oracle Coherence GE 3.7.1.0 <Error> (thread=Cluster, member=n/a): Detected hard timeout) of {WrapperGuardable Guard{Daemon=DistributedCache:DistributedCacheForCommandPattern} Service=PartitionedCache{Name=DistributedCacheForCommandPattern, State=(SERVICE_STOPPING), LocalStorage=disabled}}
      2013-04-04 17:38:43,359 [Logger@9253293 3.7.1.0] ERROR - 2013-04-04 17:38:43.358/6041.142 Oracle Coherence GE 3.7.1.0 <Error> (thread=Termination Thread, member=n/a): Full Thread Dump
      
      ... (The thread dump here)
      And after ...
      2013-04-04 17:43:49,637 [Logger@9253293 3.7.1.0] WARN - 2013-04-04 17:43:49.637/6347.421 Oracle Coherence GE 3.7.1.0 <Warning> (thread=Termination Thread, member=n/a): Terminating Guard{Daemon=Cluster}
      Coherence <Error>: Halting this cluster node due to unrecoverable service failure
      Exception in thread "Termination Thread" (Wrapped: error finding method 'halt' for cluster object) java.lang.reflect.InvocationTargetException
           at com.tangosol.util.Base.ensureRuntimeException(Base.java:288)
           at com.tangosol.internal.net.cluster.DefaultServiceFailurePolicy.onServiceFailed(DefaultServiceFailurePolicy.java:124)
           at com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid$Guard.terminate(Grid.CDB:17)
           at com.tangosol.internal.net.cluster.DefaultServiceFailurePolicy.onGuardableTerminate(DefaultServiceFailurePolicy.java:89)
           at com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid$WrapperGuardable.terminate(Grid.CDB:1)
           at com.tangosol.net.GuardSupport$Context$2.run(GuardSupport.java:677)
           at java.lang.Thread.run(Thread.java:619)
      Caused by: java.lang.reflect.InvocationTargetException
           at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
           at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
           at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
           at java.lang.reflect.Method.invoke(Method.java:597)
           at com.tangosol.util.ClassHelper.invoke(ClassHelper.java:814)
           at com.tangosol.util.ClassHelper.invoke(ClassHelper.java:769)
           at com.tangosol.internal.net.cluster.DefaultServiceFailurePolicy.onServiceFailed(DefaultServiceFailurePolicy.java:120)
           ... 5 more
      Caused by: java.nio.channels.ClosedSelectorException
           at sun.nio.ch.SelectorImpl.keys(SelectorImpl.java:51)
           at com.tangosol.coherence.component.net.TcpRing.close(TcpRing.CDB:11)
           at com.tangosol.coherence.component.net.Cluster.halt(Cluster.CDB:28)
           ... 12 more
      I don't see anything special in the Thread Dump ...

      Here is the thread of my application that execute the shutdown ...
      Thread[http-bio-8080-exec-3,5,main]
           java.lang.Object.wait(Native Method)
           java.lang.Object.wait(Object.java:485)
           com.tangosol.coherence.component.util.daemon.queueProcessor.service.Grid.shutdown(Grid.CDB:20)
           com.tangosol.coherence.component.util.daemon.queueProcessor.service.grid.PartitionedService.shutdown(PartitionedService.CDB:24)
           com.tangosol.coherence.component.net.Cluster.shutdown(Cluster.CDB:30)
           com.tangosol.coherence.component.util.SafeCluster.shutdown(SafeCluster.CDB:12)
           com.tangosol.coherence.component.application.console.Coherence.shutdown(Coherence.CDB:10)
           sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
           sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
           sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
           java.lang.reflect.Method.invoke(Method.java:597)
           com.tangosol.net.CacheFactory.shutdown(CacheFactory.java:442)
           com.exmaple.CoherenceCacheManager.destroy(CoherenceCacheManager.java:98)
           sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
           sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
           sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
           java.lang.reflect.Method.invoke(Method.java:597)
           org.springframework.beans.factory.support.DisposableBeanAdapter.invokeCustomDestroyMethod(DisposableBeanAdapter.java:300)
           org.springframework.beans.factory.support.DisposableBeanAdapter.destroy(DisposableBeanAdapter.java:226)
           org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.destroyBean(DefaultSingletonBeanRegistry.java:498)
           org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.destroySingleton(DefaultSingletonBeanRegistry.java:474)
           org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.destroySingletons(DefaultSingletonBeanRegistry.java:442)
           org.springframework.context.support.AbstractApplicationContext.destroyBeans(AbstractApplicationContext.java:1066)
           org.springframework.context.support.AbstractApplicationContext.doClose(AbstractApplicationContext.java:1040)
           org.springframework.context.support.AbstractApplicationContext.close(AbstractApplicationContext.java:988)
           org.springframework.web.context.ContextLoader.closeWebApplicationContext(ContextLoader.java:548)
           org.springframework.web.context.ContextLoaderListener.contextDestroyed(ContextLoaderListener.java:142)
           org.apache.catalina.core.StandardContext.listenerStop(StandardContext.java:4819)
           org.apache.catalina.core.StandardContext.stopInternal(StandardContext.java:5466)
           org.apache.catalina.util.LifecycleBase.stop(LifecycleBase.java:232)
           org.apache.catalina.manager.ManagerServlet.undeploy(ManagerServlet.java:1350)
           org.apache.catalina.manager.ManagerServlet.doGet(ManagerServlet.java:376)
           javax.servlet.http.HttpServlet.service(HttpServlet.java:621)
           javax.servlet.http.HttpServlet.service(HttpServlet.java:722)
           org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:305)
           org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
           org.apache.catalina.filters.SetCharacterEncodingFilter.doFilter(SetCharacterEncodingFilter.java:108)
           org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
           org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
           org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:225)
           org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:169)
           org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:581)
           org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:168)
           org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98)
           org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:927)
           org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
           org.apache.catalina.ha.tcp.ReplicationValve.invoke(ReplicationValve.java:333)
           org.apache.catalina.ha.session.JvmRouteBinderValve.invoke(JvmRouteBinderValve.java:219)
           org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407)
           org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:999)
           org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:565)
           org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:309)
           java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
           java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
           java.lang.Thread.run(Thread.java:619)
      And several coherence thread like:
      Thread[DistributedCache:DistributedCacheForCommandPattern|SERVICE_STOPPING,5,Cluster]
           com.tangosol.net.CacheFactory.ensureCluster(CacheFactory.java:424)
           com.oracle.coherence.patterns.command.internal.CommandExecutor.stop(CommandExecutor.java:671)
              ...
      
      Thread[NPDistributedCacheWorker:2,5,NPDistributedCache]
           java.lang.Object.wait(Native Method)
           com.tangosol.coherence.component.util.Daemon.onWait(Daemon.CDB:18)
              ...
      
      Thread[PacketListener1P,8,Cluster]
           com.tangosol.coherence.component.net.Cluster.onException(Cluster.CDB:1)
           com.tangosol.coherence.component.net.Cluster$PacketListener1P.onException(Cluster.CDB:1)
      
      ... (a lot more)
      Can anyone help me to understand why this is happening?

      INFO ABOUT CONFIGURATION
      Here you can see the cache schemes that we are using:
      <caching-schemes>
      
                <local-scheme>
                     <scheme-name>unlimited-backing-map</scheme-name>
                </local-scheme>
      
                <replicated-scheme>
                     <scheme-name>hp-replication</scheme-name>
                     <service-name>HPReplicatedCache</service-name>
                     <backing-map-scheme>
                          <local-scheme>
                               <scheme-ref>unlimited-backing-map</scheme-ref>
                          </local-scheme>
                     </backing-map-scheme>
                     <autostart>true</autostart>
                </replicated-scheme>
      
                <replicated-scheme>
                     <scheme-name>transient-replication</scheme-name>
                     <service-name>TransientReplicatedCache</service-name>
                     <backing-map-scheme>
                          <local-scheme>
                               <expiry-delay>1m</expiry-delay>
                          </local-scheme>
                     </backing-map-scheme>
                     <autostart>true</autostart>
                </replicated-scheme>
      
                <near-scheme>
                     <scheme-name>gp-near</scheme-name>
                     <front-scheme>
                          <local-scheme>
                               <eviction-policy>LRU</eviction-policy>
                               <high-units>1000</high-units>
                               <expiry-delay>5m</expiry-delay>
                          </local-scheme>
                     </front-scheme>
                     <back-scheme>
                          <distributed-scheme>
                               <scheme-name>gp-distributed</scheme-name>
                               <service-name>GPDistributedCache</service-name>
                               <backup-count>2</backup-count>
                               <backing-map-scheme>
                                    <local-scheme>
                                         <scheme-ref>unlimited-backing-map</scheme-ref>
                                    </local-scheme>
                               </backing-map-scheme>
                               <autostart>true</autostart>
                          </distributed-scheme>
                     </back-scheme>
                     <autostart>true</autostart>
                </near-scheme>
      
                <near-scheme>
                     <scheme-name>np-near</scheme-name>
                     <front-scheme>
                          <local-scheme />
                     </front-scheme>
                     <back-scheme>
                          <distributed-scheme>
                               <scheme-name>np-distributed</scheme-name>
                               <service-name>NPDistributedCache</service-name>
                               <serializer>
                                   <instance>
                                       <class-name>org.gridkit.coherence.utils.pof.AutoPofSerializer</class-name>
                                       <init-params>
                                              <init-param>
                                                   <param-type>String</param-type>
                                                   <param-value>tangosol-pof-config.xml</param-value>
                                              </init-param>
                                         </init-params>
                                   </instance>
                               </serializer>
                               <thread-count>5</thread-count>
                               <backup-count>2</backup-count>
                               <backing-map-scheme>
                                    <local-scheme>
                                         <eviction-policy>LRU</eviction-policy>
                                         <high-units>10000</high-units>
                                    </local-scheme>
                               </backing-map-scheme>
                               <autostart>true</autostart>
                          </distributed-scheme>
                     </back-scheme>
                     <autostart>true</autostart>
                </near-scheme>
      
                <near-scheme>
                     <scheme-name>np-normal</scheme-name>
                     <front-scheme>
                          <local-scheme />
                     </front-scheme>
                     <back-scheme>
                          <distributed-scheme>
                               <scheme-name>np-normal-distributed</scheme-name>
                               <service-name>NPNormalDistributedCache</service-name>
                               <backup-count>2</backup-count>
                               <backing-map-scheme>
                                    <local-scheme>
                                         <eviction-policy>LRU</eviction-policy>
                                         <high-units>10000</high-units>
                                    </local-scheme>
                               </backing-map-scheme>
                               <autostart>true</autostart>
                          </distributed-scheme>
                     </back-scheme>
                     <autostart>true</autostart>
                </near-scheme>
      
                <distributed-scheme>
                     <scheme-name>archive</scheme-name>
                     <service-name>archiveCache</service-name>
                     <backing-map-scheme>
                          <local-scheme>
                               <eviction-policy>LRU</eviction-policy>
                               <high-units>1</high-units>
                               <expiry-delay>10s</expiry-delay>
                          </local-scheme>
                     </backing-map-scheme>
                     <autostart>true</autostart>
                </distributed-scheme>
      
                <transactional-scheme>
                     <scheme-name>np-transactional</scheme-name>
                     <service-name>iap-transactional-cache</service-name>
                     <thread-count>10</thread-count>
                     <request-timeout>30000</request-timeout>
                     <autostart>true</autostart>
                </transactional-scheme>
      
           </caching-schemes>