0 Replies Latest reply: Jun 25, 2013 8:18 PM by RyanGardner RSS

    Failing nodes over results in a 2 minute gap where "nothing" is done

    RyanGardner

      We're on 3.7.1.9 and I'm testing a reconfiguration of our storage nodes on new hardware with adjusted heap sizing etc.

       

      Our primary use of coherence is as a distributed cache with clients having near caches. We don't use it as a data grid and therefore have no indexes - just a couple of hundred of caches with an aggregate of around a hundred or two hundred million cache entires.

       

      I have a test with about 10 client nodes hitting the set of storage nodes. The client nodes have about 100 threads each making cache requests every 150 to 250ms to simulate about 3k gets per second and 1.5 k puts per second (this is just a baseline test, I've also run other tests at much higher levels of traffic)

       

      When one of the storage nodes is shut down the other storage nodes will have a long pause in the logs before they actually start doing anything again. I am tracking metrics on our cache too and I can see a very noticable drop - it is as if any put or get requests that are made right before that cache node was stopped just hang as well as the requests that go to the other nodes.

       

      This lasts about 2 to 3 minutes. During this time I have monitored both CPU and network activity - there is very little network traffic and very little CPU usage - so it's not as if there is some IO or CPU bottleneck to this pause time.


      Does anyone have any ideas on what settings I should look at tweaking to get rid of this "failover coffee break" that the other storage enabled nodes seem to go on when one of the nodes fails?

       

      a 3 minute window where the cache doesn't respond to requests could have major negative ripple effects through the rest of our applications.

       

      Does anyone else see these same kind of "failover coffee breaks" that I see? Are yours shorter or longer than mine?

       

      We are using 4096 partitions if that has anything to do with it - but considering that the nodes don't start to distribute vulnerable partitions to each other until after they come back from their coffe break it doesn't seem like the partition count should be super relevant.