1 Reply Latest reply: Apr 12, 2012 5:10 AM by robvarga RSS

    Coherence *** WISH LIST ***

    snidely_whiplash
      Some things I'd love to see in the future...

      1) Add TCMP and Extend protocols to wireshark so I can monitor what apps are actually doing, ie. when CQC registrations go out, how much data comes back, etc

      2) need some way to introduce various errors/disconnets into the cluster to see how applications function when cluster is not performing correctly. Need to simulate deadlocks, timeouts, service restarts, long GC delays. maybe using an mbean on each member. Waiting around for things to go wrong in production so you can play whac-a-mole is no good.

      3) There should be some way to tell what's going on behind the scenes. I'm using tangosol.coherence.log.level=9 and it does not log when filters, queries, etc are received by the cluster or node. It would be very helpful for diagnostics if a cluster could be monitored for what processors and CQCs are running on it.

      4) Have verbose logging levels include thread pool utilization info whenever some threashold (ie, 80%) is crossed: "pool usage high: avg [7/8] 87% use".

      5) Configuring coherence is still too much of a black art. coherence should come "out of the box" with JVM args which limit GC times such that cluster members are never declared as paused, removed from the cluster, etc. Let application performance be as poor as necessary using these fail-safe defaults but the cluster should protect itself first. It seems that if you have all Coherence client and server JVMs on one physical machine and its CPU utilization never goes over 50% then you should never have timeouts, rescheduled packets, nodes leaving the cluster, etc. Or is that not reasonable?

      6) Documentation of <Error> messages: When a log message like this shows up:
      2012-03-21 09:11:00.886/16140.585 Oracle Coherence GE 3.7.1.1 <Error> (thread=Cluster, member=6): Assertion failed: Member 4 is unknown to this member
      I have no idea what a likely cause or solution is. A document explaining all the possible errors and what to maybe do about them would be great.

      -Andrew
        • 1. Re: Coherence *** WISH LIST ***
          robvarga
          snidely_whiplash wrote:
          Some things I'd love to see in the future...

          1) Add TCMP and Extend protocols to wireshark so I can monitor what apps are actually doing, ie. when CQC registrations go out, how much data comes back, etc

          2) need some way to introduce various errors/disconnets into the cluster to see how applications function when cluster is not performing correctly. Need to simulate deadlocks, timeouts, service restarts, long GC delays. maybe using an mbean on each member. Waiting around for things to go wrong in production so you can play whac-a-mole is no good.
          Yep, that would be good.
          3) There should be some way to tell what's going on behind the scenes. I'm using tangosol.coherence.log.level=9 and it does not log when filters, queries, etc are received by the cluster or node. It would be very helpful for diagnostics if a cluster could be monitored for what processors and CQCs are running on it.
          You can intercept the deserialization of filters, queries, etc, already in two (three) ways:

          1. Change pof configuration so that you have a custom PofSerializer which delegates to the originally configured PofSerializer but does whatever monitoring you want. Beware that it will make things slower if you want to write to persistent store from this, so don't do that.

          2. Change the serializer of the service to something which does the same (e.g. a ConfigurablePofContext subclass) which overrides and wraps the Serializer interface methods and does the monitoring.

          3. With your favourite AOP framework have an around advice which does the monitoring pointed to the two methods from Serializer in ConfigurablePofContext.

          4) Have verbose logging levels include thread pool utilization info whenever some threashold (ie, 80%) is crossed: "pool usage high: avg [7/8] 87% use".
          It is possibly not a good idea. If you have so many threads that this situation only rarely occurs, then you have overprovisioned your proxy or your service and having less threads may give you less overhead (idle threads still occupy memory due to their allocated stack, which by default is 2megs per thread, if I correctly remember).

          5) Configuring coherence is still too much of a black art. coherence should come "out of the box" with JVM args which limit GC times such that cluster members are never declared as paused, removed from the cluster, etc. Let application performance be as poor as necessary using these fail-safe defaults but the cluster should protect itself first. It seems that if you have all Coherence client and server JVMs on one physical machine and its CPU utilization never goes over 50% then you should never have timeouts, rescheduled packets, nodes leaving the cluster, etc. Or is that not reasonable?
          It is not reasonable. The ideal GC settings always depend on your application's garbage generation speed and the amount of long-lived data you should have in old gen and how frequently that long-lived data is replaced with newer long-lived data. There is no silver bullet.


          6) Documentation of <Error> messages: When a log message like this shows up:
          2012-03-21 09:11:00.886/16140.585 Oracle Coherence GE 3.7.1.1 <Error> (thread=Cluster, member=6): Assertion failed: Member 4 is unknown to this member
          I have no idea what a likely cause or solution is. A document explaining all the possible errors and what to maybe do about them would be great.

          -Andrew
          Theoretically there is one for TCMP, it may not be up-to-date if it does not contain your message:

          http://docs.oracle.com/cd/E24290_01/coh.371/e22838/appendix_errormsgs.htm


          Best regards,

          Robert

          Edited by: robvarga on Apr 12, 2012 11:07 AM