This discussion is archived
5 Replies Latest reply: Aug 5, 2011 1:37 AM by MagnusE RSS

How to run N instances of a service in a cluster?

MagnusE Explorer
Currently Being Moderated
Hi!

I have seen some hints that there is supposed to be support for executing N (in my case exactly one) instance of a "service" in a cluster in a highly available manner (i.e. if the node node where the server execute die the service instance will be started in another node instead) in the incubator. In my case the "service" is a plain old Java background task that perform a verification (background "sanity check") of the cache contents versus some parts of a database. It is not critical that it never goes down (i.e. it can be down a few seconds or even minutes) but we need to be sure that it is not down for hours or days...

If there indeed is support for doing this please direct me to the documentation of the feature / class that can be used. Some links to examples would be nice as well!

/Magnus
  • 1. Re: How to run N instances of a service in a cluster?
    pmackin Journeyer
    Currently Being Moderated
    This can be done using the Lease code in Common along with a backing map listener (BML). You will need to couple the Lease with some object in the cache to use the Lease functionality within the Grid. We don't have sample code to do this I have tried to describe how it would work with the following example. I hope it makes sense to you. There are variations of this but it should give you a general idea on how to write a highly available background thread. The MessagingPattern does have sample code on how to use Leases but it is different from this use case since the Messaging Subscriber "client" or LeaseExtender runs on the client, not in the grid.

    Paul

    ------------------------------------------------------------------------------------

    Consider an example with these objects:

    1) A Lease object which detects a member or process failure.

    2) A LeaseListener which is called by the LeaseExpiryCoordinator if the Lease expires. The LeaseListener is always called in the JVM
    of the Member that owns the Lease.

    3) A singleton object in the cluster, called ValidationStateMachine, which contains a few properties
    such a Lease, a state, db info, etc. ValidationStateMachine controls the high level low of validation between a database
    and the cache contents.

    4) A ValidationService which is a transient [e.g. not in the cache] object that performs the validation from a
    background thread.

    5) The LeaseExtender which is a transient object that extends the lease (by updating it) to ensure that it doesn't expire. The LeaseExtender
    always runs in the same JVM as the ValidationService

    6) The ValidationStateMachine Backing Map Listener (BML) which listens for events.

    The state transition of the ValidationStateMachine triggers the start of the ValidationService, which runs to completion
    even if the ValidationStateMachine object is transferred to a new member. In the extreme case where the cluster member
    is terminated then the ValidationService will be restarted on the new owner member of the ValidationStateMachine object.

    ----------------------------------------------------------------------------------------------

    This is how the validation is started. Note that states are fictional and you would replace them with your own.

    1) Application creates a singleton ValidationStateMachine which initializes its state to "initial". Member-1
    owns ValidationStateMachine.

    2) The BML gets the insert event and registers the Lease in suspended mode. It will stay suspended until the LeaseExtender
    extends it. From that point on it must be extended periodically or else the lease will expire.

    3) Application updates the ValidationStateMachine state to "running".

    4) The BML gets the update event and checks the state. If the state transitioned to "running" then the BML starts the
    ValidationService in a background thread, injecting the current state (and any other info) into the ValidationService.

    5) The ValidationService starts running
    a) starts the LeaseExtender thread.
    b) executes starting from its current state.

    6) The ValidationService completes
    a) stops the LeaseExtender thread
    b) updates the ValidationStateMachine state to "done"
    c) exits the thread.

    7) The BML gets the update event and de-registers the lease if the state transitioned to "done"

    ------------------------------------------------------------------------------------------------

    This is what happens if the ValidationStateMachine moves to Member-2 and the ValidationService is already running on Member-1.
    This is NOT a failure scenario.

    1) BML on Member-1 gets an "entry-departing" event and de-registers the Lease.

    2) Member-2 becomes the owner of ValidationStateMachine generating a BML "entry arriving" event. The BML
    re-registers the Lease, specifying the LeaseListener. Note, that the LeaseListener is always registered on the member
    that owns the object being leased.

    3) ValidationService and LeaseExtender continue to run on Member-1 to completion.

    4) The ValidationService completes
    a) stops the LeaseExtender thread
    b) updates the ValidationStateMachine state to "done"
    c) exits the thread.

    5) The BML gets the update event and de-registers the lease if the state transitioned to "done"

    ------------------------------------------------------------------------------------------------

    Here is the failure scenario. ValidationStateMachine is owned by Member-1 and ValidationService is running on Member-1.
    Member-1 then crashes.

    1) Member-2 becomes the owner of ValidationStateMachine generating a BML "entry arriving" event. The BML
    re-registers the Lease, specifying the LeaseListener. Member-2 does NOT start the ValidationService since the state
    hasn't changed and it doesn't know that ValidationService is no running on Member-1. At this point, there is no
    LeaseExtender running anywhere in the cluster.

    2) The Lease expires and the LeaseListener is called. The LeaseListener updates the state in the ValidationStateMachine to
    "restarting", then to "running"

    3) The BML gets the update event and sees that the state has transitioned to running so it starts the ValidationService, injecting the state.

    4) The ValidationService starts running
    a) starts the LeaseExtender thread.
    b) executes starting from its current state.

    ....

    -----------------

    thats it.
  • 2. Re: How to run N instances of a service in a cluster?
    pmackin Journeyer
    Currently Being Moderated
    Hi Magnus

    I should also mention that you should consider using the incubator Processing Pattern, which may meet your requirements. See http://coherence.oracle.com/display/INCUBATOR/Processing+Pattern

    Paul
  • 3. Re: How to run N instances of a service in a cluster?
    MagnusE Explorer
    Currently Being Moderated
    Excellent post - I will give this a try.

    A complication (that I did not go into in my initial description) is that that the "verifier thread" is supposed to run on one of my non-storage enabled nodes (one of the application-servers, still a member of the cluster (near topology) but not contributing storage). I have however a weak recollection that I have read that one can run multiple caching services that are storage enabled on different subsets of nodes. I could then create a special cache service that is only storage enabled (only started even) on the nodes executing on the application servers and use it to implement your proposed algorithm?!

    Lets see if I understand (some questions and assumptions embedded):

    1. Create a special cache (using the special caching service I talked about above) with a special BML that I have written.
    2. My appservers (in a startup bean etc) all "tries" to lock and create (if not already created) the singleton "validator" object in the special cache (only one will succeed of course).
    3. The BML of the node where the singleton validator object ends up is called at the insert and creates the lease (it can, I assume, itself act as the lease listener or?)

    In your algorithm (startup flow) you have a step 3 "application updates state" - can I skip that and just have app set the state to "start" (or whatever) directly having the BML start the thread right away? - feels like there is a potential "hole" in the logic if the application dies after creating the singleton object but before changing the state (or does the failure cases solve that - I have not read them in detail yet)?!
    I suppose the validator thread ideally should renew the lease as part of its work (say every time it has performed a validation "pass" over the cache) to guard against it getting "stuck"? This way I do not need a separate lease renewal thread and do not risk that the validation thread is stuck while the renewal thread is not.

    4. If the singleton object moves the departing BML will unregister itself as lease listener and the receiving BML will register itself as lease listener. Here I am a bit confused because I don't (yet) fully understand how the lease classes work:

    * Is a lease a "clustered" object - i.e. if I created a lease, add a listener and renew/extend it on node-1 can I then register an expiry expiry listener on node-2 or do I have to create a new (local) lease object on node-2 and in that case how can I renew/extend the newly created lease from node-1 (where my service thread is still running)?

    * What do I need to do in order for the lease package to work? I need to include the incubator "common" JAR before the coherence.jar (to pick up the override file). Is anything else needed?

    /Magnus
  • 4. Re: How to run N instances of a service in a cluster?
    pmackin Journeyer
    Currently Being Moderated
    Hi Magnus

    Here are some answers:

    1) I think your strategy of creating a specialized service would work but you should try it out first deciding what to do. You could also have a second cluster just for the service thread and connect to the main cluster using extend.

    2) You should use an entry processor to create the singleton and only create it if its doesn't exist. This will guarantee that only 1 object is created, even though many clients invoke the entry processor. For example:
            // Create the subscription if it doesn't exist.
            if (!entry.isPresent())
            {
                entry.setValue(new ValidatingStateMachine());
            }
    3) Correct. The ValidationStateMachine contains a Lease object. LeaseListener is a static inner class of ValidationStateMachine. The BML insert event you initialize the lease (to suspend it) then register it. You must make sure that you don't suspend the lease on failover, which also generates an insert event. If you use the Common
    if (event instanceof BackingMapEntryArrivedEvent)
    {
        // The ValidationStateMachine just got moved to this member 
        ...
    }
     else if (event instanceof BackingMapEntryInsertedEvent)
    {
        // The ValidationStateMachine was just inserted into the cache.
        getLease().setIsSuspended(true);
        LeaseExpiryCoordinator.INSTANCE.registerLease(getIdentifier(), getLease(), LeaseListener());
    }
    In regards to using a single state, you need some way to know if they ValidatingService must be started from the BML. I use a state transition to "running" in the example. This is done by comparing the previous state to the new state. The Common Event contains both the previous and current state. If you don't use a transition but start out "running" then you need some way to know to restart the ValidationService when the lease expires. You can probably do it from the LeaseListener itself.

    If the application dies before changing the state then the ValidationService won't run. However, if the application restarts and runs the create entry processor a second time then it will be a no-op (which is what you want), then the application can change the state. You have the same problem if the application starts running but dies before it invokes the entry processor to do the initial creation.

    Yes, the validation thread can extend the lease but your lease duration must be long enough to handle the case where the validation thread hangs for a while. If not, the lease will expire and the LeaseListener will trigger the execution of another validator thread. The validation thread should never really get stuck since the Coherence request timeout will cause the operation to abort. I don't know what timeouts the database uses.

    4) Yes, a Lease is part of a clustered object since it is a member variable of ValidatingStateMachine. This solution only uses 1 Lease object. The LeaseExpiryCoordinator (which does the registration) is not a cluster object. It provides static methods for registering and de-registering the Lease. I am not sure that answers your question.

    The Lease is included in Common. All of the instructions on how to download and use Common are here: http://coherence.oracle.com/display/INCUBATOR/Coherence+Common

    You may also want to download the MessagingPattern to see how it uses Leases.

    Paul
  • 5. Re: How to run N instances of a service in a cluster?
    MagnusE Explorer
    Currently Being Moderated
    Thanks for all the good input - I will "digest it" and see if I can get it all working!

    /Magnus

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points