This content has been marked as final. Show 4 replies
A GroupChangeEvent should be issued when nodes are added or removed from the group and when they [join or leave | http://www.oracle.com/technology/documentation/berkeley-db/je/ReplicationGuide/lifecycle.html#lifecycle-terms] the group (where a join/leave is a startup/stop and an add/remove is a more permanent change in membership). Upon inspection, it appears that we are not correctly firing the event when a node leaves the group. This is a bug, and the fix will appear under SR #18006.
There are a couple of additional considerations in this area. Issuing the GroupChangeEvent when a node joins and leaves has some shortcomings. One of these problems is that the GroupChangeEvent shows the membership of the group, rather than the current status of the nodes, so there's no information about which node joined, and which node left. It would seem more useful to fire a different kind of event for join/leave, which includes the name of the node which joined and left. This seems to match what you're trying to achieve in your application.
Another issue with GroupChangeEvents are that because of the distributed nature of replication, we cannot always guarantee that we issue exactly one event per change. If there are master failovers, it's possible that the application will see:
<li>1 or more GroupChangeEvents when nodes are added or removed. There may be extra notifications when a new master comes up.
<li>exactly 1 event when a node joins
<li>0 or 1 event when a node leaves: if the master crashes right after a node leaves, the event may not be fired.
One thing we're thinking of to address these issues is to add a ping command. The ping command would let an application ping a given node to check its state. It would provide a more active way to check aliveness and would complement the listener/event API. We are also thinking that perhaps the ping could be extensible so that the application can add application specific information, because database state alone may not be enough to determine whether the application is available on that node.
Let us know whether you think either of these options would improve usability!
Thanks as always for your quick response. About the various considerations -- I was indeed a little surprised that there wasn't a field in the GroupChangeEvent to indicate which node(s) were involved in the change. If I understand your response correctly, I should never see a change in the results of calling getRepGroup() when a GroupChangeEvent is fired because of a join/leave, but should always see one when generated by an add/remove (barring any edge conditions I can't think of). So even getting the leave notification won't help me identify what has changed (in my test code I iterated through the new group to see how it differed from my internal representation).
As you suggest, it sounds like the right thing to do then is to introduce a new event type that is sent when nodes join or leave the group, and to stop sending GroupChangeEvents for those cases (again, if I understand correctly a join/leave is essentially not a group change, but a status change of an existing group member).
As for the conditions where the wrong number of events may be fired, that certainly makes sense. It may be worth adding something to the javadoc for the MonitorChangeListener that the handlers be somewhat idempotent.
A ping command probably isn't something we'd need to use directly ourselves, but I can see it being useful for others. I'd guess that the implementation of the ReplicationNode interface is fairly lightweight right now, but I can see that being a good place for it. Adding an isOnline() method would mean you could get the started/stopped status for each member of the group when you do get a GroupChangeEvent.