This discussion is archived
4 Replies Latest reply: Feb 11, 2013 10:50 AM by Tim Blackman, Oracle RSS

Multi data center Berkeley DB JE?

858706 Newbie
Currently Being Moderated
Is anyone today running Berkeley DB Java Edition out of multiple data centers? Obviously there are issues of securing the replication communication across public links:

JE HA replication over public internet - possible to add authentication?

In addition, there are issues around maintaining a quorum in the face of the failure of a single data center while balancing the performance of write acks. I'm curious to know if anyone has been down this path with BDB JE and has some insights to share.

Ryan
  • 1. Re: Multi data center Berkeley DB JE?
    858706 Newbie
    Currently Being Moderated
    To provide a bit more information about what we're trying to do…

    Today we have several clusters of BDB JE machines in a single data center. This isn't great from a disaster recovery/business continuity perspective. If a major event were to happen at/near our current data center, even with three way replication we'll be in trouble. Our best bet will be offsite, periodic backups. So, we'd like to explore having off-site replicas.

    Off-site replicas bring about their own complications. First, BDB replication is not secure. It's not secure at the transport level and it's unauthenticated. Doing multi-site deployments will almost certainly involve packets traveling over the public internet. One option is to VPN tunnel traffic from site to site. Another option, which we're currently experimenting with, is modification of BDB itself to perform SSL internally.

    In addition to securing communication, there is being able to deal with high-latency connections across the public internet. Imagine two nodes in our San Jose datacenter and a third node on the east coast someplace. Given the need/desire to have acks on writes from all nodes or at least a quorum, long hauls can begin adversely affecting local writes.

    Lastly, how would you distribute replicas? 2 at one site, 1 at another? What happens if you lose the site with 2? You'll strugle to attain a quorum. You'll have to either wait for the 2 to come back or remove them from the replication group and go on without them. If the site with 2 ever returns, they would have enough for a quorum, so you'll have to inform them not to start up, otherwise you'll issue writes to disconnected members of the replication group. There's a operational issue at stake here, which isn't well served by BDB JE today.

    So as we think through all of these issues, I'm left wondering if anyone is running BDB JE today in a multi-site setup.
  • 2. Re: Multi data center Berkeley DB JE?
    Tim Blackman, Oracle Newbie
    Currently Being Moderated
    Today we have several clusters of BDB JE machines in a single data center. This isn't great from a disaster recovery/business continuity perspective. If a major event were to happen at/near our current data center, even with three way replication we'll be in trouble. Our best bet will be offsite, periodic backups. So, we'd like to explore having off-site replicas.
    For sure!
    Off-site replicas bring about their own complications. First, BDB replication is not secure. It's not secure at the transport level and it's unauthenticated. Doing multi-site deployments will almost certainly involve packets traveling over the public internet. One option is to VPN tunnel traffic from site to site. Another option, which we're currently experimenting with, is modification of BDB itself to perform SSL internally.
    Yes, this is definitely an issue that JE HA does not currently address.

    When using SSL, are you thinking of using certificate-based authentication or maybe LDAP? Would Kerberos be a possibility in your environment?
    In addition to securing communication, there is being able to deal with high-latency connections across the public internet. Imagine two nodes in our San Jose datacenter and a third node on the east coast someplace. Given the need/desire to have acks on writes from all nodes or at least a quorum, long hauls can begin adversely affecting local writes.
    Does your application need acknowledgments from all nodes for writes? If a simple majority is sufficient and you have a quorum of nodes available locally, then specifying ReplicaAckPolicy.SIMPLE_MAJORITY should mean the needed acknowledgments will be supplied by local nodes and the remote connection speed should not interfere. How does that fit with the approach you have been taking?

    An additional issue to consider here involves the location of the master. You can use the ReplicationMutableConfig.setNodePriority method to reduce the election priority of remote nodes to make them less likely of being elected as masters. In most cases, that means that one of the local nodes will be the master, and writes will take advantage of the low latency connection to that node.

    In some circumstances, though, it is possible for a remote node to be elected master even if it has a lower priority, in particular due to unexpectedly long network delays. In that case, you can use the DbGroupAdmin utility or the ReplicationGroupAdmin.transferMaster method to transfer the master to one of the local nodes. Note that these facilities are available in release 5.0.58 although we failed to mention them in the changelog.
    Lastly, how would you distribute replicas? 2 at one site, 1 at another? What happens if you lose the site with 2? You'll strugle to attain a quorum. You'll have to either wait for the 2 to come back or remove them from the replication group and go on without them. If the site with 2 ever returns, they would have enough for a quorum, so you'll have to inform them not to start up, otherwise you'll issue writes to disconnected members of the replication group. There's a operational issue at stake here, which isn't well served by BDB JE today.
    With three replicas, deploying two locally and one remotely should provide the local performance you described earlier along with a remote backup for use with disaster recovery.

    Note that are consequences to having this relatively limited number of replicas. The local non-master replica will be able service read requests, taking some of the load off of the master, but a larger number of local replicas would be able to handle even more read load. In addition, adding more replicas at the remote site would make it a more reliable backup for use in disaster recovery. Have you considered increasing the number of replicas for reasons like these or would that not fit with your requirements, say because it would involve too much hardware?

    From the other end, have you considered using just two replicas -- one local and one remote -- and using the ReplicationMutableConfig.setDesignatedPrimary method to manage the two node group? I don't mean to minimize the issues associated with two node groups, but just trying to understand your constraints here.

    Would you find it useful to have a mechanism that supports a more automatic failover mode with a replication factor of two? Or is a replication factor of three desirable because it gives you improved read performance, as well as more availability?

    Finally, regarding difficulties involving a loss of quorum, note that JE HA provides the ReplicationMutableConfig.setElectableGroupSizeOverride method as a big red button for changing the size of the replication group if a catastrophic failure has reduced the set of working members in the group to below the quorum. As the documentation warns, this option can produce bad results if the other members of the replication group are not definitively offline, but it is available to handle disaster cases like the one you describe.

    - Tim
  • 3. Re: Multi data center Berkeley DB JE?
    858706 Newbie
    Currently Being Moderated
    Tim Blackman, Oracle wrote:
    When using SSL, are you thinking of using certificate-based authentication or maybe LDAP? Would Kerberos be a possibility in your environment?
    Certificate-based probably works best with our deployment story. Kerberos might be a possibility, but it's yet another dependency to deal with and manage. Certificates would (most likely) be deployed along with our software alongside configuration files. No added dependencies.
    Does your application need acknowledgments from all nodes for writes? If a simple majority is sufficient and you have a quorum of nodes available locally, then specifying ReplicaAckPolicy.SIMPLE_MAJORITY should mean the needed acknowledgments will be supplied by local nodes and the remote connection speed should not interfere. How does that fit with the approach you have been taking?
    This is how we run today. Running SIMPLE_MAJORITY in a multi-DC world is not a huge issue unless the local replica is down, at which point you need the long haul acknowledgement.
    An additional issue to consider here involves the location of the master. You can use the ReplicationMutableConfig.setNodePriority method to reduce the election priority of remote nodes to make them less likely of being elected as masters. In most cases, that means that one of the local nodes will be the master, and writes will take advantage of the low latency connection to that node.

    In some circumstances, though, it is possible for a remote node to be elected master even if it has a lower priority, in particular due to unexpectedly long network delays. In that case, you can use the DbGroupAdmin utility or the ReplicationGroupAdmin.transferMaster method to transfer the master to one of the local nodes. Note that these facilities are available in release 5.0.58 although we failed to mention them in the changelog.
    I noticed transferMaster the other day while spelunking in the BDB source and wondered how long that had been there. When we were on 4.x I actually wrote some code that tried to do the same thing, but it was clunky and rarely worked. I'm happy to see an official mechanism for doing this. Even without multi-DC, this can help us to balance masters since we run multiple BDB Environments per machine.
    With three replicas, deploying two locally and one remotely should provide the local performance you described earlier along with a remote backup for use with disaster recovery.

    Note that are consequences to having this relatively limited number of replicas. The local non-master replica will be able service read requests, taking some of the load off of the master, but a larger number of local replicas would be able to handle even more read load. In addition, adding more replicas at the remote site would make it a more reliable backup for use in disaster recovery. Have you considered increasing the number of replicas for reasons like these or would that not fit with your requirements, say because it would involve too much hardware?
    We could potentially run 5 nodes…3 in the local datacenter and 2 in the recovery datacenter. That would still provide a quorum (for local acks) in the local datacenter. I'm not too concerned about the amount of hardware, but going from 3 hosts per partition (we partition our data over multiple BDB environments) to 5 hosts per partition will increase the overhead of things like monitoring.
    From the other end, have you considered using just two replicas -- one local and one remote -- and using the ReplicationMutableConfig.setDesignatedPrimary method to manage the two node group? I don't mean to minimize the issues associated with two node groups, but just trying to understand your constraints here.
    We really prefer the automatic failover available in using our current 3 way replication. We're at N+1, so if a node falls over at 3am our operations team can look at the page, go back to bed and deal with it in the morning. Using the primary/secondary 2 node replication, they can't do that. Losing one node means get your butt out of bed before we lose the other one.
    Would you find it useful to have a mechanism that supports a more automatic failover mode with a replication factor of two? Or is a replication factor of three desirable because it gives you improved read performance, as well as more availability?
    Automation and availability are killer features in our current setup, given how easy they are to enable in BDB. The replication factor is certainly nice as well, but I'd say it comes third to automation and availability.
    Finally, regarding difficulties involving a loss of quorum, note that JE HA provides the ReplicationMutableConfig.setElectableGroupSizeOverride method as a big red button for changing the size of the replication group if a catastrophic failure has reduced the set of working members in the group to below the quorum. As the documentation warns, this option can produce bad results if the other members of the replication group are not definitively offline, but it is available to handle disaster cases like the one you describe.
    Yeah, we're well aware of setElectableGroupSizeOverride. We've resorted to using it in anger before. The big problem is situations where you know the primary datacenter is going to be offline for, say, 24 hours. Long enough that you don't want to wait for it to come back but short enough to not be definitively down. There's no good way of signaling to the system before the app starts up (we run these out of upstart on Ubuntu) that the replication group has been rejiggered in its absence, so don't accidentally start up and reach quorum among just the two of you. We really want that quorum to reject writes or fail outright rather than elect a second master node.

    Has Oracle experimented with the 2 nodes in one DC + 1 node in a geographically distant DC setup? I'm curious to know how the latency in the long haul affects BDB and how well replication can keep up over the long haul during write-heavy times.
  • 4. Re: Multi data center Berkeley DB JE?
    Tim Blackman, Oracle Newbie
    Currently Being Moderated
    When using SSL, are you thinking of using certificate-based authentication or maybe LDAP? Would Kerberos be a possibility in your environment?
    Certificate-based probably works best with our deployment story. Kerberos might be a possibility, but it's yet another dependency to deal with and manage. Certificates would (most likely) be deployed along with our software alongside configuration files. No added dependencies.
    Right.

    Are you picturing using official (corporate/external) certificates, or would self-signed ones generated by (say) Java keytool be sufficient?
    Does your application need acknowledgments from all nodes for writes? If a simple majority is sufficient and you have a quorum of nodes available locally, then specifying ReplicaAckPolicy.SIMPLE_MAJORITY should mean the needed acknowledgments will be supplied by local nodes and the remote connection speed should not interfere. How does that fit with the approach you have been taking?
    This is how we run today. Running SIMPLE_MAJORITY in a multi-DC world is not a huge issue unless the local replica is down, at which point you need the long haul acknowledgement.
    Right.
    An additional issue to consider here involves the location of the master. You can use the ReplicationMutableConfig.setNodePriority method to reduce the election priority of remote nodes to make them less likely of being elected as masters. In most cases, that means that one of the local nodes will be the master, and writes will take advantage of the low latency connection to that node.
    So you're using node priority in your setup?
    In some circumstances, though, it is possible for a remote node to be elected master even if it has a lower priority, in particular due to unexpectedly long network delays. In that case, you can use the DbGroupAdmin utility or the ReplicationGroupAdmin.transferMaster method to transfer the master to one of the local nodes. Note that these facilities are available in release 5.0.58 although we failed to mention them in the changelog.
    I noticed transferMaster the other day while spelunking in the BDB source and wondered how long that had been there.
    :-) We'll add a belated entry to the changelog in an upcoming release.

    Note that earlier JE 5 releases had a preliminary version of some of the transferMaster infrastructure, but there were some issues that didn't get worked out until 5.0.58.
    When we were on 4.x I actually wrote some code that tried to do the same thing, but it was clunky and rarely worked. I'm happy to see an official mechanism for doing this. Even without multi-DC, this can help us to balance masters since we run multiple BDB Environments per machine.
    Right, NoSQL DB is using it for that as well.

    The NoSQL DB implementation works to produce and maintain a good balance of masters across the physical nodes, to more evenly distribute the workload. We also fixed some tricky edge cases in the NoSQL DB code involving managing the opening and closing of JE environments, in case you want to have a look.

    We'd be interested to hear about your experience with this feature when you have a chance to try it.
    With three replicas, deploying two locally and one remotely should provide the local performance you described earlier along with a remote backup for use with disaster recovery.
    Note that are consequences to having this relatively limited number of replicas. The local non-master replica will be able service read requests, taking some of the load off of the master, but a larger number of local replicas would be able to handle even more read load. In addition, adding more replicas at the remote site would make it a more reliable backup for use in disaster recovery. Have you considered increasing the number of replicas for reasons like these or would that not fit with your requirements, say because it would involve too much hardware?
    We could potentially run 5 nodes…3 in the local datacenter and 2 in the recovery datacenter. That would still provide a quorum (for local acks) in the local datacenter.
    Right.
    I'm not too concerned about the amount of hardware, but going from 3 hosts per partition (we partition our data over multiple BDB environments) to 5 hosts per partition will increase the overhead of things like monitoring.
    OK.

    I should also have mentioned that increasing the number of replicas would also increase the load on the master, since each replica needs its own replication stream from the master.
    From the other end, have you considered using just two replicas -- one local and one remote -- and using the ReplicationMutableConfig.setDesignatedPrimary method to manage the two node group? I don't mean to minimize the issues associated with two node groups, but just trying to understand your constraints here.
    We really prefer the automatic failover available in using our current 3 way replication. We're at N+1, so if a node falls over at 3am our operations team can look at the page, go back to bed and deal with it in the morning. Using the primary/secondary 2 node replication, they can't do that. Losing one node means get your butt out of bed before we lose the other one.
    Sorry, I don't follow your "N+1" comment. Do you mean that you aren't as concerned as in the RF=2 case because, even after a first failure, you can still tolerate a second failure without losing data? A second failure would still mean a disruption in service -- is that less important because you can recover from it?
    Would you find it useful to have a mechanism that supports a more automatic failover mode with a replication factor of two? Or is a replication factor of three desirable because it gives you improved read performance, as well as more availability?
    Automation and availability are killer features in our current setup, given how easy they are to enable in BDB. The replication factor is certainly nice as well, but I'd say it comes third to automation and availability.
    So, just to spell this out, if two-way replication was arbitrated in a way so that failover is completely automatic, as it is with three-way replication, would you still prefer three-way replication because of the greater security and performance of having multiple copies of data?
    Finally, regarding difficulties involving a loss of quorum, note that JE HA provides the ReplicationMutableConfig.setElectableGroupSizeOverride method as a big red button for changing the size of the replication group if a catastrophic failure has reduced the set of working members in the group to below the quorum. As the documentation warns, this option can produce bad results if the other members of the replication group are not definitively offline, but it is available to handle disaster cases like the one you describe.
    Yeah, we're well aware of setElectableGroupSizeOverride. We've resorted to using it in anger before. The big problem is situations where you know the primary datacenter is going to be offline for, say, 24 hours. Long enough that you don't want to wait for it to come back but short enough to not be definitively down. There's no good way of signaling to the system before the app starts up (we run these out of upstart on Ubuntu) that the replication group has been rejiggered in its absence, so don't accidentally start up and reach quorum among just the two of you. We really want that quorum to reject writes or fail outright rather than elect a second master node.
    Yes, this is a very tricky business.

    We'll keep thinking about this.
    Has Oracle experimented with the 2 nodes in one DC + 1 node in a geographically distant DC setup? I'm curious to know how the latency in the long haul affects BDB and how well replication can keep up over the long haul during write-heavy times.
    No, we haven't done this kind of testing yet, but it's something we'd like to do.

    - Tim

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points