Forum Stats

  • 3,769,970 Users
  • 2,253,039 Discussions
  • 7,875,254 Comments

Discussions

Unexplained replication disk usage leading to running out of disk space

mgrandi
mgrandi Member Posts: 18
edited Jun 17, 2016 6:52PM in Berkeley DB Java Edition

First off, I appreciate the time you guys spend on answering questions in this forum, its quite gracious of you to offer support for non paying users!

Anyway, yesterday we had a production issue where 2 of the 3 servers in our replication group ran out of disk space because of BDB JE chewing through all of it in a sudden manner, and I have been looking for the cause for the past day and I don't think I"m making any progress

The setup is 1 static master, and 2 replicas. (the master uses  the designated primary / node priorty system to ensure it is the only server that will ever be elected the master and the replicas won't ever become the master. This feels very hacky, but since we are using a proxy to direct write requests to a specific server, when this application was written a few years ago, I'm guessing they thought it would be better to just have a static master rather then set up a system to inform the proxy of which node currently is the master). We also have a couple of development instances that occasionally connect to the replication group to catch up, but these are not online 100% of the time and can go days/weeks without being connected  to the group

Looking at the graphs of disk usage over the past month (and year), i see that all of the servers disk usage for the BDB directory went up 20 gigs, at around may 25. Yesterday however, the replicas went up an additional 20 gigs, almost doubling in the size of the database and ran out of disk space. we don't add a lot of new data into the database per day, as evidenced by the year long disk usage graph, but on may 25th and yesterday the disk usage just shot out of control.

Even with the default values for replayCostPercent and replayFreeDiskPercent, both of the replica servers ended up running out of disk space. I ended up putting a je.properties file in the home directory with replayCostPercent = 0, and that seemed to help one of the replicas, but the other one dropped to the expected database size, and then shot up again. I ended up having to just delete the environment directory on the replica and having it restore itself via a NetworkRestore, and after that, the second replica seems relatively stable

A coworker suggested that maybe because of our development instances, the group might be holding on to extra data for replication to those instances, but I ended up removing those nodes using DbGroupAdmin, and it didn't seem to have an immediate effect. Looking into the BDB JE source code where those log messages are created, I have a hunch that maybe the global CBVLSN was dragged down by the development instances that were not online 100% of the time, and therefore all of the replicas and the master had to save extra data to support replicating to those?

I'm just confused on what exactly is happening here. Why were all the servers holding on to so much extra data, to the point that it would rather hold on to this data then prevent a "out of space" exception?

Looking in the je.info logs, i see a bunch of these:

(the second replica, the one I had to clear the environment directory)

in may i see a bunch of these:

2016-05-25 14:58:28.502 UTC INFO [slave-sidecar2] Candidates for deletion: 0xa0284-0xa0299

2016-05-25 14:58:28.507 UTC INFO [slave-sidecar2] Replication prevents deletion of 22 files by the Cleaner to support replication using free space available beyond the 10% free space requirement specified by the ReplicationConfig.REPLAY_FREE_DISK_PERCENT parameters. Start file=0xa0284 holds CBVLSN 7,542,288,376, end file=0xa029b holds last VLSN 7,545,609,723

2016-05-25 14:58:28.507 UTC INFO [slave-sidecar2] Files chosen for deletion by HA:

2016-05-25 14:58:28.508 UTC INFO [slave-sidecar2] Cleaner has 22 files not deleted because they are protected by replication. Files: 0xa0284-0xa0299

2016-05-25 15:00:07.099 UTC INFO [slave-sidecar2] Candidates for deletion: 0xa0284-0xa0299

2016-05-25 15:00:07.104 UTC INFO [slave-sidecar2] Replication prevents deletion of 22 files by the Cleaner to support replication using free space available beyond the 10% free space requirement specified by the ReplicationConfig.REPLAY_FREE_DISK_PERCENT parameters. Start file=0xa0284 holds CBVLSN 7,542,288,376, end file=0xa029b holds last VLSN 7,545,662,859

2016-05-25 15:00:07.105 UTC INFO [slave-sidecar2] Files chosen for deletion by HA:

2016-05-25 15:00:07.105 UTC INFO [slave-sidecar2] Cleaner has 22 files not deleted because they are protected by replication. Files: 0xa0284-0xa0299

2016-05-25 15:02:01.782 UTC INFO [slave-sidecar2] Candidates for deletion: 0xa0284-0xa0299

2016-05-25 15:02:01.788 UTC INFO [slave-sidecar2] Replication prevents deletion of 22 files by the Cleaner to support replication using free space available beyond the 10% free space requirement specified by the ReplicationConfig.REPLAY_FREE_DISK_PERCENT parameters. Start file=0xa0284 holds CBVLSN 7,542,288,376, end file=0xa029b holds last VLSN 7,545,726,863

2016-05-25 15:02:01.788 UTC INFO [slave-sidecar2] Files chosen for deletion by HA:

2016-05-25 15:02:01.788 UTC INFO [slave-sidecar2] Cleaner has 22 files not deleted because they are protected by replication. Files: 0xa0284-0xa0299

2016-05-25 15:03:37.099 UTC INFO [slave-sidecar2] Candidates for deletion: 0xa0284-0xa0299

2016-05-25 15:03:37.105 UTC INFO [slave-sidecar2] Replication prevents deletion of 22 files by the Cleaner to support replication using free space available beyond the 10% free space requirement specified by the ReplicationConfig.REPLAY_FREE_DISK_PERCENT parameters. Start file=0xa0284 holds CBVLSN 7,542,288,376, end file=0xa029b holds last VLSN 7,545,781,504

2016-05-25 15:03:37.106 UTC INFO [slave-sidecar2] Files chosen for deletion by HA:

2016-05-25 15:03:37.106 UTC INFO [slave-sidecar2] Cleaner has 22 files not deleted because they are protected by replication. Files: 0xa0284-0xa0299

---------------------------

then yesterday:

2016-06-08 23:39:44.609 UTC INFO [slave-sidecar2] Candidates for deletion: 0xa03c4-0xa03e2 0xa03e4-0xa03e8 0xa03ea-0xa03eb 0xa03ee

2016-06-08 23:39:44.610 UTC INFO [slave-sidecar2] Replication prevents deletion of 39 files by the Cleaner to support replication by node master, last updated 2016-06-08 22:12:39.443 UTC, 5177 seconds before the most recently updated node, and less than the 86400 seconds timeout specified by the ReplicationConfig.REP_STREAM_TIMEOUT parameter. Start file=0xa03c4 holds CBVLSN 7,687,370,481, end file=0xa03f9 holds last VLSN 7,687,374,471

2016-06-08 23:39:44.610 UTC INFO [slave-sidecar2] Files chosen for deletion by HA:

2016-06-08 23:39:44.610 UTC INFO [slave-sidecar2] Cleaner has 39 files not deleted because they are protected by replication. Files: 0xa03c4-0xa03e2 0xa03e4-0xa03e8 0xa03ea-0xa03eb 0xa03ee

2016-06-08 23:39:47.629 UTC INFO [slave-sidecar2] Candidates for deletion: 0xa03c4-0xa03e2 0xa03e4-0xa03e8 0xa03ea-0xa03eb 0xa03ee-0xa03ef

2016-06-08 23:39:47.629 UTC INFO [slave-sidecar2] Replication prevents deletion of 40 files by the Cleaner to support replication by node master, last updated 2016-06-08 22:12:39.443 UTC, 5177 seconds before the most recently updated node, and less than the 86400 seconds timeout specified by the ReplicationConfig.REP_STREAM_TIMEOUT parameter. Start file=0xa03c4 holds CBVLSN 7,687,370,481, end file=0xa03f9 holds last VLSN 7,687,374,474

2016-06-08 23:39:47.629 UTC INFO [slave-sidecar2] Files chosen for deletion by HA:

2016-06-08 23:39:47.629 UTC INFO [slave-sidecar2] Cleaner has 40 files not deleted because they are protected by replication. Files: 0xa03c4-0xa03e2 0xa03e4-0xa03e8 0xa03ea-0xa03eb 0xa03ee-0xa03ef

2016-06-08 23:39:51.243 UTC INFO [slave-sidecar2] Candidates for deletion: 0xa03c4-0xa03e2 0xa03e4-0xa03e8 0xa03ea-0xa03eb 0xa03ee-0xa03ef

2016-06-08 23:39:51.243 UTC INFO [slave-sidecar2] Replication prevents deletion of 40 files by the Cleaner to support replication by node master, last updated 2016-06-08 22:12:39.443 UTC, 5177 seconds before the most recently updated node, and less than the 86400 seconds timeout specified by the ReplicationConfig.REP_STREAM_TIMEOUT parameter. Start file=0xa03c4 holds CBVLSN 7,687,370,481, end file=0xa03f9 holds last VLSN 7,687,374,474

2016-06-08 23:39:51.243 UTC INFO [slave-sidecar2] Files chosen for deletion by HA:

2016-06-08 23:39:51.243 UTC INFO [slave-sidecar2] Cleaner has 40 files not deleted because they are protected by replication. Files: 0xa03c4-0xa03e2 0xa03e4-0xa03e8 0xa03ea-0xa03eb 0xa03ee-0xa03ef

2016-06-08 23:39:55.618 UTC INFO [slave-sidecar2] Candidates for deletion: 0xa03c4-0xa03e2 0xa03e4-0xa03e8 0xa03ea-0xa03eb 0xa03ee-0xa03ef

2016-06-08 23:39:55.618 UTC INFO [slave-sidecar2] Replication prevents deletion of 40 files by the Cleaner to support replication by node master, last updated 2016-06-08 22:12:39.443 UTC, 5232 seconds before the most recently updated node, and less than the 86400 seconds timeout specified by the ReplicationConfig.REP_STREAM_TIMEOUT parameter. Start file=0xa03c4 holds CBVLSN 7,687,370,481, end file=0xa03fa holds last VLSN 7,687,374,479

2016-06-08 23:39:55.618 UTC INFO [slave-sidecar2] Files chosen for deletion by HA:

2016-06-08 23:39:55.618 UTC INFO [slave-sidecar2] Cleaner has 40 files not deleted because they are protected by replication. Files: 0xa03c4-0xa03e2 0xa03e4-0xa03e8 0xa03ea-0xa03eb 0xa03ee-0xa03ef

2016-06-08 23:39:59.835 UTC INFO [slave-sidecar2] Candidates for deletion: 0xa03c4-0xa03e2 0xa03e4-0xa03e8 0xa03ea-0xa03eb 0xa03ee-0xa03ef

2016-06-08 23:39:59.836 UTC INFO [slave-sidecar2] Replication prevents deletion of 40 files by the Cleaner to support replication by node master, last updated 2016-06-08 22:12:39.443 UTC, 5232 seconds before the most recently updated node, and less than the 86400 seconds timeout specified by the ReplicationConfig.REP_STREAM_TIMEOUT parameter. Start file=0xa03c4 holds CBVLSN 7,687,370,481, end file=0xa03fa holds last VLSN 7,687,374,482

2016-06-08 23:39:59.836 UTC INFO [slave-sidecar2] Files chosen for deletion by HA:

2016-06-08 23:39:59.836 UTC INFO [slave-sidecar2] Cleaner has 40 files not deleted because they are protected by replication. Files: 0xa03c4-0xa03e2 0xa03e4-0xa03e8 0xa03ea-0xa03eb 0xa03ee-0xa03ef

2016-06-08 23:40:03.366 UTC INFO [slave-sidecar2] Candidates for deletion: 0xa03c4-0xa03e2 0xa03e4-0xa03e8 0xa03ea-0xa03eb 0xa03ee-0xa03ef

2016-06-08 23:40:03.366 UTC INFO [slave-sidecar2] Replication prevents deletion of 40 files by the Cleaner to support replication by node master, last updated 2016-06-08 22:12:39.443 UTC, 5232 seconds before the most recently updated node, and less than the 86400 seconds timeout specified by the ReplicationConfig.REP_STREAM_TIMEOUT parameter. Start file=0xa03c4 holds CBVLSN 7,687,370,481, end file=0xa03fa holds last VLSN 7,687,374,485

2016-06-08 23:40:03.366 UTC INFO [slave-sidecar2] Files chosen for deletion by HA:

Here are the disk graphs for replica2. (It drops when i set replayCostPercent to 0, but then it jumps back up again. The stair stepping pattern is me clearing the environment directory and doing a network restore)

replica1.png

disk graph for replica1: (the drop seems to be correlated to when i set the replayCostPercent property to 0)

replica2.png

Best Answer

  • Greybird-Oracle
    Greybird-Oracle Member Posts: 2,690
    edited Jun 14, 2016 6:05PM Accepted Answer

    No, I don't think that's related. In JE 6.0.5, the REPLAY_COST_PERCENT and REPLAY_FREE_DISK_PERCENT parameters were added, and these also allow retention of files for replicas, but without the risk of creating an out-of-disk condition. That change would likely have increased disk usage, but would not have caused out-of-disk.


    I think the problem is the 24 hour REP_STREAM_TIMEOUT value. We intended to change it from 24 hours to 30 minutes in JE 6.0.5, but the code change was accidentally omitted. The 24 hour value has been known to cause out-of-disk problems, and until now we had been assuming that we fixed this by changing the default to 30 minutes.


    We apologize for the confusion, and the trouble caused by the long timeout interval.


    --mark

«1

Answers

  • Greybird-Oracle
    Greybird-Oracle Member Posts: 2,690
    edited Jun 10, 2016 7:39PM

    Yes, your hunch about the development nodes that occasionally correct is very likely to be correct.

    We recently improved the javadoc a little for replayCostPercent and replayFreeDiskPercent. I'm pasting it below, and the bolded part (which was added) may help to explain.

    /**
    * The cost of replaying the replication stream as compared to the cost of
    * performing a network restore, represented as a percentage. Specifies
    * the relative cost of using a log file as the source of transactions to
    * replay on a replica as compared to using the file as part of a network
    * restore. This parameter is used to determine whether a cleaned log file
    * that could be used to support replay should be removed because a network
    * restore would be more efficient. The value is typically larger than
    * 100, to represent that replay is usually more expensive than network
    * restore for a given amount of log data due to the cost of replaying
    * transactions. If the value is 0, then the parameter is disabled, and no
    * log files will be retained based on the relative costs of replay and
    * network restore.
    *
    * <p>Note that log files are always retained if they are known to be
    * needed to support replication for electable replicas that have been in
    * contact with the master within the {@link #REP_STREAM_TIMEOUT} period,
    * or by any replica currently performing replication. This parameter only
    * applies to the retention of additional files that might be useful to
    * secondary nodes that are out of contact, or to electable nodes that have
    * been out of contact for longer than REP_STREAM_TIMEOUT.</p>
    *
    * <p>To disable the retention of these additional files, set this
    * parameter to zero.</p>
    *
    * <p><table border="1">
    * <tr><td>Name<td>Type<td>Mutable<td>Default<td>Minimum<td>Maximum</tr>
    * <tr><td>{@value}&lt;td>Integer<td>No<td>150<td>0<td>200</tr>
    * </table>
    *
    * @see #REPLAY_FREE_DISK_PERCENT
    */
    public static final String REPLAY_COST_PERCENT =


    /**
    * The amount of free disk space that should be maintained when deciding
    * whether to retain log files for use in replaying the replication stream,
    * specified as a percentage of the total disk space. This parameter
    * prevents the retention of log files as determined by the {@link
    * #REPLAY_COST_PERCENT} parameter if retaining the files would reduce free
    * space below the specified percentage. If the value is 0, then the
    * parameter is disabled, and decisions about which log files to remove
    * will not consider the amount of free disk space.
    *
    * <p>Note that log files are always retained if they are known to be
    * needed to support replication for electable replicas that have been in
    * contact with the master within the {@link #REP_STREAM_TIMEOUT} period,
    * or by any replica currently performing replication. This parameter only
    * applies to the retention of additional files that might be useful to
    * secondary nodes that are out of contact, or to electable nodes that have
    * been out of contact for longer than REP_STREAM_TIMEOUT.</p>
    *
    * <p>To disable the retention of these additional files, set {@link
    * #REPLAY_COST_PERCENT} to zero.</p>
    *
    * <p><table border="1">
    * <tr><td>Name<td>Type<td>Mutable<td>Default<td>Minimum<td>Maximum</tr>
    * <tr><td>{@value}&lt;td>Integer<td>No<td>10<td>0<td>99</tr>
    * </table>
    *
    * @see #REPLAY_COST_PERCENT
    */
    public static final String REPLAY_FREE_DISK_PERCENT =

    If one of your dev nodes is online, and then taken offline, files will be retained by the other nodes for 30 minutes (the default for REP_STREAM_TIMEOUT) with the hope that the offline node will come back up. These files are retained regardless of whether it might cause the disk to fill, as you said. Could this explain it?

    If you need to connect your dev nodes occasionally, I suggest using secondary nodes (NodeType.SECONDARY). Data files won't be retained by the other (electable) nodes for the sake of the off-line secondary nodes, if this would cause the disk to fill. So it seems like a good fit.

    See:

    Secondary Nodes

    --mark

  • mgrandi
    mgrandi Member Posts: 18
    edited Jun 10, 2016 9:06PM

    Background: Our application has a lot of activity during normal working hours (9-5 essentially)  and then at night it has a few cron jobs that run but nothing too intensive, until the next morning at 9 am where the cycle starts again.

    So from what I understand, that since there was enough free disk space left (as dictated by REPLAY_FREE_DISK_PERCENT), it kept files around for both REP_STREAM_TIMEOUT and REPLAY_COST_PERCENT. The REP_STREAM_TIMEOUT files would eligible for deletion after REP_STREAM_TIMEOUT minutes, but if BDB JE decides that they are worth saving because of REPLAY_COST_PERCENT, files accumulated until it was running out of free space (less than REPLAY_FREE_DISK_PERCENT) hence why it was saying in the log messages "Replication prevents deletion of 21 files by the Cleaner to support replication using free space available beyond the 10% free space requirement ", but what I'm confused about, is why when the disk space started getting low, it didn't start deleting files that it kept around for REPLAY_COST_PERCENT? It seems to me that since the documentation of REPLAY_FREE_DISK_PERCENT says:


    The amount of free disk space that should be maintained when deciding whether to retain log files for use in replaying the replication stream, specified as a percentage of the total disk space. This parameter prevents the retention of log files as determined by the REPLAY_COST_PERCENT parameter if retaining the files would reduce free space below the specified percentage.


    why was disk space not reclaimed when space started getting low? Was it because the disk started running out of space, below REPLAY_FREE_DISK_PERCENT during normal hours, so there was a lot of write/read activity, and the size of the files being retained for REP_STREAM_TIMEOUT were bigger than the ones being deleted to keep free disk space under REPLAY_FREE_DISK_PERCENT?


    It just seems that when you say REPLAY_FREE_DISK_PERCENT is 10%, it shouldn't go beyond that threshold unless it has no other choice, and the complete size of the database with no extra files is getting too big. But that was not the case here, and it seems like an oversight in the logic between these 3 parameters that will cause a disk to fill up and the application to crash, when you said in the configuration that BDB JE should keep REPLAY_FREE_DISK_PERCENT free. I understand it was an oversight on our part to have the developer nodes not be secondary nodes, but that extra data should only be around for 30 minutes, and as you can see in the monthly disk usage graph, the disk space usage grew over a period of a month.


    ~Mark

  • Greybird-Oracle
    Greybird-Oracle Member Posts: 2,690
    edited Jun 10, 2016 10:11PM

    If a replica (one of your dev nodes) is off-line, files will be saved by other nodes for 30 minutes. This happens without regard to REPLAY_FREE_DISK_PERCENT, REPLAY_COST_PERCENT, or whether you'll run out of disk.  In other words, REPLAY_FREE_DISK_PERCENT, REPLAY_COST_PERCENT are ignored when it comes to retaining files for the 30 minute period. I don't think I can reconstruct exactly what happened in your case, but this seems like the most likely cause.

    If you use secondary nodes for your dev nodes, or at least remove them from the group when they are taken off-line, this will remove this factor as a possible cause. I suggest doing this, and if the problem happens again, take another look.

    Also, in an upcoming release (due to ship during the first half of July), we have added more information to the log messages about retained files, to make understanding this a little easier.

    --mark

  • mgrandi
    mgrandi Member Posts: 18
    edited Jun 13, 2016 5:04PM

    Understood.

    However, to throw yet another wrench in the works, I just had to shut down our application on REPLICA 1, because it ran out of disk space. The disk graph:

    screenshot_197.png

    screenshot_198.png

    the current je.properties overrides:

    je.rep.replayFreeDiskPercent = 15

    je.rep.replayCostPercent = 0

    and there are also no 'development' nodes in the replication group (as those were all removed last Wednesday when this happened the first time), so it shouldn't be holding on to extra data. And this only happened for this replica, the other two are fine and holding at around 51 gb of data.

    This is what I meant where I felt that the developer node issue was a problem but not the cause of THIS problem. Why is it running out of space if all of the nodes are online? The only thing we've really changed in the years this application has been around is upgrade it from je 5.x to the latest version (6.4.25) and made the log file size 1gb from the default of 10 mb (cause it was a pain to backup a ton of tiny files)

    The BDB info logs from today:  https://dl.dropboxusercontent.com/u/962389/je_info_june13_sidecar1.zip

  • Greybird-Oracle
    Greybird-Oracle Member Posts: 2,690
    edited Jun 13, 2016 4:43PM

    Please post the messages in the log before/after the time of the disk filling.

    --mark

  • mgrandi
    mgrandi Member Posts: 18
    edited Jun 13, 2016 5:09PM

    I updated my post before to have the log messages, but maybe I edited it after you replied, here are all of the log messages (goes way back back to may 1st)

    https://dl.dropboxusercontent.com/u/962389/je_logs_june13_sidecar1.tgz

    i also added an image with a zoomed in view of the disk graph so it is easier to correlate the je.info.* log messages to the disk usage. the Disk graph is in MST timezone, so its -7 hours from the log messages in je.info.*

  • Greybird-Oracle
    Greybird-Oracle Member Posts: 2,690
    edited Jun 13, 2016 5:39PM

    2016-06-13 16:49:13.974 UTC INFO [slave-sidecar1] Replication prevents deletion of 48 files by the Cleaner to support replication by node slave-sidecar2, last updated 2016-06-13 16:09:28.027 UTC, 2371 seconds before the most recently updated node, and less than the 86400 seconds timeout specified by the ReplicationConfig.REP_STREAM_TIMEOUT parameter. Start file=0x95857 holds CBVLSN 7,724,522,192, end file=0x95892 holds last VLSN 7,725,719,877

    2016-06-13 16:49:13.974 UTC INFO [slave-sidecar1] Files chosen for deletion by HA:

    [snip]

    2016-06-13 16:49:15.566 UTC SEVERE [slave-sidecar1] checkpointId=260924

    com.sleepycat.je.LogWriteException: Environment invalid because of previous exception: (JE 6.4.25) slave-sidecar1(5):/home/svcs/security/data/bdb java.io.IOException: No space left on device LOG_WRITE: IOException on write, log is [crash]

    This is saying that slave-sidecar2 is preventing file deletion, because it was last updated less then REP_STREAM_TIMEOUT ms ago. Looks like slave-sidecar2 has been down for around 40 minutes. This is the cause of the problem.


    However, 40 minutes is suspiciously more than the 30 minute value I mentioned earlier, and in fact it looks like the default value is 24 hours. This is either a code or doc bug -- let me check with others and get back to you.


    --mark

  • mgrandi
    mgrandi Member Posts: 18
    edited Jun 13, 2016 6:16PM

    the replica slave-sidecar2 is not down, we have monitors for that. Here are the je.* files for sidecar2 which show the logs for replica slave-sidecar2 that it has been running ever since i cleared the environment directory out and had it perform a network restore on June 8th (last Wednesday)

    https://dl.dropboxusercontent.com/u/962389/je_info_files_june13_sidecar2.tgz

    if those messages mean that it thinks those nodes are down, then something else is wrong, when this happened on june 8th, all 3 nodes were up and one of the replicas (sidecar2 in this case) was complaining that it was holding data for the master:

    2016-06-08 23:39:55.618 UTC INFO [slave-sidecar2] Candidates for deletion: 0xa03c4-0xa03e2 0xa03e4-0xa03e8 0xa03ea-0xa03eb 0xa03ee-0xa03ef

    2016-06-08 23:39:55.618 UTC INFO [slave-sidecar2] Replication prevents deletion of 40 files by the Cleaner to support replication by node master, last updated 2016-06-08 22:12:39.443 UTC, 5232 seconds before the most recently updated node, and less than the 86400 seconds timeout specified by the ReplicationConfig.REP_STREAM_TIMEOUT parameter. Start file=0xa03c4 holds CBVLSN 7,687,370,481, end file=0xa03fa holds last VLSN 7,687,374,479


    and yet our own heartbeat check (the master writes a datetime value to the database and the replicas read it, and if that value is older than a certain amount of time, we get notified) said that sidecar2 did not disconnect from the master at all until we started having problems (and ran out of disk space)

  • mgrandi
    mgrandi Member Posts: 18
    edited Jun 13, 2016 6:11PM

    It also seems that REP_STREAM_TIMEOUT's default value is indeed 24 hours

    /**
    * The maximum amount of time the replication group guarantees preservation
    * of the log files constituting the replication stream. After this period
    * of time, nodes are free to do log cleaning and to remove log files
    * earlier than this period. If a node has crashed and does not re-join the
    * group within this timeout period it may need to perform a network
    * restore operation to catch up.
    */
    public static final DurationConfigParam REP_STREAM_TIMEOUT =

      new DurationConfigParam(ReplicationConfig.REP_STREAM_TIMEOUT,
      null,  // min
      null,  // max
      "24 h",  // default
      false,  // mutable
      true);


    versus 30 minutes in the documentation: http://docs.oracle.com/cd/E17277_02/html/java/com/sleepycat/je/rep/ReplicationConfig.html#REP_STREAM_TIMEOUT

  • Greybird-Oracle
    Greybird-Oracle Member Posts: 2,690
    edited Jun 13, 2016 6:15PM

    Yes, I know. I'm checking with others to see if this is a doc or code bug. I believe it's a code bug. I'm learning a little as we go here also, since I primarily work on the JE storage engine rather than HA.

This discussion has been closed.