Berkeley DB Family

1 error has occurred

Your session has timed out.

Announcement

For appeals, questions and feedback about Oracle Forums, please email oracle-forums-moderators_us@oracle.com. Technical questions should be asked in the appropriate category. Thank you!

Interested in getting your voice heard by members of the Developer Marketing team at Oracle? Check out this post for AppDev or this post for AI focus group information.

Unexplained replication disk usage leading to running out of disk space

mgrandiJun 9 2016 — edited Jun 17 2016

First off, I appreciate the time you guys spend on answering questions in this forum, its quite gracious of you to offer support for non paying users!

Anyway, yesterday we had a production issue where 2 of the 3 servers in our replication group ran out of disk space because of BDB JE chewing through all of it in a sudden manner, and I have been looking for the cause for the past day and I don't think I"m making any progress

The setup is 1 static master, and 2 replicas. (the master uses the designated primary / node priorty system to ensure it is the only server that will ever be elected the master and the replicas won't ever become the master. This feels very hacky, but since we are using a proxy to direct write requests to a specific server, when this application was written a few years ago, I'm guessing they thought it would be better to just have a static master rather then set up a system to inform the proxy of which node currently is the master). We also have a couple of development instances that occasionally connect to the replication group to catch up, but these are not online 100% of the time and can go days/weeks without being connected to the group

Looking at the graphs of disk usage over the past month (and year), i see that all of the servers disk usage for the BDB directory went up 20 gigs, at around may 25. Yesterday however, the replicas went up an additional 20 gigs, almost doubling in the size of the database and ran out of disk space. we don't add a lot of new data into the database per day, as evidenced by the year long disk usage graph, but on may 25th and yesterday the disk usage just shot out of control.

Even with the default values for replayCostPercent and replayFreeDiskPercent, both of the replica servers ended up running out of disk space. I ended up putting a je.properties file in the home directory with replayCostPercent = 0, and that seemed to help one of the replicas, but the other one dropped to the expected database size, and then shot up again. I ended up having to just delete the environment directory on the replica and having it restore itself via a NetworkRestore, and after that, the second replica seems relatively stable

A coworker suggested that maybe because of our development instances, the group might be holding on to extra data for replication to those instances, but I ended up removing those nodes using DbGroupAdmin, and it didn't seem to have an immediate effect. Looking into the BDB JE source code where those log messages are created, I have a hunch that maybe the global CBVLSN was dragged down by the development instances that were not online 100% of the time, and therefore all of the replicas and the master had to save extra data to support replicating to those?

I'm just confused on what exactly is happening here. Why were all the servers holding on to so much extra data, to the point that it would rather hold on to this data then prevent a "out of space" exception?

Looking in the je.info logs, i see a bunch of these:

(the second replica, the one I had to clear the environment directory)

in may i see a bunch of these:

2016-05-25 14:58:28.502 UTC INFO [slave-sidecar2] Candidates for deletion: 0xa0284-0xa0299

2016-05-25 14:58:28.507 UTC INFO [slave-sidecar2] Replication prevents deletion of 22 files by the Cleaner to support replication using free space available beyond the 10% free space requirement specified by the ReplicationConfig.REPLAY_FREE_DISK_PERCENT parameters. Start file=0xa0284 holds CBVLSN 7,542,288,376, end file=0xa029b holds last VLSN 7,545,609,723

2016-05-25 14:58:28.507 UTC INFO [slave-sidecar2] Files chosen for deletion by HA:

2016-05-25 14:58:28.508 UTC INFO [slave-sidecar2] Cleaner has 22 files not deleted because they are protected by replication. Files: 0xa0284-0xa0299

2016-05-25 15:00:07.099 UTC INFO [slave-sidecar2] Candidates for deletion: 0xa0284-0xa0299

2016-05-25 15:00:07.104 UTC INFO [slave-sidecar2] Replication prevents deletion of 22 files by the Cleaner to support replication using free space available beyond the 10% free space requirement specified by the ReplicationConfig.REPLAY_FREE_DISK_PERCENT parameters. Start file=0xa0284 holds CBVLSN 7,542,288,376, end file=0xa029b holds last VLSN 7,545,662,859

2016-05-25 15:00:07.105 UTC INFO [slave-sidecar2] Files chosen for deletion by HA:

2016-05-25 15:00:07.105 UTC INFO [slave-sidecar2] Cleaner has 22 files not deleted because they are protected by replication. Files: 0xa0284-0xa0299

2016-05-25 15:02:01.782 UTC INFO [slave-sidecar2] Candidates for deletion: 0xa0284-0xa0299

2016-05-25 15:02:01.788 UTC INFO [slave-sidecar2] Replication prevents deletion of 22 files by the Cleaner to support replication using free space available beyond the 10% free space requirement specified by the ReplicationConfig.REPLAY_FREE_DISK_PERCENT parameters. Start file=0xa0284 holds CBVLSN 7,542,288,376, end file=0xa029b holds last VLSN 7,545,726,863

2016-05-25 15:02:01.788 UTC INFO [slave-sidecar2] Files chosen for deletion by HA:

2016-05-25 15:02:01.788 UTC INFO [slave-sidecar2] Cleaner has 22 files not deleted because they are protected by replication. Files: 0xa0284-0xa0299

2016-05-25 15:03:37.099 UTC INFO [slave-sidecar2] Candidates for deletion: 0xa0284-0xa0299

2016-05-25 15:03:37.105 UTC INFO [slave-sidecar2] Replication prevents deletion of 22 files by the Cleaner to support replication using free space available beyond the 10% free space requirement specified by the ReplicationConfig.REPLAY_FREE_DISK_PERCENT parameters. Start file=0xa0284 holds CBVLSN 7,542,288,376, end file=0xa029b holds last VLSN 7,545,781,504

2016-05-25 15:03:37.106 UTC INFO [slave-sidecar2] Files chosen for deletion by HA:

2016-05-25 15:03:37.106 UTC INFO [slave-sidecar2] Cleaner has 22 files not deleted because they are protected by replication. Files: 0xa0284-0xa0299

---------------------------

then yesterday:

2016-06-08 23:39:44.609 UTC INFO [slave-sidecar2] Candidates for deletion: 0xa03c4-0xa03e2 0xa03e4-0xa03e8 0xa03ea-0xa03eb 0xa03ee

2016-06-08 23:39:44.610 UTC INFO [slave-sidecar2] Replication prevents deletion of 39 files by the Cleaner to support replication by node master, last updated 2016-06-08 22:12:39.443 UTC, 5177 seconds before the most recently updated node, and less than the 86400 seconds timeout specified by the ReplicationConfig.REP_STREAM_TIMEOUT parameter. Start file=0xa03c4 holds CBVLSN 7,687,370,481, end file=0xa03f9 holds last VLSN 7,687,374,471

2016-06-08 23:39:44.610 UTC INFO [slave-sidecar2] Files chosen for deletion by HA:

2016-06-08 23:39:44.610 UTC INFO [slave-sidecar2] Cleaner has 39 files not deleted because they are protected by replication. Files: 0xa03c4-0xa03e2 0xa03e4-0xa03e8 0xa03ea-0xa03eb 0xa03ee

2016-06-08 23:39:47.629 UTC INFO [slave-sidecar2] Candidates for deletion: 0xa03c4-0xa03e2 0xa03e4-0xa03e8 0xa03ea-0xa03eb 0xa03ee-0xa03ef

2016-06-08 23:39:47.629 UTC INFO [slave-sidecar2] Replication prevents deletion of 40 files by the Cleaner to support replication by node master, last updated 2016-06-08 22:12:39.443 UTC, 5177 seconds before the most recently updated node, and less than the 86400 seconds timeout specified by the ReplicationConfig.REP_STREAM_TIMEOUT parameter. Start file=0xa03c4 holds CBVLSN 7,687,370,481, end file=0xa03f9 holds last VLSN 7,687,374,474

2016-06-08 23:39:47.629 UTC INFO [slave-sidecar2] Files chosen for deletion by HA:

2016-06-08 23:39:47.629 UTC INFO [slave-sidecar2] Cleaner has 40 files not deleted because they are protected by replication. Files: 0xa03c4-0xa03e2 0xa03e4-0xa03e8 0xa03ea-0xa03eb 0xa03ee-0xa03ef

2016-06-08 23:39:51.243 UTC INFO [slave-sidecar2] Candidates for deletion: 0xa03c4-0xa03e2 0xa03e4-0xa03e8 0xa03ea-0xa03eb 0xa03ee-0xa03ef

2016-06-08 23:39:51.243 UTC INFO [slave-sidecar2] Replication prevents deletion of 40 files by the Cleaner to support replication by node master, last updated 2016-06-08 22:12:39.443 UTC, 5177 seconds before the most recently updated node, and less than the 86400 seconds timeout specified by the ReplicationConfig.REP_STREAM_TIMEOUT parameter. Start file=0xa03c4 holds CBVLSN 7,687,370,481, end file=0xa03f9 holds last VLSN 7,687,374,474

2016-06-08 23:39:51.243 UTC INFO [slave-sidecar2] Files chosen for deletion by HA:

2016-06-08 23:39:51.243 UTC INFO [slave-sidecar2] Cleaner has 40 files not deleted because they are protected by replication. Files: 0xa03c4-0xa03e2 0xa03e4-0xa03e8 0xa03ea-0xa03eb 0xa03ee-0xa03ef

2016-06-08 23:39:55.618 UTC INFO [slave-sidecar2] Candidates for deletion: 0xa03c4-0xa03e2 0xa03e4-0xa03e8 0xa03ea-0xa03eb 0xa03ee-0xa03ef

2016-06-08 23:39:55.618 UTC INFO [slave-sidecar2] Replication prevents deletion of 40 files by the Cleaner to support replication by node master, last updated 2016-06-08 22:12:39.443 UTC, 5232 seconds before the most recently updated node, and less than the 86400 seconds timeout specified by the ReplicationConfig.REP_STREAM_TIMEOUT parameter. Start file=0xa03c4 holds CBVLSN 7,687,370,481, end file=0xa03fa holds last VLSN 7,687,374,479

2016-06-08 23:39:55.618 UTC INFO [slave-sidecar2] Files chosen for deletion by HA:

2016-06-08 23:39:55.618 UTC INFO [slave-sidecar2] Cleaner has 40 files not deleted because they are protected by replication. Files: 0xa03c4-0xa03e2 0xa03e4-0xa03e8 0xa03ea-0xa03eb 0xa03ee-0xa03ef

2016-06-08 23:39:59.835 UTC INFO [slave-sidecar2] Candidates for deletion: 0xa03c4-0xa03e2 0xa03e4-0xa03e8 0xa03ea-0xa03eb 0xa03ee-0xa03ef

2016-06-08 23:39:59.836 UTC INFO [slave-sidecar2] Replication prevents deletion of 40 files by the Cleaner to support replication by node master, last updated 2016-06-08 22:12:39.443 UTC, 5232 seconds before the most recently updated node, and less than the 86400 seconds timeout specified by the ReplicationConfig.REP_STREAM_TIMEOUT parameter. Start file=0xa03c4 holds CBVLSN 7,687,370,481, end file=0xa03fa holds last VLSN 7,687,374,482

2016-06-08 23:39:59.836 UTC INFO [slave-sidecar2] Files chosen for deletion by HA:

2016-06-08 23:39:59.836 UTC INFO [slave-sidecar2] Cleaner has 40 files not deleted because they are protected by replication. Files: 0xa03c4-0xa03e2 0xa03e4-0xa03e8 0xa03ea-0xa03eb 0xa03ee-0xa03ef

2016-06-08 23:40:03.366 UTC INFO [slave-sidecar2] Candidates for deletion: 0xa03c4-0xa03e2 0xa03e4-0xa03e8 0xa03ea-0xa03eb 0xa03ee-0xa03ef

2016-06-08 23:40:03.366 UTC INFO [slave-sidecar2] Replication prevents deletion of 40 files by the Cleaner to support replication by node master, last updated 2016-06-08 22:12:39.443 UTC, 5232 seconds before the most recently updated node, and less than the 86400 seconds timeout specified by the ReplicationConfig.REP_STREAM_TIMEOUT parameter. Start file=0xa03c4 holds CBVLSN 7,687,370,481, end file=0xa03fa holds last VLSN 7,687,374,485

2016-06-08 23:40:03.366 UTC INFO [slave-sidecar2] Files chosen for deletion by HA:

Here are the disk graphs for replica2. (It drops when i set replayCostPercent to 0, but then it jumps back up again. The stair stepping pattern is me clearing the environment directory and doing a network restore)