3 Replies Latest reply: Jul 11, 2012 10:02 AM by Linda Lee RSS

    some questions about BDB JE HA

    947780
      Hi
      I'm running a Group of BDB HA replication Sets

      but I always found some WARNING logs in je.info.0 file

      120706 14:23:07:002 WARNING [node1] Exception during read: Connection reset by peer

      My Questions are:
      1. what may cause this problems?
      2. Is this WARNING LOG can cause "WARNING [node1] Cleaner has 130 files not deleted because they are protected by replication." problem?
      3. How can I relieve it? does it associate with "RepParams.REPLICA_ACK_TIMEOUT" property ? my current value is "repConfig.setReplicaAckTimeout(60, TimeUnit.SECONDS);" .


      thanks.

      Edited by: user10473325 on 2012-7-6 上午12:18
        • 1. Re: some questions about BDB JE HA
          sjh
          Jim, Wang wrote:
          Hi
          I'm running a Group of BDB HA replication Sets

          but I always found some WARNING logs in je.info.0 file

          120706 14:23:07:002 WARNING [node1] Exception during read: Connection reset by peer
          >
          My Questions are:
          1. what may cause this problems?
          This message means that node1 lost its network connection to a node it was communicating with. Such network interruptions, if they are of a short duration and infrequent, should not have an adverse impact. However, if you see such messages very frequently in the logs, it could indicate there is a network issue on the path from node1 to these other nodes. The other log messages around this time may contain additional information about the problem.
          2. Is this WARNING LOG can cause "WARNING [node1] Cleaner has 130 files not deleted because they are protected by replication." problem?
          If one of the nodes in the replication group is down, or not reachable due to network issues like the ones you mentioned above, then it will prevent the log cleaner from deleting log files and provoke the warnings you see in the logs.
          3. How can I relieve it? does it associate with "RepParams.REPLICA_ACK_TIMEOUT" property ? my current value is "repConfig.setReplicaAckTimeout(60, TimeUnit.SECONDS);" .
          The first step would be to ensure that all the nodes in the replication group are up and that they can communicate with each other over the network. You can use JE's DbPing utility described at http://docs.oracle.com/cd/E17277_02/html/java/com/sleepycat/je/rep/util/DbPing.html#main for this purpose.

          The ReplicationConfig.REPLICA_ACK_TIMEOUT parameter determines how long the Transaction.commit(com.sleepycat.je.Durability) operation on the Master will wait for a sufficient number of acknowledgements from the Replicas and is quite likely unrelated to the issue described above. The default time out value of 5 seconds is adequate for most applications. If you need higher values, it may mean that the machines are overloaded and cannot acknowledge commit requests in a timely manner. The following link: http://docs.oracle.com/cd/E17277_02/html/java/com/sleepycat/je/rep/ReplicationConfig.html#REPLICA_ACK_TIMEOUT, describes the use of the parameter in detail.

          Regards

          Sam
          • 2. Re: some questions about BDB JE HA
            947780
            hi Sam .

            thanks a lot for your reply.

            Here are my addtional questions:

            first of all, like you said. "it will prevent the log cleaner from deleting log files due to network issues"

            in my enviroument . the exception: "connection reset by peer" emerged frequently..


            120709 11:59:21:702 WARNING [node1] Exception during read: Connection reset by peer
            120709 15:39:34:683 WARNING [node1] Exception during read: Connection reset by peer
            120709 19:19:05:893 WARNING [node1] Exception during read: Connection reset by peer
            120709 22:58:03:663 WARNING [node1] Exception during read: Connection reset by peer
            120710 02:37:01:085 WARNING [node1] Exception during read: Connection reset by peer
            120710 06:19:10:748 WARNING [node1] Exception during read: Connection reset by peer
            120710 09:59:38:810 WARNING [node1] Exception during read: Connection reset by peer


            and so , these may cause a big problems: the default propertie value of prevent the log cleaner from cleaning the log files is REP_STREAM_TIMEOUT . and the default value of REP_STREAM_TIMEOUT is 12 hours.

            so the cleaner is hardly to deleate old files and my disk space is consumed rapidly.

            ---------------------

            second of all .

            When I try to add addtional replcation node to the data groups that I mentioned earlier.
            I got a exception :

            com.sleepycat_ustore_5055.je.EnvironmentFailureException: (JE 5.0.55) node9(-1):ustore_data_ha Feeder: node4(4). com.sleepycat_ustore_5034.je.rep.impl.RepGroupImpl$NodeConflictException: (JE 5.0.34) New node:node9, is configured with the socket address: /172.24.64.45:7035. It conflicts with the socket already used by the member: node9 HANDSHAKE_ERROR: Error during the handshake between two nodes. Some validity or compatibility check failed, preventing further communication between the nodes. Environment is invalid and must be closed.
            at com.sleepycat_ustore_5055.je.rep.stream.ReplicaFeederHandshake.verifyMembership(ReplicaFeederHandshake.java:233)
            at com.sleepycat_ustore_5055.je.rep.stream.ReplicaFeederHandshake.execute(ReplicaFeederHandshake.java:171)
            at com.sleepycat_ustore_5055.je.rep.impl.node.Replica.initReplicaLoop(Replica.java:566)
            at com.sleepycat_ustore_5055.je.rep.impl.node.Replica.runReplicaLoopInternal(Replica.java:373)
            at com.sleepycat_ustore_5055.je.rep.impl.node.Replica.runReplicaLoop(Replica.java:309)
            at com.sleepycat_ustore_5055.je.rep.impl.node.RepNode.run(RepNode.java:1312)



            but the name "node9" is the new node name that I want to add to the data groups.

            I use DBGroupAdmin with -dumpGroup option , and I find the node9 isn't in the node sets .

            I restarted the node9 times and times ,but It can't resloved my problem.

            until I restarted the master in these node set, the new node will be add to the replica.

            -------------------------------------------------------------------------

            last of all :

            the property ReplicationConfig.CONSISTENCY_POLICY seems that it's useless.

            when the replca left far behind the master .

            I always find a exception:


            120703 21:42:25:053 INFO [node3] Replica stats - Lag waits: 0 Lag wait time: 0ms. VLSN waits: 0 Lag wait time: 0ms.
            120703 21:42:25:054 INFO [node3] RepNode main thread shutting down.
            120703 21:42:25:055 INFO [node3] RepNode shutdown exception:
            (JE 5.0.34) node3(3):ustore_data_ha fetchTarget of 0x448c/0x220e1a9 parent IN=1358624 IN class=com.sleepycat_ustore_5034.je.tree.IN lastFullVersion=0x449a/0x78138f0 lastLoggedVersion=0x449a/0x78138f0 parent.getDirty()=false state=0 java.lang.NullPointerException LOG_INTEGRITY: Log information is incorrect, problem is likely persistent. Problem seen replaying entry INS_LN_TX/8 vlsn=1,216,196,451 isReplicated="1" txn=-211997213node3(3)[REPLICA]
            No feeders.
            GlobalCBVLSN=911,946,085
            Group info [panalyzergroup00] ac3535b3-64a1-4396-b8f6-4f5da8de3285
            Representation version: 2
            Change version: 5
            Max rep node ID: 4
            Node:node1 172.23.110.96:7031 (is removed) changeVersion:3 LocalCBVLSN:261,502,287 at:Wed Jun 20 23:57:26 CST 2012
            Node:node4 172.23.110.96:7031 (is member) changeVersion:5 LocalCBVLSN:1,215,950,988 at:Mon Jul 02 11:27:53 CST 2012
            Node:node2 172.24.64.45:7032 (is member) changeVersion:2 LocalCBVLSN:911,946,085 at:Mon Jul 02 11:25:42 CST 2012
            Node:node3 172.23.110.43:7031 (is member) changeVersion:4 LocalCBVLSN:1,081,910,305 at:Mon Jul 02 11:28:00 CST 2012
            vlsnRange=first=814,455,670 last=1,216,196,450 sync=1,216,196,450 txnEnd=1,216,196,450
            lastReplayedTxn= VLSN: 1,216,196,450 masterTxnEndTime=Mon Jul 02 11:28:26 CST 2012 lastReplayedVLSN=1,216,196,451 numActiveReplayTxns=6

            120703 21:42:25:055 SEVERE [node3] node3(3):ustore_data_ha:DETACHED node3(3) exited unexpectedly with exception com.sleepycat_ustore_5034.je.EnvironmentFailureException: (JE 5.0.34) node3(3):ustore_data_ha fetchTarget of 0x448c/0x220e1a9 parent IN=1358624 IN class=com.sleepycat_ustore_5034.je.tree.IN lastFullVersion=0x449a/0x78138f0 lastLoggedVersion=0x449a/0x78138f0 parent.getDirty()=false state=0 java.lang.NullPointerException LOG_INTEGRITY: Log information is incorrect, problem is likely persistent. Problem seen replaying entry INS_LN_TX/8 vlsn=1,216,196,451 isReplicated="1" txn=-211997213com.sleepycat_ustore_5034.je.EnvironmentFailureException: (JE 5.0.34) node3(3):ustore_data_ha fetchTarget of 0x448c/0x220e1a9 parent IN=1358624 IN class=com.sleepycat_ustore_5034.je.tree.IN lastFullVersion=0x449a/0x78138f0 lastLoggedVersion=0x449a/0x78138f0 parent.getDirty()=false state=0 java.lang.NullPointerException LOG_INTEGRITY: Log information is incorrect, problem is likely persistent. Problem seen replaying entry INS_LN_TX/8 vlsn=1,216,196,451 isReplicated="1" txn=-211997213
            at com.sleepycat_ustore_5034.je.tree.IN.fetchTarget(IN.java:1438)
            at com.sleepycat_ustore_5034.je.tree.Tree.searchSubTreeUntilSplit(Tree.java:1739)
            at com.sleepycat_ustore_5034.je.tree.Tree.searchSubTreeSplitsAllowed(Tree.java:1682)
            at com.sleepycat_ustore_5034.je.tree.Tree.searchSplitsAllowed(Tree.java:1248)
            at com.sleepycat_ustore_5034.je.tree.Tree.findBinForInsert(Tree.java:2158)
            at com.sleepycat_ustore_5034.je.dbi.CursorImpl.putInternal(CursorImpl.java:829)
            at com.sleepycat_ustore_5034.je.dbi.CursorImpl.put(CursorImpl.java:774)
            at com.sleepycat_ustore_5034.je.Cursor.putAllowPhantoms(Cursor.java:2231)
            at com.sleepycat_ustore_5034.je.Cursor.putNoNotify(Cursor.java:2188)
            at com.sleepycat_ustore_5034.je.Cursor.putNotify(Cursor.java:2105)
            at com.sleepycat_ustore_5034.je.Cursor.putForReplay(Cursor.java:2024)
            at com.sleepycat_ustore_5034.je.DbInternal.putForReplay(DbInternal.java:136)
            at com.sleepycat_ustore_5034.je.rep.impl.node.Replay.applyLN(Replay.java:799)
            at com.sleepycat_ustore_5034.je.rep.impl.node.Replay.replayEntry(Replay.java:479)
            at com.sleepycat_ustore_5034.je.rep.impl.node.Replica.doRunReplicaLoopInternalWork(Replica.java:465)
            at com.sleepycat_ustore_5034.je.rep.impl.node.Replica.runReplicaLoopInternal(Replica.java:374)
            at com.sleepycat_ustore_5034.je.rep.impl.node.Replica.runReplicaLoop(Replica.java:309)
            at com.sleepycat_ustore_5034.je.rep.impl.node.RepNode.run(RepNode.java:1318)
            Caused by: java.lang.NullPointerException

            120703 21:42:25:055 INFO [node3] RepNode main thread: DETACHED node3(3) exited.
            120703 21:42:25:056 INFO [node3] ServiceDispatcher shutdown starting. HostPort=my110043.cm3.tbsite.net:7031 Registered services: []
            120703 21:42:25:057 INFO [node3] ServiceDispatcher shutdown completed. HostPort=my110043.cm3.tbsite.net:7031
            120703 21:42:25:057 INFO [node3] node3(3) shutdown completed.
            120703 21:42:31:272 INFO [node3] Started ServiceDispatcher. HostPort=my110043.cm3.tbsite.net:7031
            120703 21:42:31:274 INFO [node3] Current group size: 3
            120703 21:42:31:275 INFO [node3] Existing node node3 querying for a current master.
            120703 21:42:31:277 INFO [node3] Master changed to node4
            120703 21:42:31:279 INFO [node3] Node node3 started
            120703 21:42:31:279 INFO [node3] Replica loop started with master: node4(4)
            120703 21:42:31:280 INFO [node3] Replica-feeder handshake start
            120703 21:42:31:286 INFO [node3] Replica-feeder node4 handshake completed.
            120703 21:42:31:287 INFO [node3] Replica-feeder node4 syncup started. Replica range: first=814,455,670 last=1,216,196,044 sync=1,216,196,044 txnEnd=1,216,196,044
            120703 21:42:31:314 INFO [node3] Rollback to matchpoint 1,216,196,044 at 0x449a/0x9856c55 status=End of range equals matchpoint, nothing to rollback
            120703 21:42:31:315 INFO [node3] Replica-feeder node4 start stream at VLSN: 1,216,196,045
            120703 21:42:31:315 INFO [node3] Replica-feeder node4 syncup ended. Elapsed time: 29ms
            120703 21:42:31:316 INFO [node3] Replica initialization completed. Replica VLSN: 1,216,196,044 Heartbeat master commit VLSN: 1,353,816,939 VLSN delta: 137,620,895
            120703 21:47:31:375 INFO [node3]
            === Interim BECOME_CONSISTENT Report ===
            checkpointId = 11152[daemon] firstActive[0x449a/0x2051d9e], ckptStart[0x449a/0x205676f], ckptEnd[0x449a/0x917f598], lastUsed[0x449a/0x9856c55]

            Approximate distances:
            firstActive->ckptStart bytes = 18,897
            ckptStart->ckptEnd bytes = 118,656,553
            ckptEnd->end bytes bytes = 7,173,821
            Elapsed(ms)
            TOTAL_ENV_OPEN : 6,209
            TOTAL_RECOVERY : 6,196
            FIND_END_OF_LOG : 516
            FIND_LAST_CKPT : 0
            BUILD_TREE : 4,537
            READ_MAP_INS : 922
            UNDO_MAP_LNS : 118
            REDO_MAP_LNS : 129
            READ_INS : 167
            REDO_INS : 151
            UNDO_LNS : 147
            REDO_LNS : 2,898
            POPULATE_UP : 320
            REMOVE_TEMP_DBS : 0
            CKPT : 795
            TOTAL_JOIN_GROUP : started at 07/03/12,21:42:31:269
            FIND_MASTER : 4
            BECOME_CONSISTENT : 300,094

            nRead nProcessed nDeleted nAux nRepeatRd nCacheMiss
            FIND_MASTER 0 0 0 0 0 8
            BECOME_CONSISTENT 0 0 0 0 0 157,837
            CKPT stats:
            Checkpoints: Frequency and extent of checkpointing activity.
            lastCheckpointEnd=0x449a/0xb9b86e7
            lastCheckpointId=11,153
            lastCheckpointStart=0x449a/0x9869375
            nCheckpoints=1
            nDeltaINFlush=11,709
            nFullBINFlush=319
            nFullINFlush=4,785

            120703 21:47:31:376 INFO [node3]
            === Replication Join Group Report ===
            start = 07/03/12,21:42:31:269
            end = 07/03/12,21:47:31:375

            % of total Elapsed(ms)
            TOTAL_JOIN_GROUP 100 300,106
            -------------------------
            FIND_MASTER 0 4
            BECOME_CONSISTENT 99 300,094

            nRead nProcessed nDeleted nAux nRepeatRd nCacheMiss
            FIND_MASTER 0 0 0 0 0 8
            BECOME_CONSISTENT 0 0 0 0 0 157,837

            120703 21:47:31:376 INFO [node3] Shutting down node node3(3)
            120703 21:47:31:377 INFO [node3] Elections shutdown initiated
            120703 21:47:31:377 INFO [node3] Elections shutdown completed
            120703 21:47:31:386 SEVERE [node3] Replica unexpected exception com.sleepycat_ustore_5034.je.EnvironmentFailureException: (JE 5.0.34) node3(3):ustore_data_ha fetchTarget of 0x4485/0xbecbe02 parent IN=1916442 IN class=com.sleepycat_ustore_5034.je.tree.IN lastFullVersion=0x449a/0x39d3a17 lastLoggedVersion=0x449a/0x39d3a17 parent.getDirty()=false state=0 java.lang.NullPointerException LOG_INTEGRITY: Log information is incorrect, problem is likely persistent. Problem seen replaying entry INS_LN_TX/8 vlsn=1,216,797,985 isReplicated="1" txn=-212072503 com.sleepycat_ustore_5034.je.EnvironmentFailureException: (JE 5.0.34) node3(3):ustore_data_ha fetchTarget of 0x4485/0xbecbe02 parent IN=1916442 IN class=com.sleepycat_ustore_5034.je.tree.IN lastFullVersion=0x449a/0x39d3a17 lastLoggedVersion=0x449a/0x39d3a17 parent.getDirty()=false state=0 java.lang.NullPointerException LOG_INTEGRITY: Log information is incorrect, problem is likely persistent. Problem seen replaying entry INS_LN_TX/8 vlsn=1,216,797,985 isReplicated="1" txn=-212072503
            at com.sleepycat_ustore_5034.je.tree.IN.fetchTarget(IN.java:1438)
            at com.sleepycat_ustore_5034.je.tree.Tree.searchSubTreeUntilSplit(Tree.java:1739)
            at com.sleepycat_ustore_5034.je.tree.Tree.searchSubTreeSplitsAllowed(Tree.java:1682)
            at com.sleepycat_ustore_5034.je.tree.Tree.searchSplitsAllowed(Tree.java:1248)
            at com.sleepycat_ustore_5034.je.tree.Tree.findBinForInsert(Tree.java:2158)
            at com.sleepycat_ustore_5034.je.dbi.CursorImpl.putInternal(CursorImpl.java:829)
            at com.sleepycat_ustore_5034.je.dbi.CursorImpl.put(CursorImpl.java:774)
            at com.sleepycat_ustore_5034.je.Cursor.putAllowPhantoms(Cursor.java:2231)
            at com.sleepycat_ustore_5034.je.Cursor.putNoNotify(Cursor.java:2188)
            at com.sleepycat_ustore_5034.je.Cursor.putNotify(Cursor.java:2105)
            at com.sleepycat_ustore_5034.je.Cursor.putForReplay(Cursor.java:2024)
            at com.sleepycat_ustore_5034.je.DbInternal.putForReplay(DbInternal.java:136)
            at com.sleepycat_ustore_5034.je.rep.impl.node.Replay.applyLN(Replay.java:799)
            at com.sleepycat_ustore_5034.je.rep.impl.node.Replay.replayEntry(Replay.java:479)
            at com.sleepycat_ustore_5034.je.rep.impl.node.Replica.doRunReplicaLoopInternalWork(Replica.java:465)
            at com.sleepycat_ustore_5034.je.rep.impl.node.Replica.runReplicaLoopInternal(Replica.java:374)
            at com.sleepycat_ustore_5034.je.rep.impl.node.Replica.runReplicaLoop(Replica.java:309)
            at com.sleepycat_ustore_5034.je.rep.impl.node.RepNode.run(RepNode.java:1318)
            Caused by: java.lang.NullPointerException

            120703 21:47:31:387 WARNING [node3] Exiting inner Replica loop with exception com.sleepycat_ustore_5034.je.EnvironmentFailureException: (JE 5.0.34) node3(3):ustore_data_ha fetchTarget of 0x4485/0xbecbe02 parent IN=1916442 IN class=com.sleepycat_ustore_5034.je.tree.IN lastFullVersion=0x449a/0x39d3a17 lastLoggedVersion=0x449a/0x39d3a17 parent.getDirty()=false state=0 java.lang.NullPointerException LOG_INTEGRITY: Log information is incorrect, problem is likely persistent. Problem seen replaying entry INS_LN_TX/8 vlsn=1,216,797,985 isReplicated="1" txn=-212072503
            com.sleepycat_ustore_5034.je.EnvironmentFailureException: (JE 5.0.34) node3(3):ustore_data_ha fetchTarget of 0x4485/0xbecbe02 parent IN=1916442 IN class=com.sleepycat_ustore_5034.je.tree.IN lastFullVersion=0x449a/0x39d3a17 lastLoggedVersion=0x449a/0x39d3a17 parent.getDirty()=false state=0 java.lang.NullPointerException LOG_INTEGRITY: Log information is incorrect, problem is likely persistent. Problem seen replaying entry INS_LN_TX/8 vlsn=1,216,797,985 isReplicated="1" txn=-212072503
            at com.sleepycat_ustore_5034.je.tree.IN.fetchTarget(IN.java:1438)
            at com.sleepycat_ustore_5034.je.tree.Tree.searchSubTreeUntilSplit(Tree.java:1739)
            at com.sleepycat_ustore_5034.je.tree.Tree.searchSubTreeSplitsAllowed(Tree.java:1682)
            at com.sleepycat_ustore_5034.je.tree.Tree.searchSplitsAllowed(Tree.java:1248)
            at com.sleepycat_ustore_5034.je.tree.Tree.findBinForInsert(Tree.java:2158)
            at com.sleepycat_ustore_5034.je.dbi.CursorImpl.putInternal(CursorImpl.java:829)
            at com.sleepycat_ustore_5034.je.dbi.CursorImpl.put(CursorImpl.java:774)
            at com.sleepycat_ustore_5034.je.Cursor.putAllowPhantoms(Cursor.java:2231)
            at com.sleepycat_ustore_5034.je.Cursor.putNoNotify(Cursor.java:2188)
            at com.sleepycat_ustore_5034.je.Cursor.putNotify(Cursor.java:2105)
            at com.sleepycat_ustore_5034.je.Cursor.putForReplay(Cursor.java:2024)
            at com.sleepycat_ustore_5034.je.DbInternal.putForReplay(DbInternal.java:136)
            at com.sleepycat_ustore_5034.je.rep.impl.node.Replay.applyLN(Replay.java:799)
            at com.sleepycat_ustore_5034.je.rep.impl.node.Replay.replayEntry(Replay.java:479)
            at com.sleepycat_ustore_5034.je.rep.impl.node.Replica.doRunReplicaLoopInternalWork(Replica.java:465)
            at com.sleepycat_ustore_5034.je.rep.impl.node.Replica.runReplicaLoopInternal(Replica.java:374)
            at com.sleepycat_ustore_5034.je.rep.impl.node.Replica.runReplicaLoop(Replica.java:309)
            at com.sleepycat_ustore_5034.je.rep.impl.node.RepNode.run(RepNode.java:1318)
            Caused by: java.lang.NullPointerException

            120703 21:47:31:387 INFO [node3] Replica stats - Lag waits: 0 Lag wait time: 0ms. VLSN waits: 0 Lag wait time: 0ms.
            120703 21:47:31:387 INFO [node3] RepNode main thread shutting down.
            120703 21:47:31:388 INFO [node3] RepNode shutdown exception:
            (JE 5.0.34) node3(3):ustore_data_ha fetchTarget of 0x4485/0xbecbe02 parent IN=1916442 IN class=com.sleepycat_ustore_5034.je.tree.IN lastFullVersion=0x449a/0x39d3a17 lastLoggedVersion=0x449a/0x39d3a17 parent.getDirty()=false state=0 java.lang.NullPointerException LOG_INTEGRITY: Log information is incorrect, problem is likely persistent. Problem seen replaying entry INS_LN_TX/8 vlsn=1,216,797,985 isReplicated="1" txn=-212072503node3(3)[REPLICA]
            No feeders.

            ============
            these NPE exception always came with Another exception in my ternimal :
            error com.sleepycat_ustore_5034.je.rep.ReplicaConsistencyException: (JE 5.0.34) Unable to achieve consistency despite waiting for 300000 ms.. classclass com.sleepycat_ustore_5034.je.rep.ReplicaConsistencyException


            =============

            I try to find how these can occur , then I got my jstack here:

            "main" prio=10 tid=0x000000004d397000 nid=0x67c9 waiting on condition [0x0000000040de8000]
            java.lang.Thread.State: TIMED_WAITING (parking)
            at sun.misc.Unsafe.park(Native Method)

            parking to wait for <0x000000062edb7a70> (a java.util.concurrent.CountDownLatch$Sync)
            at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198)
            at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1011)
            at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1303)
            at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:253)
            at com.sleepycat_ustore_5034.je.rep.utilint.RepUtils$ExceptionAwareCountDownLatch.awaitOrException(RepUtils.java:255)
            at com.sleepycat_ustore_5034.je.rep.impl.node.Replica$ConsistencyTracker.await(Replica.java:1049)
            at com.sleepycat_ustore_5034.je.rep.impl.node.Replica$ConsistencyTracker.VLSNAwait(Replica.java:1031)
            at com.sleepycat_ustore_5034.je.rep.impl.node.Replica$ConsistencyTracker.anyVLSNAwait(Replica.java:1005)
            at com.sleepycat_ustore_5034.je.rep.impl.PointConsistencyPolicy.ensureConsistency(PointConsistencyPolicy.java:82)
            at com.sleepycat_ustore_5034.je.rep.impl.node.RepNode.joinAsReplica(RepNode.java:1682)
            at com.sleepycat_ustore_5034.je.rep.impl.node.RepNode.joinGroup(RepNode.java:1599)
            at com.sleepycat_ustore_5034.je.rep.impl.RepImpl.joinGroup(RepImpl.java:459)
            locked <0x00000006642dae00> (a com.sleepycat_ustore_5034.je.rep.impl.RepImpl)
            at com.sleepycat_ustore_5034.je.rep.ReplicatedEnvironment.joinGroup(ReplicatedEnvironment.java:415)
            at com.sleepycat_ustore_5034.je.rep.ReplicatedEnvironment.<init>(ReplicatedEnvironment.java:467)
            at com.sleepycat_ustore_5034.je.rep.ReplicatedEnvironment.<init>(ReplicatedEnvironment.java:338)
            at com.sleepycat_ustore_5034.je.rep.ReplicatedEnvironment.<init>(ReplicatedEnvironment.java:397)

            ========
            My inference is :
            when the slave node is far more behind the master node, "PointConsistencyPolicy" will wait for 300000 ms and throw a exceptoin, these exception may cause the whole ReplicatedEnvironment is reset . the evictor is set to null during ReplicatedEnv restart , so I got the NPE ...

            so I try to find how to avoid let the bdb je do not throw these excpetion.

            I find code here:

            void com.sleepycat_ustore_5055.je.rep.impl.node.RepNode.joinAsReplica(ReplicaConsistencyPolicy
            consistency) throws InterruptedException
            if (consistency == null) {
            final int consistencyTimeout =
            getConfigManager().getDuration(ENV_CONSISTENCY_TIMEOUT);
            consistency = new PointConsistencyPolicy
            (new VLSN(replica.getMasterTxnEndVLSN()),
            consistencyTimeout, TimeUnit.MILLISECONDS);
            }


            so the default constency is null.

            but I have already set Consistency to "NoConsistencyRequiredPolicy" by this methods :
            repConfig.setConfigParam(ReplicationConfig.CONSISTENCY_POLICY,"NoConsistencyRequiredPolicy");

            at last , a find I have to set constency WHEN I INIT MY REPLICATEDENVIROUMENT like these :
            new ReplicatedEnvironment(repo_dir, repConfig, envConfig,
                                          new NoConsistencyRequiredPolicy(),
                                          QuorumPolicy.SIMPLE_MAJORITY);

            and I resloved my problem.

            but the propertie ReplicationConfig.CONSISTENCY_POLICY make me vary confusing...

            Edited by: Jim, Wang on 2012-7-9 下午9:35
            • 3. Re: some questions about BDB JE HA
              Linda Lee
              Jim,

              In reply to your next set of questions:

              First question:

              Yes, the default REP_STREAM_TIMEOUT value of 12 hours is very large. We set that as a default when we first released JE HA, and since then, the volume of activity supported by a typical JE HA application has grown a lot, and a lot of activity can be processed in 12 hours. We have discussed changing it, but instead of making it another constant number, it seemed like it would be nicer to make such a threshold smarter, so it is somehow a function both of time and of disk space. The project got harder, and it isn't high enough yet on our priority list, so it's just a discussion for us now. It sounds like you should certainly set it to a smaller value.

              Second question:

              Yes, your addition of node 9 seems to be doing something strange, and I understand why you would expect to see node 9 present in the group membership list, due to the exception message. One possibility is that there are liingering processes that are somehow holding a port active, and that is interfering with the addition of node 9, although the error message does imply that node 9 should be listed in the group membership.

              I'm afraid the information you've provided isn't enough for us to answer the question. I suggest that you retry the scenario, doing the following steps to add more information:

              - use a system tool such as netstat (Unix) or lsof(Linux) to display processes and ports in use
              - look in the je.info files for possible additional exception messages close to the time of the problem
              - continue to use DbGroupAdmin to look at the group membership

              If you had a detailed, step by step description of the actions, what the logs and DbGroupAdmin say, and what ports and processes were in use, we may be able to provide more information.

              Third question:

              There are two parts of this problem. One is a documentation issue about the difference between the replica consistency policy that is provided as a parameter to the ReplicatedEnvironment constructor, and the one that is specified to the ReplicationConfig. The javadoc for the parameters to ReplicatedEnvironment says this: http://docs.oracle.com/cd/E17277_02/html/java/com/sleepycat/je/rep/ReplicatedEnvironment.html
              >
              consistencyPolicy - the consistencyPolicy used by the Replica at startup to make its environment current with respect to the master. This differs from the consistency policy specified by ReplicationConfig.setConsistencyPolicy(com.sleepycat.je.ReplicaConsistencyPolicy) because
              it is used only at construction, when the node joins the group for the first time. The consistency policy set in ReplicationConfig is used any time a policy is used after node startup, such as at transaction begins.

              >

              but the one for ReplicationConfig.CONSISTENCY_POLICY doesn't make it clear that it is only used after node startup. We'll improve the javadoc there, to make it clear that it has no impact on replication environment startup.

              The other problem is that you have found that the shutdown code for the ReplicatedEnvironment is closing the internal daemons too soon, and is resulting in a NPE for the replica's replay thread. I created support request #21590 to fix that. Thank you for diagnosing and reporting it. On the face of it, it is a minor fix, but the shutdown order can be tricky, so we'll have to experiment to see how to to change the order.

              Regards,

              Linda Lee