1 Reply Latest reply on Oct 1, 2012 2:27 PM by Linda Lee-Oracle

    Why finding replication stream matchpoint takes too long

      I am using bdb je 5.0.58 HA(two nodes group,JVM 6G for each node).
      Sometimes, I found bdb node takes too long to restart(about 2 hours).
      When this occurs, I catch the process stack of bdb(jvm process) by jstack.
      After analyzing stack,I found "ReplicaFeederSyncup.findMatchpoint()" taking all the time.
      I want to know why this method takes so much time,and how can I avoid this bad case.


      帖子经 liang_mic编辑过
        • 1. Re: Why finding replication stream matchpoint takes too long
          Linda Lee-Oracle

          2 hours is indeed a huge amount of time for a node restart. It's hard be sure without doing more detailed analysis of your log as to what may be going wrong, but I do wonder if it is related to the problem you reported in outOfMemory error presents when cleaner occurs [#21786]. Perhaps the best approach is for me to describe in more detail what happens when a replicated node is connecting with a new master, which might give you more insight into what is happening in your case.

          The members of a BDB JE HA replication group share the same logical stream of replicated records, where each record is identified with a virtual log sequence number, or VLSN. In other words, the log record described by VLSN x on any node is the same data record, although it may be stored in a physically different place in the log of each node.

          When a replica in a group connects with a master, it must find a common point, the matchpoint, in that replication stream. There are different situations in which a replica may connect with a master. For example, it may have come up and just joined the group. Another case is when the replica is up already but a new master has been elected for the group. One way or another, the replica wants to find the most recent point in its log, which it has in common with the log of the master. Only certain kinds of log entries, tagged with timestamps, are eligible to be used for such a match, and usually, these are transaction commits and aborts.

          Now, in your previous forum posting, you reported an OOME because of a very large transaction, so this syncup issue at first seems like it might be related. Perhaps your replication nodes need to traverse a great many records, in an incomplete transaction, to find the match point. But the syncup code does not blindly traverse all records, it uses the vlsn index metadata to skip to the optimal locations. In this case, even if the last transaction was very long, and incomplete, it should know where the previous transaction end was, and find that location directly, without having to do a scan.

          As a possible related note, I did wonder if something was unusual about your vlsn index metadata. I did not explain this in outOfMemory error presents when cleaner occurs but I later calculated that the transaction which caused the OOME should only have contained 1500 records. I think that you said that you figured out that you were deleting about 15 million records, and you figured out that it was the vlsn index update transaction which was holding many locks. But because the vlsn index does not record every single record, it should only take about 1,500 metadata records in the vlsn index to cover 15 million application data records. It is still a bug in our code to update that many records in a single transaction, but the OOME was surprising, because 1,500 locks shouldn't be catastrophic.

          There are a number of ways to investigate this further.

          - You may want to try using a SyncupProgress listener described at http://docs.oracle.com/cd/E17277_02/html/java/com/sleepycat/je/rep/SyncupProgress.html to get more information on which part of the syncup process is taking a long time.
          - If that confirms that finding the matchpoint is the problem, we have an unadvertised utility, meant for debugging, to examine the vlsn index. The usage is as follows, and you would use the -dumpVLSN option, and run thsi on the replica node. But this would require our assistance to interpret the results. We would be looking for the records that mention where "sync" points are, and would correlate that to the replica's log, and that might give more information if this is indeed the problem, and why the vlsn index was not acting to optimize the search.

          $ java -jar build/lib/je.jar DbStreamVerify
          usage: java { com.sleepycat.je.rep.utilint.DbStreamVerify | -jar je-<version>.jar DbStreamVerify }
          -h <dir> # environment home directory
          -s <hex> # start file
          -e <hex> # end file
          -verifyStream # check that replication stream is ascending
          -dumpVLSN # scan log file for log entries that make up the VLSN index, don't run verify.
          -dumpRepGroup # scan log file for log entries that make up the rep group db, don't run verify.
          -i # show invisible. If true, print invisible entries when running verify mode.
          -v # verbose