3 Replies Latest reply: Dec 16, 2010 12:18 PM by 522690 RSS

    Suspected replication issue while using MVCC on a slave

    676400
      Environment: Red Hat Enterprise Linux AS release 3 (Taroon Update 6)
      Distribution: Berkeley DB XML 2.4.16 (consolidated patch applied) combined with Berkeley DB 4.7.25 (first 3 patches applied).

      We have a replication group with 1 master and 1 slave, the code is using the Java APIs.

      The master performs a fairly steady stream of queries, inserts and updates, while the slave is handling some fairly large and long running queries for reporting purposes.

      We have enabled MVCC and started using transactions with snapshot isolation for the queries on the slave as the replication traffic was causing a large number of deadlock exceptions during the queries. The master is not configured for MVCC.

      Is this a supported configuration? After we started using MVCC on the slave we have been getting panics on the slave with messages similar to the following. The frequency of the panics is fairly inconsistent, maybe once or twice a day.

      Log sequence error: page LSN 112 66503; previous LSN 112 75892
      transaction failed at [112][76047]
      Error processing txn [112][87244]

      I can provide additional configuration information if you feel it would be helpful.

      Thanks,
        • 1. Re: Suspected replication issue while using MVCC on a slave
          524300
          Using MVCC for reads on slaves is supported.

          It's difficult to know what is going wrong based on the error message you are seeing: it is indicating that when trying to apply a change from the master, the version of a page found on a slave was not the expected one. This obviously shouldn't happen, but there could be various reasons for it.

          Can you please tell me some more about the configuration of the slaves?

          * Is message processing single-threaded or are there multiple threads?
          * Can you send me the output from running "db_stat -m" in the environment directory?
          * Does the frequency change if you increase the cache size?
          * It is possible for you to test with the newly-released Berkeley DB 4.8?

          Regards,
          Michael Cahill, Oracle Berkeley DB.
          • 2. Re: Suspected replication issue while using MVCC on a slave
            676400
            Is message processing single-threaded or are there multiple threads?
            There are multiple threads performing concurrent operations on both the master and slave.
            Can you send me the output from running "db_stat -m" in the environment directory?
            80MB 1KB 752B     Total cache size
            1     Number of caches
            1     Maximum number of caches
            80MB 8KB     Pool individual cache size
            0     Maximum memory-mapped file size
            0     Maximum open file descriptors
            0     Maximum sequential buffer writes
            0     Sleep after writing maximum sequential buffers
            0     Requested pages mapped into the process' address space
            39M     Requested pages found in the cache (99%)
            239822     Requested pages not found in the cache
            29605     Pages created in the cache
            239822     Pages read into the cache
            493192     Pages written from the cache to the backing file
            248866     Clean pages forced from the cache
            453     Dirty pages forced from the cache
            0     Dirty pages written by trickle-sync thread
            10109     Current total page count
            9860     Current clean page count
            249     Current dirty page count
            8191     Number of hash buckets used for page location
            39M     Total number of times hash chains searched for a page (39774861)
            9     The longest hash chain searched for a page
            81M     Total number of hash chain entries checked for page (81585966)
            4     The number of hash bucket locks that required waiting (0%)
            2     The maximum number of times any hash bucket lock was waited for (0%)
            0     The number of region locks that required waiting (0%)
            0     The number of buffers frozen
            0     The number of buffers thawed
            0     The number of frozen buffers freed
            269452     The number of page allocations
            605263     The number of hash buckets examined during allocations
            12     The maximum number of hash buckets examined for an allocation
            249319     The number of pages examined during allocations
            2     The max number of pages examined for an allocation
            1     Threads waited on page I/O
            Pool File: SomeDatabase.db
            4096     Page size
            0     Requested pages mapped into the process' address space
            9136     Requested pages found in the cache (99%)
            5     Requested pages not found in the cache
            0     Pages created in the cache
            5     Pages read into the cache
            1262     Pages written from the cache to the backing file
            Pool File: SomeContainer.dbxml
            8192     Page size
            0     Requested pages mapped into the process' address space
            18M     Requested pages found in the cache (98%)
            230519     Requested pages not found in the cache
            17729     Pages created in the cache
            230519     Pages read into the cache
            450040     Pages written from the cache to the backing file
            Pool File: __db.rep.db
            4096     Page size
            0     Requested pages mapped into the process' address space
            21M     Requested pages found in the cache (99%)
            77     Requested pages not found in the cache
            10031     Pages created in the cache
            77     Pages read into the cache
            75     Pages written from the cache to the backing file
            Pool File: SomeOtherContainer.dbxml
            8192     Page size
            0     Requested pages mapped into the process' address space
            59862     Requested pages found in the cache (98%)
            870     Requested pages not found in the cache
            215     Pages created in the cache
            870     Pages read into the cache
            7174     Pages written from the cache to the backing file
            Pool File: YetAnotherContainer.dbxml
            8192     Page size
            0     Requested pages mapped into the process' address space
            5546     Requested pages found in the cache (99%)
            44     Requested pages not found in the cache
            30     Pages created in the cache
            44     Pages read into the cache
            759     Pages written from the cache to the backing file
            Pool File: YetSomeOtherContainer.dbxml
            8192     Page size
            0     Requested pages mapped into the process' address space
            328259     Requested pages found in the cache (97%)
            8307     Requested pages not found in the cache
            1600     Pages created in the cache
            8307     Pages read into the cache
            33882     Pages written from the cache to the backing file
            Does the frequency change if you increase the cache size?
            It is possible for you to test with the newly-released Berkeley DB 4.8?
            The cache size is fairly static, always has been. I may be able to scratch up a couple of machines to test that. Unfortunately, I wouldn’t be able to upgrade the environment in which this is occurring to 4.8 any time soon.

            Here is a terse version of what the startup configuration on the master and slave would look like. Variables substituted with literals and all error checking removed of course.

            Both master and slave environments have a DB_CONFIG with the following line…

            set_lg_regionmax 122880

            MASTER

            config = new EnvironmentConfig();
            config.setAllowCreate(true);
            config.setInitializeCache(true);
            config.setCacheSize(67108864);
            config.setTransactional(true);
            config.setInitializeLocking(true);
            config.setInitializeLogging(true);
            config.setTxnNoSync(false);
            config.setMultiversion(false);
            config.setLockDetectMode(LockDetectMode.MINWRITE);
            config.setMaxLockers(1310720);
            config.setMaxLockObjects(1310720);
            config.setMaxLocks(1310720);
            config.setRegister(true);
            config.setRunRecovery(true);
            … Add local & remote sites
            config.setReplicationPriority(100);
            config.setReplicationManagerAckPolicy(ReplicationManagerAckPolicy.NONE);
            config.setInitializeReplication(true);

            env = new Envoronment(path, config);
            env.setReplicationConfig(ReplicationConfig.BULK, false);
            env.replicationManagerStart(3, ReplicationManagerStartPolicy.REP_MASTER);

            SLAVE

            config = new EnvironmentConfig();
            config.setAllowCreate(true);
            config.setInitializeCache(true);
            config.setCacheSize(67108864);
            config.setTransactional(true);
            config.setInitializeLocking(true);
            config.setInitializeLogging(true);
            config.setTxnNoSync(false);
            config.setMultiversion(true);
            config.setLockDetectMode(LockDetectMode.MINWRITE);
            config.setMaxLockers(1310720);
            config.setMaxLockObjects(1310720);
            config.setMaxLocks(1310720);
            config.setRegister(true);
            config.setRunRecovery(true);
            … Add local & remote sites
            config.setReplicationPriority(0);
            config.setReplicationManagerAckPolicy(ReplicationManagerAckPolicy.NONE);
            config.setInitializeReplication(true);

            env = new Envoronment(path, config);
            env.setReplicationConfig(ReplicationConfig.BULK, false);
            env.replicationManagerStart(3, ReplicationManagerStartPolicy.REP_CLIENT);

            Thanks,

            Neil
            • 3. Re: Suspected replication issue while using MVCC on a slave
              522690
              Neil,

              Is this still an issue? Unfortunately Michael Cahill is no longer with Oracle so I cannot find out what he was looking into.

              If this is still an issue can you use the db_printlog utility get more information on the transactions involved in the problem? The output could be quite large so we would probably want to arrange to upload it to our ftp site so that I can examine it.

              Thank you
              Michael Ubell
              Oracle Berkeley DB