1 Reply Latest reply: Mar 8, 2011 4:57 PM by 522690 RSS

    Replication client gets stuck reading pages during queries

    422583
      We are using Berkeley DB XML 2.5.16 on top of Berkeley DB 5.1.25.NC on Linux 2.6.9-74.ELsmp #1 SMP x86_64 x86_64 x86_64 GNU/Linux on 2 Dual-Core AMD Opteron(tm) Processor 8218 machines, with 32GB of RAM each, replicated.

      We compiled Berkeley DB XML 2.5.16 with Berkeley DB 5.1 due to a previous ussue, the sudden segfault of the replication client while processing replication messages: {message:id=9354464}.

      We use transactions around every operation, be it a query or an insert/update/modify. What we are observing now is a random lock-up (an infinite loop or a spin of some sort) on our replicated client when its retrieving data from pages that match a query. It happens during massive updates and queries running in parallel, but the infinite wait only happens on the slave, in a thread processing a query, inside the native function com.sleepycat.dbxml.dbxml_javaJNI.HelperFunctions_getContentAsString. The thread takes up 100% of the CPU its using. Attachnig gdb and scanning through the threads, I found this thread, stuck and never returning:
      #0  0x00000038488b9895 in _xstat () from /lib64/tls/libc.so.6
      #1  0x0000002ebc6fcfd2 in __os_exists () from /home/svctebd/dbxml/lib/libdb-5.1.so
      #2  0x0000002ebc6d025f in __db_fullpath () from /home/svctebd/dbxml/lib/libdb-5.1.so
      #3  0x0000002ebc6d054d in __db_appname () from /home/svctebd/dbxml/lib/libdb-5.1.so
      #4  0x0000002ebc6d9b3e in __fop_file_setup () from /home/svctebd/dbxml/lib/libdb-5.1.so
      #5  0x0000002ebc6b253c in __db_open () from /home/svctebd/dbxml/lib/libdb-5.1.so
      #6  0x0000002ebc699be5 in __db_master_open () from /home/svctebd/dbxml/lib/libdb-5.1.so
      #7  0x0000002ebc6b3023 in __db_reopen () from /home/svctebd/dbxml/lib/libdb-5.1.so
      #8  0x0000002ebc61a52f in __bam_get_root () from /home/svctebd/dbxml/lib/libdb-5.1.so
      #9  0x0000002ebc618fc1 in __bam_rsearch () from /home/svctebd/dbxml/lib/libdb-5.1.so
      #10 0x0000002ebc6188d0 in __ramc_get () from /home/svctebd/dbxml/lib/libdb-5.1.so
      #11 0x0000002ebc606cea in __bam_bulk_duplicates () from /home/svctebd/dbxml/lib/libdb-5.1.so
      #12 0x0000002ebc607433 in __bam_bulk () from /home/svctebd/dbxml/lib/libdb-5.1.so
      #13 0x0000002ebc69d6e2 in __dbc_iget () from /home/svctebd/dbxml/lib/libdb-5.1.so
      #14 0x0000002ebc6ab187 in __dbc_get_pp () from /home/svctebd/dbxml/lib/libdb-5.1.so
      #15 0x0000002ebd44b568 in DbXml::NsEventReader::nextNode () from /home/svctebd/dbxml/lib/libdbxml-2.5.so
      #16 0x0000002ebd44baaf in DbXml::NsEventReader::getNode () from /home/svctebd/dbxml/lib/libdbxml-2.5.so
      #17 0x0000002ebd44c7ad in DbXml::NsEventReader::NsEventReader () from /home/svctebd/dbxml/lib/libdbxml-2.5.so
      #18 0x0000002ebd437e4b in DbXml::NsDocumentDatabase::getContent () from /home/svctebd/dbxml/lib/libdbxml-2.5.so
      #19 0x0000002ebd52828d in DbXml::Document::id2dbt () from /home/svctebd/dbxml/lib/libdbxml-2.5.so
      #20 0x0000002ebd52d159 in DbXml::Document::getContentAsDbt () from /home/svctebd/dbxml/lib/libdbxml-2.5.so
      #21 0x0000002ebd5710b0 in DbXml::XmlDocument::getContent () from /home/svctebd/dbxml/lib/libdbxml-2.5.so
      #22 0x0000002ebc4b35b8 in Java_com_sleepycat_dbxml_dbxml_1javaJNI_HelperFunctions_1getContentAsString ()
         from /local/1/home/svctebd/deploy/dbxml/dbxml-2.5.16_db-5.1.25.NC/lib/libdbxml_java-2.5.so
      #23 0x0000002a96716f50 in ?? ()
      #24 0x0000002af556ac08 in ?? ()
      #25 0x0000000045054368 in ?? ()
      #26 0x0000000045054370 in ?? ()
      #27 0x0000002a9a35cf60 in ?? ()
      #28 0x00000000450543d0 in ?? ()
      #29 0x0000002a9a36e630 in ?? ()
      #30 0x0000000000000000 in ?? ()
      Why would it be stuck there? Over time, it goes in and out of xstat, but it always stays within stack frame #5, dbopen.
      Meanwhile, I see the replication manager thread is stuck inside another method, db_tas_mutex_lock:
      #0  0x0000002b204c49ff in __db_tas_mutex_lock ()
         from /local/1/home/svctebd/deploy/dbxml/dbxml-2.5.16_db-5.1.25.NC/lib/libdb_java-5.1.so
      #1  0x0000002b205b86f7 in __memp_dirty () from /local/1/home/svctebd/deploy/dbxml/dbxml-2.5.16_db-5.1.25.NC/lib/libdb_java-5.1.so
      #2  0x0000002b2057f9c6 in __db_addrem_recover ()
         from /local/1/home/svctebd/deploy/dbxml/dbxml-2.5.16_db-5.1.25.NC/lib/libdb_java-5.1.so
      #3  0x0000002b2056e05d in __db_dispatch () from /local/1/home/svctebd/deploy/dbxml/dbxml-2.5.16_db-5.1.25.NC/lib/libdb_java-5.1.so
      #4  0x0000002b20524b80 in __rep_process_txn ()
         from /local/1/home/svctebd/deploy/dbxml/dbxml-2.5.16_db-5.1.25.NC/lib/libdb_java-5.1.so
      #5  0x0000002b2052514c in __rep_process_rec ()
         from /local/1/home/svctebd/deploy/dbxml/dbxml-2.5.16_db-5.1.25.NC/lib/libdb_java-5.1.so
      #6  0x0000002b2052641e in __rep_apply () from /local/1/home/svctebd/deploy/dbxml/dbxml-2.5.16_db-5.1.25.NC/lib/libdb_java-5.1.so
      #7  0x0000002b205aa418 in __log_rep_split () from /local/1/home/svctebd/deploy/dbxml/dbxml-2.5.16_db-5.1.25.NC/lib/libdb_java-5.1.so
      #8  0x0000002b2051ee81 in __rep_bulk_log () from /local/1/home/svctebd/deploy/dbxml/dbxml-2.5.16_db-5.1.25.NC/lib/libdb_java-5.1.so
      #9  0x0000002b20528519 in __rep_process_message_int ()
         from /local/1/home/svctebd/deploy/dbxml/dbxml-2.5.16_db-5.1.25.NC/lib/libdb_java-5.1.so
      #10 0x0000002b205336f8 in __repmgr_msg_thread ()
         from /local/1/home/svctebd/deploy/dbxml/dbxml-2.5.16_db-5.1.25.NC/lib/libdb_java-5.1.so
      #11 0x0000003849106137 in start_thread () from /lib64/tls/libpthread.so.0
      #12 0x00000038488c7533 in clone () from /lib64/tls/libc.so.6
      I wonder if somehow thats related. Please let us know if this is a known bug, or something specific to our setup
      Oh, also, if another thread attempts the same query (I guess tried to pull out the same pages), the original thread that was stuck gets the following exception and the new thread gets stuck:
      com.sleepycat.dbxml.XmlException: Error: DB_LOCK_NOTGRANTED: Lock not granted File: ../src/dbxml/nodeStore/NsEventReader.cpp Line: 829, errcode = DATABASE_ERROR
              at com.sleepycat.dbxml.dbxml_javaJNI.HelperFunctions_getContentAsString(
      I wonder if thats related to the mutex or select() issues replication has been having in previous releases?

      Thank you,
      Alexander.

      Edited by: ashenker on Mar 7, 2011 7:57 AM
        • 1. Re: Replication client gets stuck reading pages during queries
          522690
          Alexander,

          The first stack indicates that something is wrong in the code. It is searching a off page duplicate sub tree of your database but it has called __db_reopen which should only be called if the tree it is searching is the main btree of a sub-database. If you are a supported customer it would be best to open a case with our support organization. If not, do you have some way to replicate this situation that we can reproduce? It is unlikely that this has anything to do with replication.

          Thank you,
          Michael Ubell
          Oracle Berkeley DB