10 Replies Latest reply: Feb 23, 2011 2:30 AM by 655560 RSS

    Segfaulting on full container operations - debugging help

    671148
      Hi all -

      I have some data that has been through lots of dbxml upgrades and the last two upgrades have full container operations segfaulting. I was able to ignore this data for a year but now its coming back to haunt me. All full container operations, such as reindex, add/remove index, and trying to dump all data either through dump or getDocuments segfault. It's data specific and my container is 16GB. I want to be able to provide more info that the basic gcc stacktrace but I'm not sure. I have recompiled dbxml and bdb with debug flags but I'm not sure where to go from there - I get more or less the same undetailed information on the bug (below). Is it because I'm running the test through the shell? I want to help myself but its been quite some time since I have written any C++. If there is ANY other way to get my documents out of this DB I'm happy to dump and reload but everything segfaults that I can think of. I'm sure its just an ancient data issue.

      Thanks in advance!


      #0 0x00002b1b0d060f80 in DbXml::NsFormat::unmarshalInt64 () from /usr/local/xmldb/lib/libdbxml-2.5.so
      #1 0x00002b1b0d077a21 in DbXml::NsRawNode::setNode () from /usr/local/xmldb/lib/libdbxml-2.5.so
      #2 0x00002b1b0d05cecb in DbXml::NsEventReader::getNode () from /usr/local/xmldb/lib/libdbxml-2.5.so
      #3 0x00002b1b0d05d07f in DbXml::NsEventReader::endElement () from /usr/local/xmldb/lib/libdbxml-2.5.so
      #4 0x00002b1b0d05d355 in DbXml::NsEventReader::next () from /usr/local/xmldb/lib/libdbxml-2.5.so
      #5 0x00002b1b0d049625 in DbXml::EventReaderToWriter::doEvent () from /usr/local/xmldb/lib/libdbxml-2.5.so
      #6 0x00002b1b0d049a92 in DbXml::EventReaderToWriter::start () from /usr/local/xmldb/lib/libdbxml-2.5.so
      #7 0x00002b1b0d133cc8 in DbXml::DocumentDatabase::reindex () from /usr/local/xmldb/lib/libdbxml-2.5.so
      #8 0x00002b1b0d103d61 in DbXml::Container::reindex () from /usr/local/xmldb/lib/libdbxml-2.5.so
      #9 0x00002b1b0d10ad72 in DbXml::Container::setIndexSpecificationInternal () from /usr/local/xmldb/lib/libdbxml-2.5.so
      #10 0x00002b1b0d10b648 in DbXml::Container::setIndexSpecification () from /usr/local/xmldb/lib/libdbxml-2.5.so
      #11 0x00002b1b0d16f1a0 in DbXml::XmlContainer::setIndexSpecification () from /usr/local/xmldb/lib/libdbxml-2.5.so
      #12 0x00002b1b0d16f52b in DbXml::XmlContainer::deleteIndex () from /usr/local/xmldb/lib/libdbxml-2.5.so
      #13 0x000000000040f67e in DeleteIndexCommand::execute ()
      #14 0x0000000000427475 in Shell::mainLoop ()
      #15 0x000000000042c7f9 in main ()
        • 1. Re: Segfaulting on full container operations - debugging help
          637288
          Hi,

          what is your container version? If it's an older one, did you try to upgrade it?
          You can use gdb to spot the exact place in the code where segfaults are happening. Is it the same one when you are doing reindexing/dump?
          Not sure if the exact place would give a precise idea of what is going wrong, but it might...

          Vyacheslav
          • 2. Re: Segfaulting on full container operations - debugging help
            671148
            Hi Vyacheslav -

            I have tried upgrades a million times, reinstalling clean, and getting just about nothing. The same error with reindex, dump, everything. I'm not sure how to get more specific than what gdb is giving me - I think it has to do with the fact that is an shared object but still trying to recall. Where and bt just keep pointing to : 0x00002afad307df80 in DbXml::NsFormat::unmarshalInt64 () from /usr/local/xmldb/lib/libdbxml-2.5.so which is just about useless I know. I have compiled both dbxml and bsddb3 with debug flags although I just realized I didn't make clean so maybe thats why the so.

            In the meanwhile, I'm sure this is a data issue. Is there any way for me to iterate document by document through all documents? This way maybe I can at least find the document causing the issue.

            Thanks!

            Liz
            • 3. Re: Segfaulting on full container operations - debugging help
              637288
              Hi, Liz,

              1) but what is the version of DB XML that created your problematic container?
              2) gdb would also tell you the line number in the source code.
              3)
              In the meanwhile, I'm sure this is a data issue. Is there any way for me to iterate document by document through all documents?
              Are you able to fetch a document by its name from DB XML? Are you able to perform some queries? In this case you could read one document and put it into another container...

              Anyways, if the above does not help, I guess we should wait for an answer from the DB XML team.

              Vyacheslav
              • 4. Re: Segfaulting on full container operations - debugging help
                671148
                Hi Vyacheslav -

                Sorry to not have the line number - I was having some missing magic incantations in my library loading so it wasn't finding the symbols. Full bt with line number is below. I'll look into this as well but if you've seen this before I'd be happy to hear about it! BTW, when not run in debug this is a SEGFAULT but when run under debug its an assertion error. I am using the exact same test case and getting different results.

                No idea what the first version was - it was about 5 years ago. Everything worked fine until the 2.5.16 upgrade, which happened to fix bugs that OTHER containers were having so it was a must. I can go document by document but I literally have 16GBs of documents (49307 docs). I get the segfault when I try to do any full container query so getting all ids for example isn't working. I'll keep plugging on this part - was just looking for a cursor-ish way to find out which one is making things upset.

                Thanks again

                Liz



                #0 0x00000037dec30265 in raise () from /lib64/libc.so.6
                #1 0x00000037dec31d10 in abort () from /lib64/libc.so.6
                #2 0x00002b1f3fb24a27 in DbXml::assert_fail (expression=0x2b1f3fba1bc2 "data.data", file=0x2b1f3fba1ae0 "src/dbxml/nodeStore/NsEventReader.cpp", line=665) at src/dbxml/DbXmlInternal.cpp:21
                #3 0x00002b1f3fa3c3cf in DbXml::NsEventReader::getNode (this=0x7fff7e6887b0, parent=0x27215420) at src/dbxml/nodeStore/NsEventReader.cpp:665
                #4 0x00002b1f3fa3c648 in DbXml::NsEventReader::endElement (this=0x7fff7e6887b0) at src/dbxml/nodeStore/NsEventReader.cpp:874
                #5 0x00002b1f3fa3c7ae in DbXml::NsEventReader::next (this=0x7fff7e6887b0) at src/dbxml/nodeStore/NsEventReader.cpp:508
                #6 0x00002b1f3fa2307c in DbXml::EventReaderToWriter::doEvent (this=0x7fff7e6886e0, writer=0x7fff7e6889b0, isInternal=true) at src/dbxml/nodeStore/EventReaderToWriter.cpp:160
                #7 0x00002b1f3fa23955 in DbXml::EventReaderToWriter::start (this=0x7fff7e6886e0) at src/dbxml/nodeStore/EventReaderToWriter.cpp:146
                #8 0x00002b1f3fa5b15c in DbXml::NsWriter::writeFromReader (this=0x7fff7e6889b0, reader=@0x7fff7e6887b0) at src/dbxml/nodeStore/NsWriter.cpp:104
                #9 0x00002b1f3fa25d0b in DbXml::NsDocumentDatabase::getContent (this=0x167b7180, context=@0x16a84368, document=0x16a842d0, flags=0) at src/dbxml/nodeStore/NsDocumentDatabase.cpp:111
                #10 0x00002b1f3fb2b409 in DbXml::Document::id2dbt (this=0x16a842d0) at src/dbxml/Document.cpp:888
                #11 0x00002b1f3fb2e267 in DbXml::Document::getContentAsDbt (this=0x16a842d0) at src/dbxml/Document.cpp:530
                #12 0x00002b1f3fb8017d in DbXml::XmlDocument::getContent (this=0x16a844a8, s=@0x7fff7e688c60) at src/dbxml/XmlDocument.cpp:145
                #13 0x00002b1f3fb197c9 in DbXml::DbXmlNodeValue::asString (this=0x16a84490) at src/dbxml/Value.cpp:507
                #14 0x00002b1f3fb8a263 in DbXml::XmlValue::asString (this=0x7fff7e689010) at src/dbxml/XmlValue.cpp:299
                #15 0x000000000040d16a in PrintCommand::execute (this=0x167a7a90, args=@0x7fff7e689160, env=@0x7fff7e689880) at src/utils/shell/PrintCommand.cpp:83
                #16 0x0000000000427dda in Shell::mainLoop (this=0x7fff7e689900, in=@0x7fff7e689240, env=@0x7fff7e689880) at src/utils/shell/Shell.cpp:66
                #17 0x0000000000416d4b in IncludeCommand::execute (this=0x167a7870, args=@0x7fff7e689580, env=@0x7fff7e689880) at src/utils/shell/IncludeCommand.cpp:56
                #18 0x0000000000427dda in Shell::mainLoop (this=0x7fff7e689900, in=@0x647790, env=@0x7fff7e689880) at src/utils/shell/Shell.cpp:66
                #19 0x000000000042eac0 in main (argc=1, argv=0x7fff7e689ad8) at src/utils/shell/dbxmlsh.cpp:248

                Edited by: eleddy on Feb 22, 2011 2:39 PM
                • 5. Re: Segfaulting on full container operations - debugging help
                  637288
                  Hi Liz,

                  one more question. Could you use lazy evaluation to run a query that returns all document ids (names)? Maybe in this case it won't seg fault? What about metadata iterator over the dbxml:name metadata field?
                  Also you can try to do sth. like:
                  for $i in 1 to 49307 return 
                    let $name := collection()[$i]/dbxml:metadata('dbxml:name')
                    return ($name, $i)
                  or maybe use a predicate +[position() < $i]+, maybe it would make a difference for query evaluation.

                  Anyway, it was just an idea for now...

                  Vyacheslav
                  • 6. Re: Segfaulting on full container operations - debugging help
                    671148
                    Awesome - I didn't know you could iterate the container like that. Very helpful. Thanks!
                    • 7. Re: Segfaulting on full container operations - debugging help
                      671148
                      FWIW I found the document that causes this error, but I can't remove it without segfaulting either. I can't add or remove nodes. nothing. I think the only option will be to recreate the entire container just without this document. I'm pretty sure the document is not empty which is a bummer because I'll never know what it was.

                      Any oracle peeps know of a patch that could save me the trouble? Other ways to remove a document besides removeDocument from the shell?

                      Thanks!

                      Liz
                      • 8. Re: Segfaulting on full container operations - debugging help
                        655560
                        Hi Liz,

                        I didn't heard from other customers that they had run into this issue. If you can point me any clues to reproduce the problem, it would be appreciated.

                        If you want a solution to load your data urgently, an alternatives is:
                        1. Print all doc names into a list.
                        2. Remove the problematic xml doc from the list.
                        3. Generate a dbxml shell script from the list to load all docs from the container. e.g.:
                        getdoc doc1.xml
                        print doc1.xml
                        4. Put the loaded xml files into a new Container (you can also do this job by generating a dbxml shell script). e.g.:
                        putdoc doc1.xml doc1.xml f
                        Best regards,
                        Rucong Zhao
                        Oracle Berkeley DB XML
                        • 9. Re: Segfaulting on full container operations - debugging help
                          671148
                          Hi Rucong -

                          I found the id for the document that's the problem child but I can't do anything with it. Anything that involves node traversal it seems (which is everything) causes the segfault. I wish I had more information but I literally can't get to anything about the data. If you have ideas on how to get info on the doc without actually accessing it please let me know :)

                          I am writing a custom script to work around this but its painful (~50000 docs is not pretty). I'll post the full script later in case anyone else needs it. This data is pretty old. I even have one document that exports from the db fine and won't reimport with the exact same xml. Odd. That's on the list for tomorrow to see figure out wtf is going on there.

                          While I have you though, I have been using a dbxml script and occasionally a document import fails which stops the import of the rest of the documents since the script stops (this upgrade has to happen in one swoop - live data pain). Would be nice if the script didn't completely stop when one line of the shell script failed. Is there a flag for that? debug level? This way I can import everything I can then look at the problem kids later. I really want the db to not be crashing all the time so if I have to just skip some data from reimporting to get things running smoothly thats fine.

                          Thanks!

                          Liz
                          • 10. Re: Segfaulting on full container operations - debugging help
                            655560
                            Hi Liz,

                            Please try "setIgnore" in dbxml shell:
                            setIgnore         - Tell the shell to ignore script errors
                            Best regards,
                            Rucong Zhao
                            Oracle Berkeley DB XML