13 Replies Latest reply on Aug 20, 2009 8:14 AM by 545142

    AddIndex - failed to allocate memory

    611511
      dbxml> addindex "" pkeyword node-element-substring-string
      Adding index type: node-element-substring-string to node: {}:pkeyword
      stdin:26: addIndex failed, Error: Buffer: failed to allocate memory

      The database is 100mb. The pkeyword node is very often and contains many words.

      Is there a way to adjust this in a DB_CONFIG file?
      And how does one determine the numbers needed for any memory parameters?
        • 1. Re: AddIndex - failed to allocate memory
          Gmfeinberg-Oracle
          Walt,

          Is your container one large document? How large is your largest document?

          Regards,
          George
          • 2. Re: AddIndex - failed to allocate memory
            611511
            We have 4 to 6 containers - each holding 1 or more documents.
            The documents are 100 -> 150 Mb (yes -very large).
            • 3. Re: AddIndex - failed to allocate memory
              Gmfeinberg-Oracle
              Walt,

              There is a data structure used to hold indexes while indexing occurs on a given document
              and that structure can run out of memory if (1) the document is very large and (2) there are a lot of index entries added and/or deleted from the document during indexing. Reindexing is more expensive than initial indexing because it potentially removes entries as well.

              I'm not sure I have a good answer for you other than to try creating a new container with the desired indexes then re-inserting the large documents.

              This is a known potential issue but I think you are the first user to (apparently) run into it.

              Regards,
              George
              • 4. Re: AddIndex - failed to allocate memory
                545142
                I've run into this as well. I'm compiling a lot of small documents (average 1.5KB) into one large-ish document (around 55MB), which I want to supply as the only source document for a read-only DB XML container (Windows DB XML 2.4). Then I create the container (DBXML_INDEX_NODES), add indexes, and try and add the one large document. There are several substring indexes. Doing this via python it dies with the message:

                dbxml.XmlNoMemoryError: XmlException 20, Error: Buffer: failed to allocate memory

                What's the alternative? The size will remain pretty static since the small source XML documents will be modifed in minor ways only, and no new documents will be created. Unless I misunderstood, I don't think your suggestion to Walt will help here since I'm creating the container afresh already.

                The odd thing is that I can load the same container successfully on a slightly older machine with half the memory resources.

                Tim
                • 5. Re: AddIndex - failed to allocate memory
                  Gmfeinberg-Oracle
                  Tim,

                  This error has nothing to do with physical memory resources and everything to do with virtual memory resources. The answer may lie in figuring out what's different on your older machine -- e.g. version of python, BDB cache size, any other possible consumers of VM in the process, etc. Do you have transactions configured? That will consume additional VM.

                  Also to verify that this is in fact the same error you should monitor the process's VM and observe that it grows to the point where allocations will fail. If you do not observe that the problem may be elsewhere.

                  Regards,
                  George
                  • 6. Re: AddIndex - failed to allocate memory
                    679878
                    Just for information, I have the same problem too. It happens when I add a 167M document in a brand new container with an index on a node occurring 3,524,018 times in the document (quite a lot, for sure)

                    Jean-Philippe
                    • 7. Re: AddIndex - failed to allocate memory
                      545142
                      Hi

                      I never really solved this problem -- I simply omitted the index that was causing the XmlNoMemoryError -- but I am returning to this in the hope of solving it and would appreciate any pointers.

                      The element I am trying to index occurs in the one 55MB document which is the sole document added to the container. This element occurs 45000 times and when the contents are added together they total about 10 000 000 characters or 10MB since it's UTF-8 and unusual characters are rare. I am using this as a 'full text' element and need a substring index on it. I know substring indexes come with demanding requirements but disc space and indexing time are not a problem (the database is searched as a read-only container and is rebuilt daily after hours). My use of dbxml is very simple -- no transactions and not even an explicit environment. I've also seen suggestions in this context to use an alternative full-text search engine but it would be much simpler for me if I could resolve the memory error.

                      I can add this index from the dbxml shell after inserting the document, but when I try it via python it fails with "dbxml.XmlNoMemoryError: XmlException 20, Error: Buffer: failed to allocate memory". I've tried various things like synching to disc between operations or indexing the problem element in a discrete step (closing and reopening the container) without success.

                      My RAM memory (4GB) and virtual memory allocation (6GB) are more than adequate. The python process dies at around 700MB memory + 730MB VM. The shell uses about 750MB memory + 800MB VM when it adds the index successfully. Naturally I want to schedule the task and run it via python, not add the index manually.

                      My questions are: could I get the dbxml shell to run automatically and perform some sort of crude batch job (on Windows)? Alternatively would increasing the cache size actually help? Is it possible to change the cache size without using environments? I have a lot of simple querying code and environments look like they would probably complicate things.

                      Tim
                      • 8. Re: AddIndex - failed to allocate memory
                        597600
                        Hi Tim,

                        some remarks:

                        (1) Consider reshaping your unwieldy 55 MB XML document inte several documents. I think the kind of query you can do on a 55 MB document could also be done on several smaller documents. 55 MB is simply quite big. Does it really have to be like this?

                        (2) A substring index is not a full-text index. (I found it rather unsuitable for full-text search. So I added an indexer/searcher. I'm sure you could find one for Python.

                        (3) You can use the dbxml shell with -s your-script.txt:
                        milu@colinux:~ > cat > Werkstatt/dbxml-shell/test.txt
                        open tv.dbxml
                        query { collection()[//Monitor] }
                        printnames
                        print
                        quit
                        milu@colinux:~ > dbxml -h dbenv -s Werkstatt/dbxml-shell/test.txt
                        Joined existing environment
                        chmon-2009-08-11.2009-08-11
                        chmon-2009-08-13.2009-08-13
                        ...
                        BTW, all standalone utilities (including the dbxml shell) are documented. There is a link from the main documentation page of your copy of the DBXML distribution.

                        Michael Ludwig
                        • 9. Re: AddIndex - failed to allocate memory
                          545142
                          Hi Michael

                          Thanks for your comments. I had missed the dbxml -s option which I could use as a last resort. That's useful. I'm coming back to dbxml after last working with it closely a year or so ago, so excuse my cobwebs.

                          Yes, my initial approach was to load a lot of small documents rather than one big one, but I ran into problems there also with indexing, and followed the suggestion to change to a single document indexing nodes. I can see now how I might avoid this particular memory error with smaller documents but I'm not sure it would be worth my while changing all my building and querying code. The sole reason for adding an index on this element is to speed up a so-called full-text search option which currently takes 4-5 seconds because it doesn't have a substring index. But I have only about the same number of users, and they don't use that search option very often. Most of the time they're using field-specific queries -- ID, work title or author name etc. The performance issue on 'full-text' seems to bother me more than it does them.

                          I imagine that adding a proper full-text indexing and searching layer is not going to be trivial. It will also result in hybridity in the searching code since there I'm relying heavily on on node-specific XQueries and dbxml indexes. So perhaps you'll understand why I'm reluctant to go that route, though I agree, that would be the ideal indexing scenario. I'm happy to be corrected if this is simpler than I expect. As it stands, the 'full-text' element I'm trying to index is an element which I populate with relevant content as a preprocessing step. Not all of the document needs indexing in any case (e.g. some system-related attributes of no interest to users). So strictly speaking I don't /want/ a full-text index of the entire document. It is more useful to be able to choose the content and put it in a index element as a text string.

                          I'd like to understand what's causing this memory error. Why doesn't it occur when using the shell?

                          Tim
                          • 10. Re: AddIndex - failed to allocate memory
                            597600
                            Tim,

                            hooking in another piece of software for indexing and searching will (a) cost you time to learn about this new technology, (b) make the application code a little more complex, (c) introduce an additional dependency, and (d) result in improved search if and only if you arrange your documents so that the identifiers returned by the indexer can be used by some DBXML index, which, in general, should be quite feasable. (For example, your index could return document names, which you then use with getDocument().)

                            You're right: If your search is good enough, why change it?

                            As for the memory error, the C++ docs simply say: XmlException::NO_MEMORY_ERROR - An attempt to allocate memory failed. I'm not an expert, but isn't this simply the OS cutting off further memory supply to the application? Maybe some threshold has been reached. Does ulimit -a yield any clues?

                            Maybe Python adds some significant overhead to the dbxml shell. Maybe you're parsing the giant document in Python into a tree representation using LibXML2 in addition to having the Berkeley library parse it? That might easily add 500 MB of memory. (Rule of thumb: memsize = filesize * 10) If so, maybe you can use the XmlWriter interface to get around this?

                            Michael Ludwig
                            • 11. Re: AddIndex - failed to allocate memory
                              545142
                              I'll try and find some sort of equivalent on Windows to ulimit, thanks for the suggestion.
                              hooking in another piece of software for indexing and searching will (a) cost you time to learn about this new technology, (b) make the application code a little more complex, (c) introduce an additional dependency, and (d) result in improved search if and only if you arrange your documents so that the identifiers returned by the indexer can be used by some DBXML index, which, in general, should be quite feasable. (For example, your index could return document names, which you then use with getDocument().)
                              Time is the main limitation, but this would be worth investigating and perhaps planning for as we'll need to accommodate bigger datasets in future. Currently my queries return elements which represent discrete records of interest to the user (the 55MB XML file being a collation of 45K records of identical structure). So would this work roughly as follows: searching code, when processing a full text query, would query an index (also as an embedded DB or some sort of static index) which exists alongside the dbxml database. That could return an ID number for the record containing the full text match. The full text index would be created separately as a preprocessing step. I avoid complexities related to full text index updates because the database is periodically compiled afresh from the source XML records anyway. Is it that simple -- in outline, anyway? I have not worked with a full text index before, hence the basic questions.

                              Are there any recommended packages for such a full-text index?
                              Maybe Python adds some significant overhead to the dbxml shell. Maybe you're parsing the giant document in Python into a tree representation using LibXML2 in addition to having the Berkeley library parse it? That might easily add 500 MB of memory. (Rule of thumb: memsize = filesize * 10) If so, maybe you can use the XmlWriter interface to get around this?
                              No, I've tried getting python to do nothing other than index that 'full text' element and it fails with the memory error. I can go to the relevant call in the Python source, but this in turn refers to the compiled python bindings which I can't penetrate.

                              Tim
                              • 12. Re: AddIndex - failed to allocate memory
                                597600
                                So you have 45000 records in your 55 MB file. I think it would make perfect sense to split the big file up. That'll solve your memory problem. Also, getDocument() will start making sense.

                                Adding an index would work just as you outlined below. It works like that for me. Updates would always go to both database and index.

                                As for recommendations, I can only talk about SWISH-E and Apache Solr, which are the only ones I've used. With SWISH-E, you're confined to 8-bit encocing schemes (like Latin-1). It's easy to set up, but I don't know about Python bindings. Apache Solr is bigger and more mature. That's what I'm using now as a Java server accessed via HTTP. I'd recommend Apache Solr. But take a look around - maybe there are more suitable packages for Python.

                                Michael Ludwig
                                • 13. Re: AddIndex - failed to allocate memory
                                  545142
                                  Thanks for the useful feedback, Michael. BTW I did start off adding the records as individual documents and then switched to a single document, but it's too far back now for me to recall why. But that's a good point, the question could use review.

                                  Tim