Forum Stats

  • 3,781,447 Users
  • 2,254,521 Discussions
  • 7,879,706 Comments

Discussions

Indexing time and buffer memory

673401
673401 Member Posts: 19
edited Jan 27, 2009 7:52AM in Berkeley DB XML
Hi,

I'm trying to index a large database and I found that the indexing time was really slow (to index some hundred megabytes of data it take hours) so I run some tests.

Those test try to prove that if the XML data is smaller than the total buffer memory set, the time to index is relatively low. However for XML data bigger than the buffer memory, the indexing time becames incredibly large.
The buffer is set to 512Mb and I create one single index.

42Mb XML data => 113s to create one index
92Mb XML data => 242s to create one index
150Mb XML data => 582s to create onde index
539Mb XML data => one hour and continuing....

I conclude that:
- during the indexing, the disk is constantly being accessed
- The time is aceptable if the size of data is smaller then the buffer memory

So my question is if the Berkeley DB XML tries to load all the data from to container to RAM in order to generate the indexes? If that is the case, it seems logic that when the data is bigger that the buffer memory it takes much more time ( the time is not linear as expected ).

If that is the case, how is it possible to index 100Gb data in a "measurable" time?

---
Jorge Teixeira
Tagged:
«1

Answers

  • 524395
    524395 Member Posts: 149
    Jorge,

    I can't speak to your timing, I'm sure that someone on the DBXML team will ask question that will help find the source of your slowdown (cache size? trickle thread?) But I can tell you a few things from my experience using DBXML and as someone who worked on Berkeley DB code.

    First, I have a test project that has about 4 GB of data to begin with. Hundreds of thousands of documents each with anywhere from 0-10,000 items. These are HTTP log files, each file is a document, each request an element in the document. I have about eight indexes on this single container. I run on simple laptop hardware, years old in fact, 1GHz Athelon with 2GB RAM on a slow IDE disk (a four year old Acer Ferrari 4005 WLMi running OpenSolaris 2008.11 snv_101b using the latest DBXML and DB 4.7 rather than 4.6 via the Python API). It takes less than 10 minutes to create all the indexes. Time varies based on what I'm doing on the laptop (web browsing or not, reading email, etc. this is my work laptop). I guess what I'm saying is, your timings are suspect and I guess can be optimized quite a bit.

    In terms of your specific DB questions. No, DB maintains only the portions of the BTREE in cache that are required at the time. It tries to keep the cache "hot" keeping the most accessed pages around as any database would. DBXML uses DB so it benefits from that. You should investigate what is happening in DB using 'db_stat'. See if you are having a lot of cache misses which would mean going to disk (read: slowing things down) a lot or not. Sizing and optimizing DB parameters make a huge difference in run time.

    good luck, keep investigating, tuning is tricky, we're working on it, :)

    -greg

    Product Manager, Berkeley DB Products, [email protected]
  • 673401
    673401 Member Posts: 19
    The cash hitting is high (Requested pages found in the cache (99%)) so I guess the cache is correctly configured.

    However, every time i create an index the disk is constantly being used...

    I finally got the indexing time for the 539Mb XML data and it was 5443 seconds!

    This specific container is composed of 2000 documents. The index creation is just preformed after the insertion of all the documents is complete.

    I'm using Berkeley DB XML 2.4.13 and Berkeley DB 4.6.21, under Linux Ubuntu 7.10 Gutsy Gibbon (32bits) and a intel quadcore Q9300 @ 2.5Ghz and 4Gb RAM.

    Any ideas?
  • Indexes should be created before document insertion, so the documents can be indexed while they are being inserted. Creating the index after insertion requires that all the documents be read from disk again (which is where the constant disk usage is coming from) so they can be indexed.

    Lauren Foutz
  • 524395
    524395 Member Posts: 149
    You have 4 cores, how many processes/threads are you running to load the data? More than 4 is likely too many.

    Also, if you're using transactions (and I am guessing that you are), you might run another thread to setup a mempool trickle (http://www.oracle.com/technology/documentation/berkeley-db/db/api_c/memp_trickle.html) and reduce the durability of the transactions a bit (http://www.oracle.com/technology/documentation/berkeley-db/db/api_c/env_set_flags.html#DB_TXN_WRITE_NOSYNC) to take advantage of the operating system's file I/O buffers.

    -greg
    524395
  • 673401
    673401 Member Posts: 19
    Hi,

    I run some tests again (buffer set to 500Mb) and the results shown that indexing the container previously to load the data is even worst:

    26Mb XML data => 68s (index after load) => 79s (index before load)
    42Mb XML data => 113s (index after load) => 158s (index before load)
    92Mb XML data => 242s (index after load) => 533s (index before load)
    539Mb XML data => 5443s (index after load) => 8444s (index before load)

    In what concerns to the number of processes, actually I'm only running. So it seems once more the cache if afecting the indexing performance....
  • 673401
    673401 Member Posts: 19
    I'm runing some more tests :) and this time I've tried something different. Since the last tests were run in a LVM disk, I tried to do the same tests in a ext3 disk with no LVM and the results were incredible better!

    Do you think there is something to do with the LVM performance?
  • 524395
    524395 Member Posts: 149
    The "multiprocessing" module was called the "processing" (PyProcessing) module (http://pyprocessing.berlios.de/). Try that, it works with Python 2.4 and was later renamed and added to Python 2.5 and Python 3.

    -greg
  • 673401
    673401 Member Posts: 19
    I understand multiprocessing can improve the indexing performance, but that improvement is minimal compared to the decrease in performance I'm having if the container file is bigger than 2 Gb.

    I'm wondering if there is any kind of configuration or so, since a 1.8Gb container takes about 200 seconds to index, and a 2.3Gb container (in the exact same conditions and the same index) takes more than 4000 seconds, twenty times more!!

    Thanks,

    Jorge
  • Lucas Vogel
    Lucas Vogel Member Posts: 87
    Has anyone tried creating a new, indexed container, and moving documents from the old into the new? Assuming you had the hardware resources available to support such a transition, wouldn't that be a little faster?
  • Gmfeinberg-Oracle
    Gmfeinberg-Oracle Posts: 1,328 Employee
    Jorge,

    You never said if you were using a node container or wholedoc or what language you are using (language won't matter but container type may). Are you reindexing inside a transaction? What indexes are you using? Substring indexes are quite expensive and large and will generate the most activity. Also why can't you increase your cache size? Have you tried going to 1G?

    Based on your experiment with LVM vs ext3 it appears that you end up I/O bound on writes. Can you post the results using ext3? The Berkeley DB cache is backed by the file system so if it pages a lot there will be file system activity. If you are using transactions that will also generate I/O to the log file. The db_stat utility can be used to give you an idea of how much paging (reading AND writing) is going on during your indexing operations that take a long time.

    Ultimately it is possible to use large amounts of data so it'd be good to get to the bottom of this.

    Regards,
    George
This discussion has been closed.