Working with large data

edited Aug 21, 2006 10:26PM in Berkeley DB
I have to create a database with specific distribution of key size and data
size. Key size is some few bytes, and data size varies in a great range from
few bytes to some megabytes with average size near 64K. Overall size of a
database filled by key/data pairs is some gigabytes. One can imagine our
key/data pairs form context index (inverted file) for a large set of
Russian/English texts.

Could you recommend the best way to configure such a data storage? I mean the
best random read speed for key/data pairs in a given database and good enough
write speed (for "context index" updating).


    The most important configuration item in your case will be the cache size configuration. The larger the cache the better performance will be - especially for random read-oriented usage. Documentation about the cache configuration API are here:
    An article about tuning cache size is available here:

    Selecting the format for the database will also have an impact. Given your description I suggest that hash is likely the best solution - since the data access will be random. You should test with both hash and btree. An article describing the benefits/drawbacks of both is here:

    Then you might want to adjust the pagesize - given that your data items are generally large, a bigger page size will probably result in better performance. API here:

    The db_stat utility is a very useful tool for tuning your database. Documentation can be found here:

    If you have any specific questions I will be glad to help.

