3 Replies Latest reply: Aug 22, 2014 8:49 PM by 2736814 RSS

    Lots of write I/O in read-only workload

    2736814

      Hi,

       

      I get some very surprising behavior on a read only workload (only get). First, I create a database by inserting 10M keys into a btree. Then I sync, close the database, and exit. At this point no IO happens any more, and everything looks good.

       

      Now, when I do lookups in this database, for a while (about 15M lookups or so), everything looks good. And then performance suddenly drops:

      first 10 M lookups: 0.508826M lookups/s

      next 10 M lookups: 0.207377M lookups/s

      next 10 M lookups: 0.0563629M lookups/s

      next 10 M lookups: 0.0512832M lookups/s

       

      iotop shows that, once performance drops, the lookup process is writing dozens of MB per second to disk, but not reading anything since everything fits into the buffer pool. (Remember that I only do lookups.) The database directory contains the following files:

      259M Aug 21 04:02 access.db

      384K Aug 21 04:11 __db.001

      52M Aug 21 04:14 __db.002

      275M Aug 21 04:14 __db.003

       

      During the lookups the access timestamps of __db.002 and __db.003 increase all the time, indicating that these files are written to. ptrace does not show any read or write system calls, so the IO seem to happen through mmap. I get this behavior with Ubuntu's BerkeleyDB version 6.0 and a self-compiled version 6.1.19. This also happens with hashing instead of the btree, and when I open the database read-only. I assign 1GB to the buffer pool.

       

      Is this expected behavior?

       

      Here's the insert program:

      #include <vector>
      #include <iomanip>
      #include <db_cxx.h>
      
      
      using namespace std;
      
      
      int main(int argc, char** argv) {
         uint64_t n = 10e6;
      
      
         DbEnv env(DB_CXX_NO_EXCEPTIONS);
         Db* pdb;
      
      
         try {
            env.set_error_stream(&cerr);
            env.set_cachesize(1, 0, 0);
            env.open("/tmp/db", DB_CREATE | DB_INIT_MPOOL | DB_THREAD, 0);
      
      
            pdb = new Db(&env, DB_CXX_NO_EXCEPTIONS);
            pdb->open(NULL, "access.db", NULL, DB_BTREE, DB_CREATE, 0);
      
      
            vector<uint64_t> keys(n);
            for (uint64_t i=0; i<n; i++)
               keys[i] = i;
      
      
            for (uint64_t i=0; i<n; i++) {
               int k = keys[i];
               Dbt key(&k, sizeof(int));
               Dbt value(&k, sizeof(int));
               pdb->put(NULL, &key, &value, 0);
            }
      
      
            pdb->sync(0);
            env.memp_sync(NULL);
      
      
            if (pdb != NULL) {
               pdb->close(0);
               delete pdb;
            }
            env.close(0);
      
      
         } catch (DbException& e) {
            cerr << "DbException: " << e.what() << endl;
            return -1;
         } catch (std::exception& e) {
            cerr << e.what() << endl;
            return -1;
         }
      
      
         return 0;
      }
      
      

       

      And here's the lookup program:

      #include <sys/time.h>
      #include <iostream>
      #include <vector>
      #include <algorithm>
      #include <iomanip>
      #include <cassert>
      #include <db_cxx.h>
      
      
      using namespace std;
      
      
      static double gettime(void) {
        struct timeval now_tv;
        gettimeofday (&now_tv, NULL);
        return ((double)now_tv.tv_sec) + ((double)now_tv.tv_usec)/1000000.0;
      }
      
      
      int main(int argc, char** argv) {
         uint64_t n = 10e6;
      
      
         DbEnv env(DB_CXX_NO_EXCEPTIONS);
         Db* pdb;
      
      
         try {
            env.set_error_stream(&cerr);
            env.set_cachesize(1, 0, 0);
            env.open("/tmp/db", DB_CREATE | DB_INIT_MPOOL | DB_THREAD, 0);
      
      
            pdb = new Db(&env, DB_CXX_NO_EXCEPTIONS);
            pdb->open(NULL, "access.db", NULL, DB_BTREE, DB_CREATE /*| DB_RDONLY*/, 0);
      
      
            vector<uint64_t> keys(n);
            for (uint64_t i=0; i<n; i++)
               keys[i] = i;
            random_shuffle(keys.begin(), keys.end());
      
      
            for (int nt=1; nt<=12; nt++) {
               uint64_t sum = 0;
               double start = gettime();
               int v;
               Dbt data;
               data.set_data(&v);
               data.set_ulen(4);
               data.set_flags(DB_DBT_USERMEM);
               for (uint64_t i=0; i<n; i++) {
                  int k = keys[i];
                  Dbt key(&k, sizeof(int));
                  assert(pdb->get(NULL, &key, &data, 0) != DB_NOTFOUND);
                  sum += v;
               }
               cout << (n/1e6)/(gettime()-start) << " " << sum << endl;
            }
      
      
            if (pdb != NULL) {
               pdb->close(0);
               delete pdb;
            }
            env.close(0);
      
      
         } catch (DbException& e) {
            cerr << "DbException: " << e.what() << endl;
            return -1;
         } catch (std::exception& e) {
            cerr << e.what() << endl;
            return -1;
         }
      
      
         return 0;
      }
      
      
        • 1. Re: Lots of write I/O in read-only workload
          Winter Zhang

          Hi,

           

          We will take a look and get back to you later.

           

          Regards,

          Winter, Oracle Berkeley DB

          • 2. Re: Lots of write I/O in read-only workload
            Winter Zhang

            Hi,

             

            I can not reproduce the issue you mentioned using your code.  In my runs on two machines, the 12 results are very near, not having so big difference.

             

            My results are near to your third and fourth ones. I think maybe you get the data using different flags to env.open in different runs. During my testing, When removing the DB_THREAD flag, the ops/second increases a lot, because the number of mutex operations decreases a lot. Also, adding DB_PRIVATE to env.open improves performance, since the cache is in heap memory then.

             

            You are right, the write operations are invoked throught mmap. By default, BerkeleyDB has stats enabled, these stats are for cache, mutex, etc. So whenever a get executes, the cache stat is changed, and also mutex stat is changed. These stats locate in the mmap file of __db.002 and __db.003. The system will write back changes to the memory mmaped, and that's why you see the timestamp changes.

             

            Regards,

            Winter, Oracle Berkeley DB

            • 3. Re: Lots of write I/O in read-only workload
              2736814

              Thanks! DB_PRIVATE does the trick for me, without it there's no more write IO.