3 Replies Latest reply: Feb 4, 2013 1:28 AM by vesh RSS

    Direct I/O and Linux - any chance for it?

      It seems like Berkeley's support for O_DIRECT in Lunux is far from perfect, at least in 2.6.28-11 x86_64. When I build library with --enable-o_direct and use direct_db/direct_log - it gives all sorts of cryptic errors on startup. Looking at internets I'm not the only one with this problem.

      Thing is Linux FS cache sucks for servers. It tends to grow into all free memory and then system goes downhill - disks get trashed beyound any reason. At the moment I see load average 9-10 with insane I/O waits after pumping just 3 gigs of data into BDB database from Java app. And it is 8-core box with 12 GB or RAM.

      On my desktop Windows PC (which has O_DIRECT working, 1 CPU, 4 GB RAM) same thing does not create any performance issues and even completes faster.

      I'm trying to use Berkeley in rather large-scale project, where BDB is to be used as main data store/search index (about 200 GB database, updated with 2 GB of data each hour and should be accessed online).

      Maybe someone from Oracle can help with getting O_DIRECT working? Or do I do something horribly wrong here? :)

      * 2.6.28-11-server #42-Ubuntu SMP Fri Apr 17 02:45:36 UTC 2009 x86_64 GNU/Linux
      * 4 x Intel(R) Xeon(R) CPU E5345 @ 2.33GHz (8 cores total)
      * 12GB RAM, swapiness=0
      * DB is on RAID5, 15 MB/sec average writes.
      * db-4.8.24
      * configure enable-java enable-o_direct

      set_flags DB_TXN_NOSYNC
      set_flags DB_TXN_WRITE_NOSYNC
      set_flags DB_DIRECT_DB
      set_flags DB_DIRECT_LOG

      set_flags DB_LOG_AUTOREMOVE

      set_verbose DB_VERB_DEADLOCK
      set_verbose DB_VERB_RECOVERY

      set_lock_timeout 500000
      set_txn_timeout 500000

      set_lg_max 31457280
      set_lg_bsize 104857600

      set_cachesize 9 0 10

      set_lk_detect DB_LOCK_OLDEST
      set_lk_max_lockers 300000
      set_lk_max_locks 300000
      set_lk_max_objects 300000

      write: 0x7f5a5009d190, 8192: Invalid argument
      write: 0x7f57e5c4e378, 98: Invalid argument
      Exception in thread "main" java.lang.IllegalArgumentException: Invalid argument: write: 0x7f5a5009d190, 8192: Invalid argument
      write: 0x7f57e5c4e378, 98: Invalid argument
      at com.sleepycat.db.internal.db_javaJNI.Db_open(Native Method)
      at com.sleepycat.db.internal.Db.open(Db.java:449)
      at com.sleepycat.db.DatabaseConfig.openDatabase(DatabaseConfig.java:2106)
      at com.sleepycat.db.Environment.openDatabase(Environment.java:314)
      at com.sleepycat.compat.DbCompat.openDatabase(DbCompat.java:310)
      at com.sleepycat.persist.impl.PersistCatalog.<init>(PersistCatalog.java:183)
      at com.sleepycat.persist.impl.Store.<init>(Store.java:178)
      at com.sleepycat.persist.EntityStore.<init>(EntityStore.java:109)

      Please help.
        • 1. Re: Direct I/O and Linux - any chance for it?
          Oracle, Sandra Whitman

          Thanks for the post and apologies on the delay. I will
          read this over closely and get back to you.

          • 2. Re: Direct I/O and Linux - any chance for it?
            &quot;Andrei Costache, Oracle-Oracle&quot;

            We are aware of this issue. O_DIRECT support on Linux has always been problematic due to the complicated requirements Linux had in memory alignments.
            By default on Linux O_DIRECT support was turned off. To turn on O_DIRECT support, users would have to configure with --enable-o_direct when building Berkeley DB, but when they do that, calls to read and write fail (the "Invalid argument" error messages keeps showing up).
            This is because the Berkeley DB buffers (cache and log buffers) are not aligned in memory in the way Linux expects.
            This has never worked in any version of Berkeley DB, which is why configure disables it for Linux.

            Back in the days of the 2.4 kernel, Linux had a rather strange and complicated requirement of alignment to boundaries which were multiples of the filesystem's block size (where the file resided). Hence, we decided this problem was intractable because there was no way to work out what memory alignment was required for a particular file. Further more, the Berkeley DB binaries/libraries tend to be used also on other platforms than the ones they were built on.
            Solaris for example doesn't have any such requirement for alignment to specific boundaries. With Linux 2.6 now widely used, at least the alignment requirement is clear (all buffers need to be aligned to a 512 byte boundary).

            For reference this issue has been discussed in another thread:
            Re: BerkeleyDB-4.7.25 with the option O_DIRECT and  Invalid argument
            and there is also reasoning there why the patch suggested has not been adopted. Though, if you really need to use O_DIRECT in Linux that patch is required.
            We were advised by Linux experts that changing the kernel's default "swappiness" setting (specifically to 0/zero) is generally preferable to using direct I/O on Linux.
            That is, properly setting +/proc/sys/vm/swappiness=0+ resulted in far better performance.

            Here is an interesting article on this, and also an article with Linus Torvald's take on O_DIRECT:

            • 3. Re: Direct I/O and Linux - any chance for it?
              FYI, setting swappiness to 0 does not show meaningful difference regarding performance for me.
              My test just writes 1,000,000 (key,value)s and commit, where key is 5byte and value is 4000 byte.
              With default swappiness (60), time was real 2m42s
              With swappiness 0, time was real 2m40s