12 Replies Latest reply: Jun 14, 2007 10:48 AM by Greybird-Oracle RSS

    DPL performance of bulk insertions with secondary index present

    578006
      I have a simple directed graph that I am trying to load from two flat files, and performance just falls off the cliff with secondary indexes in place.

      There are about 400,000 nodes and 1,500,000 relationships

      In the classes below when commenting out the secondary keys, the load takes about 40 secs.

      Having just one of them present, the load trundles along nicely for a while, and then goes from almost no cache misses (i.e. 20 misses after 1,000,000 inserts) to multiple misses per insertion.

      Running a 512M JVM, the cache grows to 319M well before performance degrades.

      The code is pretty simple, doing a putNoOverwrite for the insert, and calling
      commitNoSync() every 10,000 inserts.

      Is there something specific I should be doing to get better performance when doing bulk inserts?

      Many thanks.

      --Eric Mays

      @Entity
      public class Node {
           @PrimaryKey
           int node_id;
      String stuff;
      }

      @Entity
      public class Relationship {
           @PrimaryKey
           private long relationship_id;
           @SecondaryKey(relate = MANY_TO_ONE)
           private int node_id1;
      @SecondaryKey(relate = MANY_TO_ONE)
           private int node_id2;
      }
        • 1. Re: DPL performance of bulk insertions with secondary index present
          Greybird-Oracle
          Hi Eric,
          Is there something specific I should be doing to get
          better performance when doing bulk inserts?
          Yes, there are a couple things you can try.

          First, you said that you're doing a commit every 10k records. This means that you're using a very large transaction, which takes up a large part of the cache. For each record written (including each secondary) a lock object is stored in the cache.

          So the first thing you could try is to use a non-transactional store -- do not call StoreConfig.setTransactional or pass false to this method. After performing the load, you can close the store and re-open it as a transactional store. In other words, data can be loaded non-transactionally and then used transactionally later.

          But even better would be to try our DeferredWrite mode. This mode was created for bulk load situations. In this mode, records are not logged until the cache fills. This reduces the logging of intermediate versions of the Btree index. To use this mode, call StoreConfig.setDeferredWrite(true) before opening the store.

          In DeferredWrite mode, durability is not guaranteed until you call EntityStore.sync(). When the load is complete, call sync(), close the store, and re-open it as a transactional store. This is the option that should give the best performance for a bulk insert.

          Mark
          • 2. Re: DPL performance of bulk insertions with secondary index present
            578006
            Hi Mark,

            Thanks for your help. With this the load now completes in about 25 min. configured with both secondary indexes. I would still like to see if I can speed this up.

            I started looking into deferring secondary index creation until after the load completes. I see that one can create a SecondaryDatabase below the DPL and then create a SecondaryIndex in the DPL from that. Is that the way to go? Or is there some way to programmatically change the entity model and call evolve?

            When reading the doc on the SecondaryDatabase wa a bit concerned about the following comment.

            Note that the associations between primary and secondary databases are not stored persistently. Whenever a primary database is opened for write access by the application, the appropriate associated secondary databases should also be opened by the application. This is necessary to ensure data integrity when changes are made to the primary database.

            I'm guessing the DPL takes care of this when the index is created in the entity model, but not clear how this works otherwise.

            Thanks, Eric
            • 3. Re: DPL performance of bulk insertions with secondary index present
              Greybird-Oracle
              I started looking into deferring secondary index
              creation until after the load completes. I see that
              one can create a SecondaryDatabase below the DPL and
              then create a SecondaryIndex in the DPL from that. Is
              that the way to go? Or is there some way to
              programmatically change the entity model and call
              evolve?
              I'm not sure whether deferring secondary population will improve performance, but it is something you can try.

              Creating the SecondaryDatabase explicitly is one possible solution, but you won't be able to take advantage of DPL bindings or annotations if you do this -- you'll have to implement your own bindings.

              I'd like to suggest another option to try first:

              1) Define your classes with all members present but without the @SecondaryKey annotations
              2) Load the data
              3) Add the @SecondaryKey annotations and open the store.

              In step (3) the DPL will create the secondary databases and populate them at the time you open the store and call getPrimaryIndex. It does this by reading through the entire primary index in primary key order and inserting the necessary records in the secondary indices for each primary record that it reads.

              The drawback of this approach is that it requires that you restart the process between steps (2) and (3), in order to use an updated version of your persistent classes. I don't know whether that will be practical for your application or not.

              If this approach is not practical, or if you have a different algorithm in mind for loading the secondaries (perhaps pre-sorting the input data), please let me know and I'll suggest other possibilities.
              When reading the doc on the SecondaryDatabase wa a
              bit concerned about the following comment.

              Note that the associations between primary and
              secondary databases are not stored persistently.
              Whenever a primary database is opened for write
              access by the application, the appropriate associated
              secondary databases should also be opened by the
              application. This is necessary to ensure data
              integrity when changes are made to the primary
              database.


              I'm guessing the DPL takes care of this when the
              index is created in the entity model, but not clear
              how this works otherwise.
              Yes, the DPL takes care of this and in fact does persistently store the relationships between primary and secondary databases. If you work with the lower level base API, it is up to you to open the secondary databases explicitly and to maintain the knowledge, either stored persistently somehow or implicit in your code, of the relationships.

              Mark
              • 4. Re: DPL performance of bulk insertions with secondary index present
                578006
                If the approach of adding the SecondaryKey annotations just requires closing the store, that's fine, I just don't see how one uses the eniity model APIs to add the annotations.

                If you mean exiting the JVM, recompiling, and re-start, that's not going to fly.

                Thanks, Eric
                • 5. Re: DPL performance of bulk insertions with secondary index present
                  Greybird-Oracle
                  If you mean exiting the JVM, recompiling, and
                  re-start, that's not going to fly.
                  Yes, by restarting the process I meant restarting the JVM. Let me do a little experimenting with other approaches and I'll get back to you.

                  Mark
                  • 6. Re: DPL performance of bulk insertions with secondary index present
                    Greybird-Oracle
                    Eric,
                    Yes, by restarting the process I meant restarting the
                    JVM. Let me do a little experimenting with other
                    approaches and I'll get back to you.
                    On second thought, before I spend time on this could you please try an experiment using the approach I described -- where a recompile and restart of the JVM is required -- and measure the results. If the results are acceptable, then I'll work on finding a way for you to do this without a restart/recompile.

                    Thanks,
                    Mark
                    • 7. Re: DPL performance of bulk insertions with secondary index present
                      578006
                      Mark,

                      OK, ran the experiment and the aggregate load time is cut by a bit more than half. But it looks like maybe something went wrong in the process. After the load, I changed the version and added the annotations. Then ran a method that does a get on the primary and both secondaries, followed by a store.sync(). The stack trace below happens in the cursor iteration line. This error does not occur when the db is loaded from empty with the annotations present.

                      Thanks, Eric

                      EntityCursor<Relationship> cursor = relationship_c2.subIndex(id).entities();
                                try {
                                     for (Relationship rel : cursor) {
                                          System.out.println(rel.getId1());
                                     }
                                } finally {
                                     cursor.close();
                                }


                      java.lang.IndexOutOfBoundsException
                           at com.sleepycat.bind.tuple.TupleInput.readUnsignedInt(TupleInput.java:414)
                           at com.sleepycat.bind.tuple.TupleInput.readInt(TupleInput.java:233)
                           at com.sleepycat.persist.impl.SimpleFormat$FInt.readPrimitiveField(SimpleFormat.java:403)
                           at com.sleepycat.persist.impl.ReflectionAccessor$PrimitiveAccess.read(ReflectionAccessor.java:429)
                           at com.sleepycat.persist.impl.ReflectionAccessor.readNonKeyFields(ReflectionAccessor.java:274)
                           at com.sleepycat.persist.impl.ComplexFormat$PlainFieldReader.readFields(ComplexFormat.java:1606)
                           at com.sleepycat.persist.impl.ComplexFormat$MultiFieldReader.readFields(ComplexFormat.java:1814)
                           at com.sleepycat.persist.impl.ComplexFormat$EvolveReader.readObject(ComplexFormat.java:1943)
                           at com.sleepycat.persist.impl.PersistEntityBinding.readEntity(PersistEntityBinding.java:88)
                           at com.sleepycat.persist.impl.PersistEntityBinding.entryToObject(PersistEntityBinding.java:58)
                           at com.sleepycat.persist.EntityValueAdapter.entryToValue(EntityValueAdapter.java:56)
                           at com.sleepycat.persist.BasicCursor.returnValue(BasicCursor.java:206)
                           at com.sleepycat.persist.BasicCursor.next(BasicCursor.java:74)
                           at com.sleepycat.persist.BasicIterator.hasNext(BasicIterator.java:50)
                      • 8. Re: DPL performance of bulk insertions with secondary index present
                        Greybird-Oracle
                        Hi Eric,

                        Thanks for doing the experiment. I'll take a look at ways of doing this that don't require a compile/restart and get back to you. This could take a day or so.

                        The exception probably has something to do with class evolution. This will not be pertinent to what you're doing in the end, but I'll take a look at what is causing this also.

                        Mark
                        • 9. Re: DPL performance of bulk insertions with secondary index present
                          Greybird-Oracle
                          Thanks for doing the experiment. I'll take a look at
                          ways of doing this that don't require a
                          compile/restart and get back to you. This could take
                          a day or so.
                          I've taken a look at this and concluded that the simplest thing is to add a new feature to the DPL that supports this optimization. There are other ways of doing it, using the lower level API, but they require contortions that I would rather not spend a lot of time trying to describe. The new feature turns out to be very simple.

                          The new feature involves a new configuration property for entity stores called SecondaryBulkLoad. If this property is true (it would be false by default), and you don't explicitly call getSecondaryIndex, the secondary will not be updated automatically as the primary is updated. Then, the first time that getSecondaryIndex is called, the secondary will be populated by reading through the primary. From then on, the SecondaryBulkLoad property would have no effect.

                          The usage would be something like this:
                          // Open the store with SecondaryBulkLoad configured
                          StoreConfig config = ...
                          config.setSecondaryBulkLoad(true);
                          EntityStore store = new EntityStore(..., config);

                          // Open the primary index and peform insertions
                          PrimaryIndex<X,E> primary = store.getPrimaryIndex(...);
                          primary.put(...);
                          ...

                          // Sometime later, open the secondary index
                          SecondaryIndex<X,Y,E> secondary = store.getSeondaryIndex(...);
                          // The secondary is now fully populated
                          ...
                          How does this sound to you?
                          The exception probably has something to do with class
                          evolution. This will not be pertinent to what you're
                          doing in the end, but I'll take a look at what is
                          causing this also.
                          I have reproduced this problem with a new unit test case in our test suite. We'll get a fix into an upcoming release.

                          Mark
                          • 10. Re: DPL performance of bulk insertions with secondary index present
                            Greybird-Oracle
                            In advance of the feature I described, I thought of a way you can do this now
                            with the current release, without having to bend too far over backwards.

                            First open the entity store and get some information about the lower level
                            Database for the primary index that you wish to load. Don't write any data
                            during this step. Leave the store open so that the entity binding will work
                            during the next step.
                            EntityStore store = new EntityStore(env, ...);
                            PrimaryIndex<X,Y> index = store.getPrimaryIndex(X.class, Y.class);
                            Database indexDb = index.getDatabase();
                            DatabaseConfig dbConfig = indexDb.getConfig();
                            String dbName = indexDb.getDatabaseName();
                            EntityBinding dbBinding = index.getEntityBinding();
                            Then, open and load the primary index database using the JE base API, using
                            your input data. Because you are opening a separate standalone handle for the
                            database, writing to this database will not cause the secondary index to be
                            updated.
                            Database standaloneDb = env.openDatabase(null, dbName, dbConfig);
                            DatabaseEntry keyEntry = new DatabaseEntry();
                            DatabaseEntry dataEntry = new DatabaseEntry();
                            while (moreInputAvailable) {
                                Y myEntity = ...; // create entity from input data
                                dbBinding.objectToKey(myEntity, keyEntry);
                                dbBinding.objectToData(myEntity, dataEntry);
                                standaloneDb.put(null, keyEntry, dataEntry);
                            }
                            standaloneDb.close();
                            Now access the entity store as usual. The first time you call
                            getSecondaryIndex the secondary index data will be loaded by reading through
                            primary, as in your experiment.

                            Mark
                            • 11. Re: DPL performance of bulk insertions with secondary index present
                              578006
                              Hi Mark,

                              Yes, your proposed addition would work well for my purposes. Hopefully you see this as useful for others as well.

                              I am able to to make progress in the meantime.

                              Thanks, Eric
                              • 12. Re: DPL performance of bulk insertions with secondary index present
                                Greybird-Oracle
                                This feature has been added in JE 3.2.31 and higher. The reference number in the change log will be [#15525].