This discussion is archived
12 Replies Latest reply: Jun 14, 2007 8:48 AM by greybird RSS

DPL performance of bulk insertions with secondary index present

578006 Newbie
Currently Being Moderated
I have a simple directed graph that I am trying to load from two flat files, and performance just falls off the cliff with secondary indexes in place.

There are about 400,000 nodes and 1,500,000 relationships

In the classes below when commenting out the secondary keys, the load takes about 40 secs.

Having just one of them present, the load trundles along nicely for a while, and then goes from almost no cache misses (i.e. 20 misses after 1,000,000 inserts) to multiple misses per insertion.

Running a 512M JVM, the cache grows to 319M well before performance degrades.

The code is pretty simple, doing a putNoOverwrite for the insert, and calling
commitNoSync() every 10,000 inserts.

Is there something specific I should be doing to get better performance when doing bulk inserts?

Many thanks.

--Eric Mays

@Entity
public class Node {
     @PrimaryKey
     int node_id;
String stuff;
}

@Entity
public class Relationship {
     @PrimaryKey
     private long relationship_id;
     @SecondaryKey(relate = MANY_TO_ONE)
     private int node_id1;
@SecondaryKey(relate = MANY_TO_ONE)
     private int node_id2;
}
  • 1. Re: DPL performance of bulk insertions with secondary index present
    greybird Expert
    Currently Being Moderated
    Hi Eric,
    Is there something specific I should be doing to get
    better performance when doing bulk inserts?
    Yes, there are a couple things you can try.

    First, you said that you're doing a commit every 10k records. This means that you're using a very large transaction, which takes up a large part of the cache. For each record written (including each secondary) a lock object is stored in the cache.

    So the first thing you could try is to use a non-transactional store -- do not call StoreConfig.setTransactional or pass false to this method. After performing the load, you can close the store and re-open it as a transactional store. In other words, data can be loaded non-transactionally and then used transactionally later.

    But even better would be to try our DeferredWrite mode. This mode was created for bulk load situations. In this mode, records are not logged until the cache fills. This reduces the logging of intermediate versions of the Btree index. To use this mode, call StoreConfig.setDeferredWrite(true) before opening the store.

    In DeferredWrite mode, durability is not guaranteed until you call EntityStore.sync(). When the load is complete, call sync(), close the store, and re-open it as a transactional store. This is the option that should give the best performance for a bulk insert.

    Mark
  • 2. Re: DPL performance of bulk insertions with secondary index present
    578006 Newbie
    Currently Being Moderated
    Hi Mark,

    Thanks for your help. With this the load now completes in about 25 min. configured with both secondary indexes. I would still like to see if I can speed this up.

    I started looking into deferring secondary index creation until after the load completes. I see that one can create a SecondaryDatabase below the DPL and then create a SecondaryIndex in the DPL from that. Is that the way to go? Or is there some way to programmatically change the entity model and call evolve?

    When reading the doc on the SecondaryDatabase wa a bit concerned about the following comment.

    Note that the associations between primary and secondary databases are not stored persistently. Whenever a primary database is opened for write access by the application, the appropriate associated secondary databases should also be opened by the application. This is necessary to ensure data integrity when changes are made to the primary database.

    I'm guessing the DPL takes care of this when the index is created in the entity model, but not clear how this works otherwise.

    Thanks, Eric
  • 3. Re: DPL performance of bulk insertions with secondary index present
    greybird Expert
    Currently Being Moderated
    I started looking into deferring secondary index
    creation until after the load completes. I see that
    one can create a SecondaryDatabase below the DPL and
    then create a SecondaryIndex in the DPL from that. Is
    that the way to go? Or is there some way to
    programmatically change the entity model and call
    evolve?
    I'm not sure whether deferring secondary population will improve performance, but it is something you can try.

    Creating the SecondaryDatabase explicitly is one possible solution, but you won't be able to take advantage of DPL bindings or annotations if you do this -- you'll have to implement your own bindings.

    I'd like to suggest another option to try first:

    1) Define your classes with all members present but without the @SecondaryKey annotations
    2) Load the data
    3) Add the @SecondaryKey annotations and open the store.

    In step (3) the DPL will create the secondary databases and populate them at the time you open the store and call getPrimaryIndex. It does this by reading through the entire primary index in primary key order and inserting the necessary records in the secondary indices for each primary record that it reads.

    The drawback of this approach is that it requires that you restart the process between steps (2) and (3), in order to use an updated version of your persistent classes. I don't know whether that will be practical for your application or not.

    If this approach is not practical, or if you have a different algorithm in mind for loading the secondaries (perhaps pre-sorting the input data), please let me know and I'll suggest other possibilities.
    When reading the doc on the SecondaryDatabase wa a
    bit concerned about the following comment.

    Note that the associations between primary and
    secondary databases are not stored persistently.
    Whenever a primary database is opened for write
    access by the application, the appropriate associated
    secondary databases should also be opened by the
    application. This is necessary to ensure data
    integrity when changes are made to the primary
    database.


    I'm guessing the DPL takes care of this when the
    index is created in the entity model, but not clear
    how this works otherwise.
    Yes, the DPL takes care of this and in fact does persistently store the relationships between primary and secondary databases. If you work with the lower level base API, it is up to you to open the secondary databases explicitly and to maintain the knowledge, either stored persistently somehow or implicit in your code, of the relationships.

    Mark
  • 4. Re: DPL performance of bulk insertions with secondary index present
    578006 Newbie
    Currently Being Moderated
    If the approach of adding the SecondaryKey annotations just requires closing the store, that's fine, I just don't see how one uses the eniity model APIs to add the annotations.

    If you mean exiting the JVM, recompiling, and re-start, that's not going to fly.

    Thanks, Eric
  • 5. Re: DPL performance of bulk insertions with secondary index present
    greybird Expert
    Currently Being Moderated
    If you mean exiting the JVM, recompiling, and
    re-start, that's not going to fly.
    Yes, by restarting the process I meant restarting the JVM. Let me do a little experimenting with other approaches and I'll get back to you.

    Mark
  • 6. Re: DPL performance of bulk insertions with secondary index present
    greybird Expert
    Currently Being Moderated
    Eric,
    Yes, by restarting the process I meant restarting the
    JVM. Let me do a little experimenting with other
    approaches and I'll get back to you.
    On second thought, before I spend time on this could you please try an experiment using the approach I described -- where a recompile and restart of the JVM is required -- and measure the results. If the results are acceptable, then I'll work on finding a way for you to do this without a restart/recompile.

    Thanks,
    Mark
  • 7. Re: DPL performance of bulk insertions with secondary index present
    578006 Newbie
    Currently Being Moderated
    Mark,

    OK, ran the experiment and the aggregate load time is cut by a bit more than half. But it looks like maybe something went wrong in the process. After the load, I changed the version and added the annotations. Then ran a method that does a get on the primary and both secondaries, followed by a store.sync(). The stack trace below happens in the cursor iteration line. This error does not occur when the db is loaded from empty with the annotations present.

    Thanks, Eric

    EntityCursor<Relationship> cursor = relationship_c2.subIndex(id).entities();
              try {
                   for (Relationship rel : cursor) {
                        System.out.println(rel.getId1());
                   }
              } finally {
                   cursor.close();
              }


    java.lang.IndexOutOfBoundsException
         at com.sleepycat.bind.tuple.TupleInput.readUnsignedInt(TupleInput.java:414)
         at com.sleepycat.bind.tuple.TupleInput.readInt(TupleInput.java:233)
         at com.sleepycat.persist.impl.SimpleFormat$FInt.readPrimitiveField(SimpleFormat.java:403)
         at com.sleepycat.persist.impl.ReflectionAccessor$PrimitiveAccess.read(ReflectionAccessor.java:429)
         at com.sleepycat.persist.impl.ReflectionAccessor.readNonKeyFields(ReflectionAccessor.java:274)
         at com.sleepycat.persist.impl.ComplexFormat$PlainFieldReader.readFields(ComplexFormat.java:1606)
         at com.sleepycat.persist.impl.ComplexFormat$MultiFieldReader.readFields(ComplexFormat.java:1814)
         at com.sleepycat.persist.impl.ComplexFormat$EvolveReader.readObject(ComplexFormat.java:1943)
         at com.sleepycat.persist.impl.PersistEntityBinding.readEntity(PersistEntityBinding.java:88)
         at com.sleepycat.persist.impl.PersistEntityBinding.entryToObject(PersistEntityBinding.java:58)
         at com.sleepycat.persist.EntityValueAdapter.entryToValue(EntityValueAdapter.java:56)
         at com.sleepycat.persist.BasicCursor.returnValue(BasicCursor.java:206)
         at com.sleepycat.persist.BasicCursor.next(BasicCursor.java:74)
         at com.sleepycat.persist.BasicIterator.hasNext(BasicIterator.java:50)
  • 8. Re: DPL performance of bulk insertions with secondary index present
    greybird Expert
    Currently Being Moderated
    Hi Eric,

    Thanks for doing the experiment. I'll take a look at ways of doing this that don't require a compile/restart and get back to you. This could take a day or so.

    The exception probably has something to do with class evolution. This will not be pertinent to what you're doing in the end, but I'll take a look at what is causing this also.

    Mark
  • 9. Re: DPL performance of bulk insertions with secondary index present
    greybird Expert
    Currently Being Moderated
    Thanks for doing the experiment. I'll take a look at
    ways of doing this that don't require a
    compile/restart and get back to you. This could take
    a day or so.
    I've taken a look at this and concluded that the simplest thing is to add a new feature to the DPL that supports this optimization. There are other ways of doing it, using the lower level API, but they require contortions that I would rather not spend a lot of time trying to describe. The new feature turns out to be very simple.

    The new feature involves a new configuration property for entity stores called SecondaryBulkLoad. If this property is true (it would be false by default), and you don't explicitly call getSecondaryIndex, the secondary will not be updated automatically as the primary is updated. Then, the first time that getSecondaryIndex is called, the secondary will be populated by reading through the primary. From then on, the SecondaryBulkLoad property would have no effect.

    The usage would be something like this:
    // Open the store with SecondaryBulkLoad configured
    StoreConfig config = ...
    config.setSecondaryBulkLoad(true);
    EntityStore store = new EntityStore(..., config);

    // Open the primary index and peform insertions
    PrimaryIndex<X,E> primary = store.getPrimaryIndex(...);
    primary.put(...);
    ...

    // Sometime later, open the secondary index
    SecondaryIndex<X,Y,E> secondary = store.getSeondaryIndex(...);
    // The secondary is now fully populated
    ...
    How does this sound to you?
    The exception probably has something to do with class
    evolution. This will not be pertinent to what you're
    doing in the end, but I'll take a look at what is
    causing this also.
    I have reproduced this problem with a new unit test case in our test suite. We'll get a fix into an upcoming release.

    Mark
  • 10. Re: DPL performance of bulk insertions with secondary index present
    greybird Expert
    Currently Being Moderated
    In advance of the feature I described, I thought of a way you can do this now
    with the current release, without having to bend too far over backwards.

    First open the entity store and get some information about the lower level
    Database for the primary index that you wish to load. Don't write any data
    during this step. Leave the store open so that the entity binding will work
    during the next step.
    EntityStore store = new EntityStore(env, ...);
    PrimaryIndex<X,Y> index = store.getPrimaryIndex(X.class, Y.class);
    Database indexDb = index.getDatabase();
    DatabaseConfig dbConfig = indexDb.getConfig();
    String dbName = indexDb.getDatabaseName();
    EntityBinding dbBinding = index.getEntityBinding();
    Then, open and load the primary index database using the JE base API, using
    your input data. Because you are opening a separate standalone handle for the
    database, writing to this database will not cause the secondary index to be
    updated.
    Database standaloneDb = env.openDatabase(null, dbName, dbConfig);
    DatabaseEntry keyEntry = new DatabaseEntry();
    DatabaseEntry dataEntry = new DatabaseEntry();
    while (moreInputAvailable) {
        Y myEntity = ...; // create entity from input data
        dbBinding.objectToKey(myEntity, keyEntry);
        dbBinding.objectToData(myEntity, dataEntry);
        standaloneDb.put(null, keyEntry, dataEntry);
    }
    standaloneDb.close();
    Now access the entity store as usual. The first time you call
    getSecondaryIndex the secondary index data will be loaded by reading through
    primary, as in your experiment.

    Mark
  • 11. Re: DPL performance of bulk insertions with secondary index present
    578006 Newbie
    Currently Being Moderated
    Hi Mark,

    Yes, your proposed addition would work well for my purposes. Hopefully you see this as useful for others as well.

    I am able to to make progress in the meantime.

    Thanks, Eric
  • 12. Re: DPL performance of bulk insertions with secondary index present
    greybird Expert
    Currently Being Moderated
    This feature has been added in JE 3.2.31 and higher. The reference number in the change log will be [#15525].