This discussion is archived
11 Replies Latest reply: Sep 26, 2013 7:19 AM by greybird RSS

Help setting up correct key/index structure

9b78ff64-42a5-49e3-ac23-f5d19e32749a Newbie
Currently Being Moderated

Hi,

 

I'm new to BDB and have spent some time reading the docs and playing with the examples. I'm a little confused as to how best to structure my keys and indexes to achieve what I want, and would appreciate any pointers to set me on the right track and avoid going down too many blind alleys. I'm looking to use the Java DPL API if possible to keep things simple.

 

I want to store a single Event entity class with the following fields: id (int), timestamp (java.util.Date), componentId (int), status (int).

 

I would like to query all events between a time period (start/end date) with a specific status, and ordered by componentId. What's the best approach to optimise this query?

 

Thanks,

 

Simon.

  • 1. Re: Help setting up correct key/index structure
    Andrei Costache, Oracle Journeyer
    Currently Being Moderated

    Hi Simon,

     

    I think you'll find the following two resources very useful:

    Performing Queries in Oracle Berkeley DB Java Edition

    Getting Started with Berkeley DB Java Edition  (specifically, the "Programming with the Direct Persistence Layer" section)

    The Javadoc for the JE's APIs is here.

     

    In the Event entity class your primary key will be id, and all others will be secondary keys.  Your query expressed in SQL would be something like:  SELECT * FROM event WHERE (timestamp >= start AND timestamp <= end) AND status = X  ORDER BY componentId;

     

    Assuming that the primary index is defined and created like this:

    PrimaryIndex<integer, Event> eventById = store.getPrimaryIndex(Integer.class, Event.class);

     

    a) Getting the entities that satisfy the the range query on timestamp can be done using and EntityIndex.entities() call like:

    EntityCursor<Event> curEventsInTimestampRange = eventByTimestamp.entities((new Date()).setTime(startTime), true, (new Date()).setTime(endTime), true);

    where "eventByTimestamp" is the secondary index on timestamp defined and created as

    SecondaryIndex<Date, Integer, Event> eventByTimestamp = store.getSecondaryIndex(eventById, Date.class, "timestamp");

     

    b) Getting the entities that satisfy the status criteria can be done using EntityIndex.subIndex().entities():

    EntityCursor<Event> curEventsWithStatus = eventByStatus.subIndex(1).entities();

    where "eventByStatus" is the secondary index on status defined and created as

    SecondaryIndex<Integer, Integer, Event> eventByStatus = store.getSecondaryIndex(eventById, Integer.class, "status");

     

    c) Getting all the entities ordered by componentId can be done using EntityIndex.entities():

    EntityCursor<Event> curEventsOrderedByComponentId = eventByComponentId.entities();

    where "eventByComponentId" is the secondary index on componentId defined and created as

    SecondaryIndex<Integer, Integer, Event> eventByComponentId = store.getSecondaryIndex(eventById, Integer.class, "componentId");

     

    But the above are a bit unpractical to use for the scope of the query here, since you'll need to do some sort of iterating through the results with some nested loops etc.  So I think you're better off using the Java Collections framework / JE's Collections API, specifically for this case, SortedMap / StoredSortedMap and SortedSet / StoredSortedValueSet, for restricting/filtering and ordering the results, using something like:

    // get Events satisfying the two criterias

    SortedSet<Event> sortSetEventsTimestampInRange = eventByTimestamp.sortedMap().subMap((new Date()).setTime(startTime), true, (new Date()).setTime(endTime), true).values();

    SortedSet<Event> sortSetEventsWithStatusX = eventByStatus.sortedMap().subMap(statusX, statusX).values;

    // Event needs to override equals()

    // EventComponentIdComparator is a Comparator<Event> comparing by componentId and allowing the ordering

    SortedSet<Event> eventsTimestampInRangeAndStatusXOrderedByComponentId = new TreeSet<Event>(new EventComponentIdComparator());

    eventsTimestampInRangeAndStatusXOrderedByComponentId.addAll(sortSetEventsTimestampInRange.retainAll(sortSetEventsWithStatusX));

    // iterate and process the results

    for(Event e : eventsTimestampInRangeAndStatusXOrderedByComponentId) ...

     

    Regards,

    Andrei

  • 2. Re: Help setting up correct key/index structure
    greybird Expert
    Currently Being Moderated

    Andrei, I think this can be done in a simpler way.  Simon, please let us know if this meets your needs.

     

    Rather than create three secondary keys and three secondary indexes, only a single secondary key/index is needed that contains the three fields: status, timestamp, and component.  If the fields are declared in that order (using the @KeyField annotation to specify the ordering), then this allows performing a query that is equivalent to the SQL statement that Andrei described, which I believe is the query you asked about.

     

    For example:

    @Persistent
    class MySecondaryKey {
          @KeyField(1) int status;
          @KeyField(2) long date;
          @KeyField(3) int component;
          MySecondaryKey(int status, long date, int component) {...}
    }
    @Entity
    class Event {
        @PrimaryKey int id;
        @SecondaryKey MySecondaryKey secKey;
    }
    PrimaryIndex eventById = store.getPrimaryIndex(Integer.class, Event.class);
    SecondaryIndex eventBySecKey = store.getSecondaryIndex(eventById, MySecondaryKey.class, "secKey");
    
    // Perform a query
    MySecondaryKey startKey = new MySecondaryKey(status, startDate, 0);
    MySecondaryKey endKey = new MySecondaryKey(status, endDate, Long.MAX_VALUE);
    EntityCursor cursor = eventBySecKey.entities(startKey, true /*inclusive*/, endKey, true /*inclusive*/);
    // Use cursor to iterate over matching events...
    

     

    Does this make sense?  The javadoc for SecondaryIndex and KeyField may help, but feel free to ask additional questions.

     

    There is no need to use our Collection API unless you would like to do that.  The Collection API is mainly valuable for two reasons: 1) familiarity (java.util classes are commonly understood), and 2) interoperability with other components (e.g., a library that requires passing a Collection object).

     

    --mark

  • 3. Re: Help setting up correct key/index structure
    9b78ff64-42a5-49e3-ac23-f5d19e32749a Newbie
    Currently Being Moderated

    Andrei, Mark - many, many thanks for taking the time to write such thorough answers. I hadn't expected such detail and I'm very appreciative of your help and suggestions.

     

    I think you've understood what I'm trying to achieve, despite me not giving more detail in my original post. Andrei's SQL representation of the query is spot on. To give a bit more context, there will be a large number of events stored (around 1000 per day initially, probably growing to 100k per day over the next couple of years). Every 24 hours, we want to query events generated in the previous 24 hrs, processing them in order of component ID. Whilst I could see how to achieve each part of the query individually (time range, status comparison, component ID ordering) I was struggling to see how to combine them without having to inefficiently iterate over a larger set.

     

    Andrei, I had initially discounted the Collection API because the docs seemed to imply that it was intended for 3rd party components that are build around Collections, but the approach you outlined seems straightforward (it would have taken me a while to figure this out so, again, thank you).  If I've understood your example code, it creates two SortedSets (one for time range and one for the status comparison) and then combines them using a TreeSet? I will read the Collection API docs more thoroughly.

     

    Mark, your suggestion is the kind of approach I was envisaging but when I thought about a secondary key with multiple key fields, I couldn't see how the date ordering would work when combined with the component ID. Looking at your code, I'm still struggling to see how this works but I could be missing something. For example, considering some simplified event data (ignoring the status field):

     

    TIME / COMPONENT ID

    9:00 / 1

    10:00 / 2

    11:00 / 1

    12:00 / 2

     

    Sorting on time before component ID will give the component IDs out of sequence won't it? Won't the component ID sorting only work for records with identical timestamps? Apologies if I've misunderstood.

     

    Thanks again for your help.

     

    Simon.

  • 4. Re: Help setting up correct key/index structure
    greybird Expert
    Currently Being Moderated

    >> Mark, your suggestion is the kind of approach I was envisaging but when I thought about a secondary key with multiple key fields, I couldn't see how the date ordering would work when combined with the component ID. Looking at your code, I'm still struggling to see how this works but I could be missing something. For example, considering some simplified event data (ignoring the status field):

     

    Nice reply Simon.  You're right.  I was thinking that ordering by component within date was what you wanted, but now I understand that it's not.  (My mistake.)

     

    It's not possible to create a secondary key that returns only the records you want with a single query, in the order you want.  This is because you want to order by a field that is not used for selection.  In other words, BDB does not have a ORDER BY feature -- ordering is based only on the index key used.

     

    So the alternatives are:

     

    1) Use a secondary key that is only for ordering (perhaps just {component}), and filter the results to select only the status and date range you want.  By filter I mean discard the events that are not desired.  This is probably the least desirable solution, because it requires reading all records and therefore may perform poorly as the data set grows.

     

    2) Use a secondary key with only the two fields needed for selection {status, date} and sort the results yourself.  This is a good solution if the result set size is bounded and small enough to hold in memory.  It is pretty simple to implement.  But beware of running out of memory when the data set grows.

     

    3) Use a secondary key such as {component, status, date}, do a separate query for each component value, then concatenate the results.  If (2) is not practical and there are a small number of component values, this may be a good solution.  Only the records you're interested in are read, and they are returned in the desired order.  This may be the best choice performance-wise, but is a little more difficult to implement.

     

    --mark

  • 5. Re: Help setting up correct key/index structure
    9b78ff64-42a5-49e3-ac23-f5d19e32749a Newbie
    Currently Being Moderated

    Mark, thanks for following up and clarifying. Of the 3 alternatives you give, I don't think the first two will be practical due to likely dataset size. Option 3 isn't ideal because there will be tens of thousands of component IDs and only a relatively small subset of these will be present in the period being queried, but perhaps this is the least worst option.

     

    What do you think of Andrei's collections-based suggestion? Being new to BDB, it seems strange that this offers a fairly elegant solution whereas you say that BDB doesn't have the ORDER BY feature that I need in this particular case. Is there a downside I'm missing? Again, apologies if I'm missing something obvious.

     

    Cheers,

     

    Simon.

  • 6. Re: Help setting up correct key/index structure
    greybird Expert
    Currently Being Moderated

    >> Mark, thanks for following up and clarifying. Of the 3 alternatives you give, I don't think the first two will be practical due to likely dataset size. Option 3 isn't ideal because there will be tens of thousands of component IDs and only a relatively small subset of these will be present in the period being queried, but perhaps this is the least worst option.

     

    Because BDB is embedded, doing multiple queries like this is not nearly as expensive as with a client-server database. However, tens of thousands of queries is a lot, so I agree it's not ideal.

     

    >> What do you think of Andrei's collections-based suggestion? Being new to BDB, it seems strange that this offers a fairly elegant solution whereas you say that BDB doesn't have the ORDER BY feature that I need in this particular case. Is there a downside I'm missing? Again, apologies if I'm missing something obvious.

     

    I would consider this a variant of option (2) because it does sort the results in memory after retrieving them.  The TreeSet below is an in-memory set of all results.

     

    SortedSet<Event> eventsTimestampInRangeAndStatusXOrderedByComponentId = new TreeSet<Event>(new EventComponentIdComparator());

    eventsTimestampInRangeAndStatusXOrderedByComponentId.addAll(sortSetEventsTimestampInRange.retainAll(sortSetEventsWithStatusX));


    Also the sortSetEventsTimestampInRange.retainAll call will delete the non-qualifying records from the database, so this approach has a more serious problem.  retainAll removes the records from 'this' that are not present in the collection passed.  In this case 'this' is the database.

     

    In general, the Collections API can't provide added performance, as it's just a layer on top.  It provides convenience.

     

    I'll try to think a little more on this, but I'm not sure there is an ideal solution.  Perhaps there is some way to leverage what you said above: "only a relatively small subset of these will be present in the period being queried".

     

    --mark

  • 7. Re: Help setting up correct key/index structure
    9b78ff64-42a5-49e3-ac23-f5d19e32749a Newbie
    Currently Being Moderated

    Ok, I hadn't appreciated the in-memory nature of TreeSet (although I was wondering how it worked) and definitely not the impact of the retainAll call.

     

    Out of interest, when BDB is used as an engine for a SQL RDBMS such as SQLite/MySQL, how would a query such as this be handled by BDB?

  • 8. Re: Help setting up correct key/index structure
    greybird Expert
    Currently Being Moderated

    >> Out of interest, when BDB is used as an engine for a SQL RDBMS such as SQLite/MySQL, how would a query such as this be handled by BDB?

     

    Those aren't the products I work on, but in the case we're discussing they would have to sort the results after querying them (in memory).  All databases must do this, if there isn't an index the provides the requested order.  Either you store the data/index in the order you want to query it, or you have to sort in memory after doing the query.

     

    When I said that BDB doesn't support ORDER BY, I meant it doesn't do the in-memory sort for you.  And of course I was not referring to the SQLite product that implements SQL on top of BDB.

     

    --mark

  • 9. Re: Help setting up correct key/index structure
    greybird Expert
    Currently Being Moderated

    I haven't been able to come up with any new ideas except the following variation of option 2, which is where the secondary key is {status, date} and you sort the results after doing the query. 

     

    The variation is to sort the results by inserting them into a temporary database that is ordered by {component, id} (or whatever key you prefer to give the order you'd like), and then query them from that database in key order.  You can create a temporary store for this purpose using the DPL by calling StoreConfig.setTemporary(true).

     

    A further optimization would be to do the sort in memory when the result set is small enough to fit, and only resort to the in-database sort when necessary.

     

    It may even be worthwhile to prototype this approach, and also option 3, and compare their performance.

     

    --mark

  • 10. Re: Help setting up correct key/index structure
    9b78ff64-42a5-49e3-ac23-f5d19e32749a Newbie
    Currently Being Moderated

    Mark, thanks for this suggestion. I think I will go with this approach. Using a temporary store makes sense. Thanks again for your time on this - I'll let you know how I get on and may well have some other questions along the way!

  • 11. Re: Help setting up correct key/index structure
    greybird Expert
    Currently Being Moderated

    You're welcome.  Yes, please do let us know how it goes.

     

    I was also going to mention that temporary databases have the property that nothing is flushed to disk until the JE cache fills.  So for small result sets, it may not be worth the effort to implement a special case for in-memory sorting.

     

    --mark

Legend

  • Correct Answers - 10 points
  • Helpful Answers - 5 points