This discussion is archived
5 Replies Latest reply: Jun 29, 2010 3:29 AM by EJP RSS

A couple of million keys based collection with fastest look up algorithm

843853 Newbie
Currently Being Moderated
What collection allows us to store about ~10 million String/Long keys that are later used only for lookup, check if a key exists in this collection and very rarely add or delete from this collection.

It requires me to use such a collection as a reference in my JVM, as of now, only to check if the records key exists in the collection, to do specific processing on the actual record.

A list does not make sense, a sorted linked list with a custom search algorithm is far from my league, a hashmap might be sensible but even with a uniform hashcode assuming the key itself is an integer/long as the primary key of the table record, searching within the 10 million hashcodes should be slower?

Would balanced tree based algorithms that databases internally use for indices fit here, if so, are there any good implementations that fit my criteria that I can quickly start using from the JDK?

Thanks
Rama
  • 1. Re: A couple of million keys based collection with fastest look up algorith
    EJP Guru
    Currently Being Moderated
    I dont't understand your objection to a hash table. It is the obvious choice until proven otherwise.
  • 2. Re: A couple of million keys based collection with fastest look up algorith
    843853 Newbie
    Currently Being Moderated
    :), curiously, as database indices are fast and support very large datasets and the idea to make "lookups faster" with very few inserts or deletes (~1-5% of lookups) with no updates. Even the inserts/deletes from different threads are on different keys, will be Strings, at all time! The lookups only determine if its new or old before being persisted.

    The size of the collection in known in advance, (no rehashing) and the keys are unique with a 100% uniform hashcode (PK + number : toString()) makes this an ideal case for HashTable (thread safety!), as a bad hashcode would make a hashTable, the worst choice.

    There is no value attribute here, and the hashcode is 100% uniform/unique, why not use part of the hashcode algorithm.

    HashTable:
    table[(key.hashCode() & 0x7FFFFFFF) % tab.length]

    HashMap:
    h ^= (h >>> 20) ^ (h >>> 12);
    h ^ (h >>> 7) ^ (h >>> 4);
    table[h & (length-1)]

    why the difference?

    gives bunch of String key nodes, without needing to maintain HashMap<String,value> nodes in the HashTable's Entry. Two keys with different hashcode(), can hash index into the same table entry.

    This may not be worth if a value attribute may be added/used later.

    The multi-threaded inserts/deletes on are different keys (100% sure) but we cannot use a HashMap because the hashes can be same for multiple keys.

    Is this worth rewriting HashTable?

    Each insert computes toString(), hashcode, hash and inserts/adjusts the table entry.

    The cost is toString(), hashcodes, hashes and lookup/update index in a ~10 million array.

    Can an ordered String[] array with an efficient search algorithm beat a hash lookup for lots of lookups (~10 million)?
  • 3. Re: A couple of million keys based collection with fastest look up algorith
    EJP Guru
    Currently Being Moderated
    If the hash codes are 100% unique, how can the hash code be the same for multiple keys? And why would that be a reason for not using a hash table anyway?

    This isn't making much sense to me.
  • 4. Re: A couple of million keys based collection with fastest look up algorith
    843853 Newbie
    Currently Being Moderated
    I meant the hash not the hashcode(), the keys being unique PK (String) generate a uniform and distinct hashCode(), the hash however need not be unique right!

    Thanks
    Rama
  • 5. Re: A couple of million keys based collection with fastest look up algorith
    EJP Guru
    Currently Being Moderated
    I meant the hash not the hashcode()
    and the difference here is?
    the keys being unique PK (String) generate a uniform and distinct hashCode()
    So that's the end of the problem. Use a HashMap.
    the hash however need not be unique right!
    I don't know what you mean by 'the hash'. If it's the same as the hashCode the question is answered. If it isn't, it is irrelevant, as HashMap uses the hashCode.

    I don't know why you're not trialling this instead of wasting time here.