This content has been marked as final. Show 5 replies
I am trying to cluster high dimensional data (dim = 100 - 1000+) using 'Locality sensitive hashing'(http://en.wikipedia.org/wiki/Locality-sensitive_hashing)
This method requires hashing all feature vectors to a hash bucket through some hash function. The idea is that high dimensional data that is close in high dimensions should also be close after hashing(this hash function hash special properties which make this approximately true) . So once we have hashed all of our data we want to compare the data that has fallen in the same bucket and perhaps run some local clustering on that data(eg k-nearest neighbor).
I know I can use BDB to hash my data using my specialhash function, but I am less sure about how to get all the data that has fallen into the same bucket efficiently. I am also wondering if there are other technical issues I might run into trying to do this procedure using a BDB hash table.
Furthermore, and one reason I think using BDB is a good solution is that I want this table to be persistent, instead of just running the method on an in memory table, I can add data, eg daily, and continue to grow the table.
The main question is how do I access all data in the same hash bucket?
Does that clarify a bit?
Here's the answer I got from a colleague of mine:
We do not export any notion of pages through the API, so he can't do exactly what he wants, but you can certainly use a cursor and get a set of objects that are all "next to each other" which would mean either in the same bucket or in adjacent buckets. I think that would allow him to get K nearest neighbors (you search for the item and then ask for the K/2 items Prev and the K/2 items Next)
It's not exactly what you requested, but may be sufficient.
Hope this helps.
Thanks and warm regards.