I'm pondering the idea of using BDB to do some clustering using 'locality sensitve hashing' on an [extremely] large scale. I know I can set my own hash function, but I'm wondering if there is any(simple?) way to access all of the values in the same hash bucket.
Has anyone tried this before with BDB?
Are there any major issues I am overlooking here that might cause problems?
I am trying to cluster high dimensional data (dim = 100 - 1000+) using 'Locality sensitive hashing'(http://en.wikipedia.org/wiki/Locality-sensitive_hashing)
This method requires hashing all feature vectors to a hash bucket through some hash function. The idea is that high dimensional data that is close in high dimensions should also be close after hashing(this hash function hash special properties which make this approximately true) . So once we have hashed all of our data we want to compare the data that has fallen in the same bucket and perhaps run some local clustering on that data(eg k-nearest neighbor).
I know I can use BDB to hash my data using my specialhash function, but I am less sure about how to get all the data that has fallen into the same bucket efficiently. I am also wondering if there are other technical issues I might run into trying to do this procedure using a BDB hash table.
Furthermore, and one reason I think using BDB is a good solution is that I want this table to be persistent, instead of just running the method on an in memory table, I can add data, eg daily, and continue to grow the table.
The main question is how do I access all data in the same hash bucket?
Here's the answer I got from a colleague of mine:
We do not export any notion of pages through the API, so he can't do exactly what he wants, but you can certainly use a cursor and get a set of objects that are all "next to each other" which would mean either in the same bucket or in adjacent buckets. I think that would allow him to get K nearest neighbors (you search for the item and then ask for the K/2 items Prev and the K/2 items Next)
It's not exactly what you requested, but may be sufficient.
Thank you for the reply, I figured that would be the best method as well, except I will Likely use either values in same bucket(through iterating) or actual bucket plus previous and next, same idea though.