In Locality-Sensitive Hashing (LSH) method we are given two sets of points from multidimensional feature space with some metric:
In our LSH method the predicted values for each of test points are calculated as averages of values for their nearest neighbour points from the set of training points.
The set of nearest neighbors of a point \(x\) is approximated by set of training points \(y\) such that \(g(y)=g(x)\) for some hash function \(g\). For this method to work it is essential to use many hash functions from LSH functions family that is tuned for a given metric. LSH functions ensure much higher probability of collision for close points than for distant ones.
This solution is a simplification of algorithm presented in [1].
For integer parameter \(k\), we first generate a set of LSH functions \(h_i (1 \leq i \leq k)\).
Next for each training point \(t\) we update the hash map \(m\) which maps keys \( g(t) = (h_1(t), \ldots, h_k(t))\) to average categories of training points with the same key.
Then for each test point \(q\) the predicted category is \(m[g(q)]\).
We repeat all previous steps \(L\) times and for each test point \(q\) the final category is an average of nonempty predicted categories from all iterations (or average category of all training points if \(q\) hash key was not present in the hash map in all iterations).
Basing on [1] we implemented LSH function families for Hamming, Euclidean( \(l_2\)), Manhattan( \(l_1\)) and Jaccard (via min-wise independent permutations) distances. All hash function generators are defined in file lsh_functions.hpp.
example file is lsh_nearest_neighbors_regression_example.cpp
There are two make functions constructing the lsh_nearest_neighbors_regression model. The more general one:
auto make_lsh_nearest_neighbors_regression( TrainingPoints &&training_points, TrainingResults &&training_results, unsigned passes, LshFunctionGenerator &&lsh_function_generator, unsigned threads_count = hardware_concurrency);
And the special version which assumes that lsh_function is a concatenation of several homogeneous functions:
auto make_lsh_nearest_neighbors_regression_tuple_hash( TrainingPoints &&training_points, TrainingResults &&training_results, unsigned passes, FunctionGenerator &&function_generator, unsigned hash_functions_per_point, unsigned threads_count = hardware_concurrency);
The model can be updated with additional training points by using the following method:
void update(TrainingPoints &&training_points, TrainingResults &&training_results, unsigned threads_count = hardware_concurrency);
The model can be tested with the test points by using the following method:
void test(TestPoints &&test_points, OutputIterator result) const;
IN: TrainingPoints &&training_points - range of training points (each training point is a range of its coordinates)
IN: TrainingResults &&training_results - range of training points categories
IN: TestPoints &&test_points - range of test points (each test point is a range of its coordinates)
OUT: OutputIterator &&result - output iterator for the predicted categories for the test points
IN: unsigned passes - number of algorithm iterations (parameter \(L\)), more iterations give more accurate results
IN: unsigned hash_functions_per_point - number of hash functions used by a single hash key (parameter \(k\)). Value has to be accurate: to small one increases probability of hash key collisions for distant points and to big one decreases probability of hash key collisions for near points.
IN: LshFunctionGenerator &&lsh_function_generator - functor generating LSH functions.
IN: unsigned threads_count
The solution can be used as a binary program lsh-regression which supports:
For more details on usage, please run the binary program with --help option.
The total time complexity is \(O(L(m+n)kh)\), where \(L\) is the number of iterations, \(m\) and \(n\) are numbers of training and test points respectively, \(k\) is the number of hash functions used by a single hash key, \(h\) is the time complexity of a single application of one LSH function.
The space complexity is \(O(n+k+s)\), where \(s\) is the maximal size needed by the hash map, which is proportional to the number of different hash keys of all training points for a single hash function \(g\).
The LSH function families and the original version of the algorithm are described in [1].