Locality-sensitive hashing nearest neighbors regression

# Problem definition.

In Locality-Sensitive Hashing (LSH) method we are given two sets of points from multidimensional feature space with some metric:

1. set of training points - each with given category;
2. set of test points - for which we want to predict the category.

In our LSH method the predicted values for each of test points are calculated as averages of values for their nearest neighbour points from the set of training points.

The set of nearest neighbors of a point $$x$$ is approximated by set of training points $$y$$ such that $$g(y)=g(x)$$ for some hash function $$g$$. For this method to work it is essential to use many hash functions from LSH functions family that is tuned for a given metric. LSH functions ensure much higher probability of collision for close points than for distant ones.

# Solution

This solution is a simplification of algorithm presented in [1].

For integer parameter $$k$$, we first generate a set of LSH functions $$h_i (1 \leq i \leq k)$$.

Next for each training point $$t$$ we update the hash map $$m$$ which maps keys $$g(t) = (h_1(t), \ldots, h_k(t))$$ to average categories of training points with the same key.

Then for each test point $$q$$ the predicted category is $$m[g(q)]$$.

We repeat all previous steps $$L$$ times and for each test point $$q$$ the final category is an average of nonempty predicted categories from all iterations (or average category of all training points if $$q$$ hash key was not present in the hash map in all iterations).

# LSH functions

Basing on [1] we implemented LSH function families for Hamming, Euclidean( $$l_2$$), Manhattan( $$l_1$$) and Jaccard (via min-wise independent permutations) distances. All hash function generators are defined in file lsh_functions.hpp.

# Example

#include <boost/numeric/ublas/vector.hpp>
#include <boost/range/algorithm/copy.hpp>
#include <iostream>
#include <iterator>
#include <vector>
using coordinate_t = int;
using point_t = boost::numeric::ublas::vector<coordinate_t>;
auto make_point(const std::initializer_list<int> &list) {
point_t point(list.size());
boost::copy(list, point.begin());
return point;
};
int main() {
const std::vector<point_t> training_points =
{make_point({0, 0}), make_point({0, 1}),
make_point({1, 0}), make_point({1, 1})};
const std::vector<double> training_results = {0.0, 0.4, 0.6, 1.0};
const std::vector<point_t> test_points =
{make_point({0, -1}), make_point({2, 1})};
auto const passes = 50;
auto const hash_functions_per_point = 10;
auto const dimensions = training_points.front().size();
auto const threads_count = 1;
//w_param should be essentially bigger than radius of expected test point neighborhood
auto const w_param = 3.0;
auto lsh_function_generator =
training_points, training_results,
passes,
std::move(lsh_function_generator),
hash_functions_per_point,
std::vector<double> results;
model.test(test_points, std::back_inserter(results));
std::cout << "Solution:" << std::endl;
boost::copy(results, std::ostream_iterator<double>(std::cout, ","));
std::cout << std::endl;
return 0;
}

example file is lsh_nearest_neighbors_regression_example.cpp

# Main functions and methods

There are two make functions constructing the lsh_nearest_neighbors_regression model. The more general one:

auto make_lsh_nearest_neighbors_regression(
TrainingPoints &&training_points, TrainingResults &&training_results,
unsigned passes,
LshFunctionGenerator &&lsh_function_generator,
unsigned threads_count = hardware_concurrency);


And the special version which assumes that lsh_function is a concatenation of several homogeneous functions:

auto make_lsh_nearest_neighbors_regression_tuple_hash(
TrainingPoints &&training_points, TrainingResults &&training_results,
unsigned passes,
FunctionGenerator &&function_generator,
unsigned hash_functions_per_point,
unsigned threads_count = hardware_concurrency);


The model can be updated with additional training points by using the following method:

void update(TrainingPoints &&training_points, TrainingResults &&training_results,
unsigned threads_count = hardware_concurrency);


The model can be tested with the test points by using the following method:

void test(TestPoints &&test_points, OutputIterator result) const;


# Parameters

IN: TrainingPoints &&training_points - range of training points (each training point is a range of its coordinates)

IN: TrainingResults &&training_results - range of training points categories

IN: TestPoints &&test_points - range of test points (each test point is a range of its coordinates)

OUT: OutputIterator &&result - output iterator for the predicted categories for the test points

IN: unsigned passes - number of algorithm iterations (parameter $$L$$), more iterations give more accurate results

IN: unsigned hash_functions_per_point - number of hash functions used by a single hash key (parameter $$k$$). Value has to be accurate: to small one increases probability of hash key collisions for distant points and to big one decreases probability of hash key collisions for near points.

IN: LshFunctionGenerator &&lsh_function_generator - functor generating LSH functions.

# Binary

The solution can be used as a binary program lsh-regression which supports:

• reading both training and test data from files,
• writing predicted results for the test data to a file and computing logloss measure on the test data,
• configurable data buffer size/reading file using mmap in order to control memory usage,
• svm file format, for details see paal::detail::svm_row::operator>>(),
• sparse/dense data points representation,
• serialization of the model to a file and reading the serialized model from a file,

For more details on usage, please run the binary program with --help option.

# Complexity

The total time complexity is $$O(L(m+n)kh)$$, where $$L$$ is the number of iterations, $$m$$ and $$n$$ are numbers of training and test points respectively, $$k$$ is the number of hash functions used by a single hash key, $$h$$ is the time complexity of a single application of one LSH function.

The space complexity is $$O(n+k+s)$$, where $$s$$ is the maximal size needed by the hash map, which is proportional to the number of different hash keys of all training points for a single hash function $$g$$.

# References

The LSH function families and the original version of the algorithm are described in [1].