k-means clustering engine

# Problem definition.

In the k-means clustering problem we are given:

• a set of observation $$(x_1,x_2, ...x_n)$$
• function given distances between observation and center $$d$$.
• function given center for cluster $$c$$.
• function given start position of cluster center $$c$$. k-means clustering aims to partition the n observations into k sets $$(k<=n)$$. $$S ={S_1,S_2,...,S_k}$$ and give for each set center so as to minimize the within-cluster sum of distances from center to all observation:

\begin{eqnarray*} \mbox{argmin}_{S}\sum\limits_{i=1}^k\sum_{x_j\in S_i} d(x_j-c(S_i)) & \\ \end{eqnarray*}

# Solution

TODO

Example:

#include <vector>
#include <limits>
#include <cmath>
int main() {
// in this example cetroid is always chosen from the original points
// in this version of algorithm, user is required to provide functors which are used
// to run k_medians. That is closest_to and centroid
using Point = std::vector<double>;
// sample data
const int NUMBER_OF_CLUSTER = 2;
std::vector<Point> points = { { 0, 0 },
{ 0, 3 },
{ 4, 0 } };
std::vector<Point> centers(NUMBER_OF_CLUSTER);
paal::get_random_centers(points, NUMBER_OF_CLUSTER, centers.begin());
std::vector<std::pair<Point, int>> point_cluster_pairs(points.size());
//this centroid chooses centers from points
auto centroid = [&](const std::vector<std::vector<double>> &points) {
double best_dist = std::numeric_limits<double>::max();
std::vector<double> result;
for (auto && center : points) {
double dist = 1e99;
for (auto && point : points) {
dist = std::min(dist, paal::distance_square(point, center));
}
if (dist < best_dist) {
best_dist = dist;
result = center;
}
}
assert(!result.empty());
return result;
};
//default closest_to
auto closest_to = [&](const Point & point) {
return paal::closest_to(point, centers);
};
// solution
paal::k_means(points,
centers, centroid, closest_to,
point_cluster_pairs.begin());

complete example is k_means_clustering_engine_example.cpp

## Parameters

IN: ObservationRange observation_range

IN: const unsigned int number_of_cluster

IN: GetCoordinates distance

IN: Center center functor return center of set of samples

IN: StarterCenter get_starter_center

OUT: OutputIterator result

TODO