Frequent-directions

# Problem definition.

In the sketch problem we are given matrix $$A$$. Our goal is to find matrix $$B$$ which is significantly smaller than $$A$$ and approximates it well.

# Solution

We solve the problem using frequent-directions algorithm. We start with zero matrix $$B$$. Rows from matrix $$A$$ replace all-zero valued rows of sketch $$B$$. If there is no zero valued row we nullify some of them in two steps:

1. the sketch is rotated using its Singular Value Decomposition (SVD) such that its rows are orthogonal and in descending magnitude order
2. the least important rows are nullified so that $$compress\_size$$ of them are nonzero and rest are set to zero

Default $$compress\_size$$ is half of number of rows in sketch.

Pseudocode:

\begin{eqnarray*} & & \\ & & \color{blue}B \leftarrow all\ zeros\ matrix\ R^{\ l \times m}\\ & & \textbf{for } \color{red}i \in [\color{red}n] \textbf{ do}\\ & & \hspace{1cm}Insert\ \color{blue}{A_i}\ into\ a\ zero\ valued\ row\ of\ \color{blue}B\\ & & \hspace{1cm}\textbf{if }\color{blue}B\ has\ no\ zero\ valued\ rows \textbf{ then}\\ & & \hspace{2cm}[\color{blue}U, \color{blue}\Sigma, \color{blue}V] \leftarrow SVD(\color{blue}B)\\ & & \hspace{2cm}\color{red}\delta \leftarrow \color{red}\sigma^2_{compress\_size}\\ & & \hspace{2cm}\color{blue}{\hat{\Sigma}} \leftarrow \sqrt{max(\color{blue}\Sigma^2 - I_l\color{red}\delta,0)}\\ & & \hspace{2cm}\color{blue}B \leftarrow \color{blue}{\hat{\Sigma}V}^T\\ \end{eqnarray*}

# Example

#include <boost/numeric/ublas/matrix.hpp>
#include <boost/numeric/ublas/matrix_proxy.hpp>
#include <iostream>
#include <vector>
using coordinate_t = double;
using matrix_t = boost::numeric::ublas::matrix<coordinate_t>;
int main() {
std::size_t const rows_count = 5;
std::size_t const columns_count = 3;
std::size_t const sketch_size = 4;
auto fd_sketch = paal::make_frequent_directions<coordinate_t>(sketch_size, columns_count);
matrix_t data(rows_count, columns_count);
for (auto i : paal::irange(rows_count)) {
for (auto j : paal::irange(columns_count)) {
data(i, j) = i + j;
}
}
fd_sketch.update(std::move(data));
std::vector<std::vector<coordinate_t>> rows = {
{2, 1, 0},
{3, 2, 1},
{4, 3, 2},
{5, 4, 3},
{6, 5, 4}};
fd_sketch.update_range(std::move(rows));
auto row = {7.0, 6.0, 5.0};
fd_sketch.update_row(std::move(row));
fd_sketch.compress();
auto actual_size = fd_sketch.get_sketch().second;
std::cout << "Actual sketch size: " << actual_size << std::endl;
auto sketch = fd_sketch.get_sketch().first;
boost::numeric::ublas::matrix_range<matrix_t> sketch_range (sketch,
boost::numeric::ublas::range(0, actual_size),
boost::numeric::ublas::range(0, columns_count));
std::cout << "Sketch data:" << std::endl;
paal::print_matrix(std::cout, sketch_range, " ");
std::cout << std::endl;
return 0;
}

example file is frequent_directions_example.cpp

# Main functions and methods

There are four make functions constructing the frequent_directions model. First two with user matrix as sketch:

auto make_frequent_directions(Matrix matrix);
auto make_frequent_directions(Matrix matrix, size_t const compress_size);


And other which creates matrix of given sizes:

auto make_frequent_directions(size_t rows_count, size_t columns_count);
auto make_frequent_directions(size_t rows_count, size_t columns_count, size_t const compress_size);


The model can be updated with additional data by using following methods:

void update(MatrixData&& matrix);
void update_row(InputRow&& input_row);
void update_range(RowRange&& row_range);


# Parameters

IN: Matrix matrix - starting sketch matrix

IN: size_t compress_size

IN: size_t rows_count - numer of rows in a sketch

IN: size_t columns_count - number of columns in a sketch

IN: MatrixData &&matrix - data to update sketch in matrix form

IN: InputRow &&input_row - single row to update

IN: RowRange &&row_range - range of rows to update

# Binary

The solution can be used as a binary program frequent-directions which supports:

• reading data from a file or standard input,
• writing sketch to a file or standard output,
• configurable data buffer size in order to control memory usage,
• configurable $$compress\_size$$,
• serialization of the model to a file and reading the serialized model from a file.

For more details on usage, please run the binary program with --help option.

# Approximation Ratio

After adding all data and running compress phase at the end, we get:

\begin{eqnarray*} &0\preceq B^TB \preceq A^TA\\ and&\\ &\Vert A^TA - B^TB \Vert \leq \Vert A \Vert ^2_f / compress\_size\\ \end{eqnarray*}

# Complexity

Complexity of the algorithm is $$O(n*m*l / (1-\frac{compress\_size}{l}))$$ where $$n$$ is number of rows in matrix $$A$$, $$m$$ is number of columns in matrix $$A$$, $$l$$ is number of rows in sketch $$B$$ and $$compress\_size$$ is number of nonzero rows after compress phase.

# References

The algorithm analysis is described in [14]