Problem definition.

In the sketch problem we are given matrix \(A\). Our goal is to find matrix \(B\) which is significantly smaller than \(A\) and approximates it well.

Solution

We solve the problem using frequent-directions algorithm. We start with zero matrix \(B\). Rows from matrix \(A\) replace all-zero valued rows of sketch \(B\). If there is no zero valued row we nullify some of them in two steps:

the sketch is rotated using its Singular Value Decomposition (SVD) such that its rows are orthogonal and in descending magnitude order
the least important rows are nullified so that \(compress\_size\) of them are nonzero and rest are set to zero

Default \(compress\_size\) is half of number of rows in sketch.

Pseudocode:

\begin{eqnarray*} & & \\ & & \color{blue}B \leftarrow all\ zeros\ matrix\ R^{\ l \times m}\\ & & \textbf{for } \color{red}i \in [\color{red}n] \textbf{ do}\\ & & \hspace{1cm}Insert\ \color{blue}{A_i}\ into\ a\ zero\ valued\ row\ of\ \color{blue}B\\ & & \hspace{1cm}\textbf{if }\color{blue}B\ has\ no\ zero\ valued\ rows \textbf{ then}\\ & & \hspace{2cm}[\color{blue}U, \color{blue}\Sigma, \color{blue}V] \leftarrow SVD(\color{blue}B)\\ & & \hspace{2cm}\color{red}\delta \leftarrow \color{red}\sigma^2_{compress\_size}\\ & & \hspace{2cm}\color{blue}{\hat{\Sigma}} \leftarrow \sqrt{max(\color{blue}\Sigma^2 - I_l\color{red}\delta,0)}\\ & & \hspace{2cm}\color{blue}B \leftarrow \color{blue}{\hat{\Sigma}V}^T\\ \end{eqnarray*}

Example

#include "paal/sketch/frequent_directions.hpp"
#include "paal/utils/irange.hpp"
#include "paal/utils/print_collection.hpp"
#include <boost/numeric/ublas/matrix.hpp>
#include <boost/numeric/ublas/matrix_proxy.hpp>
#include <iostream>
#include <vector>
using coordinate_t = double;
using matrix_t = boost::numeric::ublas::matrix<coordinate_t>;
int main() {
    std::size_t const rows_count = 5;
    std::size_t const columns_count = 3;
    std::size_t const sketch_size = 4;
    auto fd_sketch = paal::make_frequent_directions<coordinate_t>(sketch_size, columns_count);
    matrix_t data(rows_count, columns_count);
    for (auto i : paal::irange(rows_count)) {
        for (auto j : paal::irange(columns_count)) {
            data(i, j) = i + j;
        }
    }
    fd_sketch.update(std::move(data));
    std::vector<std::vector<coordinate_t>> rows = {
        {2, 1, 0},
        {3, 2, 1},
        {4, 3, 2},
        {5, 4, 3},
        {6, 5, 4}};
    fd_sketch.update_range(std::move(rows));
    auto row = {7.0, 6.0, 5.0};
    fd_sketch.update_row(std::move(row));
    fd_sketch.compress();
    auto actual_size = fd_sketch.get_sketch().second;
    std::cout << "Actual sketch size: " << actual_size << std::endl;
    auto sketch = fd_sketch.get_sketch().first;
    boost::numeric::ublas::matrix_range<matrix_t> sketch_range (sketch,
         boost::numeric::ublas::range(0, actual_size),
         boost::numeric::ublas::range(0, columns_count));
    std::cout << "Sketch data:" << std::endl;
    paal::print_matrix(std::cout, sketch_range, " ");
    std::cout << std::endl;
    return 0;
}

example file is frequent_directions_example.cpp

Main functions and methods

There are four make functions constructing the frequent_directions model. First two with user matrix as sketch:

auto make_frequent_directions(Matrix matrix);
auto make_frequent_directions(Matrix matrix, size_t const compress_size);

And other which creates matrix of given sizes:

auto make_frequent_directions(size_t rows_count, size_t columns_count);
auto make_frequent_directions(size_t rows_count, size_t columns_count, size_t const compress_size);

The model can be updated with additional data by using following methods:

void update(MatrixData&& matrix);
void update_row(InputRow&& input_row);
void update_range(RowRange&& row_range);

Parameters

IN: Matrix matrix - starting sketch matrix

IN: size_t compress_size

IN: size_t rows_count - numer of rows in a sketch

IN: size_t columns_count - number of columns in a sketch

IN: MatrixData &&matrix - data to update sketch in matrix form

IN: InputRow &&input_row - single row to update

IN: RowRange &&row_range - range of rows to update

Binary

The solution can be used as a binary program frequent-directions which supports:

reading data from a file or standard input,
writing sketch to a file or standard output,
configurable data buffer size in order to control memory usage,
configurable \(compress\_size\),
serialization of the model to a file and reading the serialized model from a file.

For more details on usage, please run the binary program with --help option.

Approximation Ratio

After adding all data and running compress phase at the end, we get:

\begin{eqnarray*} &0\preceq B^TB \preceq A^TA\\ and&\\ &\Vert A^TA - B^TB \Vert \leq \Vert A \Vert ^2_f / compress\_size\\ \end{eqnarray*}

Complexity

Complexity of the algorithm is \(O(n*m*l / (1-\frac{compress\_size}{l}))\) where \(n\) is number of rows in matrix \(A\), \(m\) is number of columns in matrix \(A\), \(l\) is number of rows in sketch \(B\) and \(compress\_size\) is number of nonzero rows after compress phase.

References

The algorithm analysis is described in [14]