SVD: the easy case

First, I assume the data is well behaved (see part 1 to see how to make the data well behaved).

The easy case is when you have

a small number of features (e.g. 100 to 1000)
the features are dense

In the usual notation

X: is data by feature matrix
n: data points (rows)
m: features columns

To compute the SVD we can use:

For the sample covariance matrix
C = (X'X)/(n - 1)
Smooth the covariance matrix
C_smooth = C + lambda*I (lambda is a small constant)

(V,S1,Vt) = svd(C_smooth) #call an SVD solver

//Mathematically the V above is the same V as in X = USVt (but the S is not S1)
//therefore X*(Vt)' = U*S

project the dataset X onto (Vt').

output new representation of datapoints = X*(Vt)'

The value of lambda affects numerical stability and the singular values of S1 (those values have lambda added to them)

SVD: the harder part of the easy case

The heavy lifting in the easy case is to compute X’X. Imagine X has 1000 columns but 1 million data points.

We can compute X’X exactly, but if the data is dense (i.e. most data points have most of the feature set), we don’t have to. We can compute X’X approximately without affecting too much the output.

Why is that? X’X is the sample covariance matrix of the whole sample. One can think that if we take a subsample of sufficient size, it will do. However, there is a better method. This is the following method due to Woodruf and Clarkson and related to the so called hashing trick used by Vowpal Wabbit machine learning system and the Hashing Vectorizer.

A faster covariance matrix at the speed of sampling.

Input: X is an n by m matrix. We assume that m is small.

Allocate a summary matrix, called it K. K has z = 10000 rows and k columns.
K is a summary of X but it has fewer columns.

scan over the rows of X
  let row_of_X be the next row
  random_pairs = generate 50 pairs of random numbers: (sign, row_index in K)
  //sign is a +/- 1 and row_index is a random number [0, z - 1].
  for (sign, random_row_index) in random_pairs:
    K(random_row_index) += sign*row_of_X 

compue K'K instead of X'X.
(V,S1,Vt) = svd(K'K + lambda*I)

SVD: the hard case

The hard case is when we have

large number of data points (e.g. documents)
large number of features (e.g. words or phrases)
very sparse matrix
a very large dataset (e.g. does not fit in the memory of the standard SVD solvers)

The easier part of the hard case

The easier part of the hard case is when the data is not “too sparse”

In that case:

Let R be a random matrix containing +/-1 chosen randomly
the dimenions of R are n and z. Take z to be between 200 and 1000.

Pass 1:
Compute K = X'X*R
K is a tall thin matrix of dimension m by z

Assume K can fit into memory
In other words, the feature space m is not so large (may be its less than 1 million)

Pass2:
Compute
M = X*K

Let (U,_,_) = svd(M)
U is n by z
U is already some approximation of the final result

compute F = X'*U
F is m by z

let (_,_, Vt) = svd(F)
Vt is l by z (in practice we pick l = z, but writing l helps check the dimensions)

(Vt') is z by l

Pass 3:
Output = U*(Vt)' 
Output is n by l (but we picked l = z)

SVD: the harder part of the hard case

The harder case is when the data is too sparse but also has very little structure. For example, when the matrix represents a graph in which there are a huge number of very small connected components will fit this case.

Then it’s very hard to change the representation from a sparse matrix to a dense matrix with small dimension.

How can this case be seen from the SVD? We have to look at the singular values: in that case they are not decaying rapdily. For example 100th singular value is still quite large. This means that the top 100 singular vectors/values have not captured enough of the structure in the dataset. We need a higher dimension to capture the dataset.

What can be done in this case is to first run the procedure from “SVD: the easier part of the hard case”. Then we remove by orthogonalization the first 100 singular vectors and apply again the SVD procedure to compute the next 100 singular vectors.

SVD: finding points with large approximation errors