In this blog post, I motivate the SVD as a way to create better features for machine learning. The improved features are based on features present in the original data. I start with the formula for the variance of two random variables. Then I discuss how correlated features “break” regression and nearest neighbor based on Euclidean Distance. The way to “fix” those problems is to decorrelate the features. This is where the SVD comes in. It is (an) optimal decorrelation method where the new features are linear combinations of the original ones.

That being said, the best use case for the application of the SVD is when the data has multiple weak features (e.g. pixels/words) which are of the same type. Those features should have some structure (correlations) for the SVD to be useful. For example, two neighboring pixels on an image are likely correlated (have similar colors). Ideally features should not have too heavy tailed distribution. This means we don’t want features to take too extreme values too often. This is the reason why for word data we apply some transformations prior to applying the SVD.

SVD: Singular Value Decomposition
PCA: Principle Component Analysis
Mahalanobis Distance: a generalization of Euclidean Distance which takes into account correlation
between original features

Key formula: sum of variance of two random variables

The formula for the variance of a linear combination of random variables is:

This formula tells us that if two variables are correlated the variance of their sum is increased with twice the amount of their correlation in addition to the sum of the individual variances.

As a simple case consider $\alpha=\beta=1$ and $Var(X) = Var(Y)$. We consider two cases: X and Y are uncorrelated and correlated.

This means that correlation causes the variance of the sum to be twice as large compared to the case when there is no correlation.

A better way to write the above comparision is to create a new feature which is the mean of the two variables:

As it can be seen, averaging uncorrelated variables reduces the variance, while for correlated variables, avaraging did not change the variance (in this case the average is the same variable).

We will consider two use cases for correlation. The first use case is during prediction with linear regression. In this case, when we predict the label of a new data point, we add correlated variables multipled by their estimated coefficients. The coefficients themselves in this case are likely to have high variance. This is why the predicted value will have high variance.

The second use case is related to building stronger predictive features. Strong predictive features tend to have high variance. One way to increase the variance is to add correlated variables to make a stronger feature.

The key difference between those two cases is whether one estimates coefficients based on variables that are correlated or on variables that are not correlated.

Correlated Observations in Linear Regression

Imagine the following situation arising in prediction with linear regression:

There are two sets of variables 
  X1, X2, ..., Xm
  and Z1, Z2, ..., Zk

You want to predict Y as a function of all of those:
  Y ~ lm(X_i + Z_j) #lm == linear model

X1, X2, ..., Xm are correlated. They measure the same signal but with noise:
  X_i = true X + noise

Similarly Z_j's are correlated:
  Z_i = true Z + noise.

As a practical example $X$ and $Z$ could be two topics. Then $X_i$ are words from the topic $X$ while $Z_j$ could be words from the topic $Z$.

Now I will propose two simple fitting strategies:

Strategy 1:
  Run linear regression on all the variables:
    Y ~ lm(X_i + Z_j)
Strategy 2:
  First average all X_i's and Z_j's:
  X_mean = mean(X_i)
  Z_mean = mean(Z_j)
  Run a regression on X_mean and Z_mean:
    Y ~ lm(X_mean + Z_mean)

Main question: Which estimation strategy works better?

It depends. There are three parameters of importance:

n: number of training points
m, k: number of correlated observations (we make m = k for simplicity)
noise: how much is the noise due to measurements

However, as we’ll see the first strategy is likely to result in high variance especially in small n and large m and k.

Under the first strategy, let’s imagine we do a stage-wise fitting approach. This means that first we fit to the mean of $Y$, then we choose the better predictor of $X_i$ and $Z_j$. Assume $X_i$ are better predictors. Those variables have essentially the same predictive power, so without loss of generality let us image that we pick $X_1$. At the next state we pick some $Z$, for example $Z_1$. After that none of the remaining variables $X_2$ through $X_m$ and $Z_2$ through $Z_k$ will result in substantial reduction in the error. Their estimated coefficients will be close to 0, but not exactly 0. This is how large variance in the estimated coefficients comes in.

Now there are two ways out:

  • We prune those variables $X_2 \ldots X_m$ and $Z_2 \ldots Z_m$. This is suboptimal since the information in them is not used;
  • We leave them. However, this is even worse since we end up with large variance when we predict $Y$.

The second strategy is much better since by averaging, we reduce the variance in $mean(X_i)$ and $mean(Z_j)$. Then we just fit two coefficients which is more stable.

Just want to make clear that L2-penalized regression (or ridge regression) is likely to fix the issue. Internally ridge regression works through the SVD.

Euclidean distance/Cosine on correlated data

In the nearest neighbor prediction model, two data points $a$ and $b$ are similar (or neighbors) if the distance between them is small. As before, let the data points have two components $x_i$ and $z_j$. The variables $x_i$ are correlated between themselves; $z_j$ are also correlated between themselves, but $x_i$ and $z_j$ are not correlated.

Similarly to the regression use case, consider two ways of measuring the distance between $a$ and $b$

Strategy 1:
Do not preprocess the variables
Strategy 2: 
First find the means of x and z, then compute the distance between the means:

Again, why do you want to use the second strategy? Because $x_i$ and $y_i$ are noisy measurements and the distance is not really the “true” distance, but a noisy version with it. The first strategy amplifies the noise present in the measurement, while the second strategy reduces it. The key is to first average the correlated variables to reduce the noise.

A related technique is the so called (penalized) Mahanobis Distance. Mahanonobis distance decorelates the features and normalizes them to have unit variance. Features with very small variances will cause issues unless the penalized version of the Mahalonobis Distance is used. This is where the SVD come in - it makes it possible to better estimate the covariance matrix and to invert it. Be careful with matrix inversion in statistics problems. Always use smoothing techniques. (See the blog post Better Euclidean Distance With the SVD ).

Main Observation

Both euclidean distance and linear regression are expected to work better (lower variance) if the variables are uncorrelated. Therefore the SVD can be motivated by the need to decorrelate the variables.

Optimal decorrelation

Now that we know that decorrelation is good, how can we achieve it?

Both the euclidean distance and linear regressioin are based on linear transformation of the original variables. So, it’s natural to consider a reparameterization of the original inputs into new inputs which are linearly related. Assume $x=(x_1,x_2,\ldots,x_m)$ are the original inputs written as a row vector. Then we can consider $k$ new features:

Those features are simply linear combinations of the original features. There are two natural requirements of those features:

  • The new features $f_j$ are uncorrelated

  • The new features $f_j$ have high variances

  • A third (technical) requirement is that the L2-norm of the coefficients $\beta_{j}$ is 1.

The first requirement is motivated by the previous discussion how correlation results in a large variance of the preditions. The second requirement states that $f_j$ should have high variances. Why? Consider what happens under the opposite possibility, namely that $f_i$ have small variances. The smallest variance is zero, which means that the variable is a constant. A constant will yield no information neither for regression nor for computing the closes points with euclidean distance. Large variance of the feature means exactly the opposite. It means an informative (discriminative) feature. For example, making measurements of the feature more fine grained will increase the information (variance/entropy) of the feature in general.

Now one can try the following stagewise approach:

Stage 1: Find the linear combination of the original features that has the highest variance.
Stage 2: Find a second linear combination which is uncorrelated with the first, but
      again with largest possible variance.
Stage 3: A linear combination uncorrelated with the first and the second derived feature and maximal variance

As it happens, this procedure is equivalent to the PCA. PCA is the SVD not of the original input features, but of the feature minus their respective means. This is because the formula for the covariance requires subtracting the means. For now we’ll keep this intuition, but later on we’ll compare the SVD to the PCA (i.e. in what cases subtacting the means of the orignal variables makes sense).

Truncated SVD

As it can be seen from the stagewise algorithm, each subsequent feature has less variance (less information). However, the variance of the feature has two components:

  • One component is due to structure
  • The other component is due to noise

Since we systematically exhaust the structure, at the end we are left mostly with noise. The common wisdom is that we just keep the first few linear combination (50 to 500) depending on the size of the data and drop the rest.

Caution. Not all data points are treated equally well by the (truncated) SVD

It may be that for some data points, the optimal choice is 50 features from the SVD while for other datapoints the optimal is 5 features. We should also be aware that data points with norms close to 0 (such as data points having fewer measurements as well as outliers) will not be well-represented by the SVD. This means the structure behind those data points is not found, but noise is added nonetheless. The noise comes from imposing a structure based on other data points. That’s why this argument about SVD cleaning the noise, while true on average, is wishful thinking for some data points. This will be a separate blog post on that issue with worked example as well as advice what to do about the issue.

Is the SVD really optimal?

It is optimal in the sense of the mathematical model just described: that we can decompose the original features into a smaller number of features while preserving most of the variance (interesting structure) of the dataset. But what are the shortcoming to this approach:

  • Non-sparsity. Most coefficients are close to zero but not exactly zero. A better approach in some cases could be to have a few non-zero coefficients that are larger rather than a lot of coefficients that are smaller.
  • The SVD does not restrict the matrixes U and V to be non-negative. If you restrict U and V you can get a more interpetable representation. The SVD entails complex (non-interpetable) cancelations. See non-negative matrix factorizations.
  • The SVD is too greedy. You can think of the SVD as a stage wise method that finds a representation that reduces the variance. However, this procedure is too greedy in the sense that exhausts the variance too quickly. Therefore components with large indexes look random.
  • Too small basis. This causes the SVD to produce worse representation for some data points than the original representation.
  • Applicable to features that are of the same type (e.g. words, or click data), but not applicable if you mix for example a location feature or a time feature with word feature. Features need to be properly normalized before applying SVD. Typically this means taking the log or the square root.
  • Finding statistically independent features may be superior to uncorrelated features (this is Statistical Component Analysis vs. Principle Component Analysis comparison).