In my blog post “SVD Part 1: Why is correlation bad?” I argued that when the data contains correlated features, it is better to “remove” the correlations first by applying the SVD. This is (almost) the same as applying the Penalized Mahalonobis Distance. (“Almost” because we may not want to center the input data by subtracting the means of the features).

A lot of Machine Learning textbooks mention the Mahalonobis Distance, but of the textbooks that I have only “The Elements of Statistical Learning” by Hastie, Tibshirani and Friedman mentions the penalized version (which is the version one should use in practice). When I was at univeristy, I was always being warned by my professor: “Never compute the inverse of a matrix unless you have to”. A more useful version of this warning is: “If you want to compute the inverse of the covariance matrix, always smooth the covariance matrix first”.

What is the reason for the professor’s warning? If you analyze the inverse of the covariance matrix though the SVD, you will find out that the inverse entail a division by the covariance matrix singular values (see below). When the input features are correlated you will get some singular values close to 0. So when computing the inverse of the covariance matrix you will divide by a very small number. This will make some of the “newly derived features” very large. This is very bad since those features are the most useless ones for machine learning purposes.

We need some notation to properly describe what happens.

The input data is a matrix $X$ where the rows are the data points and the columns are the features:

X: input matrix
rows are data points
columns are features.


The (un-centered) covariance matrix is:

If you want the centered version, first compute the mean value for each column and subtract the mean value from its respective column.

Let $X = U S V^T$ be the SVD of $X$.

What we want to accomplish is to compute

$(X^T X + \lambda I) ^{-1}$

Let $s_1,s_2,\ldots, s_k$ be the first singular values of $X$

Just replace $X$ with its SVD in the formula above. Apply the Woodbery matrix identity and you will work out the answer to be:

$(X^T X + \lambda I)^{-1} = V \text{diag}\big(\frac{s_i^2}{s_i^2 + \lambda}\big) V^T$

This formula clearly shows

• how we avoid division by a small number;
• all the features are “shrunk”, but more important features are “shrunk less”. Sometimes penalization (adding $\lambda$ to the diagonals of the covariance matrix) is called shrinkage because it “shrinks” the contribution of the features. The original shrinking without penalty would have been $\frac{1}{s_i^2}$.

In practical terms, when the number of points is large, but the number of features is small (in the order of 1000’s), we can work with the SVD of the (uncentered) covariance matrix:

where the $V$ and $S$ are the same $V$ and $S$ as in the SVD of $X$. In this way, we don’t require special SVD solvers. You can also use the eigenvalue decomposition since it is the same in this case.

The procedure in this whole blog post can be summarized as:

Step 1. Compute $X^T X$.

Step 2: Compute the svd of step 1: $\text{svd}(X^T X) = V S^2 V^T = V D V^T$. ($D = S^2$)

(Steps 1 and 2 are in case you don’t have a solver for SVD of large matrices. (Check this answer why the svd of $X^TX$ is less desirable to use than the SVD of X)

Step 3: Take the first top $k$ singular values. Those are $d_1 = s^2_1, \ldots, d_k= s^2_k$

Step 4: Compute the transformed features:

Step 5: Compute the Euclidean distance using the transformed features

A little bit of explanation on Step 4 follows. V is a $n$ by $k$ matrix ($n$ is the number of data points). Row $i$ in $V$ is some new representation of the data point. It’s a row vector, let us call it $[v_1, v_2, \ldots, v_k]$. Then we need to put weights to the components of this row vector through $w_i$:

If you want the Mahalanobis Distance first apply the following step 0:

Step 0: Replace each column $col$ in $X$ with $col - mean(col)$. For steps 1 to 5 take $X$ to be the matrix with the replaced columns.

Should you do step 0? It could be that Step 0, makes the solution to Nearest Neighbor worse. Why is that? Typically the features are so encoded that when the value of the feature is close to 0, the feature bears the least information to the problem at hand (for example prediction). So when you subtract the mean, you are making the features less senstive to solving the prediction problem. 