SVD vs. PCA
One question that arises often in the context of PCA/SVD is how they are different and when one should use one or the other.
Common Notation
The following is a common notation
X is an n by m matrix rows = data points columns = features n: number of rows (data points) m: number of columns (features)
What is PCA?
This is the following transformation:
Transform the dataset: for each column i of X: column_i = X[:,i] X[:,i] = column_i - mean(column_i) Let U,S,V = svd(X, k = 100) -- svd after transformation U is an n by k orthonormal matrix S is an k by k diagonal matrix V is an m by k orthonormal matrix Then we have a new representation of the dataset. Option a) U (n by k matrix; k new features) Option b) U*diag(S) Option c) V*diag(sqrt(S*S/(S*S + 1.0))) Option d) V*diag(f(S)), f(s) = if s >= threshold 1/s else 0.0
Geometrically PCA corresponds to “centering the dataset”, and then rotating it to align the axis of highest variance with the principle axis.
PCA vs. SVD quesion
Since we use the SVD to compute the PCA, the question is really whether it makes sense to center our dataset by subtracting the means.
A related question arises in the context of linear regression: does it make sense to subtract the means from our variables (as well as divide by standard deviation). This is the so called Z-score normalization. One way to answer this question is to say that because those transformations (subtracting and dividing) are linear transformations of the input and because linear regression is also a linear transformation, it does not matter whether normalizations are carried out. This is true for standard linear regression. However, in practice we do L2-penalized regression (also called ridge regression or Bayesian linear regression). Then the transformation actually matters. For the penalized version of regression both sustracting the mean and dividing by standard diviation are likely to be harmful transformations.
Here we’ll just look at the case of subtracting the means. In many problems our features are positive values such as counts of words or pixel itensities. Typically a higher count or a higher pixel itensity means that a feature is more useful for classification/regression. If you subtract the mean then you are forcing features with original value of zero have a negative value which is high in magnitude. This means you just made the features values that are non-important to the problem of classification as important as the most important features values.
The same reasoning holds for PCA. If your features are least sensitive towards the mean of the distribution, then it makes sense to subtract the mean. If the features are most sensitive towards the high values, then subtracting the mean does not make sense.
Subtracting the mean vs. projecting on the mean
In some problems such as image classification or text classification (with bag of words models) we can see empirically that the first right singular vector of the SVD is very similar to a vector computed by averaging all data points. In this case, in the very first step of SVD (if you think of SVD as a stage-wise procedure), SVD takes care of global structure. In the case of PCA, PCA takes care of global structure by substracting the means.
Here is one way to think of the first singular component of the SVD:
col_means_vector = [col_mean_1, col_mean_2, ..., col_mean_m] = the means of the features L2-normalize col_means_vector V[:,1] = col_means_vector # col_means_vector plays the role of V[:,1] U[1,1] = X[1, :].dot(V[:,1]) # this plays the role of a correlation; the result is a scalar U[1,2] = X[2, :].dot(V[:,2]) ... normalize U[:,1] X[1,:] = X[1,:] - s1 * U[1,:] * V[:,1].T #remove global structure X[2,:] = X[2,:] - s1 * U[2,:]* V[:, 1].T ...
So, the column means play a role but in a different way than in the PCA procedure.