Which matrix

In the usual notation

X: n by m matrix
n = data points (rows)
m = features (columns)

We can consider the SVD of the following matrices

X: data point to feature
X*X': data point to data point
X'*X: feature to feature matrix

Those three matrices are related via the SVD:

(U,S,Vt) = svd(X)       [equation 1]
(U,S^2, U') = svd(X*X') [equation 2]
(V,S^2, V') = svd(X'*X) [equation 3]

We can take equation 1 and apply it twice to get any of equation 2 and equation 3.

These results show mathematically it does not matter which of the three matrices we are considering.

Sometimes (this goes to part 2) we can also consider the SVD of the square of the data-to-data matrix:

let D = X*X' (the data to data matrix, or Gram matrix)
(U,S^2,U') = svd(D)
note that D is symmetric: D = D'
(U,S^4,U') = svd(D*D)

What are the goals:

However, in practice it matters to which matrix one decides to take the SVD.

Consider data consiting of documents (data points) and words (features). We can have at least three tasks:

compute most similar words to a given word: P(word_i word_j)
compute the most similar documents to a given documents: P(document_i document_j)
select words that best describe the document (including words not in the document): P(word_i document_j)

Given the task of computing the most similar words, the following is one of the best performing similarity:

P(a,b)/P(a)*P(b)

or

P(a|b)/P(a)

where
P(a) = normalized frequency of word a
P(b) = normalized frequency of word b
P(a,b) = normalized number of contexts (documents) in which a and b appear together

For this formula we need to the log(P(a,b)). So it seems it’s better to do the SVD of the following (or related) transformation:

log(P(a,b)) - log(P(a)) - log(P(a))