Weak features, strong features, what about the SVD?

Some datasets have a few strong features, while other datasets have many weak features. The SVD fits the latter case: the case when we have multiple weak features. What the SVD produces are a smaller number (50 to 500) of stronger features.

This discussion is connected to linear regression. Linear regression comes in a few favours:

  • ridge regression
  • sparse regression (the lasso)

Ridge regression will make use of multiple weak features combining them in a single output of the regressio function. The lasso (or sparse regression) on the other hand will select the few most informative features.

Any of those methods might be preferable depending on the circumstances.

The point here is to know when the SVD is preferable. If you have a few strong features, probably the SVD will not bring you much (and it can even hurt quality). If you have multiple weak features, then you should try the SVD.

What about the case of strong and weak features. Here is my recommendation: do the SVD on the weak features only and produce a few aggregated features. Why is that? If you put strong features with weak features, the strong features will mask the weak features. At a later stage create a new dataset containing the original strong features and the aggregated weak features.

Ideally the weak features have the same variance.

What the svd does is to replace the original features with linear combinations of those features. Here is an example:

Original features: 
f1
f2
f3
f4


Features after SVD:
s1 = -f1 + f2 - f3 - f4
s2 =  f1           - f4
s3 =  -f1     + f3
s4 =  f1 + f2 + f3 - 4f4

Now you can ask to recover the original features from the features after SVD:

We get

f1 = some linear combination of s1 and s2 and s3 and s4

But the thin (or truncated) SVD uses only the first few k features. In this example, we use the first 3 features

f1 = the truncated combination + rest.

Under the thin SVD, we cannot recover exactly the original features. We can recover only approximately. In particular the “rest” is a linear combination of the original features. The term rest is simply the approximation error and we want rest to be as small as possible. How small the term rest will depend on the magnitude (variance) of each of the features f_i.

Which features will end up in rest will depend on each data point. For the purpuse of analysis we can assume that those are a random set of features with random coefficients. Those random coefficients will be small, but if the features themsleves are large, the approximation error can be large and can fluctuate depending on the data points.

What we need is to find a way to transform the features to have bounded variance. One way to do this is to take the log of the feature if it is a count or the square root. Those are transformations related to the more general Box-Cox Power transform which transforms a random variable into a normally distributed random variable.

Linear combinations of transformed feature values need to make sense

Suppose that we want to compute the similarity between two text documents. One good similarity measure here is based on BM25:

document P = (w1, w2, w3) 
document Q = (w1, w2, w3)
#documents P and Q have words w1, w2, w3

similarity = idf(w1)*tfNorm(P, w1)*tfNorm(P,w2) + 
             idf(w2)*tfNorm(Q, w2)*tfNorm(Q,w2)

It’s best to have a similarity measure in mind from which to extract a useful transformation. Here the transformation that is suggested is:

input: word_count of word w in document of length document_lenght
output: a sutiable transformation
idf(w)*tfNorm(word_count, document_length)

Since the SVD is about computing dot products the similarity measure itself needs to be represented as a dot product.

Here is one similarity measure that works well on it’s own but cannot be expressed as a dot product:

P = document with normalized word frequencies (probablities) (p1, p2, p3)
Q = document (q1,q2,q3)
log(P|Q) = p1*log(q1) + p2*log(q2) + p3*log(q3)
log(Q|P) = q1*log(p1) + q2*log(p2) + q3*log(p3)
similarity = min(log(P|Q), log(Q|P))

What are the practical applications of those finding?

First, exclude “strong” or “dominant” or features whose presense is likely to mask other features in a linear combination.

Second, find a good similarity measure that can be expressed as a dot product. Often the cosine does a good job and the cosine already includes a normalization of vector length. If you are working with probability distriubtions say (p1,p2,p3) the following transformation makes sense: (sqrt(p1), sqrt(p2), sqrt(p3)). With respect to ranking this is equivalent to the Helliger distance.

Third, the similarity measure will dictate what kind of transformation makes sense.

Fourth, make sure the transformed features have bounded variance and that their distribution looks Gaussian.

A large number of sparse features with noise