Cosine Similarity Part 2: Transformations for Data Preprocessing

This is the second blog post in a series of posts introducing the cosine similarity.

In the first blog post, I talked about how to represent a collection of documents (or clicks) as vectors. Once represented as vectors, two documents are defined as similar if the angle between the vectors is small.

The simplest approach, which does not work well, is to take the $w$-th component of the vector be equal to the number of times word $w$ appears in the document.

For example, the document “d a a c a d” is represented as the vector [3, 0, 1, 2]. I assume that the vocabulary consists of the words (a,b,c,d).

Representing the words as counts is not really good, because some words like “the” and “a” appear with high counts in most documents. So the high counts just amplify the “noise”. If those counts are made low, “noise” is suppressed.

Note: the cosine similarity can be used to compare documents, clicks, images and other data. The current transformations only apply to comparing documents and possibly clicks. Documents and clicks are types of data where the original components are raw counts.

IDF (Inverse Document Frequency) Transformation

The first transformation is to discount frequent words in the collection. To do this one first needs to compute the so called df (document frequency) statistic.

document term frequency = number of documents from the collection in which the term appears

DF is a global statistic (collection-level statistic) because it requires the processing of all documents in the collection.

In pseudo-code the computation of DF is organized as:

document_term_frequencies = {}
for document in collection:
  for word in unique_words(document):
    document_term_frequencies[word] += 1

Then the following statistic, called idf (or inverse document frequencies), is one of the hallmark achievements of Information Retrieval:

N = total number of words in the collection
df = document frequency of word w
idf = log[(N - df + 0.5)/(df + 0.5)]

Sometimes, the idf is formulated a bit differently, but I prefer to use this definition. One reason I like this definition is that it appears nicely summarized in the paper “Modern Information Retrieval: A Brief Overview” by Amit Singhal. I’ll take a small digression here. Amit Singhal is Google’s SVP of Search and student of Gerard Salton. Gerard Salton was one of the founding figures of Information Retrieval and he pioneered the representation of documents as vectors and suggested comparing documents and queries using the cosine. The idf itself comes from the work of Steven Robertson and Karen Sparck Jones. Steven Robertson, again a founding figure of Information Retrieval, works for Microsoft Research, Cambridge, UK. Even though many publications define the IDF, the mentioned publication is probabably one of the best introductions to the topic.

An alternative, simpler, but worse performing definition of IDF is:

#alternative definition of idf (do not use)
N = total number of words in the collection
df = document frequency of word w
idf = log[N/df]

This definition is easier to justify. First, for a word $w$, $p = \frac{df}{N}$ is the probability that a randomly chosen document from the collection contains the word $w$. Then $idf = -\log{p}$ is the amount of information (measured in bits) that one gets when they are told that a randomly chosen document contained the word $w$. If the word was very common, it would appear in most documents. Therefore, when we are told that a document contains that word, we have not learned much.

I will leave the justification of the more complex formula for another post. Suffice it to say that the idf is related to deciding if a document is relevant for a particular query. The idf is then the log-odds used in statistics to make a decision about the relevance of a document.

TF (Term Frequency) transformation

The simplest transformation is simply to divide the count of the word in a document by the document length.

tf = number of times word appears in a document, the term frequency
doclen = number of total words (including duplicates) in a document, the document length
tfnorm = tf/doclen

Referring again to the work of Steven Robertson and Karen Spark Jones, one can use the following more advanced term frequency normalization (assuming all documents are of the same length).

tfnorm = tf/(tf + 1)

If you also want to account for the document length, then the following formula is used:

tf = number of times word appears in a document, the term frequency
doclen = number of total words (including duplicates) in a document, the document length 
avg_doc_len = average of all document lengths in the collection
doclen_correction = 0.75*doclen/avg_doclen + 0.25
tfnorm= tf/(tf + doclen_correction)

You can see that if all documents had the same length, the document length corrected formula would simply be equal to $\frac{tf}{tf + 1}$, which is the simpler version mentioned above.

Combining TF and IDF

Just multiply the idf and the tf normalizations. The idf normalization is global, depending on the distribution of individual words in the collection. The tf normalization is local, and its purpose is to dampen the effect of words that appear too many times.

Application to Clicks

The tf and idf transformations also make sense when comparing customer profiles consisting of clicks. A customer profile serves the same function as a document, while a click on a link is analogous to the presence of a word in a document. It makes sense to discount pages that are very common (for example a start page) since those pages are not informative about subsequent behavior. Such discounting can be achieved via idf.