One application of cosine similarity is to answer queries “More documents like this…”. This is the use case that I will consider in this blog post.

Cosine similarity can also be used in a search engine to match queries to documents. I am not going to discuss this use case here.

When we try to use the cosine similarity for the task “More documents like this…”, we encounter a serious problem: How to find the most similar documents to a given document without resorting to a brute force approach. The brute force algorithm is given below:

input: a document
scored_documents = []
for each other_document in collection:
  sim = cosine(input, other)
  scored_documents.append((other_document, sim))
sort scored documents by sim in descending order and report top K

The problem with brute force is that it’s just too slow. There are a number of approaches to make the computation of top-k most similar documents faster. One approach is to use an inverted index (aka a text search engine). Other approaches are special nearest neighbor structures and algorithms (such as LSH, Random Projections, KGraph). Other helpful approaches are described by key words such as dimensionality reduction, clustering, and sketching.

The inverted index approach

The inverted index approach is described best through an example. Suppose you have 3 documents (d1,d2,d3) with a vocabulary of four words (a,b,c,d)

A sample collection of documents with words:
doc1 = [a, c]
doc2 = [a, b, d]
doc3 = [b, d]

The inverted index is simply an organization of the data by word. The original data was organized by document.

Inverted index:
word a = [doc1, doc2]
word b = [doc2]
word c = [doc1]
word d = [doc2, doc3]

If you want to build a proper search engine there are a lot more concerns such as storing the document ids efficiently using compression, dynamic updates, API’s and so on.

But the index structure that search engines use is pretty simple at the logical level.

How does this structure help us with search? Suppose that I need to search for the documents that are similar to document 1. If I had gone brute force, then I need to examine documents 2 and 3.

brute force approach:
query: doc1
touched documents: doc1, doc2 and doc3

The inverted index approach will look up the lists (those are called postings in the search engine literature). For document 1 those lists are the lists for words a and c. Then we take the union of the documents in all lists.

inverted index approach
query: doc1
doc1 contains words a and c
list a = [doc1, doc2]
list c = [doc2]

the union of all documents = [doc1, doc2]

In this particular example, there is a saving since doc3 is not looked at.

The bottlenecks in the inverted index

Typical search engine queries contains a few words. This means that only a few postings lists need to be examined. If the collection is big, those lists will be really large. However, search engines are able to deal with this problem since the query typically asks for an intersection of postings. In this case, a large portion of the inverted lists can be skipped (search for example “skip lists Lucene” to find out how).

For the use case discussed here, we cannot really use intersection, since the query contains too many words (in the order of hundreds to thousands). The intersection of hundred words will be empty.

One can see that just a typical inverted index will not do. The problems we encountered are the following:

  • query contains too many words
  • posting lists are too big

Solving the bottlenecks

If we manage to reduce the size of the documents to say 50 words, then we are making progress. In this way we make the inputs to the inverted index small. At the same time, we also reduce the number of documents in the postings, which was our second goal.

We can even make more progress towards the second goal. We can notice that some words have very long postings (e.g. “the”, “a”, “to”), while others have short postings. The words with long postings are not informative. Such words are informative only if they are part of a phrase (e.g. “to” is not informative, but “to be or not to be” is informative). Following this observation, we should remove words with long postings. In this way we reduce the search time, while we keep the quality of the similarity under control.

Methods for reducing the size of the documents

I will put the methods under a few categories:

  • For each document independently find and keep top K informative words;
  • Use min sketch to select what is called a “conditional” random sample of words for each document;
  • Use some combination of the above two;
  • Compute the postings lists and sample them.

Methods for finding informative words

Tf-IDF based methods

The simplest method is to sort the words of a document by tf-idf and take the top 50. The intuition is that one wants words that are both common in the document and rare in the collection. Typically, such ordering contains informative words, but it also contains words like “the” lower in the sorted list of words. Therefore, additionally we may want to include a “hard” rule not to select any word that appears in more than a fixed fraction of all documents in the collection. Too rare words may also not be a very good candidate for inclusion. This is because they will match only a few documents.

Word-Graph based methods

The tf-idf method uses local and global statistics, but no correlation between the words itself. Here is a reference to one method that first creates a graph of words and then assigns weights to each word (see “Graph-of-word and TW-IDF: New Approach to Ad Hoc IR”, François Rousseau, Michalis Vazirgiannis ). However, there is no mainstream approach in the Information Retrieval literature to create such word graphs.

What is the problem with those two methods? The problem can be seen most clearly when we set the number of selected words to 1. Suppose, we have two documents doc1 = (“a”:1.1, “b”:1.12) and doc2 = (“a”:1.13, “b”:1.11). This is a short-hand for saying that doc1 contains two words “a” and “b” and the tf-idf of word “a” is 1.1. If we are to select a single word from each document, we select word “b” for doc1 and word “a” for doc2. As it can be seen we will not find any overlap. The problem is that the words that are selected for doc1 are not guaranteed to be selected for doc2. In this case we do not find overlap even though it exists.

Conditional random sampling (Sketching)

Conditional random sampling (CRS) works like this. First, before we start indexing, we create a (global) random permutation of all unique words. For example, if the vocabulary is “a”, “b”, “c”, “d”, “e”, “f”, a random permutation could be the following (“d”, “a”, “f”, “c”, “b”, “e”). In this random permutation each word is mapped to a position: (“d”:0, “a”:1, “f”:2, “c”:3, “b”:4, “e”:5).

Suppose we have the following documents and we want to select only 3 words from them:

doc1 = ("a", "b", "a", "e", "f")
doc2 = ("b", "e", "d", "f")
doc3 = ("b")
doc4 = ("f", "c")

Let’s take doc1. First we find all unique words, ignoring their counts and tf-idf weights. Those are (“a”, “b”, “e”, “f”). Then we sort this list according to the position of each word in the global random permutation. The positions are: (“a”:1, “b”:4, “e”:5, “f”:2). Sorting by position gives the list: (“a”, “f”, “b”, “e”). We truncate this list to 3 words.

The selected words for complete example are:

doc1 = ("a", "f", "b")
doc2 = ("d", "f", "b")
doc3 = ("b")
doc4 = ("f", "c")

This method ignores the values of the tf-idf weights. The method would have been suitable if all weights are equal to 1. In the literature, this method has been applied to compute similarity between words, not documents.

The references to Conditional Random Sampling are:

Combined methods (TF-IDF weighed and sampling)

One can imagine a way to combine the benefits of the tf-idf method and the conditional random sampling method. Typically such combination is achievable through stratified sampling. One can first sort the words in a document by weight. Then group them in buckets. For each bucket one could use conditional random sampling. Groups with higher weights should be sampled more.

There is no literature on this approach, but I’m listing it as a possibility.

Twitter’s Method for Cosine Similarity: DISCO

One can also imagine the following method. First, compute the postings for each word. Then for each posting, sort the documents by the tf-idf weights. Truncate the posting to contain only the top-something documents.

There is no such method in the literature, but the DISCO method, recently introduced by Twitter, works along those lines. DISCO stands for Dimension Independent Similarity Computation. The method is roughly as follows:

DISCO method:
Organize the data by postings, i.e. tuples of  (word -> document list)
for each (word, posting):
  for each pair of documents (d1, d2) from posting:
    compute the tf-idf weight or some other weight
    over sampling parameter f = p/eps # see paper
    select the pair with probability f * weight

Only the selected pairs of documents and words (d1,d2, word) are going to be used for similarity computation. At the next stage, one can organize the data by document and we are going to have entries like this:

doc1: [(word, other_doc)]

For doc1, some document, for example doc56, could appear m times with different words. Then we report as cosine similarity: $\frac{m}{f*weight}$. As one can see this scheme is an application of weighted sampling.

References to Twitter’s DISCO method:

Next: Cosine Similarity Shootout

In the next blog post, I will evaluate how these methods perform in conjunction with different weighting schemes.