In this blog post, I will evaluate a number of toolkits for nearest neighbor with the MNIST dataset available from Kaggle.

A number of those tools were already evaluated in a wonderful blog post by RaRe Technologies document similarity shootout. (RaRe Technologies are known for the gensim similarity toolkit). The referred blog post used a large dataset consisting of all the documents from Wikipedia. The goal was to find software for nearest neighbor that is useful for production deployments.

My goal is different. I want to understand the role of the indexing algorithm in the nearest neighbor tasks as well as the role of a number of tuning parameters. This blog post might be useful in this regard to anyone who deploys spotify/annoy. The MNIST dataset is tiny compared to Wikipedia, but this gives the possibility to compare the toolkits under more settings.

Here is a list of the tools I will be comparing:

Software

Introduction

This blog post is related to the talk at Berlin Buzzwords 2015. Consult the slides (to appear) for additional context.

spotify/annoy

This algorithm builds a number of random trees. There is a differentiation between training time and query time. The implementation of the algorithm is in C++, but client code is in Python. The item ids are contiguous from 0 to $(n - 1)$ inclusive ($n$ is the total number of items added to the index).

This is how to build/rebuild this tool:

python setup.py build --force #use force for rebuilding after modification
python setup.py install --force

The following snippet give the code for training

from annoy import AnnoyIndex

num_features = 784 #for this problem
t = AnnoyIndexAngular(num_features) #or AnnoyIndexEuclidean(num_features)
f = open('training file', 'r')

i = 0
for line in f:
  vector = parse_line(line) #returns just a python list
  t.add_item(i, vector)
  i += 1
f.close()
t.build(num_trees)

The code for querying is below:

num_neighbors = set to appropriate value, related 
  to the number of candidates required for examination.
  Even if you want top 2, you might want to set this to 800 (for 784 features).
  At a minimum set it to the number of features you have in the dataset in 
  order to explore "full leaves"

for i in range(0, numPoints):
  #items = t.get_nns_by_item(i, num_neighbors)
  items = t.get_nns_by_vector(vectors[i], num_neighbors)
  as_str = " ".join(map(str, items))  
  results.write(as_str + "\n") #write to a file

In query time when a new point is encountered, it is run down each tree until it reaches a leaf. Points from the training dataset are stored at the leaves at training time. We pick the points from each reachable leaf and add them to a set of candidates. Each point from the candidate set is scored with the exact euclidean distance.

How does the indexing algorithm work?

The algorithm builds a tree by recursively splitting the dataset by a random hyperplane. Implementation details follow.

input: dataset
k: parameter, maximum number of points in a leaf of the tree
by default
  k = number of features
n: parameter, number of trees
m: parameter, maximum evaluated points
by default 
  m = number of points requested * number of trees
    more points could be evaluated than this number (see below)

def build_tree(dataset):
  if number of points in dataset <= k:
    return leaf with points from the dataset
  else:
    pick a random hyperplane (different code for cosine or euclidean distance)
    split the data into two parts
    the index type (angular, euclidean) determines how the split is made
      (there is a special case if one cannot split)
    call build_tree recursively on each part
    return a tree consisting of the trees from both recursive calls

Parameters

It is important to know what the parameters are since the quality of the nearest neighbors depends strongly on them.

Distance: Cosine vs Euclidean distance

The first choice is the type of distance function: cosine or euclidean distance. This is controlled by the index type which comes in two varieties: AnnoyIndexAngular or AnnoyIndexEuclidean Those have to be tested separately. In general, if cosine works for your problem, choose cosine because it is cheaper and more reliable. When cosine is used it is preferable to normalize the vectors of the dataset before adding them to the index.

Why is cosine cheaper? In the implementation of Spotify/Annoy when euclidean distance is used, they attempt multiple projection lines and choose the one that results in the widest projection. This means more computation for the euclidean distance than for cosine where they attempt just one projection.

Why is cosine more reliable? The cosine means that the points lie on a high dimensional sphere (this is why I suggested the normalization). If you use euclidean distance, the points can lie anywhere. So the search space is smaller for the cosine.

Number of points stored at the leaves (k)

The second parameter is the number of points that are stored at the leaves of the tree. Spotify/Annoy fixes this number in the source code to be the number of features used. You can change this number and rebuild the code. The parameter is located in:

  file: annoylib.h

  AnnoyIndex(int f) : _random() {
    _f = f; //number of features
    ...
    _K = (sizeof(T) * f + sizeof(S) * 2) / sizeof(S);
    #suggested to put this parameter to maximum 100
  }

Number of nearest neighbor required

The third parameter is the number of nearest neighbors required. You may want just ten neighbors but to increase the chance of actually getting the closest neighbors you should request this parameter to be at least $f + 1$ where $f$ is the number of features. This will slow down the computation of course. To counteract this increase in computation, I suggest to modify the code for the parameter $k$.

The role of parameters in selecting candidates.

It’s important to know that the search for nearest neighbor contains two phases. The first phase is to select good candidate neighbors. At the second stage each candidate neighbor is re-scored with the exact distance. In practice, you should not examine more than 1000 neighbors per query, because the system will get very slow.

Illustration how the candidate nearest neighbors are selected.

Let us suppose that we want only one tree. We have set the number of points per leaf to 100. We want the top 2 nearest neighbors. Often the query is part of the dataset, so in order to exclude it, we need to ask for +1 more neighbors. During the search process, the tree is walked down until a leaf is reached. Now there are two possibilities:

  • The leaf contains a single point.
  • The leaf contains more than one point.

If the leaf contains a single point, because we requested two points, another leaf will be explored. Later on I will talk how this leaf will be selected, but most likely it will be a neighboring leaf. Now, since the second leaf contains at least one point and we requested two, the candidate search process will finish.

How are leaves explored? We reach a leaf by following a series of splits. Let us take the cosine as a distance function. Each split is decision whether to go left or right. In geometric terms, this means that there are two vectors (a or b) in opposite direction. The query is compared to each of them, and the vector (a or b) with smaller angle to the query is selected. Before the first leaf is chosen the decision is straightforward: we just walk down the tree following the best path. If we want to attempt another leaf we look at all possibilities (splits on the paths we have examined so far) and choose the best. The best in this case will be a vector whose cosine with the query is least negative. This means that when a leaf that is not sufficiently full is reached, we can jump anywhere in the tree. For example, we can jump back to the root. I believe that is it possible to traverse the tree multiple times from top levels. However, it might not be very likely.

What happens when there are multiple trees? Under spotify/annoy algorithm, what is important is the angle to a number of already examined random vectors. It does not matter whether those random vectors come from different trees or from which depth of the tree. However, deeper in the tree we are expecting tighter matches. What happens under this algorithm is that the query is compared to multiple random points (at the splits) until a number of small areas around the query point are found.

In the end, the quality the neighbors depends on how many “relatively full” leaves of the tree are examined. One can achieve the same number of candidate neighbors that are examined in a number of ways:

  • examine more trees each of which has fewer leaves. This is a trade-off with memory usage.
  • specify more candidate neighbors at query time. Increases the pool of candidates and running time.

Evaluation of spotify/annoy on Kaggle MNIST dataset under various parameters

One can see how many data points are examined for the MNIST dataset under various parameters.

settings indexing + searching whole dataset accuracy (whether query label agrees with NN label) in % number of candidates examined per query point
nn = 2, k = 100, t = 1 0m38.319s 84.99 from 2 to 100
nn = 100, k = 100, t = 1 0m45.743s 91.11 from 100 to 200
nn = 100, k = 10, t = 1 1m24.128s 93.11 from 100 to 110
nn = 100, k = 10, t = 2 1m43.829s 95.41 from 100 to 110
nn = 100, k = 10, t = 5 2m45.949s 96.69 from 500 to 510
nn = 100, k = 10, t = 10 3m18.547s 97.11 from 1000 to 1010
nn = 2, k = default, t = 10 2m7.822s 94.10 from 2 to 784 (to be verified)
nn = 100, k = default, t = 10 4m29.444s 96.13 to be found

The running times include the time to read the data, create the index, select the candidates and candidate re-scoring with the actual dot product. The time is dominated by the candidate re-scoring.

I would emphasize the following observations:

  • A single tree can achieve a very high accuracy of 93.11.
  • Trees with smaller leaves perform better
  • For 10 trees almost optimal performance is reached at a reasonable cost. Optimal performance is around 98%
  • The defaults were not optimal for this problem. Same accuracy could be achieved more cheaply.