In this short blog post I describe a recipe for fast and accurate nearest neighbor search with random projections. This recipe is a result of of applying random projections to the MNIST image dataset with the spotify/annoy tool and my tool (random-trees)
Here is the recipe:
Input: dataset (n by m matrix where n datapoints(rows) m features(columns)) Step 1: Apply svd to the original dataset decor_dataset = svd(dataset) # decorrelated dataset this is an n by k matrix, k dimensions after reduction) Step 2: Build randomized trees for i until num trees do random_matrix = generate a hadamard matrix of dimension k by s do not choose s to be more than 4*log(n) because it's not necessary and you'll save computation randomized_dataset = decor_dataset %*% random_matrix at each tree node simply choose a single feature (the features are orthogonal to each other) Step 3: Run nearest neighbor for each point in the dataset and store the top 10 neighbors in the index
On very sparse data random projections will not perform well. This is known. So it is better to make the data denser. Use SVD for this. (One can apply random projections on the sparse data to make it dense. However, it is known that random projections will need more dimensions than the SVD. Fewer dimensions will make the search easier. That’s why I recommend the SVD.)
Even when the data is dense, it is recommended to apply the SVD. Why is that? Because the efficiency of random projections drops when it has to deal with correlated features.
It is known how to make the SVD fast and scalable and even how to make it work as an online algorithm, so SVD is not really a bottleneck.
On building randomized trees
There are many ways to build randomized trees. But after some experimentation the following properties emerged:
- The trees should be fast during search.
- The trees should not be too weak (see later what weak means)
If you build trees with standard random projections, then a computation at a non-leaf node (to choose whether to go to the left or right child) will take time proportional to the number of features (e.g. 100). This is simply too much when just splitting on one single feature will suffice. So, it is essential to have run the SVD to sphere the data. The multiplication of the data by a random matrix before building the tree is simply a random rotation of the dataset. This is necessary to guarantee that different trees will explore the space of the points in different ways.
On point 2. What are good trees? It means that if two points are really nearest neighbors, then they should end-up together (in the same leaf) in many trees. For high-dimensional and very noisy data, this might not happen.
Point 2 simply means: if we explore the space in two different (random ways), we should end-up at the same conclusion “sufficiently” often.
On Hadamard matrix
This matrix is a special matrix whose columns are orthogonal. Simply, it is a number of random projection column vectors stacked next to each other. However, this matrix has a special structure that allows a computation of a vector of size $k$ by this matrix (which itself has size k x k) to take $k \times log(k)$ steps and not $k \times k$ steps. This is a huge saving.
You don’t need to generate the Hadamard matrix to multiply by it. Use the following recursive algorithm.
Multiply by Hadamard input: a vector of numbers multiply the input by random numbers (gaussian or bernoulli) call loop(input) output: the same vector under random rotation def loop(input): (a,b) = split(input) # split at length(input)/2 result_a = loop(a) result_b = loop(b) if length(a) == length(b): # case 1 create a vector by appending one after each other return append( (result_a + result_b), (result_a - result_b)) else # case 2 # one vector is bigger (assume it's the vector a) # it must be bigger by one position only let result_a' = result_a[0:length(result_a) - 1] let last_a = result[length(result_a) - 1] result_a' and result_b are of the same length. handle like case 1 let rand_b = result_b[random_pos] let random_sign = +1 or -1 return append( (result_a' + result_b), [last_a + random_sign*rand_b], (result_a' - result_b) )
On Step 3
Nearest neighbor search is an approximate algorithm. In order to improve it the following heuristic might be useful:
The neighbor of a neighbor is likely a neighbor.
At query time do the following:
Query time: - compute the approximate neighbors using the index (from steps 1 and 2) - get the neighbors of the neighbors (that's why we need them precomputed) - some of those might be a better neighbor because of the approximation
There is actually a method for nearest neighbor search called K-graph that works entirely on this principle.