This page gives an annotated list of references on random projections organized by type of projection and application area. The literature on random projections is vast spanning at least 20 years of work in various scientific communities such as Statistics, Algorithms, Data Mining, Artificial Intelligence and Information Retrieval. This is just the tip of the iceberg.

Note: curating this list is work in progress.

## Overview

• Sanjoy Dasgupta, Experiments with Random Projection, Sanjoy Dasgupta is an authority on machine learning and algorithms. This is one of his early papers on random projections. As such the paper is highly readable (as scientific papers go) and helps the reader build intuition about random projections. One interesting claim is that random projections preserve clustering structure, while PCA can merge some of the clusters.

• Samuel Kaski, Dimensionality Reduction by Random Mapping: Fast Similarity Computation for Clustering. I’m listing this publication because it is an easy read for the properties of random projections. For example, the paper shows that when vectors are projected in a random subspace, the dot products are preserved because the projection matrix is orthogonal and AA’=ARR’A’ where R is a random matrix.

## Proofs of Johnson-Lindenstrauss theorem

• Sanjoy Dasgupta and Anupam Gupta, An Elementary Proof of a Theorem of Johnson and Lindenstrauss, The Johnson and Lindenstrauss theorem is cited on almost every paper on random projections. Dasgupta and Gupta give a proof of this theorem using elementary methods. The paper is useful to understand why this theorem is true.

## Sparse Tranforms

• Dimitris Achlioptas, Database-friendly Random Projections: Johnson-Lindenstrauss with Binary Coins This is a highly cited paper in the area. This paper was one of the first to suggest replacement of the gaussian random variables with +/-1 random variables (so called binary coins). This paper also suggested the use of a sparse matrix rather than the usual dense matrix. In practice both optimizations lead to algorithmic speed-up. Written in a friendly style, this paper makes for a pleasant read.

• Ping Li, Trevor J. Hastie, Kenneth W. Church, Very Sparse Random Projections. This paper goes beyound the sparse projection of Achlioptas (see previous reference), in the sense that it suggests an even sparser projection matrix. This makes the application of random projections even faster.

• <a href”http://arxiv.org/abs/1207.6365”>Low Rank Approximation and Regression in Input Sparsity Time by Kenneth L. Clarkson, David P. Woodruff</a> (Ravi Kannan mensions this method in one of his lecures as a way to get cheap SVD; the description is quite difficult to follow)

## Preprocessing of Data before applying Random Projections

Roughly follow the checklist to see if random projections will work well:

• Are all dimensions equally important. “Pairwise vector distances are meaningful only when all dimensions of the data are more or less equally important”(Ping Li, Trevor J. Hastie, Kenneth W. Church,Very Sparse Random Projections)

• Is the data heavy tailed. The pairwise distances are dominated by outliers. The suggestion is to apply a Box-Cox transform. A simple rule of thumb (and an instance of the Box-Cox transform) is to take the square root of the features.

• Are there correlated features. For example, a few groups where the features within group are highly correlated will break the random projections. The suggestion here is to average the correlated features before applying random projection. Typically this suggestion is for regression, but the advice is valid in the context of random projections. One can also “orthogonalize” the data before applying random projections via SVD. (Random projections are still useful for search although SVD will achieve dimensionality reduction itself).

• Euclidean distances are more robust to estimate than dot products. This because the dot products require a normalization (division by estimated vector lengths). In statistics, division by an estimated quantity leads to large variance. For a solution see: Ping Li, Trevor J. Hastie, and Kenneth W. Church. Improving random projections using marginal information. If you are using random projections with documents. Then you can take the normalized smoothed counts (that is the probability of a word, smoothed with Dirichet smoothing), and apply the square root. In that case the length of the vector is 1. For this approach see: Krstovski et al, Efﬁcient Nearest-Neighbor Search in the Probability Simplex

• Vectors in which most components are non-zero, vs. vectors with a few non-zero components. For example, in information retrieval when the vectors represent words. The word “the”, for example, will have a vector with many non-zeros, while a topic word will have much fewer non-zeros. This is in general an issue because the vector for “the” will be “close” to any other vector.
• is the data too sparse? Then one suggestion is to first apply a Fast Johnson-L transform with a Hadamard matrix to make the data dense, and then apply a sparse projection. See for details: Nir Ailon and Bernard Chazelle,THE FAST JOHNSON–LINDENSTRAUSS TRANSFORM AND APPROXIMATE NEAREST NEIGHBORS

## Dot Products vs. Euclidean Distance

Ping Li, Trevor J. Hastie, and Kenneth W. Church. Improving random projections using marginal information. In Proc. of COLT, Pittsburgh, PA, 2006.

## Application to Dimensionality Reduction

Dimitris Achlioptas, Frank McSherry, and Bernhard Scholkopf. Sampling techniques for kernel methods. In Proc. of NIPS, pages 335{342, Vancouver, BC, Canada, 2001.

Ella Bingham and Heikki Mannila. Random projection in dimensionality reduction: Applications to image and text data. In Proc. of KDD, pages 245{250, San Francisco, CA, 2001.

### Relationship to SVD (and PCA)

Samuel Kaski, Dimensionality Reduction by Random Mapping: Fast Similarity Computation for Clustering.. Suggests that random projections do as good as PCA in preserving the separability between classes of documents provided that more dimensions are used. In the dataset used, PCA with 50 dimensions and random projection with 100 dimensions achieve the same result.

## Application to Classification

• Dmitriy Fradkin and David Madigan. Experiments with random projections for machine learning.

• Ali Rahimi Benjamin Recht, Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning. This paper describes an interesting idea: instead of trying optimize the parameters of a machine learning model (for example Ada-Boost), let us just create random features (or projections on random lines with splits). Given enough of those feature, they reach the performance of Ada-Boost. The advantage here is that code that uses randomization is much easier to write. In future papers (see the next reference), there is a way to speed up this idea many fold.

• Le, Sarlos, Smola, Fastfood - Approximating Kernel Expansions in Loglinear Time. This paper builds upon the idea of the previosly cited one (“the random kitchen sink’s paper”). First, it sets out to do the same as “kitchen sinks”, but much faster. The idea is related to a special orthogonal matrix known as the Hadamard matrix. Not only that, but after this paper the relationship between random projections and kernel methods becomes clear. They propose to write explicitly the feature map of the Gaussian kernel. Typically (or at least until recently) the advantage of the kernel methods was that they could work without writing a high-dimensional (or even infinite) feature map. This is a sharp departure (see the blog reference cited below). However, the infinite feature map is approximated via random projections.

• Smola’s blog related to the topic

• Vikas Sindhwani, , Random Embeddings, Matrix-valued Kernels and Deep Learning This is a youtube talk that shows an application of the idea of explicitly approximating the feature map via randomization. In particular the slide titled “DNNs vs Kernel Methods on TIMIT (Speech)” show a few line matlab script that applies SVM with Gaussian Kernel. This matlab script does not scale but is excellent to list the basic idea of what is going on.

• Fast Random Feature Expansions for Nonlinear Regression This is a youtube talk that could be useful in explaining the “Fastfood paper”. The slide “Random Kitchen Sinks: Implementation” gives matlab implementation of random kitchen sinks with Gaussian kernel.

• Random Projections for Linear Support Vector Machines by Saurabh Paul, Christos Boutsidis, Malik Magdon-Ismail, Petros Drineas Evaluates (and describes) the efficient projection method due to Clarkson and Woodruff.

## Sparsity

Toward a unified theory of sparse dimensionality reduction in Euclidean space by Jean Bourgain, Sjoerd Dirksen, Jelani Nelson.

## Application to Clustering

• One approach is to compute all k nearest neighbors and then use single link agglomerative clustering.

• Another approach is to apply a random projection of the data into a lower dimensional space and perform clustering there. For example, Sanjoy Dasgupta, Experiments with Random Projection, was suggesting to project the data into a random subspace of lower dimension and perform clustering with EM algorithm there. Recently Ravi Kannan’s work on SVD, was making use of a random projection into a lower subspace in which he first applies SVD and then an approximate heuristic for k-means.

• Xiaoli Zhang Fern and Carla E. Brodley. Random projection for high dimensional data clustering: A cluster ensemble approach. Related to the first approach mentioned above, but more expensive to execute. They choose a dimension k for random subspace. Then they repeat t times: project the data into a random subspace of dimension k; cluster the data into the random subspace; observe which points belong to the same cluster. After this procedure there are t different clustrings (sets of clusters). What is important here is in how many of those clusterings two points a and b share the same cluster. This gives a measure of distance between the points a and b. Then they apply complete link agglomerative clustering (which is a computational bottleneck). One take away is that when they project into a small dimensions, different random subspaces may reveal different clustering structure. They used k=5, which may be too low.

## Applications

### Documents

• Krstovski et al, Efﬁcient Nearest-Neighbor Search in the Probability Simplex. It is known that Jenson-Shannon distance often achieves superior results to cosine similarity. This paper shows an approximation that is simply taking the square root of the probabilities. After this transformation Jenson-Shannon is well-approximated by the cosine.

## Application to visualization

If one studies the methods that embed high dimensional data in two dimensions for the purpose of visualization, one can notice the following trick. For all the methods that work well (e.g. t-sna), they optimize a sum of weighted pairwise distances. However, the trick is that the weighting is such that all small distances are over-emphasized, while large distances are under-emphasized or even ignored by setting their weight to 0. Putting a high-dimensional data in two dimensions is not possible, so compromises have to be made. The compromise is that we focus on small distances. Small distances are nearest neighbors and methods that compute them are therefore useful. Any method that computes nearest neighbors will be useful for visualization.

The following example from scikit on on random forest embedding is useful to illustrate this trick.

## Relationship of random projections to other machine learning methods

• Random Forest
• SVM

## Comparision of Software Tools

Radim Řehůřek, Performance Shootout of Nearest Neighbours: Querying

## Tutorials

Bob Durrant and Ata Kaban. Random Projections for Machine Learning and Data Mining: Theory & Applications https://sites.google.com/site/rpforml/ http://www.cs.bham.ac.uk/~durranrj/ECML_Tutorial/ECML_Tutorial.pdf

## Other

Feature Hashing for Large Scale Multitask Learning by Weinberger, Dasgupta, Langford, Smola, Attenberg This paper solves the following problem. You have A LOT OF features and you want to do regression or SVM or something else which involves dot products. You don’t want to explicitly store the vocabulary of $n$ features (and work with all n features – they are just too many). Instead you map the orginal $n$ features to $m$ features via hashing (m is much smaller than n). In this way you’ll get collisions (of course). So while collisions are unavoidable, the trick they use (in conjunction with hashing) is to make sure the dot products computed on the hashed features are equal in expectation to dot products computed on the original features.