In this and next blog posts, I will discuss some opportunities for data clustering using random projections.

## Hierarchical Clustering

One of the simplest clustering approaches is bottom-up hierarchical clustering. In this approach one starts with each data point being its own “cluster”. If we start with $n$ data points we have $n$ clusters at the first iteration. Then the closest two clusters are merged and one ends up with $(n - 1)$ clusters. Continuing in this way, one obtains a tree. The original data points are at the leaves of the tree.

## Index structure for nearest neighbor

Random projections are used to solve the nearest neighbor problem efficiently. This means that by using random projections one can construct an index which is able to efficiently execute a search operation. Given a point id, the closest point is returned as the answer to a search query.

One possibility to use this nearest-neighbor indexing structure is to find for each data point its closest point. Then we can use the so called single link clustering to merge the points. For example, if points (a,b) are neighbors and points (b, c) are neighbors, then we’ll put them together in the same cluster containing (a,b,c).

One can observe, that if for every point one finds the top nearest neighbor, and then applies single link clustering, then the size of the data set is reduced by at least a factor of two. Then one proceeds recursively until a predefined number of clusters if found. It takes less than $log(n)$ steps.

## Kaggle Digits Dataset

I am using the dataset for the Digit Recognizer Tutorial from Kaggle. This dataset has about 40000 images of digits from 0 to 9. I ran the clustering algorithm described above and set the algorithm to return less than 20 clusters. The clusters obtained are showed in the figure below. Each block represents a top-level cluster. The top-level clusters are composed of nested clusters which are represented as sub-blocks with different colors within each block. Each digit is represented as a data point in different color. If the digits had clustered perfectly, one would see 10 blocks each containing points of a single color.

## Artificial Dataset

I ran the same algorithm on an artificial dataset. The dataset was generated in the following way. First ten prototype vectors of dimension 700 consisting of +1 and -1 were generated. Each +1/-1 was chosen with probability 0.5. Then 2000 datapoints were generated from each prototype by taking the prototype, selecting randomly 40% of the components and multiplying them by -1. This means that 40% of the prototype was corrupted by noise.

The result of the clustering is here:

## Clustering Algorithm

The clustering algorithm is described in pseudo code below:

Input: a dataset of n points
Output: k clusters

def cluster(dataset, weights):
if dataset.numPoints < threshold:
return hierarchy of clusters
else:
index = buildRandomProjectionsIndex(dataset)
uf = unionFindStructure()
for each point a in dataset:
b = index.nearestNeighbor(a)
# compute b, the nearest neighbor of a
uf.merge(a,b)
clusters = uf.getConnectedComponents()
newDataset = create a new dataset of the next iteration
newWeights = []
for cluster in clusters:
clusterRepr = zero_vector
for point in cluster:
clusterRepr += weights(point)*datset.getVectorRepr(point)
length = compute the length of clusterRepr