Clustering: References - Algorithms for Big Data

CLUSTERING TO MINIMIZE THE MAXIMUM INTERCLUSTER DISTANCE, T. Gonzalez (known as Farthest-first traversal)

Finds a 2-approx to mimimization of maximum intercluster distance (or maximum cluster radius). The algorithm is efficient O(k*n) for k clusters. The points need to satisfy the triangle inequality to match the approx. factor.

Algorithm: Initially all points are in the same cluster with one arbitrary point denoted as the “head” of the cluster.

Iteration 1: Step 1: From the first cluster (which contains all points) choose the point furthest from the head. Step 2: Go over the points in the first cluster and partition them into two: we move a point to the second (just created cluster) if the distance to the head of the second cluster is smaller than to the distance to the head of the first cluster.

Iteration 2: Now we have two clusters. From the first cluster find the furthest point from the head of the cluster; From the second cluster find the furthest point from the head of the second cluster. Choose the point furthest from its respective “head”. Put that point in a third cluster. Go over the points in the first and second clusters and move those of them that are closer to the head of the first cluster than their respective “heads”

and so on until k clusters have been created.

If the data is sparse the distances may not be very reliable. Use the SVD. For initialization of the first two clusters one can also use the SVD (the two furthest points along the first principle direction) If a small number of clusters is required (e.g. k = log(n)) this algorithm will do. For each point we maintain the distance to the current head. When a new head is created we need to evaluate the distances from the new head to each of the points. May be one can use some approximation since we need to know only the points that are closer to the newly opened cluster. One can also do a bottom up pass as a preprocessing step to merge some points which are “obviously” close. In this way n is reduced.

For an alternative description see Figure 4 in Dasgupta: Performance guarantees for hierarchical clustering. http://cseweb.ucsd.edu/~dasgupta/papers/hier-jcss.pdf

Performance guarantees for hierarchical clustering, Sanjoy Dasgupta

link

Based on the furtest first traversal describes how to create a tree (hierarchy of clusters) with some optimality properties.

The idea is that first firthest first traversal is run so that the points are numbers. 1 is the first point, 2 is the second point added and so on. While doing that, we are actually defining clusterings with k clusters. First k = 1, then k = 2, and so on. As we go to larger values of k, the radiuses of the clusters shrink. We are using the distances between the cluster centers. $R_i$ is the min. distance from cluster center $i$ to the previous cluster centers (1,2,…, i - 1). Now we bucket those radiuses. If $R$ is the radius of the complete dataset, then the buckets are containing points of radiuses:

R/2 to R
R/4 to R/2
R/8 to R/4

It’s probably (although not mentioned) that the number of points in the buckets increases by a factor of 2 with each level.

At the next stage of the algorithm we are building a tree. We start with point 1 (“head” of the first cluster containing all the points). We then take the second point, and we have a tree with two nodes. Next we pick point 3. 3 will be added under 1 or 2 – whichever it is closer. Suppose we add it under 2. We have:

Next we pick 4. It can be potentially added under 1, 2 or 3. However, we also consider the buckets based on the radiuses $R_i$. Suppose that 4 is at bucket (R/4, R/2) and 3 is at the same bucket. We cannot add 4 under 3 because it will not decrease the radius of 3. So we add 4 under 1 or 2 – which is closer - suppose 1. What we want to achieve with this is a guarantee that the radius is halved or even more than halved.

Thresholded SVD: A provable SVD-based algorithm for learning topics in dominant admixture corpus, T. Bansal, C. Bhattacharyya, and R. Kannan.

link. This is something that is potentially faster and more accurate than LDA.

Stage 1:

The key is first to throw out words from each document if they do not appear above a threshold. The threshold is individually decided for each word. The reason is that some words e.g. “run” belong to a topic (“sports”), but can also be used outside of a topic (“to run an election”). Therefore, we put a threshold above which there is a higher chance that the word “run” is used indicatively of a topic.

Stage 2:

Then we are going to use the “truncated” documents to build topic clusters. We want that the correlations between two “truncated” documents are more “truthful”. We also need a substantial number of documents (those are available in practice) to overcome the “data loss” from the truncation. We cluster those “truncated” documents by first representing them in the SVD space and then using a k-means approximation algorithm (e.g. farthest first traversal on the SVD representation). Then we refine the k-means approximation using the typical k-means iterative scheme. Notice that they run the iterative k-means stage on the truncated matrix, not the SVD’ed one. Each cluster defines a topic vector. We also keep the individual documents in the topic.

Stage 3:

Clean up the topics via catchwords. The idea is that only a few words are enough to indicate a topic – so called catch words. The other words are adding more noise than signal. Given that we already have the topics, the catch words are going to stick out: be more frequent – they take some of the highest entries.

Notes: instead of thresholds, one can consider a hypothesis test: “is that word coming from the distrubtion of a document (or topic) or is coming from a general background distribution”. It could be more reliable that some thresholds which are picked to make the proofs work.

Other

Subquadratic Approximation Algorithms For Clustering Problems in High Dimensional Spaces, Borodin, Ostrovskyy and Rabaniz . This paper is using approximate nearest neighbor search to solve hiearchical clustering. The paper mentions that the furthest neighbors can also be computed by similar techniques to the closest neighbors. This might be useful for farthest first traversal applications.