- Word clustering. Cluster words using word vector representations.
- Image clustering (coming soon)
- Document clustering (coming soon)
- Search results clustering (coming soon)
Structuring unstructured data
Clustering is one technique to organize “unstructured” data for the purpose of seeing a “bird’s eye view of the data” It sounds very cool in theory to be able to “just organize” any data in a meaningful way, but in practice there could be short-comings to the clustering approach which people should be aware of. My advice is here to start with the practical or business requirements. Depending on those you should decide if:
- is clustering the best fit to your business problem
- are there easier technical ways to accomplish the same business goal (e.g. classification or grouping by a small number of fixed rules)
- are you going to show your clusters to people, or only use the results in a black-box machine learning system
- do you need/want to provide some labeled examples to guide the clustering
- does the clustering need to change frequently over time
- how many clusters are needed (10, 100, 1000, 10000, 1 million, etc.)
- how many points need to be clustered (e.g. 1000 or half a billion)
- how to remove outliers (what points can be removed from the clustering)
From a business perspective if could be best to imagine how people are going to “consume” the results of the clustering, in what user interface will the results be presented. It could be a good idea to manually hand-craft clusters first and then see if the quality is actually achievable by automated methods if there is a need to scale.
This blog post is about some practical techniques to get your clustering to look as good as possible while running as fast as possible.
What problem are you solving?
Are you solving a grouping problem? If so, what intuitive meaning do you want the groups to have? Do you need interpretability or good labels of the clusters?
Are you building a automated taxonomy for the purpose of navigating the data?
Are you in an insight discovery mode where you try to find dominant patterns (or trends)
In practice, I have not seen en many business examples targeted at end end users that expose raw clustering results. Most probably this is because it is hard to guarantee what an end user might see or how the displayed results should be interpreted.
Types of clustering
bottom-up clustering (i.e. create clusters by mering points). Preferable when the number of clusters is large (e.g. more than 10000). Small clusters will look good, but big clusters composed hierarchically of small clusters will be bad (not much structure)
top-down clustering (i.e. create clusters by splitting the dataset). Preferable when the number of clusters is small (e.g. 10 to 1000) At the bottom of the hierarchy, points which should have been in the same cluster will end up in multiple clusters. Typically the first level of the hierarchy will look OK.
k-means clustering. Often the choice in practice but good initialization is critical.
density based clustering. Intuitively appealing but it has problems when the points lie in regions with different density.
There are some examples of clustering in two dimensional datasets in scikit-learn website that show that bottom-up hierarchical clustering and DBScan performs the best. DBscan will not perform well when the clusters have different density (this case is typical in practice). Also, as on those datasets, the single link rule will perform extremely well, while in practice it is problematic. The other clustering algorithms seem to fail even those simple datasets.
All clustering problems are going to have big issues with high-dimensional datasets. This is typically the issue in practice: very few points belong “obviously” to a single cluster. It’s very hard to have a few big clusters that capture well the complete dataset when you have multiple dimensions.
Clustering as deriving rules to organize the data
One good way to think about clustering is that clustering gives you automatically rules to group the data. Once you have the “rules” clustering is just about apply the rules to your dataset and grouping by those rules. In clustering the grouping criterion is often expressed as a “bag of words and frequencies” if you are clustering text or as a list of numbers, which represent the average over a set of datapoints. Another way to understand the rule is to look at the “closest” data points that fall in the cluster.
Prefer the k-means algorithm unless you need a hierarchy
Decide on the similarity criterion. For text the cosine after tf-idf or Jenson-Shannon divergence are very good choices. The cosine is simpler to understand while the Jenson-Shannon divergence and related similarities will give better performance. To understand which similarity works well one can look at similarity between examples.
Watch for symmetric vs. asymmetric similarities. This means that you have to decide whether to use similarity between examples (this is symmetric) vs. similarity between cluster and example. There are two ways: P(example given cluster) and P(cluster given example)
Remove outliers. For clustering one can define outliers as points which lie is sparse regions of the dataset (i.e. there not many points around them). This can be done efficiently with nearest neighbor search or node centrality in graphs using the SVD.
Decrease dimensionality of the dataset as much as possible.
Use the best initialization possible.
As I said, there are three tips:
decrease the dimensionality of the data (the easiest way is to use the SVD, but there are other good ways)
decrease the size of the dataset
choose “non-outliers” that are farthest apart
The recommended initialization for k-means is the so called farthest-first traversal. It seems to be a standard choice although not many people know about it. Farthest-first traversal means that initial clusters are points that are as far apart as possible. Unfortunately, in many practical situations those points could turn out to be outliers. Removing outliers and/or restricting initial cluster centers to be points in the dense regions is beneficial. Once this is done, farthest-first traversal can be applied 100000 data points from which to select good initial centers. Farthest-first traversal is a quadratic algorithm however using tricks such as a hashing and exploiting the sparsity of a dataset makes this algorithm perform well. If a small number of clusters is needed (for example 100) one can use the SVD and look for clusters centers in the “corners” given by points with largest projections on the singular vectors. The SVD can still be applied successfully even if more cluster centers are needed (e.g. 1000) by applying it multiple times and removing points which are already well fitted by previously selected cluster centers. As a last tip for obtaining good initialization consider the random projection method. This means the following: draw a random line and choose two points with farthest projections on the line. In recent years there are papers on a random initialization called kmeans++. This is a cheap version of farthest first traversal, where points far from the previous points are selected cheaply (or randomly weighted by their distance to previous centers).
- Implementation of scalable kmeans++ in Oryx
- Scalable K-Means++ by Bahman Bahmani, Benjamin Moseley, Andrea Vattani, Ravi Kumar, Sergei Vassilvitskii
- Parallel Streaming Signature EM-tree: A Clustering Algorithm for Web Scale Applications by Christopher M. de Vries, Lance De Vine, Shlomo Geva, Richi Nayak
- Power Iteration Clustering by Frank Lin, William W. Cohen