Algorithms for Big Data

Topics

Cosine Similarity

A series of blog posts on Cosine Similarity. Even if you are an advanced data scientist, you may be surprised to learn that the Cosine Similarity is related to the more advanced statistical similarity called Jenson-Shannon Divergence. In this series, I start with the basics but end up with advanced topics such as Jenson-Shannon Divergence and sketching for efficient computation.

Singular Value Decomposition

This series contains tips and tricks about the SVD. SVD is a foundational technique in Machine Learning. This series approaches the SVD from a few view points to build useful intuition what is actually happening behind the SVD equation. For example one can think of the SVD as a matrix approximation technique, as a technique to find interesting one dimensional projections of the dataset and as a pattern extraction technique…

Practical Clustering

Tips about high-quality and fast clustering

What’s new in Nearest Neighbor Search?

Perhaps a more common name will be Similarity Search. This problem also appears in different disguises. One disguise is the so called all nearest neighbors which asks to compute the similarity between all pairs and select the top similarities. A name related to all nearest neighbors is Thresholded Correlation Matrix. When we are asking for all most similar pairs within very high similarity threshold we are solving the so called deduplication problem or record linking.

Topics

Cosine Similarity

Singular Value Decomposition

Practical Clustering

What’s new in Nearest Neighbor Search?

Random Projections for Search and Machine Learning

Custom Similarity for Elasticsearch

Finding phrases with suffix arrays