A series of blog posts on Cosine Similarity. Even if you are an advanced data scientist, you may be surprised to learn that the Cosine Similarity is related to the more advanced statistical similarity called Jenson-Shannon Divergence. In this series, I start with the basics but end up with advanced topics such as Jenson-Shannon Divergence and sketching for efficient computation.
This series contains tips and tricks about the SVD. SVD is a foundational technique in Machine Learning. This series approaches the SVD from a few view points to build useful intuition what is actually happening behind the SVD equation. For example one can think of the SVD as a matrix approximation technique, as a technique to find interesting one dimensional projections of the dataset and as a pattern extraction technique…
Tips about high-quality and fast clustering
Perhaps a more common name will be Similarity Search. This problem also appears in different disguises. One disguise is the so called all nearest neighbors which asks to compute the similarity between all pairs and select the top similarities. A name related to all nearest neighbors is Thresholded Correlation Matrix. When we are asking for all most similar pairs within very high similarity threshold we are solving the so called deduplication problem or record linking.
This is the website for my talk at Berlin Buzzwords 2015.
A detailed description with source code how to implement custom similarity for Elasticsearch.
The suffix array allow one to find arbitrary long phrases in very large strings. Here’s how it works.