Case studies

Below you will find some case studies (demos) I have developed outside of my normal project work. If you are interested in descriptions of actual projects for customers please go to the projects page.

Nearest Neighbor Search with Random Projections (on Github)

This is a library implemented in Scala which takes a set of datapoints each represented as a dense vector of dimension 100-300 and indexes the data. The index can be queried to find other vectors close to the query by cosine similarity.

Visit on Github.

Technical Information.

Word Clustering Demo (Precomputed Demo)

This is the output of a scalable clustering algorithm that I have implemented. The purpose is to answer the question: "How can one navigate a dataset where objects are represented as vectors".

Browse Demo.

Technical Blog Post: Hierarchical Clustering That Works.

TextDrill: A New Knowledge Discovery Application for Text

TextDrill aims to provide an easy to use, flexible, faithful and fast interface for knowledge discovery in unstructured text. TextDrill is centered around common tasks of word, ngram, sentence and document clustering, as well entity extraction. Everything, including entities can be represented and linked in the same "knowledge" graph. At the moment, its is still early days for the tool and only a technical presentation is available for some of the features. I am driving the features based on user feedback.

Textdrill Tour (on textdrill.io).

Extensions of Spark SQL via User-Defined Types (On Github)

One can extend Spark SQL by additional functions. To ammortize the implementation effort, those should be frequently used functions, and most likely should be related to the business domain. This approach allows to have a) very good performance similar to native Spark while coding in Python; b) minimize noise in the code; c) speak the language of the business via Spark SQL.

Visit on Github.

GraphFormation: Exploring the Foundation of Infrastructure as Code

A bare-bones Python application (with tests and demos) that explores the foundational underpinnings of (immutable) infrastructure as code.

Visit on Github.