In Algorithms for Big Data we strive to design and implement the world’s most advanced, accurate and efficient algorithms for handling Big Data.
For us, (as opposed to most of the industry – REMOVE) Big Data is not just about the storage of the data and moving it around between storage systems like Mongodb, Hadoop or Elastic Search. For us, Big Data is about looking inside the data. Starting from a bird’s eye view we are able to drill down to stuctures and patterns that develop over time, geographic locations, and customer and product profiles. We are also able to query in real time about anything, not only in the past but in the future.
While there are other companies that try to handle the same tasks, what sets us apart is the accuracy and efficiency of our algorithms. In fact, we have the word algorithm in our name for this very reason - we are an algorithmic research company.
Our first task was to solve a couple of core pieces of the big data puzzle:
- how to store big data in memory based on efficient compression algorithms and streaming techniques
- how to extract patterns from the data
- how to search big complex data efficiently
- how to build accurate and efficient machine learning models
At a second pass, we aim to solve the general query problem: arbitrary queries about the past and the future.
- how to query our datasources with SQL about the past
- how to query our knowledge base about the future (again with SQL)
We list the tasks from easiest to most difficult and in the order in which we had to solve them. While the first group of tasks has already finished research prototypes, we call the second group simply the five year plan for big data queries.
The data summary approach: our way of storing Big Data
It turns out the ability to store big data in memory with efficient compression is key to our success. This is the same idea behind image compression with jpeg. Most of the time you don’t need the very high resolution image to see what is the picture. It is the same with big data: when we talking about trends, pattern extraction, fuzzy search and machine-learning based prediction, the need to have access to the full dataset is not required. Actually, it is exactly the opposite. The need of having to access the full dataset to compute something which is an approximation already (like a search query) exactly, is just a waste of time and resources. While a Hadoop-based system is trying to evaluate one hypothesis on the full dataset, we are going to evaluate a million and find out the one which the data supports.
The following observation is key: to store n numbers, you don’t need a storage of size n. The application of one celebrated result of theoretical computer science tells you that you need a storage of size 100*log(n) to store those numbers with some small error of 0.1 that happens only 90% of the time. This means an exponential reduction in the size of the storage. This is how we approached the first problem.
To discover patterns in big data, you don’t need the whole dataset. If you have the right summary (but not just a sample) you can solve this problem easily. It turns out the summary we build for storage already contains all the interesting patterns in the data which are dominant.
Our system not only stores the data efficiently but also allows the execution of complex queries which matter in business. While our system is internally based on linear algebra and matrices, the interface we expose is based on relational algebra and SQL. So we have to translate from relational algebra to linear algebra. SQL is a very high level and powerful language. As it happens a top by query which asks for the items with highest similarity is just a nearest neighbor search. So we had to solve
- the 5 year plan