The ability to store and process clicks
The ability to store and process clicks is foundational for today’s information economy. I, just to take a simple example, care about how many people liked my status update. This is a simple instance of click counting. A more complex example is the Google Analytics summary of the number of users per city who viewed my posts. But these are all statistics on counts.
In today’s information economy clicks are big deal and big business. A concrete example is online advertising where companies compete and pay for clicks.
Clicks are just events like events coming from manufacturing or the internet of things
I just want to make clear that while I’m talking about clicks a lot of this information applies perfrectly well to other events such those as coming from manufacturing equipment and of course the internet of things (internet of things – this is just devices like microwaves, fridges and other gadgets sending information to the cloud)
Approaches to processing clicks
As a website user and product base grows the simple approaches to processing clicks may not work so well. In particular, the number of clicks may get so large that it may not fit in traditional database systems. It’s much worse for manufacturing and the internet of things where the data is produced by machines. As we know those have a much larger capacity when it comes to sinking your grandmothers’ database system (I mean things like MySQL, Posgress and the like).
The other approach is to throw the click (or event) data at Hadoop. However, one has to wait for an overnight or weekly batch processing to complete in order to surface useful information. Even so, there is a large chance the complex processing Hadoop pipeline to fail due to bad data, out-of-date software components and the most common pitfall: subtle and sudden schema changes.
In today’s world where everyone needs to be able to know everything the moment it happens, (shaky) overnight batches may produce results a bit too late.
Streaming based on keeping a window of the data
A competing approach, the so called streaming, is based on processing data on the fly and computing answers valid within some small timeframe, then throwing the data away. The core distinction here between streaming approaches and Hadoop is the memory restriction. To take a concrete example, a simple streaming algorithm may store only the last 100 users who clicked on a product page, while a Hadoop cluster has capacity to store the complete history.
Streaming relies on the need to forget in order to know the most useful information in the moment. The great thing about streaming systems is that they are not likely to fail as much as Hadoop pipelines, and if they do they can be restarted immediately. What is not so great is that you have to decide appriori what you need to know and teaching the streaming system what to remember, what to forget and how.
A data abstraction for what you need to know
If I were in e-commerce company my business revolves around users interacting with products. The best kind of interaction is of course a purchase, but other interactions like page visits are also valuable since they signal an intent for a purchase.
The most natural and useful abstraction for this kind of data is a simple 2-dimensional table.
If a user John clicks on a product “bananas” I will write an entry +1 to the corresponding table.
At the end of the day, if I want to know how many purchases John made, I look in John’s row and count all the +1’s. If I want to know how many times the bananas sold, I count the +1’s in the babanas columns.
One nice thing about such tables is that information can also be easily aggregated by summing rows/columns and itersections. Speaking abstractly, this is just linear algebra operations and for reasons I will mention later on, this is VERY, VERY good (and your servers will appreciate it as well).
One can also do more complex aggregations, by looking at sales across geographic regions and user profiles. – this is also linear algebra. because it means adding up a bunch of rows corresponding to user profiles.
It gets even better with machine learning.
Curously enough when it comes to steaming, linear algebra is more useful than relational algebra.
At the end of the day when you choose a system vendor, make sure they can do the data summary approach.
Hadoop and exact statistics are good, but slow to produce. Information delivered without delay within a tolerable error may be worth more than a late but accurate information.
The tolerable error of streaming approaches is a function of the available memory and the system design.
Smart statistics, aka sketches and hashes
Perhaps a more useful statistic is the number of distinct users who clicked on the page. Naively this statistic will require a storage of all the user ids, while a smart data structure called hyperloglog will require only k*log(log(n)) where n is the number of unique users who visited the page and k is some constant related to the accuracy. Other useful statistics could be the number of distinct users from a particular geographic category. The ultimate summary are the user ids themselves. Surprisingly there are also ways to store this information in much less space than the number of user id.
A useful data summary abstractions
An example data summary
The approach based on data summary and streaming is an approximation by necessity (we’d need more memory if we were to store everything exactly). However, this approach is very fast (online), and uses little memory. It’s nice to combine it with a batch approach which will be exact but needs a lot of memory (a Hadoop deployment) and is very slow (overnight batches).
One can go even further and instead of storing atomic pieces of information such as user ids, to store cleverly designed hashes. In this way, one can gain memory.
In the future hadoop will be important for stroning the logs and historical processing,
but not for daily analytics investigations.
In a way, streaming vs. hadoop is the same as smart methods (algorithms) and brute force (hardware).