The code is open-source and available on Github.
– “Which pages are getting an unusual hit in the last 30 minutes?”
– “Which categories of items are now hot?”
We want to know which items exceed a certain frequency and identify events and patterns. Answers to such questions in real-time over a continuous data stream is not an easy task when serving millions of hits due to the following challenges:
- Single Pass
- Limited memory
- Volume of data in real-time
The above impose a smart counting algorithm. Data stream mining to identify events & patterns can be performed by applying the following algorithms: Lossy Counting and Sticky Sampling. Below I will demonstrate how these problems can be solved efficiently. Continue Reading