Frequency Counting Algorithms over Data Streams

The code is open-source and available on Github.

– “Which pages are getting an unusual hit in the last 30 minutes?”
– “Which categories of items are now hot?”

We want to know which items exceed a certain frequency and identify events and patterns. Answers to such questions in real-time over a continuous data stream is not an easy task when serving millions of hits due to the following challenges:

  • Single Pass
  • Limited memory
  • Volume of data in real-time

The above impose a smart counting algorithm. Data stream mining to identify events & patterns can be performed by applying the following algorithms: Lossy Counting and Sticky Sampling. Below I will demonstrate how these problems can be solved efficiently. Continue Reading