Frequency Counting Algorithms over Data Streams

The code is open-source and available on Github.

– “Which pages are getting an unusual hit in the last 30 minutes?”
– “Which categories of items are now hot?”

We want to know which items exceed a certain frequency and identify events and patterns. Answers to such questions in real-time over a continuous data stream is not an easy task when serving millions of hits due to the following challenges:

  • Single Pass
  • Limited memory
  • Volume of data in real-time

The above impose a smart counting algorithm. Data stream mining to identify events & patterns can be performed by applying the following algorithms: Lossy Counting and Sticky Sampling. Below I will demonstrate how these problems can be solved efficiently. Continue Reading

My conversation with the great Nathan Marz

Three months ago I attended the NoSQL matters conference in Barcelona. The keynote speaker was Nathan Marz. Nathan is the creator of Storm, an open source real-time processing framework on top of which I’ve leveraged heavy scaling in the past 1.5 year. His blog is motivating (it’s probably the reason I started this blog) and he writes a new book on Big Data. So overall, I had solid reasons I wanted to meet and discuss with a person I admire. Continue Reading

7 Lessons Learned at a London Startup

So I will add one more post to the stack of this topic by sharing my own experiences about the startup world. I used to work for a tech startup for about a year. I was hired as the first employee doing back-end (and not only!) development. The company had already began its business 5 months before I joined.

The main product of the company informs you about the most important people that engage with your brand on Twitter. Apart from offering other detailed reports such as gender-location breakdown, engagement, potential reach of your content marketing, the value exists in creating a top-influencers list. You should engage with your key people, try to nurture them and turn them into customers or influence their negative thinking. Not rocket science but clever idea. Below I share 7 lessons that I learned during my time there and I’ll never regret. Continue Reading

How to spot first stories on Twitter using Storm

The code is open-source and available on Github.
Discussion on Hacker News

As a first blog post, I decided to describe a way to detect first stories (a.k.a new events) on Twitter as they happen.  This work is part of the Thesis I wrote last year for my MSc in Computer Science in the University of Edinburgh.You can find the document here.

Every day, thousands of posts share information about news, events, automatic updates (weather, songs) and personal information. The information published can be retrieved and analyzed in a news detection approach. The immediate spread of events on Twitter combined with the large number of Twitter users prove it suitable for first stories extraction. Towards this direction, this project deals with a distributed real-time first story detection (FSD) using Twitter on top of Storm. Continue Reading