Deploying a Spark job using Juju !

Juju makes it easy to setup and monitor a Spark cluster with a few commands.
In this guide we will setup a new cluster and deploy a Spark job using this tool. According to the official definition, Juju is a service modelling tool that allows people to model, configure and deploy applications in the cloud. It also offers some monitoring capabilities as well as scaling. Continue Reading


Using Spark for Anomaly (Fraud) Detection

The code is open-source and available on Github.


Anomaly detection is a method used to detect outliers in a dataset and take some action. Example use cases can be detection of fraud in financial transactions, monitoring machines in a large server network, or finding faulty products in manufacturing. This blog post explains the fundamentals of this Machine Learning algorithm and applies the logic on the Spark framework, in order to allow for large scale data processing. Continue Reading

How to spot first stories on Twitter using Storm

The code is open-source and available on Github.
Discussion on Hacker News

As a first blog post, I decided to describe a way to detect first stories (a.k.a new events) on Twitter as they happen.  This work is part of the Thesis I wrote last year for my MSc in Computer Science in the University of Edinburgh.You can find the document here.

Every day, thousands of posts share information about news, events, automatic updates (weather, songs) and personal information. The information published can be retrieved and analyzed in a news detection approach. The immediate spread of events on Twitter combined with the large number of Twitter users prove it suitable for first stories extraction. Towards this direction, this project deals with a distributed real-time first story detection (FSD) using Twitter on top of Storm. Continue Reading