Deploying a Spark job using Juju !

Juju makes it easy to setup and monitor a Spark cluster with a few commands.
In this guide we will setup a new cluster and deploy a Spark job using this tool. According to the official definition, Juju is a service modelling tool that allows people to model, configure and deploy applications in the cloud. It also offers some monitoring capabilities as well as scaling. Continue Reading

Using Spark for Anomaly (Fraud) Detection

The code is open-source and available on Github.

Introduction

Anomaly detection is a method used to detect outliers in a dataset and take some action. Example use cases can be detection of fraud in financial transactions, monitoring machines in a large server network, or finding faulty products in manufacturing. This blog post explains the fundamentals of this Machine Learning algorithm and applies the logic on the Spark framework, in order to allow for large scale data processing. Continue Reading

How to improve the flawed interview process

Hiring engineers is a painful, time-consuming, error-prone process. I think we can all agree to that. If done right, a good candidate will become a part of the company who adds value to it. If done wrong, candidates can waste the interviewer’s valuable time as well as their own time. Moreover, a bad candidate can also end up being an employee of the company adding little or even negative productivity – the team gets less work done with the new member onboard.

The truth is that (un)fortunately, everything starts from Continue Reading

Frequency Counting Algorithms over Data Streams

The code is open-source and available on Github.


– “Which pages are getting an unusual hit in the last 30 minutes?”
– “Which categories of items are now hot?”

We want to know which items exceed a certain frequency and identify events and patterns. Answers to such questions in real-time over a continuous data stream is not an easy task when serving millions of hits due to the following challenges:

  • Single Pass
  • Limited memory
  • Volume of data in real-time

The above impose a smart counting algorithm. Data stream mining to identify events & patterns can be performed by applying the following algorithms: Lossy Counting and Sticky Sampling. Below I will demonstrate how these problems can be solved efficiently. Continue Reading

My conversation with the great Nathan Marz

Three months ago I attended the NoSQL matters conference in Barcelona. The keynote speaker was Nathan Marz. Nathan is the creator of Storm, an open source real-time processing framework on top of which I’ve leveraged heavy scaling in the past 1.5 year. His blog is motivating (it’s probably the reason I started this blog) and he writes a new book on Big Data. So overall, I had solid reasons I wanted to meet and discuss with a person I admire. Continue Reading

7 Lessons Learned at a London Startup

So I will add one more post to the stack of this topic by sharing my own experiences about the startup world. I used to work for a tech startup for about a year. I was hired as the first employee doing back-end (and not only!) development. The company had already began its business 5 months before I joined.

The main product of the company informs you about the most important people that engage with your brand on Twitter. Apart from offering other detailed reports such as gender-location breakdown, engagement, potential reach of your content marketing, the value exists in creating a top-influencers list. You should engage with your key people, try to nurture them and turn them into customers or influence their negative thinking. Not rocket science but clever idea. Below I share 7 lessons that I learned during my time there and I’ll never regret. Continue Reading

How to spot first stories on Twitter using Storm

The code is open-source and available on Github.
Discussion on Hacker News

As a first blog post, I decided to describe a way to detect first stories (a.k.a new events) on Twitter as they happen.  This work is part of the Thesis I wrote last year for my MSc in Computer Science in the University of Edinburgh.You can find the document here.

Every day, thousands of posts share information about news, events, automatic updates (weather, songs) and personal information. The information published can be retrieved and analyzed in a news detection approach. The immediate spread of events on Twitter combined with the large number of Twitter users prove it suitable for first stories extraction. Towards this direction, this project deals with a distributed real-time first story detection (FSD) using Twitter on top of Storm. Continue Reading