The Cassandra London meetup group has recently celebrated its six month anniversary and after a string of fantastic speakers it was left to me to follow up my talk at the first ever meetup with another talk.
I decided to start out giving a brief history of my work with Cassandra; starting when I joined VisualDNA, through the hard times struggling with GC issues up until the present – successfully running a 16 node Cassandra cluster on EC2. Looking back, working with Cassandra has been a very positive experience, but the analytics side of things (carrying out complex analysis of data stored in Cassandra) seemed harder than it could be.
This is where DataStax’s Brisk comes into play.
DataStax’ Brisk is an enhanced open-source Apache Hadoop and Hive distribution that utilizes Apache Cassandra for many of its core services.
Put simply, Brisk gives you the real-time capabilities of Cassandra combined with an easy interface to Map Reduce via Hive, in an easy to use bundle.
Case study – segmenting users
As a case study I built a very simple system for segmenting users into buckets using PHP. The key idea is to have a pixel that can be included on a website to track users (via a Cookie) and put them into various buckets. This demonstrates the key features of Brisk:
Real-time API access
- A Hive query to find out how many users are in each segment
- A Hive query to calculate the average and standard deviation of the number of groups that each user is part of