Archive for the ‘Cassandra’ Category

posted on Tuesday 17th May 2011 by Dave

Cassandra + Hadoop = Brisk

The Cassandra London meetup group has recently celebrated its six month anniversary and after a string of fantastic speakers it was left to me to follow up my talk at the first ever meetup with another talk.

I decided to start out giving a brief history of my work with Cassandra; starting when I joined VisualDNA, through the hard times struggling with GC issues up until the present – successfully running a 16 node Cassandra cluster on EC2. Looking back, working with Cassandra has been a very positive experience, but the analytics side of things (carrying out complex analysis of data stored in Cassandra) seemed harder than it could be.

This is where DataStax’s Brisk comes into play.

DataStax’ Brisk is an enhanced open-source Apache Hadoop and Hive distribution that utilizes Apache Cassandra for many of its core services.

Put simply, Brisk gives you the real-time capabilities of Cassandra combined with an easy interface to Map Reduce via Hive, in an easy to use bundle.

Case study – segmenting users

As a case study I built a very simple system for segmenting users into buckets using PHP. The key idea is to have a pixel that can be included on a website to track users (via a Cookie) and put them into various buckets. This demonstrates the key features of Brisk:

Real-time API access

Batch analytics

  • A Hive query to find out how many users are in each segment
  • A Hive query to calculate the average and standard deviation of the number of groups that each user is part of

The talk

You can watch a podcast of the talk on the SkillsMatter website.

posted on Wednesday 1st December 2010 by Dave

Running Cassandra on EC2

As the founder of Cassandra London it was left to me to provide the first talk; hopefully this won’t be necessary every month! To kick things off I talked about running Cassandra on Amazon EC2. At VisualDNA we run a production cluster on EC2; but this hasn’t been without its difficulties!

This talk covers the advantages and disadvantages of running Cassandra on EC2 and includes some I/O benchmarks – including some excellent work from Corey Hulen. There is also a basic overview of what actually happens when Cassandra reads and writes (although this is simplified to a single node).

The main reason that EC2 could be problematic really comes down to I/O performance, and perhaps more importantly the predictability of I/O performance. This aside, there are many reasons why you may want to use EC2. This interview with Adrian Cockcroft looks at why Netflix chose to go down the EC2 route and is a recommended read.

posted on Thursday 7th October 2010 by Dave

Cassandra: replication and consistency

Cassandra can be an unforgiving beast if you don’t know what you’re doing. I have first hand experience of this! My advice: learn everything you can. This is a good introduction to replication and consistency in Cassandra.

posted on Friday 2nd July 2010 by Dave

PHP and Cassandra

Yesterday (1st July) I presented for the first time at the PHP London user group. It was a gentle introduction; a five minute “lightening” talk slot. I spoke about Cassandra, giving a short introduction to using it with PHP.

To summarise my main points from the talk (perhaps something I should have done in the talk!)

  • Cassandra is a “highly scalable second-generation distributed database”
  • It can be considered a schema-less database insofar that each row can have different columns
  • Cassandra is designed to be both fault tolerant and horizontally scalable – both read and write throughput go up linearly as more boxes are added to the cluster
  • I think the best way of accessing Cassandra from PHP is directly via the Thrift API. This allows a beginner to learn about the core functionality of Cassandra including its limitations
  • Cassandra has Hadoop support which means that Hadoop Map Reduce jobs (a scalable, distributed mechanism for processing data) can read and write to Cassandra*
  • Cassandra does not have any query language (as opposed to MySQL or MongoDB which both allow you to query data in different ways)
  • When designing your data model, I think its easiest to try to forget about SQL and concentrate on how Cassandra works (don’t design a relational schema and then “port” it over)

* As of version 0.7!

Overall, I think Cassandra is a very useful tool. Whether it fits your use case or not is another matter!

If you’re interested in learning more about using Cassandra in a PHP project, I recommend the following starting points:

  1. Using Cassandra with PHP
    https://wiki.fourkitchens.com/display/PF/Using+Cassandra+with+PHP
  2. WTF is a SuperColumn? An Intro to the Cassandra Data Model
    http://arin.me/blog/wtf-is-a-supercolumn-cassandra-data-model