Posts Tagged ‘PHP’

posted on Tuesday 16th July 2013 by Dave

From PHP to Go

I’ve built stuff in PHP for 12 years. For the last 8 weeks I’ve programmed solely in Go. I feel liberated.

Go (or golang if you’re Googling it) is a compiled language, with memory management and some nice concurrency features.

Why not PHP?

My career has been in three phases – firstly working on my own startup (in PHP4!), secondly working predominantly in marketing (in PHP5) and lastly working in startups.

When I started out, PHP made sense. As a self taught programmer, it was incredibly accessible. However that looseness and lack of a “one true way” encouraged me to develop bad patterns. I believe my PHP is now pretty good. I can work around many of the problems people associate with PHP. I accept PHP for the cruft pile it is, but still enjoy working with it.

However, right now PHP does not solve the sort of problems I’m working on. For some time, I have been using the wrong tool for the job. My issue with PHP was around efficiency.

  1. Efficiency of machine utilisation — you can scale out PHP, if you build your app correctly, but you’ll need a tonne of CPU to do it
  2. Efficiency of developer time — with PHP a lot of unit tests are needed to cover off the most basic of things; working round deficiencies in the PHP standard library and lack of typing
  3. Efficiency of running code — creating the entire application context for each request and then destroying it each time is not efficient – it is hard to build a PHP application which has very low and consistent response times (single digit ms)

Why golang?

There is a lot to like about Go. Here’s the things I like.

Lightweight

Go feels like the antidote to heavy-weight do-everything frameworks that are increasingly popular. The standard library feels lean and the language design encourages a similar approach in your own code. Go packages encourage small reusable units of functionality.

Safety

There is nothing like the feeling of refactoring code and then compiling to catch any problems. Lots of bugs that could be introduced when writing PHP are impossible with Go due to type safety.

Easy to learn

Go has one way of doing things. Unsure how to layout code? There is only one way – and go fmt will automatically apply a coding style for you. Go only has one type of loop (the `for` loop). Many things you do in Go, there is only one way of doing it, so learning Go is simply a matter of learning these patterns. For me, having tried to give Scala a crack, I found Go much more approachable.

Productive

It is very easy to be productive with Go – especially when coming from a dynamic language. Automatic memory management (through Garbage Collection) avoids the intricacies of doing it manually. The `:=` short variable declarations make code look dynamic.

Performant

With Go you can quickly write an API that will handle an order of magnitude more requests per second with significantly lower latency than PHP. There is no real tricks involved in achieving this – it is simply out of the box performance.

PHP to golang

Types

PHP has no types. So you can do stuff like:

$a = "A";
$a = 1.1;
$a += "B";
echo $a;

With Go, we’re going to be telling it what type everything is, either explicitly:

var a string
a = "A"

Or implicitly:

a := "A"

The colon-equals operator is incredibly useful and hides a lot of type declaration nonsense. However, `a` is still a string in either case, in Go we can’t just then do this:

a += 1.1

Occasionally you’ll need to move data between types.

  • strings to numbers `a, err := strconv.ParseInt(”12345″)`
  • numbers to strings `b := fmt.Sprintf(”%v”, a)`
  • direct casting `a := []byte(”foo”)` or if `i` is an `int` `j := int64(i)`

Objects

Hold onto your hats folks – Go is not an object oriented language. Since 2008 when a man named Neil explained object oriented design, and I invested an enormous amount of time reading the classic texts, I have been wedded to OOP. I could not understand why
Java land was such a problem.

With Go, you can attach methods to a type. So you can have some of the features of OOP. But remember – this is Go – we’re focussed on being lean. So with Go you’ll often see structs with public properties.

type User struct {
    GivenName, FamilyName string
    Age int
}

func main() {
    u := &User{"Dave", "Gardner", 34}
    fmt.Println(u)
}

http://play.golang.org/p/61B3jZSCn9

Job done, thank you kindly.

  • There are no `public`, `private`, `protected` keywords – instead, if the first character of a property is uppercase, it is “exported” from the package (think public) otherwise it is private.
  • Go has no inheritance for structs, so `protected` is meaningless.
  • The go way is simply to make stuff public, unless you have a reason not to. For entities, just have them public.
  • The `&` operator is saying “make an instance of this struct and return a pointer to it”
  • You should construct structs explicitly naming the fields (the method I used above is not recommended practice and may at some point be deprecated), eg: `&User{GivenName:”Dave”, FamilyName:”Gardner”, Age: 34}`

If we want to, we can attach a method to this data type.

func (u *User) Name() string {
    return fmt.Sprintf("%s %s", u.GivenName, u.FamilyName)
}

http://play.golang.org/p/_0vevLo9CJ

We could just as easily declare a string type and attach a method.

type Name string

func (f Name) SayHello() string {
    return fmt.Sprintf("Hello %s!", f)
}

http://play.golang.org/p/e2msqvYAGL

Packages

One interesting point here is the “exported” concept — when writing Go you’re thinking about what stuff you will be exporting from your package, both in terms of struct properties, functions, methods, types, constants etc. Within a package, there is no facility for information hiding — any piece of code can access the properties (exported or otherwise) of any variables.

This has lead me to construct my programs as small, often reusable, packages. A package is where you carry out your information hiding.

A word on scope

Unlike PHP’s namespaces, Go’s packages define scope. If I were to write this:

package foo

var bar string

Then `bar` is in the global scope within this package, but is entirely unaccessible outside of the package. If I had written `Bar` then it would have been exported and directly accessible outside the package.

PHP has global scope, object scope (for properties) and local function scope (for locally declared variables). PHP also has some strange syntax for dealing with scope.

<?php

$bar = "The Three Greyhounds";

function foo() {
    global $bar;
    echo "$bar\n";
}

In Go we would write:

var bar string = "The Three Greyhounds"

func foo() {
     fmt.Println(bar)
}

http://play.golang.org/p/7yOmIlW6nv

Here we are enclosing around the scope of `bar` from the function `foo`.

Go also has more levels of scope, for example within an `if` statement.

<?php

if ($a > 0) {
    $b = 1;
} else {
    $b = 0;
}

echo $b;

In Go you might be tempted to write:

func main() {
    if a > 0 {
        b := 1
    } else {
        b := 0
    }

    log.Println(b)
}

http://play.golang.org/p/2c5yLLLbCt

However `b` was created from within the scope of the `if` statement only and so is not defined when you come to `Println`.

Interfaces and duck typing

In PHP you would write `class foo implements bar` – explicitly declaring that your class foo is going to implement bar. In Go, if something implements the methods defined in an interface, then it automatically implements that interface. Let’s have a look.

type Stringer interface {
    String() string
}

So to be a `Stringer`, we just need to have a method named `String` that returns a `string`.

func (u *User) String() string {
    return fmt.Sprintf("%s %s (%v)", u.GivenName, u.FamilyName, u.Age)
}

Errors and Exceptions

Hold onto your hats again folks – Go does not have Exceptions. Instead, the Go way is to return an error (something that implements the Error interface) from a function or method and then check this explicitly. Let’s have a look:

foo, err := gossie.NewConnectionPool()

// now it's up to me!

The other thing we’re introducing here is Go’s concept of multiple return values — we can return a connection pool and an error. This same pattern is used in much Go code; make call, check and handle returned error.

if err := DoFoo(); err != nil {
    // handle error
}

There is also `panic` and `recover`. These are for exceptional circumstances – where there is some critical run-time bug. An example is if you attempt to do something with a `nil` struct:

func main() {
    var user *User
    fmt.Println(user.Name())
}

http://play.golang.org/p/ven1GGzvyl

It is possible to recover from panics. A good pattern here would be to catch any panics within an HTTP request handler, so that you can serve up an HTTP 500 Internal Server Error (instead of the program dying). However I rarely use this method of recovery — panic/recover is not designed for normal error handling flow.

PHP Arrays

Arrays are one of the most obviously different areas to PHP – stemming from the fact that PHP has a single construct that does all the things. With Go, we’ll need to declare explicitly when we want a map vs a list etc. In PHP, everything is an array. In Go, you’ll find yourself learning more about data structures and then picking the right one for the right workload (Trie anyone?).

Array

Literal arrays in PHP:

$arr = array("dave", "loves", "cheese");

And in Go:

arr := [3]string{"dave", "loves", "cheese"}

In Go world, no one ever uses arrays directly (nearly). Instead you can use a `slice` which you can do by simply omitting the number of elements in the declaration.

arr := []string{"daves", "loves", "cheese"}

Here `arr` is a `slice` of strings. Remember that Go is typed — you have to tell it what’s in your array, and then the type system will make sure you stick to that.

Once we have a slice, we can append easily enough:

$arr[] = "dreams"
arr = append(arr, "dreams")

Maps

Literal maps in PHP:

$map = array("foo" => "bar", "baz" => "bing");

And in Go:

m := map[string]string{"foo":"bar", "baz":"bing"}

Test if something exists in a map in PHP:

$exists = array_key_exists($map, "baz");

And in Go:

_, exists := m["baz"]

http://play.golang.org/p/pwOwd3AM_1
The underscore is a placeholder for the element – by using an underscore we’re explicitly saying we don’t care what this return value is.

Other useful data structures

Conclusions

Learning new languages is important for any developer. For me, it was always something I knew I needed to do, but something I could never prioritise. I also found a real barrier being back at the start – not being an expert anymore. I feel like Go changed that. It was understandable. It was accessible. It showed me a new way of solving problems. With this knowledge I could, if I wished, take some of these patterns back to PHP land – but I probably won’t. Once you’ve been using Go, it would seem perverse to switch back to PHP.

In my next post I’ll introduce some patterns I use when programming in Go.

posted on Tuesday 6th November 2012 by Dave

Stream de-duplication

I’ve recently started playing around with NSQ, in my hunt for a good, resilient, highly available, guaranteed “at least once” delivery queue. That’s a lot of adjectives, but basically it boils down to a queue that puts a copy of messages on N nodes and is able to operate (without losing messages) with any X of them failing, obviously where X < N.

NSQ attacks this problem in an interesting way. Instead of trying to form a cluster (in the sense that say RabbitMQ does), it instead treats each `nsqd` instance as a separate entity. It is only the clients that know there is more than one of them, and the directory service `nsqlookupd`. This actually makes it very reliable, in the sense that there are no troublesome master/slave relationships to preserve or leaders to elect.

This simplicity forces some of the work back on the client.

  • NSQ is guaranteed “at least once”, rather than “exactly once”; hence subscribers should operate in an idempotent way
  • when using with replication, it is up to the client to de-duplicate the messages on subscription

Deduplication

To de-duplicate, a subscriber needs to determine if it has seen a message before. Doing so in an accurate way would involve storing all the message IDs or some digest of the message itself in a large hash table. With this we could simply test:

if (message is in hash map) {
    ignore
}
process

Then we just need to make sure we add messages seen to the hash map. With a lossless hash map (eg: store everything), this is going to use unbounded memory.

The opposite of a Bloom filter

Bloom filters were my first thought when trying to come up with a way of bounding memory. Bloom filters are a probabilistic data structure that is able to test if some element is a member of a set. A Bloom filter will never tell you an item is in the set if it isn’t (no false negatives), but may tell you it is in the set when really it isn’t (chance of false positives).

What I actually want is _the opposite_ of a Bloom filter.

http://lmgtfy.com/?q=opposite+of+a+bloom+filter

So picking the first link on Google, I checked out the blog post on somethingsimilar.com. @jmhodges’s solution is simple; use a fixed-size hash map and then simply overwrite entries on collision. Let’s go through that slowly.

Here’s our hash map, with 10 buckets:

Our empty hash buckets

Now we process our first message and add it:

Our content hashes to bucket 3

To test if some new message has been seen we need to check whether we have got exactly this message content within the appropriate bucket. If the content does match, then we can be sure we’ve seen it. If the content does not match, then we cannot know. The reason is that we may have just overwritten this message with a new message that collided into the same bucket.

So now we write in our next message, and it hashes to the same bucket. At this point we’ve lost our knowledge of having ever seen the first message we processed.

Our next item also hashes to bucket 3; now we have lost knowledge of having seen the previous item

Deciding how big to make it

So with this data structure, we will lose knowledge of messages we have seen; however we can determine how quickly this happens by choosing the size of our hash map (how many buckets we have).

Intuitively, there is a trade off between the amount of space used and our ability to detect duplications. At one extreme, with 1 bucket, we can only ever de-duplicate if we receive messages in order. At the other extreme, with a huge number of buckets, we can nearly always de-duplicate (we are bounded by our hash function’s ability to determine unique values for different content).

To get a clearer picture, we can consider our implementation in terms of probability. Starting with a single message stored, the probability of overwriting this message with the next message (assuming a perfectly random hash function), is 1/N, where N is the number of buckets.

First iteration; chance of removing knowledge of some previously processed message

On our next go, the chances of us overwriting on this go is:

Probability of overwriting on _exactly the second go_

This combines the probability of us _not_ having overwritten on the first go with the probability of overwriting this time. To get the probability of us having overwritten by this go, we simply add up:

Probability of having overwritten _by the second go_

Our next go looks like this:

Probability of overwriting by the third go!

And we can express this as a sum, for any given x (where x is the number of additional messages we’ve written into our hash map):

Probability of having lost some initial message after X goes, with N buckets.

Plotting this, for N=100, we get:

Probability of having overwritten a previously stored message after X further messages processed (x axis) for N=100

So what we are saying here is that with 100 buckets, after adding 459 additional messages, we are 99% certain to have overwritten our initial message and hence 99% certain that we won’t be able to de-duplicate this message if it turned up again.

We can work out the equation of this graph:

Solved

We can visualise this as it varies with both N and X:

Surface plot of equation as it varies in X and N

So if we want to be able to de-duplicate (to 90% chance) a stream running at 1,000 events per second, with an hour delay (y = 0.9, x = 1000*60*60):

0.9 = 1 - (1-1/N) ^ 3600000
0.1 = (1-1/N) ^ 3600000
0.999999360393234 = 1-1/N
1 / N = 0.000000639606766

So N = 1,563,461

NSQPHP implementation

The @jmhodges implementation of opposite of a Bloom filter has an atomic “check and set” to test membership. nsqphp ships with two implementations which implement the same basic interface. The first implementation runs in a single process (and hence doesn’t have to worry about this anyway – due to PHP’s lack of threads).

In this implementation I’m actually using an MD5 of the entire content, to save space. This introduces a theoretical possibility that I could give a false negative (saying it’s seen a message when it hasn’t).

The second implementation uses Memcached to store the actual hash map; this completely ignores races on the basis that they will only mean we may not quite de-duplicate as many messages as we could have.

The only other complication is with failed messages; here we need to erase our knowledge of having “seen” a message. To achieve this we simply update our hash map so that the message we’re interested in is no longer the content within the hash bucket:

if (entry at hash index foo is our content) {
    overwrite with some placeholder (eg: "deleted")
}
posted on Monday 13th February 2012 by Dave

PHP Deployment with Capistrano

This blog post explains how to use Capistrano to deploy PHP, including some tips for integration with Jenkins.

Background

Back when I worked at Imagini, we used the home-baked Cruftly Cloudly deployment system (built by @pikesley) to roll releases. It had some nice features:

  • new code had to be pushed to staging first (as a release candidate)
  • final deploys were always against the exact release candidate
  • it would release to a bunch of machines in one hit, with symlinks switched at the end
  • from a developer perspective, it was “push button” (we just ran a script that requested a deploy of our code)
  • it presented an ASCII art dinosaur upon successful release

Once I’d moved on I was keen to recreate the Cloudly experience, but I didn’t want to have to hand-craft my own solution. Luckily Capistrano now exists which provides a very handy tool to deploy code from SCM to one or more servers.

Cruftosaurus

What is Capistrano?

Capistrano is a developer tool for deploying web applications”

Capistrano is written in Ruby and offers up a basic DSL from which you can craft quite flexible deployment scripts. Typically, the deploy process would be to deploy a particular version (branch, commit etc.) from SCM to one or more boxes (into a new folder on the server), then switch a symlink so that they all immediately run off the new code. Support for instant rollback is provided. That said, it’s very flexible. In my current setup I have it deploying to multiple environments (dev, staging, production), building code (think Phing), running tests on the servers before finalising the deploy and then restarting worker processes on completion.

All of this functionality is driven from a simple command line interface:

cap deploy
cap deploy:rollback

We can list all the available commands with:

cap -T
cap deploy               # Deploys your project.
cap deploy:check         # Test deployment dependencies.
cap deploy:cleanup       # Clean up old releases.
cap deploy:cold          # Deploys and starts a `cold' application.
cap deploy:migrations    # Deploy and run pending migrations.
cap deploy:pending       # Displays the commits since your last deploy.
cap deploy:pending:diff  # Displays the `diff' since your last deploy.
cap deploy:rollback      # Rolls back to a previous version and restarts.
cap deploy:rollback:code # Rolls back to the previously deployed version.
cap deploy:start         # Blank task exists as a hook into which to install ...
cap deploy:stop          # Blank task exists as a hook into which to install ...
cap deploy:symlink       # Updates the symlink to the most recently deployed ...
cap deploy:update        # Copies your project and updates the symlink.
cap deploy:update_code   # Copies your project to the remote servers.
cap deploy:upload        # Copy files to the currently deployed version.
cap deploy:web:disable   # Present a maintenance page to visitors.
cap deploy:web:enable    # Makes the application web-accessible again.
cap invoke               # Invoke a single command on the remote servers.
cap shell                # Begin an interactive Capistrano session.

Some tasks were not listed, either because they have no description,
or because they are only used internally by other tasks. To see all
tasks, type `cap -vT'.

Getting started

Step 1 – install capistrano.

apt-get install capistrano

Step 2 – “capify” one of your projects

cd /my/project/location
capify

This step creates the files Capfile and config/deploy.rb.

Taking it over for PHP

Capistrano is pretty simple, these are the basics:

  • Capistrano runs on your local machine (it’s not a server-side thing)
  • A recipe is a bunch of named tasks that combine to define your deploy process
  • The “default” process is tailored for releasing Rails applications – therefore you’ll have to customise the recipes for PHP
  • Capistrano is built around the concept of roles, but for a simple PHP setup you can just have a “web” role and think of this as meaning “where I’m going to deploy to”

To learn more about the Capistrano deployment process, I found the following useful:

So back to PHP.. I followed these directions (from the aptly named “Capistrano PHP” project). All we are doing is overriding the finalise update and migrate tasks. The restart task is actually blank by default, so we only need to define it if we want it to do something. We’ll make it reload nginx (I symlink the nginx config from the deployed code).

## php cruft ##

# https://github.com/mpasternacki/capistrano-documentation-support-files/raw/master/default-execution-path/Capistrano%20Execution%20Path.jpg
# https://github.com/namics/capistrano-php

namespace :deploy do

  task :finalize_update, :except => { :no_release => true } do
    transaction do
      run "chmod -R g+w #{releases_path}/#{release_name}"
    end
  end

  task :migrate do
    # do nothing
  end

  task :restart, :except => { :no_release => true } do
    run "sudo service nginx reload"
  end
end

Multi-stage deployments

I needed to be able to deploy to different environments – dev, staging and production. The obvious starting point was Googling, which led to the Capistrano multistage extension. I worked through this for some time, however the requirement for an extra dependency seemed more complicated than necessary. The footnote on the multistage page offered an alternative – multiple stages without the multistage extension.

With this pattern, all we have to do is define extra tasks for each of our environments. Within these tasks we define the key information about the environment, namely the roles that we want to deploy to (which servers we have).

## multi-stage deploy process ##

task :dev do
  role :web, "dev.myproject.example.com", :primary => true
end

task :staging do
  role :web, "staging.myproject.example.com", :primary => true
end

task :production do
  role :web, "production.myproject.example.com", :primary => true
end

Now when we deploy we have to include an environment name in the command. I don’t bother defining a default, so leaving it out will throw an error (you could define a default if you wanted to).

cap staging deploy

Tag selection

The next feature I wanted from my killer deploy system was the ability to release specific versions. My plan was

  • Jenkins would automatically release master on every push
  • a separate Jenkins project would automatically tag and release production-ready “builds” to a staging environment (anything pushed to release branch)
  • releasing to production would always involve manually tagging with a friendly version number

Nathan Hoad had some good advice on releasing a specific tag via Capistrano – including a snippet that makes Capistrano ask you what tag to release, defaulting to the most recent. One change I made was the addition of the unless exists?(:branch) condition, which means we can setup dev and staging releases to go unsupervised.

## tag selection ##

# we will ask which tag to deploy; default = latest
# http://nathanhoad.net/deploy-from-a-git-tag-with-capistrano
set :branch do
  default_tag = `git describe --abbrev=0 --tags`.split("\n").last

  tag = Capistrano::CLI.ui.ask "Tag to deploy (make sure to push the tag first): [#{default_tag}] "
  tag = default_tag if tag.empty?
  tag
end unless exists?(:branch)

For staging, I use this handy bit of bash foo, courtesy of @jameslnicholson (split to aid readability):

set :branch, `git tag | \
    xargs -I@ git log --format=format:"%ci %h @%n" -1 @ | \
    sort | \
    auk '{print  $5}' | \
    egrep '^b[0-9]+$' | \
    tail -n 1`

Putting it all together

Here’s a fictitious deployment script for the “We Have Your Kidneys” ad network. There are some extra nuggets in here that are worth highlighting.

Run a task once deployment has finished:

after "deploy", :except => { :no_release => true } do

Run build script, including tests. This will abort the deployment if they do not pass:

# run our build script
run "echo '#{app_environment}' > #{releases_path}/#{release_name}/config/environment.txt"
run "cd #{releases_path}/#{release_name} && phing build"

The deployment script in full

# Foo Bar deployment script (Capistrano)

## basic setup stuff ##

# http://help.github.com/deploy-with-capistrano/
set :application, "Foo Bar PHP Service"
set :repository, "git@github.com:davegardnerisme/we-have-your-kidneys.git"
set :scm, "git"
default_run_options[:pty] = true
set :deploy_to, "/var/www/we-have-your-kidneys"

# use our keys, make sure we grab submodules, try to keep a remote cache
ssh_options[:forward_agent] = true
set :git_enable_submodules, 1
set :deploy_via, :remote_cache
set :use_sudo, false

## multi-stage deploy process ###

# simple version @todo make db settings environment specific
# https://github.com/capistrano/capistrano/wiki/2.x-Multiple-Stages-Without-Multistage-Extension

task :dev do
  role :web, "dev.davegardner.me.uk", :primary => true
  set :app_environment, "dev"
  # this is so we automatically deploy current master, without tagging
  set :branch, "master"
end

task :staging do
  role :web, "staging.davegardner.me.uk", :primary => true
  set :app_environment, "staging"
  # this is so we automatically deploy the latest numbered tag
  # (with staging releases we use incrementing build number tags)
  set :branch, `git tag | xargs -I@ git log --format=format:"%ci %h @%n" -1 @ | sort | awk '{print  $5}' | egrep '^b[0-9]+$' | tail -n 1`
end

task :production do
  role :web, "prod01.davegardner.me.uk", :primary => true
  role :web, "prod02.davegardner.me.uk"
  set :app_environment, "production"
end

## tag selection ##

# we will ask which tag to deploy; default = latest
# http://nathanhoad.net/deploy-from-a-git-tag-with-capistrano
set :branch do
  default_tag = `git describe --abbrev=0 --tags`.split("\n").last

  tag = Capistrano::CLI.ui.ask "Tag to deploy (make sure to push the tag first): [#{default_tag}] "
  tag = default_tag if tag.empty?
  tag
end unless exists?(:branch)

## php cruft ##

# https://github.com/mpasternacki/capistrano-documentation-support-files/raw/master/default-execution-path/Capistrano%20Execution%20Path.jpg
# https://github.com/namics/capistrano-php

namespace :deploy do

  task :finalize_update, :except => { :no_release => true } do
    transaction do
      run "chmod -R g+w #{releases_path}/#{release_name}"
      # run our build script
      run "echo '#{app_environment}' > #{releases_path}/#{release_name}/config/environment.txt"
      run "cd #{releases_path}/#{release_name} && phing build"
    end
  end

  task :migrate do
    # do nothing
  end

  task :restart, :except => { :no_release => true } do
    # reload nginx config
    run "sudo service nginx reload"
  end

  after "deploy", :except => { :no_release => true } do
    run "cd #{releases_path}/#{release_name} && phing spawn-workers > /dev/null 2>&1 &", :pty => false
  end
end
posted on Wednesday 27th July 2011 by Dave

A mapper pattern for PHP

The “mapper” pattern allows us to either:

This sounds very much like straight-forward serialisation, but there are some key differences.

  1. It is not necessarily two-way
    It is possible to map from an object graph to a representation that does not contain all the information to reconstruct an object graph. An example might be a case where we map a user’s “screen name” out of the user object; a useful piece of information, but not enough to construct a fully-formed user object on its own.
  2. The representation is flexible
    It is possible to write mappers that map to and from various different representation formats (eg: JSON, XML, Cassandra Mutation Map). PHP’s in-built serialisation has a fixed end-result, determined by the PHP engine itself.
  3. There are a number of added-extras
    These include in-built caching, the ability to override values and provide defaults. More on this later.

How they fit into the overall architecture

I’m a huge fan of Domain Driven Design. When implementing a new set of functionality, I usually start with CRC cards, formulate a design for the domain objects and then start on prototyping in conjunction with unit testing. Aside – I generally don’t practice Test Driven Development, although I will hopefully try it out at some point.

The mappers come into play when creating representations of domain objects or creating an object graph from representations.

How mappers can be used in overall architecture

How mappers can be used in overall architecture

Web Service

RESTful Web Service GET requests can be created by mapping domain object graphs to representations (JSON, XML, HTML). The schema (more on this later) allows intricate control over exactly which bits of the domain object graph are mapped, allowing for a variety of different representations of the same domain objects. I’m ignoring HATEOS for the purposes of this post, this can easily be added.

RESTful Web Service POST and PUT requests can be handled by mapping the POSTed or PUT representations into a domain object graph and then saving these via the persistence layer.

Persistence

Retrieving domain objects from a persistence service (loading) can be facilitated by the representation to object mapper. One example of a specific implementation is a Cassandra Mutation Map (a representation) to object mapper.

Persisting domain objects (saving) can be facilitated by the object to representation mapper, for example mapping to a Cassandra Mutation Map or SQL.

Building blocks of the domain layer

There are many ways that this mapper pattern could be implemented. The implementation I have created relies on a number of consistent design principles within the domain layer. The important building blocks are outlined below.

Keyed objects, value objects, collection objects

We can construct our domain object based around three core object types. Any concrete domain objects therefore share functionality of one of these types and implement a common interface.

  • Keyed objects
    Domain objects that have a uniquely identifiable key – globally unique within the application. These objects are stored in an Identity Map to make sure only one instance for each unique key value is ever created.
  • Value objects
    Domain objects that do not have a uniquely identifiable key. These objects cannot be stored in an Identity Map, nor would it make sense to do so.
  • Collections
    Domain objects that represent a list of other objects. These, at their most basic level, implement the Iterator interface. Each different type of domain object will have its own accompanying collection object.

Virtual-proxy pattern for lazy-loading

All keyed domain objects can be replaced with a virtual-proxy. This behaves like the original object (implements the same interface) but only loads itself from the database at the last minute when needed.

Getters

All objects have a consistently named “getter” defined. I used to think this was a bad idea but I have since mellowed in my opinion (setters are still evil).

Object to representation mapper

The aim here is to turn an object graph into a representation. An example:

class user {
    public function getUsername();
    public function getName();
    public function getRegisteredTimestamp();
}

We then map this to a JSON representation using the schema:

$schema = array(
    'username',
    'name',
    'registeredTimestamp'
    );

Resulting in:

{"username":"davegardnerisme","name":"Dave Gardner","registeredTimestamp":1284850800}

Schemas

Object graphs nearly always have great complexity, often involving circular relationships. When mapping to representations we often don’t want all this complexity, nor would it be feasible to include it all! With a rich domain layer built up from many contained lazy-loading objects, if you continue to dig down into the object graph you could end up loading every single object that exists within your application!

This is why we use schemas when mapping from objects to representations – we need to choose what to actually map. What we are really doing it identifying how far to dip into the object graph when formulating the representation.

How it works

The implementation of the object-to-representation concept has a number of key elements:

  1. Object graph walker
    Aware of how to walk through the object graph, according to the given schema, drilling into any contained objects and collections.
  2. Object property extractor
    Able to extract a property from an object according to a schema.
  3. Property to value convertor
    Able to turn a retrieved property into a scalar value (string/integer/float).

Pseudo code

We run this code passing in the object to map from plus the schema. The code is structured to be run recursively, building the output array (passed by reference) as it goes.

foreach (entry in schema)
{
    if (schema entry indicates we should map to a list)
    {
        assert (we have a list)
        foreach (item in list)
        {
            call this recursively with this item and the sub-selection of schema
        }
    }
    else
    {
        extract property of object according to the schema
        convert this property to a scalar value
        add this property to the mapped-to data
    }
}

Caching

One interesting thing about the object to representation cache is that you can add in a caching layer which avoids, in many cases, the need to actually carry out the mapping. This is particularly effective when complex object graphs built from immutable objects. The reason is that we don’t actually need to load the objects themselves from the database; we can use the virtual-proxy key to give us a cache key and then simply load the representation directly. With PHP it’s usually a good idea to use APC to cache representations (over Memcached) to avoid the 1MB limit. This makes it more effective when mapping/caching large object graphs to large representations.

A caching system can be added into the mapping code:

if (object to map is a keyed object)
{
    cache key = hash on (schema + object key)
    if (!exists in cache)
    {
        map as normal
    }
    else
    {
        return the cached representation
    }
}

Object verification

The domain layer is built upon the principle of lazy-loading, making use of the virtual-proxy pattern. These virtual-proxies will initialise a concrete object when they need to. Remember that a virtual-proxy has a property that indicates some kind of unique identifier for the object, and it knows how to load itself. Therefore when mapping, if we need to turn a virtual-proxy object into a single value, we don’t need to load it, since we already have the unique identifier. This is a nice optimisation. However, this is not always desirable.

Sometimes you want to guarantee that objects exist, by forcing any objects touched via the mapper to be loaded from the database. This is where the verification feature comes in.

Overrides

We can tweak the final representation by adding overrides. These are a way of saying “please ignore any value that could be extracted from the object and use this object/callback instead”. One interesting way I have used these is to attach URL properties to domain objects. URLs are a property that does not usually belong in the problem domain, but rather are concerned with a specific representation scheme – HTTP. Therefore the overrides can be added within the web service layer to allow us to map a URL for a domain object within a specific context. One interesting point to note, required by the URL example, is that the override doesn’t actually have to override anything. The original object does not necessarily have to have this property in the first place.

Example

All these examples use the domain objects from my GlastoFinder project. You can check out the interfaces for these in the last section of this post.

Mapping the list of places to JSON representation

$schema = array(
    service_mapper::LOOP_ITEMS => array(
        'key',
        'title',
        'category' => array(
            'key',
            'title'
            ),
        'location' => array(
            'latitude',
            'longitude'
            ),
        'icon',
        'hashTag',
        'details'
        )
    );
$mapper = $this->diContainer->getInstance(
        'service_mapper_objectToArray',
        $schema
        );
$json = $mapper->map($list);

Sample output:

[
    {
        "key": "0f45b80d-96d5-546d-bade-c2e583489783",
        "title": "Poetry & Words",
        "category": {
            "key": "other_venues",
            "title": "Other Venues"
        },
        "location": {
            "latitude": 51.149364968572,
            "longitude": -2.5799948897032
        },
        "icon": "/i/otherstage-sm.png",
        "hashTag": "poetryandwords",
        "details": null
    },
    {
        "key": "10b11ab0-e164-5ac7-8ea4-ea5670cdf54e",
        "title": "Pedestrian Gate E",
        "category": {
            "key": "gates",
            "title": "Gates"
        },
        "location": {
            "latitude": 51.147641084707,
            "longitude": -2.6001275707455
        },
        "icon": "/i/gate-sm.png",
        "hashTag": "gatee",
        "details": null
    }
]

Representation to object mapper

This is the complete opposite of the object to representation mapper. We take some representation and then turn this into an object graph; potentially constructed of many related objects. An example:

Representation:

{"username":"davegardnerisme","name":"Dave Gardner","registeredTimestamp":1284850800}

Will be able to yield us a user object with the following properties:

class user {
/**
 * Constructor
 *
 * @param string $username The user's username - an alphanumeric (a-zA-Z0-9) string, unique
 * @param string $name The user's full name, forename plus surname, or however they want to name themselves
 * @param integer $registeredTimestamp The date this user was registered
 */
public function __construct(
    $username,
    $name,
    $registeredTimestamp
    )
}

The mapper, when asked to construct a user object, will examine the values needed by the object (by looking at its constructor) and then extract these properties from the representation. Unlike the object to representation mapper, no schema is needed. The schema is inherent in the object definitions themselves.

How it works

The implementation of the representation-to-object concept has a number of key elements:

  1. Code analysis tool
    Ideally a static analysis tool to avoid the cost of reflection. This would know what parameters are needed to construct each domain object.
  2. Recursive object-building code
    Able to determine what type of thing to build (depends on what value available in the representation) and then build it.

Pseudo code

We run this code passing in the object to map from plus the schema. The code is structured to be run recursively, building the output array (passed by reference) as it goes.

decide which value to use by looking at representation, defaults and overrides
if (value to use is a scalar [string, int, float])
{
    build placeholder domain object
}
else
{
    build full domain object
}

if (thing to build is list)
{
    foreach (item within representation)
    {
        call function recursively to build item, then add to list
    }
}

verify all built objects, if required

Overrides and defaults

When we map from a representation to an object, it’s often useful to be able to tweak the process, supplying default values where necessary and overrides in certain situations. To understand a use-case, it’s important to understand the principle of “fully formed objects”. This means that whenever we create an object, we always supply every piece of information to the object constructor, meaning that if ever we have an object instance, we know that all the data is present and correct. So for example a user object may have a createdTimestamp; this should be supplied, as a valid value, whenever we construct the object. To adhere to this, we could not pass in a NULL value and leave it to the object to provide a default value (eg: now). Instead we should use a Factory for this, or we could use a mapper default!

If we had a web service end point that created a new user, we may require a representation (eg: JSON) to be PUT. However what about the createdTimestamp? Should this be in the PUT representation? My thinking is that no, it shouldn’t. Instead we use the mapper override feature to force the createdTimestamp to be exactly the time that the user was created, according to the server processing the request.

The following illustrates defining an override callback to force a createdTimestamp to be defined at time of mapping. This makes use of PHP 5.3 anonymous functions.

$mapper = $diContainer->getInstance('service_mapper_jsonToObject');
$mapper->addOverride(
    array('createdTimestamp'),
    function () { return time(); }
    );

Rules for building domain objects

Placeholders (virtual-proxy) vs full domain objects

The mapping algorithm ultimately boils down to a situation where we need to build some object, based on some value. This value could be a scalar (string, integer, float) or an array. We need some rules to determine how we go about building a domain object based on the situation we find, specifically what type of value we have.

  • Scalar – string, integer or float
    When we are asked to build a domain object and we find a scalar in the representation, we make the assumption that this value refers to some unique identifier for the object and therefore build a placeholder object (virtual-proxy) instead.
  • Associative array
    When we are asked to build a domain object and we are presented with an array of values, we make the assumption that all of the necessary constructor properties are present in the representation. We then dig through these values and match them up with constructor arguments, before finally constructing the fully-formed domain object.

Collections – recursion

Whenever we are building a collection object, we look for a numerically-indexed array of items within the representation. We then call the “turn a value into a domain object” method recursively to yield us objects to push onto our list. The assumptions here are that collections are always represented as an Iterable object within the domain layer.

Verification

When we carry out the mapping, we may create any number of placeholder (virtual-proxy) objects in place of real domain objects. These will then be lazy-loaded on-demand. This is all fine and good, but not if you want to ensure that all the objects are valid. Luckily it is trivial for the mapper to keep track of any placeholders it mints during its mapping job. With the verification flag set, this list of of placeholders can then be force-loaded to ensure that they actually exist. Whether or not this is desirable depends on the situation; when accepting PUT or POSTed representations via a web service the safety net is useful, when mapping from a database representation we often don’t want the overhead.

GlastoFinder domain object reference

Location interface

interface domain_location_interface
{
    /**
     * Get latitude
     *
     * @return float
     */
    public function getLatitude();

    /**
     * Get longitude
     *
     * @return float
     */
    public function getLongitude();

    /**
     * Get distance to another location
     *
     * Uses "haversine" formula
     * @see http://www.movable-type.co.uk/scripts/latlong.html
     *
     * @param mz_domain_location $otherLocation The other location
     *
     * @return float The distance measured in kilometres
     */
    public function getDistanceTo(mz_domain_location $otherLocation);
}

Place interface

interface domain_place_interface
        extends     domain_keyed_interface,
                    domain_hasLocation_interface
{
    /**
     * Get title
     *
     * @return string
     */
    public function getTitle();

    /**
     * Get category
     *
     * @return domain_place_category_interface
     */
    public function getCategory();

    /**
     * Get icon
     *
     * @return string
     */
    public function getIcon();

    /**
     * Get hash tag
     *
     * @return domain_hashTag_interface
     */
    public function getHashTag();

    /**
     * Get details
     *
     * @return string
     */
    public function getDetails();
}
posted on Monday 25th July 2011 by Dave

GlastoFinder – writing a mobile application

Have you ever been to Glastonbury? It’s massive. Seriously huge. This is a fantastic feature – there’s loads to do and you can wander aimlessly for hours and hours. However! If you’re trying to find your friends it’s a problem. Figuring out where everyone is can be tricky.

The solution, to a web technologist like myself, was obvious – build a website! As they say, when all you have is a hammer… And so GlastoFinder was born.

Looking down on Glastonbury

Interesting features of the build

This project was completed quickly. It leaned on a ton of features and patterns I have been developing over a number of years, plus some new tricks (EC2)!

EC2

This was my first personal project that made use of Amazon’s Elastic Compute Cloud (EC2). The process felt good. I created an instance and then did (roughly) this:

svn co http://svn.davegardner.me.uk/path/to/repo glastofinder_com
cd glastofinder_com
./scripts/install-build-dependencies
phing build

This feels good. Everything is scripted. Once I finished with the project, I simply shut down the instance. This makes use of the Phing build tool, which is mentioned in one of my previous posts.

Building on top of Twitter

This is the first time I’ve built an application on top of Twitter. What I mean by that is that the application relies on Twitter to operate and to supply some of its features (Tweet-to-check-in). My experience is broadly positive. Leveraging an existing network removes the need to build boring sign up, sign in and “follow” functionality. My only grumble is that I think the Twitter API could be less restrictive.

Using jQuery mobile for the frontend

This was picked at the last minute to get a vaguely professional frontend in a short amount of time. Whether it achieved this lofty goal is up for debate.

Solid architecture principles

The backend (PHP) makes use of a number of nice patterns that are worth a mention.

In particular, the interplay between a strong and pure domain layer, the use of Data Access Objects (DAOs) for persistence and the mapper pattern to generate Web Service representations worked very well indeed.

Field testing

There’s nothing like a baptism of fire. I discovered at 10pm on the Tuesday before Glastonbury (I had to leave my house at 7am the next morning) that Twitter search did not include results from new users (most of my friends). There was much last-minute hackery to implement a workaround. Testing is always an important activity. Mimicking the conditions that the application should operate under is especially important to ensure things worth smoothly “in the field”.

I would say that overall the app worked “acceptably”. There were a lot of lessons learned from actually being at the festival!

Things that were a problem

  • Patchy network
    This probably had the biggest impact. The network was not unusable, merely patchy and often slow. This meant that the more advanced features (Google Map) were much less useful than the light-weight features (timeline / check-in via Twitter).
  • Setup process difficult
    We didn’t invest an enormous amount of energy trying to make this smooth. The idea was that you would choose which of your Twitter friends could track your location. This avoided a blanket approach but was fiddly to setup and the UX was less than obvious.
  • Twitter search is heavy cruft
    It’s not possible to build an app that relies on Twitter search. You are not guaranteed to get a message appearing in Twitter search and new users will be absent completely for a few days. Therefore you cannot simply sign-up to the app as a new user and start using it straight away. This would be a massive problem. There are other issues with Twitter search like the fact that the from_user_id is not actually a valid Twitter user ID.

Things that worked well

  • Timeline
    The timeline view was great. Firstly because you could see quickly both where people were and also who was actively using the app. Secondly because of the amount of information crammed into each entry (where someone was, when, where they were heading off to, what they were up to). Thirdly because it was fast!
  • Check-in and check-out via Twitter
    Despite the failings of Twitter search, this still turned out to be an amazing feature. It was super fast to check-in via Twitter at one of the many hash-tagged venues – especially at busy times when 3G didn’t work (Tweet via SMS).

Potential improvements

  • Crowd-sourced locations
    Each Glastonbury things popup that you don’t know about before. Epic food stalls setup shop and become instant favourites (like the cheese on toast stand). Allowing users to add and share places would be a good addition.
  • Use DataSift!
    I know these guys from Cassandra London and their product efficiently processes the Twitter “firehose”. Using this instead of the Twitter search API would save lots of effort.
  • Build a native app
    This is something I was against when I started the project. However a native app would have a few key advantages over a website-based implementation. The major advantage would be that the map could be cached within the app itself, making it much more usable at the festival. This would not be without its drawbacks. The current implementation (based around Google Maps) lends itself to scalability – it would be relatively simple to add lots of other events, simply using Google Maps and crowd-sourced locations database.

The results

I decided to analyse my movements at the festival based on my check-ins and check-outs. As a starter for 10, I’ve just pulled out all the check-ins by day. If I get time I’ll mash this up with a map so you can see how much ground I’ve covered and/or if the geo-location was accurate!

Wednesday

  • It’s 6:59 and the computer is going off now. Leaving the house in ~ 40 minutes. Any bugs are now classified as "features". 159km from the festival at 7:00am
  • At Waterloo! 176km from the festival at 8:30am
  • Slow progress. 171km from the festival at 10:29am
  • Services! 85km from the festival at 11:45am
  • Holy moly I’m getting near. Last 3G signal? 44km from the festival at 1:07pm
  • Frome now. The end is in sight. And it has a huge black cloud over it. 25km from the festival at 1:41pm
  • Epic queues at the gate. FFS! Very near Bus Station at 2:22pm
  • Oh yeah! Arrived at #busstation #glasto #checkin /cc @glastonick @sdiddy At Bus Station at 2:25pm
  • I can _see_ gate a. But I can also see a lot of people between me and gate a. Very near Bus Station at 3:02pm
  • #checkin #glasto #gatea Finally! At Pedestrian Gate A at 3:30pm
  • On way shopping passing brielfy #glade #checkin #glasto. At The Glade at 5:20pm
  • Passing #tinyteatent on way for food. #checkin #glasto At The Tiny Tea Tent at 8:55pm
  • Cider, but not at a bus. Rather Brothers cider. Very near Pie Minister at 9:10pm
  • Brother’s. Oh yeah. Very near Pie Minister at 9:29pm
  • This is all good. Fire time at glasto. And the Internet work. 170m from HMS Sweet Charity at 10:56pm
  • This is the camp site. 170m from HMS Sweet Charity at 11:14pm

Thursday

  • Where am I? Campsite. Late-o-clock with BHL Very near Le Grand Bouffe at 12:48am
  • Breakfast with BHL and friends. Very much the raw prawn this morning. Very near Le Grand Bouffe at 11:08am
  • Milling about. Wondering without aim! Very near Reggae Delights (TBC) at 12:23pm
  • Sunbathing at #pyramid stage. #glasto #checkin At Pyramid at 1:40pm
  • Still hanging out at #pyramid waiting for @sdiddy to get the hell out of his van! #checkin #glasto At Pyramid at 2:20pm
  • Watching first live music at the bandstand! Yay! 170m from HMS Sweet Charity at 4:10pm
  • Having tea at the #tinyteatent – will be here for a bit. #glasto #checkin At The Tiny Tea Tent at 5:00pm
  • Near the #glade watching the band who opened first #glasto at spirit of 71 stage. #checkin At The Glade at 6:05pm
  • Still at #parkstage but heading off soon. #glasto #checkin At The Park Stage at 10:25pm

Friday

  • Wakey wakey rise and shine! At #pyramid with @terrahawkes for first band of the day. #glasto #checkin At Pyramid at 11:00am
  • At #westholts watching 30 drummers tearing it up. #glasto #checkin At West Holts at 12:30pm
  • Passing #tinyteatent on way to Avalon. #glasto #checkin At The Tiny Tea Tent at 1:40pm
  • Found real ale. At a bar. Neat Avalon I think. Hanging here. Is good. I have zero faith in GPS location ability! 170m from HMS Sweet Charity at 2:09pm
  • Listening to the mad hatters (i think) tear it up at #avalon. #glasto #checkin At Avalon at 2:45pm
  • It’s official. @terrahawkes is a bad influence. At #avalon still. Good music here. #glasto #checkin At Avalon at 3:45pm
  • Watching BB King at #pyramid #glasto #checkin At Pyramid at 4:55pm

Saturday

  • Gentle introduction to the day at #tinyteatent. #glasto #checkin At The Tiny Tea Tent at 11:00am
  • Managed to watch a band twice. #checkin #glade #glasto At The Glade at 12:35pm
  • At #dance (west) for fujiya miyagi. #checkin #glasto At Dance Village at 1:05pm
  • At #Avalon watching someone who is the champion of the world. #checkin #glasto At Avalon at 2:25pm
  • In #avalon cafe with a coffee and the paper. Hedonistic! #glasto #checkin At Avalon at 2:40pm
  • At #westholts with @sdiddy going to get a cider. The sun is out again! #glasto #checkin At West Holts at 3:15pm
  • Still at #westholts. #checkin #glasto At West Holts at 4:20pm
  • I appear to have not left #westholts yet. I blame @sdiddy. #checkin #glasto At West Holts at 5:15pm
  • _still_ at #westholts. The sun is hot. I have a ukekele. All is good. /cc @Pikesley (start a band?) #glasto #checkin At West Holts at 5:30pm
  • At #pyramid to see some band @sdiddy wants to see! #checkin #glasto At Pyramid at 6:05pm
  • About to peak? #pyramid #glasto #checkin At Pyramid at 7:25pm
  • Holy moly, it’s Elbow at #pyramid and I’ve just sung happy birthday! #checkin #glasto At Pyramid at 9:10pm
  • Tell a lie. I mean #Avalon. #checkin #glasto At Avalon at 10:55pm

Sunday

  • At #pyramid, so early the litter-pickers are still going and there’s no music. #checkin #glasto At Pyramid at 10:00am
  • Watching fisherman’s friends. #checkin #glasto #pyramid At Pyramid at 11:05am
  • Still at #pyramid. Staying for Paul Simon. #glasto #checkin /cc @sdiddy At Pyramid at 2:20pm
  • Still at #pyramid. Paul Simon on next. #glasto #checkin At Pyramid at 3:55pm
  • At #Avalon for Show of Hands. #glasto #checkin At Avalon at 6:05pm
  • At #pyramid waiting for Beyoncé to start. #checkin #glasto At Pyramid at 9:55pm

Monday

  • Heading out of #gatea. #checkin #glasto At Pedestrian Gate A at 7:40am
  • 176km from the festival at 12:27pm

A final word

It was great fun building a mobile app. I’m a firm believer that you should try to work on new things all the time to build your knowledge. A big thank you to my colleagues for the endless discussions and in particular @oksannnna for the pin icons and @paugay for all the frontend work.

posted on Sunday 21st November 2010 by Dave

Why you should always use PHP interfaces

This post was sparked by a very simple question from an ex-colleague:

Why bother with PHP interfaces?

The subtext here was that the classes themselves contain the interface definition (through the methods they define) and hence interfaces are not needed, particularly where there is only one class that implements a given interface. One response to this question would be “hang on, let me Google that for you”. However the question got me thinking about how best to demonstrate the purpose and power of interfaces. Plus there isn’t that much good stuff out there about the use of PHP interfaces.

Why interfaces?

1. It helps you think about things in the right way

Object oriented design is hard. It’s usually harder for programmers, like myself, who started out in a procedural world. Allen Holub sums it up nicely in the primer section of his book Holub on Patterns (which includes much of this article from Javaworld) when he explains how people often think that an “object is a datastructure of some sort combined with a set of functions”. As Holub points out – this is incorrect.

First and foremost, an object is a collection of capabilities. An object is defined by what it can do, not by how it does it — and the data is part of “how it does it.” In practical terms, this means that an object is defined by the messages it can receive and send; the methods that handle these messages comprise the object’s sole interface to the outer world.

It’s a subtle distinction, especially for those of us with a procedural background. Thinking in terms of these messages is hard; in the same way that static methods in PHP and “God” objects seem so much more familiar. Making wide use of interfaces when building an application is an easy way of getting out of these bad habits.

When I’m trying to design an OO system I usually start with Cunningham and Beck’s CRC cards (Class, Responsibility, Collaboration). This approach involves writing down on a separate index card (or small piece of paper) the name of each class, what its responsibility is and which other classes it collaborates with. From here, I then define the public interface of each class (the methods that handle the messages it can send and receive).

Thinking in the right way is the biggest advantage of interfaces, because it is the hardest thing to get right.

2. It makes for more future-proof code

Having a separate interface for every domain object may at first seem like overkill. Why bother? One reason is the flexibility it gives you going forward. An example of this is the Virtual Proxy lazy-load pattern. This relies on a placeholder object having exactly the same interface as a real domain object. Whenever one of the Virtual Proxy’s methods is called, it will lazy-load the actual object itself and then pass the message through. Stefan Preibsh recently gave a talk that includes this pattern entitled “A new approach to object persistence” which is well worth a read.

The interesting thing here is that by making classes that collaborate with a given class rely only on interface rather than concrete class we can implement a Virtual Proxy pattern at any time, safe in the knowledge none of our application will break!

// this is good - userInterface is an interface name
public function sendReminder(userInterface $user)
{
    // do something
}
// this is less good! user is a specific class implementation
public function sendReminder(user $user)
{
    // do something
}

The following simple example shows the Virtual Proxy pattern in action.

// our interface
interface userInterface
{
    public function isMember();
}

// our concrete user class
class user implements userInterface
{
    private $paidUntil;

    public function __construct($paidUntil)
    {
        $this->paidUntil = $paidUntil;
    }
    public function isMember()
    {
        return $paidUntil > time();
    }
}

// our proxy
class userProxy implements userInterface
{
    private $dao;
    private $user;
    private $userId;

    public function __construct($dao, $userId)
    {
        $this->dao = $dao;
        // set user to NULL to indicate we haven't loaded yet
        $this->user = NULL;
    }

    public function isMember()
    {
        if ($this->user === NULL)
        {
            $this->lazyLoad();
        }
        return $this->user->isMember();
    }

    private function lazyLoad()
    {
        $this->user = $this->dao->getById($this->userId);
    }
}

Now let’s assume we have some other part of our code that collaborates with the user object to find out if someone is a member.

class inviteList
{
    public function add(userInterface $user)
    {
        if (!$user->isMember())
        {
            throw new youMustPayException('You must be a member to get an invite.');
        }
        // do something else
    }
}

Our inviteList class does not care what type of object it receives (proxy or real), it simply requires something that implements userInterface.

The same argument holds true for the Decorator pattern; by programming to interface each collaborator doesn’t care which concrete implementation (the outer-most decorated layer) it is given, as long as it adheres to the right interface.

In a nut shell: always programme to interface.

posted on Friday 2nd July 2010 by Dave

PHP and Cassandra

Yesterday (1st July) I presented for the first time at the PHP London user group. It was a gentle introduction; a five minute “lightening” talk slot. I spoke about Cassandra, giving a short introduction to using it with PHP.

To summarise my main points from the talk (perhaps something I should have done in the talk!)

  • Cassandra is a “highly scalable second-generation distributed database”
  • It can be considered a schema-less database insofar that each row can have different columns
  • Cassandra is designed to be both fault tolerant and horizontally scalable – both read and write throughput go up linearly as more boxes are added to the cluster
  • I think the best way of accessing Cassandra from PHP is directly via the Thrift API. This allows a beginner to learn about the core functionality of Cassandra including its limitations
  • Cassandra has Hadoop support which means that Hadoop Map Reduce jobs (a scalable, distributed mechanism for processing data) can read and write to Cassandra*
  • Cassandra does not have any query language (as opposed to MySQL or MongoDB which both allow you to query data in different ways)
  • When designing your data model, I think its easiest to try to forget about SQL and concentrate on how Cassandra works (don’t design a relational schema and then “port” it over)

* As of version 0.7!

Overall, I think Cassandra is a very useful tool. Whether it fits your use case or not is another matter!

If you’re interested in learning more about using Cassandra in a PHP project, I recommend the following starting points:

  1. Using Cassandra with PHP
    https://wiki.fourkitchens.com/display/PF/Using+Cassandra+with+PHP
  2. WTF is a SuperColumn? An Intro to the Cassandra Data Model
    http://arin.me/blog/wtf-is-a-supercolumn-cassandra-data-model
posted on Monday 22nd March 2010 by Dave

Caching dependency-injected objects

This blog posts talks about caching and retrieving objects in PHP (eg: via Memcache) where the objects themselves have a number of injected dependencies. It includes using the PHP magic methods __sleep and __wakeup to manage serialisation. It also discusses mechanisms for re-injecting dependencies on wakeup via a method that maintains Inversion of Control (IoC).

This post covers:

Sample system

To illustrate the idea, I’ll use a simple domain model where we have a userList object (iterator) containing a number of user objects. Each user has an injected userDao dependency which is used for lazy-loading usageHistory, on request.

class userList
{
    public function current() { }

    public function key() { }

    public function next() { }

    public function rewind() { }

    public function valid() { }

    public function count() { }
}

class user
{
    private $usageHistory;

    public function __construct($dao, $userDataRow)
    {
        $this->dao = $dao;
        $this->usageHistory = NULL;
    }

    public function getUsageHistory()
    {
        if ($this->usageHistory === NULL)
        {
            $this->usageHistory = $this->dao->lazyLoadHistory($this);
        }
        return $this->usageHistory;
    }
}

class userDao
{
    public function __construct($database, $cache, $logger)

    public function getList() { }

    public function lazyLoadHistory() { }
}

class usageHistory
{
}

Dependency Injection

A sample invocation of this simple system might be to ask the DAO for a user list object. To create a DAO object we will almost certainly need to pass in a bunch of dependencies such as database services, caching services and logging services.

I’m using a DI container to create objects. To get a really quick idea of what these are about you can imagine doing this:

$diContainer = new diContainer();
$userDao = $diContainer->getInstance('userDao');

Instead of this:

$configuration = new systemConfig();

$database = new mysqlDatabaseConnection($configuration);
$cache = new memcacheConnection($configuration);
$logger = new firebugLogger();

$userDao = new userDao($database, $cache, $logger);
$userList = $userDao->getList();

The key idea is that the DI container will build the object graph for you. For each dependency needed it will go away and fetch that, building any other dependencies of those objects and so on recursively up the tree.

I’m using an annotation system to power my own DI container; making the whole process simple and configuration-light.

Putting objects to sleep

Caching is a very handy tool to improve the performance of applications. Storing objects in a cache (for example Memcache) prevents us having to go to database each time. Memcache is a very simple system; a key-value store. You give it some data (less than 1MB) and it stores it for you until you ask for it again. Storing objects is slightly more complex than simple strings; with objects you need to serialise them. Memcache actually does this for you (you don’t need to call serialize() first).

However caching objects can be problematic. Whenever you start to really use the power of OOP you inevitably end up with complex object graphs. Our user object, for example, contains a userDao object. This in turn contains a database service object, a cache service object and a logging service object. Some of these objects have their own dependencies! For example the database service object contains a configuration object.

The key point here is that by default, when we serialise a user object we will be serialising all the internal properties, including all the dependencies. This is undesirable.

This is where PHP’s built-in magic __sleep method comes to the rescue. Using __sleep we can tell PHP what we do want to store. Let’s assume our user object has the following properties:

class user
{
    private $dao;
    private $name;
    private $age;
    private $email;
    private $phoneNumber;
    private $usageHistory;
}

What we’ll do is tell PHP what we want to save.

class user
{
    public function __sleep()
    {
        return array('name', 'age', 'email', 'phoneNumber');
    }
}

Now we can serialise and/or cache objects without the overhead of complex dependency graphs.

Waking objects up

When it comes to restoring objects, for example via Memcache::get or via unserialize(), we will end up with a user object that has a valid name, age, email and phoneNumber property. What we won’t have is the DAO dependency or the usageHistory property. It is important to realise that the class constructor will not be called when the object is unserialised.

For pure simplicity we can use PHP’s built-in magic __wakeup method to execute code on unserialisation.

class user
{
    public function __wakeup()
    {
        $this->usageHistory = NULL;
        $diContainer = new diContainer();
        $userDao = $diContainer->getInstance('userDao');
    }
}

This is handy for ensuring that the usageHistory property is properly set to NULL (so it will lazy-load). The problem with this approach is that we lose the Inversion of Control. Instead of injecting the dependencies, we are instead looking them up; we have a tightly coupled dependency to the DI container. One of the key points of DI is that the objects themselves shouldn’t really know or care about the DI container.

When constructing objects using the DI container we never directly use the “new” keyword to create objects – instead we rely on the DI container to do this for us. This supplies all dependencies as parameters. However we can’t replace the call to __wakeup; and therefore we can’t inject dependencies here.

Restoring dependencies

To ensure that dependencies are restored correctly I use a “magic” method __restoreDependencies. Ok so it’s not actually that magic; PHP doesn’t call it automatically! However the serialisation/unserialisation in my application is localised within my cache object. Therefore what I can do is adjust my cache::get method:

class cache
{
    public function get($key)
    {
        $value = $this->memcache->get($key);
        if (is_object($value) && $value instanceof cacheable)
        {
            $this->diContainer->wakeup($value);
        }
    }
}

To make life easy I actually use a “cacheable” interface that objects must implement in order to be stored in cache. This formality really just ensures that no one tries to cache objects without making sure they think of the implications on dependencies. The cacheable interface simply ensures that an object has a __restoreDependencies() method.

The (bespoke) DI container has a “wakeup” method that will:

1. Call the __restoreDependencies() method injecting any required services (dependency objects)

2. If the __restoreDependencies() method returns an array of other objects, call the wakeup() method on those objects as well. This can repeat recursively if required.

The second point here ensures that we can cache an entire userList object and wake it up effectively. The userList object’s __restoreDependencies() method would return an array of all user objects that need waking up.

The result is that I can cache complex object graphs without dependencies, but have these dependencies automatically “fixed” when objects are retrieved from cache. The objects themselves don’t really know anything about the process. Instead all they need to do is define a simple interface which defines the required dependencies.

posted on Monday 23rd November 2009 by Dave

PHP dependency strategies: dependency injection and service locator

In this post I’m hoping to answer my own question: what strategy shall I use for handling dependencies in my new project? I’m going to explore three possible strategies that should help create good quality, uncoupled code:

  1. Simple Dependency Injection (DI)
  2. The Service Locator pattern
  3. A DI framework

This includes a bespoke implementation of a DI framework for PHP that automatically creates configuration by conducting a simplistic static analysis of code.

What are dependencies

Consider the following code. This simple application comprises a simple event domain object combined with a Data Access Object (DAO) that deals with persistance.

event.class.php
class event
{
    private $name;
    private $cost;
    private $eventDate;

    /**
     * @param array $row Information on this event from DAO
     */
    public function __construct($row)
    {
        $this->name = $row['name'];
        $this->cost = new money($row['cost']);
        $this->eventDate = new date($row['date']);
    }

    public function __toString()
    {
        return "EVENT: {$this->name}\nCOST:  {$this->cost}\nDATE:  {$this->eventDate}\n";
    }
}
eventDao.class.php
class eventDao
{
    public function getById($id)
    {
        $db = new database('localhost','mydb','user','password');

        $row = $db->fetchAll(
           "SELECT name, cost, date FROM events WHERE id = ".$db->quote($id);
        );

        return new event($row);
    }
}

An event object is dependant on a money object and a date object. It needs to create these to function correctly.

An eventDao object is dependant on a database object and an event object. It needs the database object to get the data and it needs to create and return a new event object.

Why are depedencies problematic?

Dependencies are not in themselves problematic. It is going to be impossible to write any useful code that doesn’t have some dependencies. The problem is how we handle those the dependencies. The code example provided above presents the following problems.

1. It makes testing impossible

Writing a unit test for the event object will inevitably end up testing the money and date objects. When we create a new event object we have no control over the creation of those dependent objects. This means our unit test will cross the class boundary and what we end up with is an integration test rather than a unit test. Instead of testing the logic of the specific, isolated “unit” (our event object), we are instead testing the event object works in relation to the rest of the program.

While it may not seem immediately obvious why that’s a problem with fairly trivial dependencies such as a money object, the problem is more obvious when considering the DAO. Here we could not test the getById method without inadvertantly testing the database object. Without a fully-functioning database, setup with the expected data, our unit test will fail. Again, this isn’t a unit test, it’s more likely an intergration test or possibly even a system test.

2. The objects are tightly coupled

The eventDao class is tightly coupled to the specific concrete classes event and database. What if we want to use a different database object on our test environment? We can’t. Of course there’s ways round this immediate problem without changing too much at all. We could use global constants DATABASE_NAME, DATABASE_USER etc.. Don’t even go there! Tight coupling makes for brittle code. If you’re not convinced you can read this article or spend 10 minutes with Google.

3. It goes against the Don’t Repeat Yourself (DRY) principle

If we imagine adding some other domain objects and DAOs to our system we will end up repeating the new database() line again and again. The same goes for other domain objects that want to represent information internally as a date object.

An alternative coupling

Let’s say you’ve got this in your DAO instead:

        $db = database::getInstance();

Same problems! We probably still can’t test it (unless we have some kind of database::setInstance() method) and it’s certainly still tightly coupled regardless.

Strategy 1: Dependency Injection

Dependency Injection is very straightforward. In fact it’s so straightforward you’ve almost certainly already done it, even if you didn’t refer to it as DI. Fabien Potencier of Symfony fame explains it expertly in these slides.

To make use of dependency injection, our eventDao can be updated to:

eventDao.class.php
class eventDao
{
    private $db;

    public function __construct($db)
    {
       $this->db = $db;
    }

    public function getById($id)
    {

        $row = $this->db->fetchAll(
           "SELECT name, cost, date FROM events WHERE id = ".$this->db->quote($id);
        );

        return new event($row);
    }
}
$db = new database('localhost','mydb','user','password');
$dao = new eventDao($db);
$event = $dao->getById(1);

We can now test the getById method using a mock database object because we are injecting the dependency into the DAO object.

We can’t, however, isolate testing of the DAO completely because of the event dependency. This can be fixed by delegating responsibility for event creation to a Factory.

eventFactory.class.php
class eventFactory
{
    public function create($row)
    {
        return new event($row);
    }
}
eventDao.class.php
class eventDao
{
    private $db;
    private $eventFactory;

    public function __construct($db, $eventFactory)
    {
       $this->db = $db;
       $this->eventFactory = $eventFactory;
    }

    public function getById($id)
    {

        $row = $this->db->fetchAll(
           "SELECT name, cost, date FROM events WHERE id = ".$this->db->quote($id);
        );

        return $this->eventFactory->create($row);
    }
}

Our code is now loosely coupled and we are programming to interface (allbeit that I haven’t actually put any interfaces into the code at this stage!) To quote kdgregory from StackOverflow:

Programming to an interface is saying “I need this functionality and I don’t care where it comes from.”

By putting in place a factory for creating objects we can program only to the interface we require. In our event class, this means we don’t have to rely on specific concrete implementations for date and money, instead we merely require some object that implements iDate and iMoney, and we can use a factory to make us one of those objects.

Inversion of Control (IoC)

It’s worth noting that by injecting dependencies into the objects we have inverted control, effectively because rather than the procedural/linear style of setting up object and then doing something, we have passed in an object and are executing what almost amounts to a ‘call back’ on it. The term “Inversion of Control” seems to come up frequently when reading about DI, although I’m not sure that it’s always that clearly explained. Fowler explains it in this article. There are also some other interesting blog posts on the subject. If you want the short answer on what IoC is, check out this definition.

Object graphs

Using DI, what we inevitably end up with is a complex object graph. An object graph is simply a set of interconnected objects. In the case of DI, we have lots of interconnected objects since we are passing all our dependencies around as objects – so we end up with a lot of objects related to a lot of other objects at run time!

Strengths

This form of dependency injection is easy to understand. We avoid tight coupling, we can test our code and we are programming to interface. All is good!

Weaknesses

One of the web apps I work on uses a Front Controller pattern to handle incoming requests. The bootstrap code looks a bit like this:

$fc = new frontController();
$fc->dispatch();

If I take the issue of dependency injection to the extreme, slightly insane, but on some level undeniably logical, conclusion, I would have to inject all dependencies needed by the entire application into the constructor of the front controller! This post by Ben Scheirman on StackOverflow gives another example:

var svc = new ShippingService(new ProductLocator(),
   new PricingService(), new InventoryService(),
   new TrackingRepository(new ConfigProvider()),
   new Logger(new EmailLogger(new ConfigProvider())));

You get the idea.

We could be pragmatic about this and suggest that individual page controllers are allowed to be tightly coupled to domain objects. However it merely defers the inevitable.

The problem with this strategy is that if you add a dependency to an object, you then have to add it to all parent objects that use that object. This becomes a recursive task so the change causes a ripple effect to other code. The documentation to Google Guice explains it quite well (scroll down to the “Dependency Injection” section – due to their clever reader thing I can’t get an anchor link straight to it!) This problem relates to the inherent complexity involved in creating a large object graph.

Unfortunately, now the clients of BillingService need to lookup its dependencies. We can fix some of these by applying the pattern again! Classes that depend on it can accept a BillingService in their constructor. For top-level classes, it’s useful to have a framework. Otherwise you’ll need to construct dependencies recursively when you need to use a service

Strategy 2: Service Locator

Martin Fowler explains the idea of a service locator in detail in his article on the subject of DI. I’m going to explain it in the context of a PHP application.

A service locator is a straight forward system whereby objects can “look up” any dependencies they need from a central source. This gives the following advantages:

  • It is easy to add a dependency to any object
  • It is easy to replace which dependency is provided project wide, so we are adhering to the DRY principle
  • It removes tight coupling between objects

The simplest service locator may look like:

serviceLocator.class.php
class serviceLocator
{
    public static function getDatabase()
    {
        return new database();
    }

    public static function getDateFactory()
    {
        return new dateFactory();
    }

    public static function getMoneyFactory()
    {
        return new moneyFactory();
    }

    public static function getEventFactory()
    {
        return new eventFactory();
    }
}

Our DAO now becomes:

eventDao.class.php
class eventDao
{
    public function getById($id)
    {

        $row = serviceLocator::getDatabase()->fetchAll(
           "SELECT name, cost, date FROM events WHERE id = ".$this->db->quote($id);
        );

        return serviceLocator::getEventFactory()->create($row);
    }
}

For testing we’d need to add in equivalent methods like serviceLocator::setMoneyFactory and serviceLocator::setDatabase.

We can simplify (or complicate depending on your point of view) our service locator by replacing methods like serviceLocator::getMoneyFactory() with a more generic serviceLocator::getService($serviceName). We could then configure the service locator in our bootstrap with calls to serviceLocator::registerService($serviceName, $object). If we really wanted to go to town we could use an XML or YAML file to store the details of the dependencies that the service locator provided. For a working system, we probably would want to go this far.

In terms of coupling, we have replaced the coupling of objects from our very first example (where eventDao was tightly coupled to database and event) with equally tight coupling, albeit this time to a single object – the service locator object. Whether this is desirable will come down to the details of the application. As Fowler points out in his discussion of locator vs injector:

The key difference is that with a Service Locator every user of a service has a dependency to the locator. The locator can hide dependencies to other implementations, but you do need to see the locator. So the decision between locator and injector depends on whether that dependency is a problem.

In terms of practical implementations, Mutant PHP has published an article on this subject which includes a sample service locator class.

The fairly new Symfony Dependency Injection Container appears to be based around the idea of a service locator. I say this because it doesn’t implement an inversion of control mechanism – as covered in strategy 3.

Strengths

The service locator provides a simple strategy for managing dependencies that is easily understood. It allows for testing and it avoids tight coupling between classes.

Weaknesses

The use of a service locator leads to a tight coupling between classes and the service locator itself.

Strategy 3: DI Framework

A dependency injection “framework” is an alternative strategy for dealing with dependencies to the arguably simpler service locator. They key idea is to stick with depedency injection (either into the constructor or via a setter), but have some external object (Fowler calls this an “assembler” in his article on the subject) actually deal with managing the dependencies, injecting them into objects as required, without the user having to worry about it.

Now coming from a PHP background, I’ve searched about for PHP-specific information on DI frameworks. So far, I haven’t managed to find anything that I feel explains the concept as well as the Guice documention does. In terms of responsiblity-driven design, Guice outlines the role of the “injector“ (Fowler’s “assembler”):

The injector’s job is to assemble graphs of objects. You request an instance of a given type, and it figures out what to build, resolves dependencies, and wires everything together.

This sounds promising, although I’m not 100% convinced I need a DI framework, I’m starting to see some advantages. There is an interesting discussion on StackOverflow (again!) about the need for a DI framework.

A bespoke DI Framework

To help understand the advantages of a DI framework I built my own, which I’ve rather confusingly called a “Service Injector”. As Benjamin Eberlei explains in a blog post on the subject of DI:

My subjective feeling tells me there are now more PHP DI containers out there than CMS or ORMs implemented in PHP, including two written by myself (an overengineered and a useful one).

Having recently read the excellent Coders at Work (go and buy it now if you haven’t read it), I took some advice from Donald Knuth who said:

The problem is that coding isn’t fun if all you can do is call things out of a library, if you can’t write the library yourself.

So I decided to write my own.

Design motivations

In Benjamin’s post, he goes on to say:

Its an awesome pattern if used on a larger scale and can (re-) wire a complex business application according to a clients needs without having to change much of the domain code.

I think my motivations for DI framework are somewhat different. I don’t see myself wanting to “re-wire” an application at a later date.
Ideally I want my logic and wiring to remain clear at the code level; I personally don’t want to delegate all the wiring to some configurator – I can see that making any debugging task harder. What I want from a framework is something that will do the hard work for me; something that will supply actual dependencies automatically.

This led me to make the following decisions:

  • I wanted an automated builder; something that would look at the code and get the DI framework setup ready to go – based on class names and interfaces
  • I wanted to keep Factory classes; I think it makes logical sense to have a class who’s responsibility is to create new objects of type blah

A sample application

At the top level, I can ask the DI framework to create me an object:

// setup the service injector
include APP_PATH.'siConfig.php';
$serviceInjector = new serviceInjector();

// ----

// for our test app we'll just pretend we're looking at the details
// of event #1:

$oDao = $serviceInjector->getInstance('eventDao');
$oEvent = $oDao->getById(1);

The DAO object is created, along with its dependencies; this all happens simply by annotating the code within the PHPDocumentor style comment blocks:

    /**
     * Constructor
     * @param iDatabase $database A service that will allow us to execute SQL
     * @param iEventFactory $eventFactory A service that will create event objects for us
     * @inject This informs the DI builder to inject constructor parameters on object creation
     */
    public function __construct(iDatabase $database, iEventFactory $eventFactory)

The service injector will find a class that implements iDatabase and iEventFactory and automatically inject these on object creation. The interesting thing is that either of these two services can have their own dependencies. For example, my eventFactory class declaration looks like this:

class eventFactory extends factory implements iEventFactory

It extends the Layer Supertype factory. The factory base class has a method to set its own dependency, again specified via annotation:

    /**
     * Set service injector
     * @inject This informs the DI builder to inject method parameters immediately after object creation
     */
    public function setServiceInjector(iServiceInjector $serviceInjector)
    {
        $this->serviceInjector = $serviceInjector;
    }

The service injector will happily go away and recursively assemble the required objects and their dependencies.

The builder

I have a script that can be executed as part of an automated build process (see my other post) that will create a pure-PHP configuration file for my service injector. It works by conducting a somewhat crude static analysis of the code you tell it to examine. It then works out which classes wire up to which interfaces, what extends what, which methods need parameters injecting and which classes should be shared (rather than a new instance created on every request).

Right now, it works as well as it needs to for the sample application. However it doesn’t do very well if you have more than one class that implements a given interface, and then you ask the service injector to build you a blah interface – in this situation it will fail. You’ll notice that although I’ve got a lot of interfaces in the sample application, they all have one class that implements the interface. I think this is a worthwhile exercise because it gets you into the mindset that you are programming to interface and thinking about messages that the objects send other objects.

Pros and cons

I like how my implementation creates the wiring-up configuration automatically based on the actual code. I also like how the service injector is really focussed on programming to interface: a service is simply some object that will provide a set of capabilities and the service injector’s only job is to inject these objects at run time. It does not deal with injecting strings and other configuration parameters; I think that’s OK since a string is not a service – and I set out to build something that would only do that job.

I guess that’s where my service injector differs from other implementations of dependency injection containers – I have focussed purely on something that will provide services, not any other types of depdency, such as configuration strings. Perhaps this could be considered a con!

The static analysis in this simple version is fairly rudimentary, although that said it will quite happily analyse the Zend framework source code. I tried this out and then made my date factory ask the service injector for a new Zend_Date object. This all worked fine – simply by changing one line of code.

The source code

So I’ve written this tool purely as a way to learn about the ideas involved and also to see if I could find a structure that I thought was useful for my application. It’s been done pretty quickly but if you’d like to have a closer look you can browse the source code here.

Other implementations for PHP

Conclusions

Through this process of research I have come to the following conclusions:

  • I prefer DI over a service locator because the individual modules are cleaner; dependencies are passed in rather than the object itself going and asking for them
  • A DI framework seems the way to go, simply to reduce the complexity involved in manually creating complex object graphs
  • I like Factory classes because they serve a clear purpose and make code easy to understand
  • I want a DI framework to be able to work (almost) completely from the source code

My next step is to look more closely at existing implementations to see if they could work in a production project.

posted on Monday 9th November 2009 by Dave

Setting up continuous integration for PHP using Hudson and Phing

In this, my first post, I’m going to write about the benefits of Unit Testing and how Continuous Integration (CI) can be used to get the best out of Unit Testing. This will include details of how I setup a CI system using Hudson CI server, Phing build tool combined with various other analysis tools (including PHP Unit).

One of the best explanations of Unit Testing I’ve read was posted by benzado on Stack Overflow.

Unit testing is a lot like going to the gym. You know it is good for you, all the arguments make sense, so you start working out. There’s an initial rush, which is great, but after a few days you start to wonder if it is worth the trouble.

The difficulty with Unit Testing is keeping it up. It is very easy to slip into poor habits and before you know it there’s a huge chunk of code with no tests. Possibly a huge, badly designed chunk of code, that didn’t benefit from having tests written before it was coded. Before you know what’s going on, you end up with a project that you really can’t write tests for, because retrofitting the tests is near impossible.

For me, there are two critical reasons for Unit Testing:

  1. Enforcing good design
    To be able to write tests, you need to be able to zero in on a “unit” of code, isolating it from all the rest of your 1,000,000 lines of web application. Writing Unit Tests forces you to design systems that have loose coupling because otherwise it is impossible to test.
  2. Allowing changes to be made in confidence
    Without Unit Tests, you get to the point where no one really wants to make any changes to the code. This is especially true in a commercial environment, where many people have worked on the code, including some key team member who has since left. Unit Tests allow you to make changes to one part of the code and be pretty convinced you haven’t messed up something else.

Continuous integration

Martin Fowler describes the process of Continuation Integration in detail. He suggests:

Continuous Integration is a software development practice where members of a team integrate their work frequently, usually each person integrates at least daily – leading to multiple integrations per day. Each integration is verified by an automated build (including test) to detect integration errors as quickly as possible. Many teams find that this approach leads to significantly reduced integration problems and allows a team to develop cohesive software more rapidly. This article is a quick overview of Continuous Integration summarizing the technique and its current usage.

The key idea behind CI is to do what is most painful often, namely “building” everyone’s code from source and making sure it all works.

A CI system usually consists of the following key elements:

Continuous integration

Continuous integration

  • Developers commit code
  • CI server detects changes
  • CI server checksout code, runs tests, analyses code
  • CI server feeds back to development team

If you want to find out more about CI, I recommend the excellent book Continuous Integration: Improving Software Quality and Reducing Risk. There is an excerpt published on JavaWorld which covers a lot of the key advantages. In particular, it highlights:

1. Reduce risks
2. Reduce repetitive manual processes
3. Generate deployable software at any time and at any place
4. Enable better project visibility
5. Establish greater confidence in the software product from the development team

CI gets the most out of Unit Tests by forcing them to be run after every change. Not only that, but with a good CI setup, developers instantly know if they haven’t written enough tests. If avoids the situtation where Joe Bloggs has added in a huge chunk of code with zero tests.

Setting up CI for a PHP project

To get my environment setup, I consulted the following blog posts which are worth a read:

  1. http://blog.jepamedia.org/2009/10/28/continuous-integration-for-php-with-hudson/
  2. http://toptopic.wordpress.com/2009/02/26/php-and-hudson/

I’m assuming you’re using a CentOS 5 server (or I guess RHEL5). If not, you may still find various parts of this useful.

1. Install JDK

EPEL provide a set of CentOS packages, including a package for openJDK. This is the easiest way of installing Java.

Firstly, setup EPEL:

wget -O /etc/yum.repos.d/hudson.repo http://hudson-ci.org/redhat/hudson.repo

Next install OpenJDK:

yum install java-1.6.0-openjdk

2. Install Hudson

Download and install the CentOS RPM for Hudson:

wget -O /etc/yum.repos.d/hudson.repo http://hudson-ci.org/redhat/hudson.repo
rpm --import http://hudson-ci.org/redhat/hudson-ci.org.key
yum install hudson

Now Hudson is installed, we can start using the standard CentOS “service” command.

service hudson start

We can check Hudson is working by pointing the browser at port 8080 (the default Hudson port). Hudson will work “out of the box” and give you a web interface immediately. This is the primary reason I decided to go with Hudson over the other possibilities, eg: CruiseControl and phpUnderControl. Although I didn’t do an exhaustive analysis before I decided on Hudson, it just seemed right to me.

To get the graphing engine working for Hudson, you may need to install x.

yum groupinstall base-x

3. Install phing

Phing is a PHP project build system or build tool based on Apache Ant. A build tool ensures that the process of creating your working web application from source code happens in a structured and repeatable way. This helps reduce the possibility of errors caused by simply uploading files via FTP or some other simple method.

Make sure PEAR is installed for PHP (this is the easiest way of installing phing):

yum install php-pear

Then install the PEAR phing package:

pear channel-discover pear.phing.info
pear install phing/phing

4. Setup SVN

If you haven’t got a Subversion repository, you’re going to need one (or some other SCM tool like CVS, GIT or Mercurial).

yum install mod_dav_svn

The simplest setup involves creating a repo in /var/www/svn/<my repo>

mkdir -v /var/www/svn/test
svnadmin create --fs-type fsfs /var/www/svn/test
chown –R apache:apache /var/www/svn/test

Setup Apache by pretty much uncommenting the lines in /etc/httpd/conf.d/subversion.conf. Once Apache restarted, you’ll be able to get to it via /repos/test, assuming you’re using the default settings (sets up SVN on /repos). I haven’t gone into the details of getting SVN up and running; there are lots of resources out there that will help you do this.

5. Install PHP tools

PHPDocumentor – to generate documentation automatically from code
pear install PhpDocumentor
PHP CPD – “copy and paste detector” for PHP

This requires PHP 5.2. At time of writing, this wasn’t standard with CentOS 5, but is part of the CentOS “test” repo. This can be setup by creating a yum repo file, eg: /etc/yum.repos.d/centos-test.repo and populating with:

[c5-testing]
name=CentOS-5 Testing
baseurl=http://dev.centos.org/centos/5/testing/$basearch/
enabled=1
gpgcheck=1
gpgkey=http://dev.centos.org/centos/RPM-GPG-KEY-CentOS-testing

Then you can do:

yum update php

You may also need to upgrade pear; if the install of phpcpd fails (below). To do this, try:

pear upgrade pear

or, if this wants to be forced, and you think it’s a good idea (I did):

pear upgrade --force pear

Finally we can install phpcpd!

pear channel-discover pear.phpunit.de
pear install phpunit/phpcpd
PHP Depend – help analyse quality of codebase

Note you may have update PHP to include the DOM module (first line below).

yum install php-dom
pear channel-discover pear.pdepend.org
pear install pdepend/PHP_Depend-beta
PHP Code Sniffer – analyse code for adherence to style/standards
pear install PHP_CodeSniffer-1.2.0
PHP Unit – unit test framework for PHP
pear channel-discover pear.phpunit.de
pear install phpunit/PHPUnit

To make PHP Unit work, we need XDebug installed, the PHP profiler.

yum install php-devel gcc
pecl install xdebug

6. Install Hudson plugins

Use the web interface to install the following plugins (Manage Hudson -> Plugins).

  • Checkstyle
  • Clover
  • DRY
  • Green Balls (handy because it shows successful builds as green circles rather than blue)
  • JDepend
  • xUnit (will handle the output of PHPUnit test results XML)

7. Setup the phing build script

The Phing build script defines what steps will be taken to “build” the application.

Hudson itself works by placing our code into a project workspace. It will checkout the code from subversion and place it into the following location, where “Test” is the name of our project.

/var/lib/hudson/jobs/Test/workspace/

We can then use the Phing build script to carry out a number of processes on this code. When we talk about “building”, what we will actually do is place the code where we need it so it can actually run the website (we’ll keep this within the workspace) plus we run tests etc…

We’ll keep the build script in the subversion repository, so effectively it will be updated from SVN each build. For this approach to work, the following XML needs to be stored in a file named build.xml, stored in the project root folder (within trunk), eg: /trunk/build.xml

<?xml version="1.0" encoding="UTF-8"?>
 <project name="test" basedir="." default="app">
    <property name="builddir" value="${ws}/build" />

    <target name="clean">
        <echo msg="Clean..." />
        <delete dir="${builddir}" />
    </target>

    <target name="prepare">
        <echo msg="Prepare..." />
        <mkdir dir="${builddir}" />
        <mkdir dir="${builddir}/logs" />
        <mkdir dir="${builddir}/logs/coverage" />
        <mkdir dir="${builddir}/docs" />
        <mkdir dir="${builddir}/app" />
    </target>

    <!-- Deploy app -->
    <target name="app">
        <echo msg="We do nothing yet!" />
    </target>

    <!-- PHP API Documentation -->
    <target name="phpdoc">
        <echo msg="PHP Documentor..." />
        <phpdoc title="API Documentation"
                destdir="${builddir}/docs"
                sourcecode="yes"
                defaultpackagename="MHTest"
                output="HTML:Smarty:PHP">
            <fileset dir="./app">
                <include name="**/*.php" />
            </fileset>
        </phpdoc>
    </target>

    <!-- PHP copy/paste analysis -->
    <target name="phpcpd">
        <echo msg="PHP Copy/Paste..." />
        <exec command="phpcpd --log-pmd=${builddir}/logs/pmd.xml source" escape="false" />
    </target>

    <!-- PHP dependency checker -->
    <target name="pdepend">
        <echo msg="PHP Depend..." />
        <exec command="pdepend --jdepend-xml=${builddir}/logs/jdepend.xml ${ws}/source" escape="false" />
    </target>

    <!-- PHP CodeSniffer -->
    <target name="phpcs">
        <echo msg="PHP CodeSniffer..." />
        <exec command="phpcs --standard=ZEND --report=checkstyle ${ws}/source > ${builddir}/logs/checkstyle.xml" escape="false" />
    </target>

    <!-- Unit Tests & coverage analysis -->
    <target name="phpunit">
        <echo msg="PHP Unit..." />
        <exec command="phpunit --log-junit ${builddir}/logs/phpunit.xml --log-pmd ${builddir}/logs/phpunit.pmd.xml --coverage-clover ${builddir}/logs/coverage/clover.xml --coverage-html ${builddir}/logs/coverage/ ${ws}/source/tests"/>
    </target>
</project>

8. Setup Hudson

The first step is to create a new job.

  • From the Hudson homepage, click New Job.
  • Enter a Job name, for example “Dave’s Product Build” and choose “Build a free-style software project”. Click OK.

Now you need to configure the job; the configuration form should be displayed immidiately after adding.

Under Source Code Management choose Subversion and enter:

  • Repository URL: http://www.myrepo.com/path/to/repo
  • Local module directory: source
  • Check “Use update” which speeds up checkout

Under Build Triggers select Poll SCM and enter the following schedule:

5 * * * *
10 * * * *
15 * * * *
20 * * * *
25 * * * *
30 * * * *
35 * * * *
40 * * * *
45 * * * *
50 * * * *
55 * * * *

Note that this will poll for changes to the repository every 5 minutes and rebuild if any changes are detected.

Under Build click the button to Add build step and choose Execute shell, enter the command:

phing -f $WORKSPACE/source/build.xml prepare app phpdoc phpcs phpunit -Dws=$WORKSPACE

Under Post-build Actions choose:

  • Check Publish Javadoc and then enter:
    Javadoc directory = build/docs/
  • Check Publish testing tools result report and then click Add and pick PHP Unit, enter:
    + PHPUnit Pattern = build/logs/phpunit.xml
  • Check Publish Clover Coverage Report and enter:
    + Clover report directory = build/logs/coverage
    + Clover report file name = clover.xml
  • Check Publish duplicate code analysis results and enter:
    + Duplicate code results = build/logs/phpunit.pmd-cpd.xml
  • Check Publish Checkstyle analysis results and enter:
    + Checkstyle results = build/logs/checkstyle.xml

Finally, click Build Now to test it all works.