Applying collective intelligence to PHP UK Conference 2011

I had a cracking time at the PHP UK Conference this year. It’s usually pretty good, but this year I thought the talks were slightly better than normal. I think the free beer at the end always helps!

This got me wondering.

What talks did I miss out on that I would have liked?”

As you may be aware, many delegates have been using joind.in to provide feedback on the talks. It turns out that joind.in have an API, and this, in turn, means we can carry out some basic collective intelligence techniques to provide “recommendations” on what other talks would have been of interest.

The term “collective intelligence” refers to intelligence that emerges from the collaboration of a group. In this case, we can leverage the data within joind.in and make “intelligent” recommendations.

This post looks at building a simple recommendation engine using the data from joind.in. You can download the entire source code here (gzipped) or view via PasteBin here and try it out for yourself.

The joind.in API

The API is not entirely simple to understand, and examples are fairly thin on the ground within the documentation. The main thing to figure out is that you have to POST data to the appropriate API end point, where the POST data itself contains the “action” to carry out.

This PHP function uses CURL to fetch API data via JSON, constructing the correct data to POST.

/**
 * Hit the Joind.in API
 *
 * @param string $endPoint API end point, eg: "event" to hit event API
 * @param string $action The desired action, eg: "gettalks"
 * @param array $params Any params to send
 *
 * @return array Decoded JSON data
 */
function joindInApi($endPoint, $action, array $params = array())
{
    $requestData = array(
        'request' => array(
            'action' => array(
                'type' => $action,
                'data' => $params
            )
        )
    );
    $options = array(
        CURLOPT_RETURNTRANSFER => TRUE,     // return web page
        CURLOPT_HEADER         => FALSE,    // don't return headers
        CURLOPT_FOLLOWLOCATION => TRUE,     // follow redirects
        CURLOPT_ENCODING       => '',       // handle all encodings
        CURLOPT_USERAGENT      => 'DAVE!',  // who am i
        CURLOPT_AUTOREFERER    => TRUE,     // set referer on redirect
        CURLOPT_CONNECTTIMEOUT => 120,      // timeout on connect
        CURLOPT_TIMEOUT        => 120,      // timeout on response
        CURLOPT_MAXREDIRS      => 10,       // stop after 10 redirects
        CURLOPT_HTTPHEADER     => array('Content-Type: application/json'),
        CURLOPT_POSTFIELDS     => json_encode($requestData)
    );

    $ch = curl_init('http://joind.in/api/' . $endPoint);
    curl_setopt_array($ch, $options);
    $content = curl_exec($ch);
    $err = curl_errno($ch);
    $errmsg = curl_error($ch);
    $header = curl_getinfo($ch);
    curl_close($ch);

    return json_decode($content, TRUE);
}

Grab talks and ratings

The first thing we need to do is fetch all the talks for the conference along with any user ratings. We can do this via the Event API “gettalks” action followed by the Talk API “getcomments” action.

// Phase 1: grab ratings via the Join.in API

$userRatings = array();     // [userId][talkId] = rating
$talkTitles = array();      // we'll store these for later

$talks = joindInApi('event', 'gettalks', array('event_id' => 506));
foreach ($talks as $talk)
{
    $talkTitles[$talk['ID']] = $talk['talk_title'];
    echo $talk['ID'] . "\t" . $talk['talk_title'] . "\n";
    $comments = joindInApi('talk', 'getcomments', array('talk_id' => $talk['ID']));
    foreach ($comments as $comment)
    {
        echo ' -> ' . $comment['uname'] . "\t" . $comment['rating'] . "\n";
        $userRatings[$comment['uname']][$talk['ID']] = $comment['rating'];
    }
}

Calculating similar users

To work out recommendations we’ll use the classic “people like me like” method (a type of collaborative filter). This works by calculating a similarity score between a user and every other user. This is easy to implement and works well for small users sets. Companies with a lot of users, for example Amazon, usually use item-based collaborative filtering instead of user-based, due to the difficulty in calculating similarity between every user at this scale.

There are many different algorithms that will score how similar two users are, based on a set of data. Examples include Euclidean distance, Jaccard index and Pearson correlation.

It is often very difficult to know which distance algorithm will give best results and therefore the best advice is to try them all out! We will use the Pearson correlation in this example.

The following is a PHP implementation of Pearson, borrowing heavily from the excellent beginners book Programming Collective Intelligence.

/**
 * Calculate pearson distance
 *
 * This calculates the pearson correlation between user1 and user2; a measure
 * of how similar users are.
 *
 * @param array $userRatings Our array of user ratings; [userId][talkId] = rating
 * @param string $user1 The first userId
 * @param string $user2 The second userId
 *
 * @return integer|float A number between -1 and 1, where -1 indicates very
 *      dissimilar, and 1 indicates very similar
 */
function calculatePearson($userRatings, $user1, $user2)
{
    // get list of talks both have rated
    $talks = array_keys(array_intersect_key(
            $userRatings[$user1],
            $userRatings[$user2]
            ));
    $numBothHaveRated = count($talks);
    if ($numBothHaveRated === 0)
    {
        $pearson = 0;
    }
    else
    {
        $sumOfRatingsUser1 = 0;
        $sumOfSquareOfRatingsUser1 = 0;
        $sumOfRatingsUser2 = 0;
        $sumOfSquareOfRatingsUser2 = 0;
        $sumOfProducts = 0;

        foreach ($talks as $talkId)
        {
            $sumOfRatingsUser1 += $userRatings[$user1][$talkId];
            $sumOfSquareOfRatingsUser1 += pow($userRatings[$user1][$talkId], 2);
            $sumOfRatingsUser2 += $userRatings[$user2][$talkId];
            $sumOfSquareOfRatingsUser2 += pow($userRatings[$user2][$talkId], 2);
            $sumOfProducts += $userRatings[$user1][$talkId] * $userRatings[$user2][$talkId];
        }

        // calculate pearson
        $numerator = $sumOfProducts - ($sumOfRatingsUser1 * $sumOfRatingsUser2 / $numBothHaveRated);
        $denominator = sqrt(
                ($sumOfSquareOfRatingsUser1 - pow($sumOfRatingsUser1, 2) / $numBothHaveRated)
              * ($sumOfSquareOfRatingsUser2 - pow($sumOfRatingsUser2, 2) / $numBothHaveRated)
                );
        if ($denominator == 0)
        {
            $pearson = 0;
        }
        else
        {
            $pearson = $numerator / $denominator;
        }
    }

    return $pearson;
}

We can now run through all the users we found (who had provided comments!) and work out their similarity with every other user.

// Phase 2: Calculate user similarity (via Pearson correlation)

$pearson = array();

$users = array_keys($userRatings);
foreach ($users as $user1)
{
    foreach ($users as $user2)
    {
        if ($user1 !== $user2 && !isset($pearson[$user1][$user2]))
        {
            $value = calculatePearson(
                    $userRatings,
                    $user1,
                    $user2
                    );
            $pearson[$user1][$user2] = $value;
            $pearson[$user2][$user1] = $value;
            echo $user1 . "\t" . $user2 . "\t" . $value . "\n";
        }
    }
}

echo "\nLike me:\n";

arsort($pearson[WHO_AM_I]);
foreach ($pearson[WHO_AM_I] as $user => $value)
{
    echo $user . "\t" . $value . "\n";
}

So who is like me? Turns out it’s these guys:

  • welworthy = 1
  • ianb = 1
  • manarth = 1
  • m.whitby@gmail.com = 0.99999999999999
  • rowan_m = 0.5

Providing recommendations

Now I know the users who are most similar to me, I can see which talks they liked. The following recommendation algorithm does just this, weighting all talks according to how similar I am to them.

/**
 * Get recommendations
 *
 * Return recommendations on talks I _should_ have seen (if I could have!)
 *
 * @param array $userRatings Our user ratings; [userId][talkId] = rating
 * @param string $user The user to get recommendations for
 * @param array $similarities The similarities of all users; [user1][user2] = #
 *
 * @return array [talkId] = <how much you should have seen it!>
 */
function getRecommendations(array $userRatings, $user, array $similarities)
{
    $totals = array();
    $similaritySums = array();

    foreach ($userRatings as $compareWithUser => $talksWithRatings)
    {
        // don't compare against self
        if ($user === $compareWithUser)
        {
            continue;
        }

        // how similar?
        $similarity = $similarities[$user][$compareWithUser];
        // ignore users if they aren't similar (<=0)
        if ($similarity <= 0)
        {
            continue;
        }

        foreach ($talksWithRatings as $talkId => $rating)
        {
            // skip if I saw this talk
            if (isset($userRatings[$user][$talkId]))
            {
                continue;
            }
            if (!isset($totals[$talkId]))
            {
                $totals[$talkId] = 0;
            }
            $totals[$talkId] += $rating * $similarity;
            if (!isset($similaritySums[$talkId]))
            {
                $similaritySums[$talkId] = 0;
            }
            $similaritySums[$talkId] += $similarity;
        } // end foreach talks
    } // end foreach users

    // generate normalised list
    foreach ($totals as $talkId => &$score)
    {
        $score /= $similaritySums[$talkId];
    }

    arsort($totals);

    return $totals;
}

The final stage is to run this through for me!

// Phase 3: Get recommendations

echo "\nRecommended talks:\n";

$recommendations = getRecommendations($userRatings, WHO_AM_I, $pearson);
foreach ($recommendations as $talkId => $recommendation)
{
    echo $talkId . "\t" . $talkTitles[$talkId] . " ($recommendation)\n";
}

So my final recommendations are (with a rating in brackets):

  • 2514: Beyond Frameworks (5)
  • 2511: 99 Problems, But The Search Ain’t One (5)
  • 2512: Advanced OO Patterns (5)
  • 2521: Varnish in Action (4)
  • 2520: Running on Amazon EC2 (4)
  • 2513: Agility and Quality (3)

Conclusion

When I first ran this through on Saturday evening, my recommendations did not include “Beyond Frameworks” nor “Agility and Quality”. Now, on Sunday evening, there is more data and these have popped up. I think I prefer my Saturday evening list, but it’s not too far off.

It would be interesting to experiment with different similarity algorithms to see what impact this has. It would also be cool to use the joind.in API to look at other talks that my similar users have rated positively, outside of this conference. These are left as exercises for the reader!

If you’re interested in learning more I’d recommend starting with the O’ Reilly book, Programming Collective Intelligence. The examples take a bit of work to fully understand, but it shields you from the Maths.

Tags: , , , , ,

One Response to “Applying collective intelligence to PHP UK Conference 2011”

  1. LornaJane says:

    Dave, this is fabulous and uber-geeky, thanks for sharing :)

    The joind.in API is indeed pretty quirky, a new version is currently under development that will hopefully be much easier to use. I’ve got a few talks coming up in that use it as an example though so conference-driven development should mean users see the new service in the next few months!

Leave a Reply