Activity Data Synthesis

Tuesday, 2 August 2011

Draft Guide: 'Dealing with Activity Data'

[This is a draft Guide that will be published as a deliverable of the synthesis team's activities. Your comments are very much welcomed and will inform the final published version of this Guide. We are particularly interested in any additional examples you might have for the 'Additional Resources' section]

The problem:

A project that aims to make use of activity data from sources such as those in the Identifying Activity Data draft Guide can’t avoid the fact that they will inevitably have to roll their collective sleeves up and get hands on with various data sources. It is likely that the data you hope to extract and manipulate will be either hard to reach, unwieldy, incompatible, incomplete, downright uncooperative or all of the above. This guide shares some helpful hints from the experiences of the JISC Activity Data projects and the wider world of library data hacking.

The solution:

Dealing with activity data relies on embracing a pioneering mindset, requiring equal measures of experimentation and hacking, together with a sixth sense of how far down one route you should go before accepting that a different solution is needed. Unfortunately there are no hard and fast rules you can follow but here are helpful principles and pointers that have come out of the JISC AD projects and beyond:

Taking it further:

If you are releasing open data with the hope that people outside of the project and the institution will do something with that data, it’s worth taking steps to remove any unnecessary barriers. Many of those barriers will be the same things that made it a challenge for you to deal with the data in the first place:

  • create small sample files that enable potential end-users to get a feel for the scope and structure of the data you’re sharing.
  • use lowest common denominator/widely accepted formats e.g. CSV
  • publish the scripts you yourself used to manipulate the data. If you adapted someone else’s script/code then share what you’ve done with them to create a virtuous cycle of iterative improvements.

Additional resources:

Tony Hirst’s Online Exchange presentation covers some of the issues mentioned in the section above: . Tony’s blog is also a robust source of further information:

This twinset of AEIOU project blogposts were the initial inspiration for this guide:

The EVAD project is handling a vast dataset and have blogged about the data and also published a Guide to Using Pivot Tables in Open Office. They’ve also shared their thoughts around taking a user-centric approach to their data:

The OU RISE project documented their thoughts about how they could most usefully format the recommender data they plan to release:


  1. This worries me on a number of levels. Firstly, it suggests that activity data is very difficult to do anything with needing a mix of magic and a six sense together with technical competence of Tony Hirst.

    Secondly it seems to be about playing with activity data for its own sake or in order to see what questions might be answered by it.

    I think if someone saw this as an introduction to activity data they would be put off.

    Perhaps we could turn it upside down and start with what sort of questions / issues can activity data help with? Where might one look for appropriate data? How might one start going about using it?

  2. Hi Tom, thanks for the comments - they would have been even more useful at the internal review stage, lol.

    Perhaps I could split out the guide into a 'starting to deal with data' guide which would cover the questions / issues etc and then a 'data wrangling' guide which would be a bit more gritty? We already have the 'what are the data sources' that David drafted so a 'starting to deal with data' guide would form a logical bridge between that guide and a revised version of this guide. It would also give me an opportunity to bring in the virtuous cycle in data mining more explicity. In any case we can discuss when we meet tomorrow :)