A project that aims to make use of activity data from sources such as those in the Identifying Activity Data draft Guide can’t avoid the fact that they will inevitably have to roll their collective sleeves up and get hands on with various data sources. It is likely that the data you hope to extract and manipulate will be either hard to reach, unwieldy, incompatible, incomplete, downright uncooperative or all of the above. This guide shares some helpful hints from the experiences of the JISC Activity Data projects and the wider world of library data hacking.
Dealing with activity data relies on embracing a pioneering mindset, requiring equal measures of experimentation and hacking, together with a sixth sense of how far down one route you should go before accepting that a different solution is needed. Unfortunately there are no hard and fast rules you can follow but here are helpful principles and pointers that have come out of the JISC AD projects and beyond:
- Tony Hirst blogged about his tactics for dealing with large CSV files after in the course of playing with the OpenURL Router project.
- Make the most of existing tools and resources to avoid reinventing the wheel. The AEIOU project was able to make use of resources and expertise from the PIRUS2 project.
- Sometimes data extraction will go much smoother than you dared imagine – cherish these moments and share any triumphs with the wider world.
Taking it further:
If you are releasing open data with the hope that people outside of the project and the institution will do something with that data, it’s worth taking steps to remove any unnecessary barriers. Many of those barriers will be the same things that made it a challenge for you to deal with the data in the first place:
- create small sample files that enable potential end-users to get a feel for the scope and structure of the data you’re sharing.
- use lowest common denominator/widely accepted formats e.g. CSV
- publish the scripts you yourself used to manipulate the data. If you adapted someone else’s script/code then share what you’ve done with them to create a virtuous cycle of iterative improvements.
Tony Hirst’s Online Exchange presentation covers some of the issues mentioned in the section above: http://blog.activitydata.org/2011/07/online-exchange-4-event-recording-21.html . Tony’s blog is also a robust source of further information: http://blog.ouseful.info/
This twinset of AEIOU project blogposts were the initial inspiration for this guide:
The EVAD project is handling a vast dataset and have blogged about the data and also published a Guide to Using Pivot Tables in Open Office. They’ve also shared their thoughts around taking a user-centric approach to their data: http://vledata.blogspot.com/2011/04/story-so-far.html
The OU RISE project documented their thoughts about how they could most usefully format the recommender data they plan to release: http://www.open.ac.uk/blogs/RISE/2011/05/06/open-recommender-data/