Activity Data Synthesis

Thursday, 31 March 2011

If wishes were fishes... some blue sea thinking

This photo caught my eye this morning:

[Photo by Julian Istar, taken at the 'Harnessing Transport Data' workshop, Manchester, 30 Mar 2011]

It got me wondering what would be on an open data wishlist for academic libraries and whether that wishlist might stimulate new activity data projects or provide new use/business cases for making data available 'out there'.

So I added a question to Quora: "What data should academic libraries be releasing as open data?" ... I'll let you know if I get any responses but feel free to also add your comments below.

Tuesday, 29 March 2011

Tabbloid #3: 28 Mar 2011

Some noteworthy items from this week's Tabbloid:

#LIDP reported in the minutes from their first project meeting that they're planning some qualitative investigations (in the form of focus groups) to test their hypothesis. That decision was the result of the deep thinking they've been doing about the hypothesis itself and how they might hope to get anywhere close to considering all of the different factors which have an impact on attainment. From the minutes it looks like Dave Pattern is working hard to retain his crown as 'king of the data gatherers' (a sheperding role you are only qualified to tackle once you've graduated from the Cat Herding Academy) and is providing LIDP's partner institutions with the guidance they need in order to get data submitted to him by their fast approaching deadline of 23rd April.

Richard Nurse has published a couple of thought provoking blogposts relating to the #OURISE project. One reflecting on what we mean (and don't mean) when we use the term 'recommendation' and another on the importance of considering timeliness when establishing a basis for making recommendations. Richard's points about what a recommendation is reminded me of one of the JISC Conference sessions I attended which quoted a student who was interviewed as part of the OER Impact project:
“I do everything she [indicating a fellow student] tells me to do. If she tells me to look at something, I look at it.”
I come from a sociology background so maybe I'm biased but it feels like these ethnographic nuggets can be powerful reminders of the users' side of the story which might be sitting just below the surface of any activity data.

I spotted an open data event taking place in Birmingham next month which has the question of sustainable business (and funding) models at the heart of its agenda. It's primarily centred around data from national and local government but it will be interesting to see whether there are any solid outcomes in terms of establishing convincing business cases which might translate into the world of HE.

Tuesday, 22 March 2011

Tabbloid #2: 21 Mar 2011

I was particularly excited to see a tweet signposting to this blogpost by Eric Hellman of Gluejar where he's done some heavy grade analysis on the "motherlode" of data released by the University of Huddersfield in order to look at the impact of Harper Collins' ebook expiration strategy. It's interesting to note how using that data is opening up the debate within the blog comments. The article also got me thinking about data dissemination and wondering what else needs to happen beyond making data open and then telling Lorcan Dempsey about it. Hmmm, food for thought over the coming months. A quick google search unearthed this useful post from 2008 on ReadWriteWeb which in turn points to the Open Knowledge Foundation's CKAN data hub which in turn holds information on Huddersfield's dataset and other library datasets, both open and not so open. It would be interesting to see something similar for examples of data mashups and visualisations with links to the open data they've used.

[Last week's Tabbloid features updates from the STAR-Trak and AEIOU projects]

A few websites that have caught my attention this last week or so:
[data released by libraries in Australia and New Zealand]

Also from New Zealand, the Reading Rooms project which "used ‘live’ 3D animation to rebuild the architecture of the Design Faculty, Unitec, Auckland according to what students were borrowing from the campus library."

The Guardian reported on David McCandless' 'consensus cloud' visualisation of 100 books everyone should read [based on this collated dataset]

This passed me by at the time but last year Mozilla ran an open data visualization competition based on their Test Pilot data. The winners were announced back in January.

Major impressions at the kick-off

It’s a couple of weeks since our start up meeting in Birmingham and I’m sure your thinking is developing rapidly. However, some reflections on the ‘sum of the parts’ presented by all 9 projects may still be of value – certainly as a marker for the Synthesis Project. So here are the high level observations that we shared at the end of the Birmingham meeting:

Variety & Volume – Our group of projects is particularly impressive in terms of the variety of data sources (a wide range of library, learning, repository and admin applications), the available volumes of data (including multi-year) and the potential aggregations (with opportunities in library, repository and VLE spaces). What’s more many of you had already collected formal commitments to supply data as part of the bidding process. Given this potential feast of possibility, the next points are particularly important …

Time constraints - Given the timeline (5 months from 1 March), it may be best to plan backwards from the end point as well as forwards from the start – it’s always a useful sanity check. Whilst projects using an agile methodology could fit several sprints / iterations of activity in to the period, you are likely to be restricted by major milestones such as getting hold of the data (and ingesting / amalgamating it) in the first place.

Technical priorities – It may be wise to park the tech challenges relating to scalability and repeatability and to concentrate on low cost agile experiments that will prove your hypothesis … or not! Given that something worthwhile emerges, it is highly likely that second order issues of performance and automation can be addressed post-hoc – and with relative ease, given the available tools.

Algorithmic investigations – The experience of projects such as MOSAIC indicates that investigation (theoretical and practical) of the algorithms that underpin data processing (e.g. ingest), analysis, filtering and presentation will be really important, and increasingly so as the data scales beyond an initial experiment. And if you come up with a rule or algorithm (like the Huddersfield example of discarding course activity data where there are less than 35 students), please share it.

Legal concerns – Whilst data protection and other legal issues do not appear to be show stoppers for this work right now, it will benefit us all to catalogue issues and responses, risks and mitigations as they emerge; I’m going to start a legal issues register (without calling it a risk register!) with the points raised in Birmingham, hoping you’ll contribute more as we progress. We also agreed that we should address these challenges (ghost busting?) in a ‘can do’ manner, evidencing where institutions are taking affirmative action (e.g. upgrading privacy statements as highlighted by the Edina project) rather than diving for cover.


Wednesday, 9 March 2011

Tabbloid #1: 7 Mar 2011

Every week [edit: or every other week depending on the volume of new news] we'll be publishing a Tabbloid which is basically a auto-collated magazine of all the project blog feeds. For this first issue I've also included the #jiscad twitter feed but it adds a lot of noise to the magazine so I'm not planning to keep it for future editions.

This week's Tabbloid is largely made up of first blogposts and includes project plans and hypotheses for most of the projects so it's a useful read for anyone wanting to get an early overview of the projects and what their experimentations will be over the next 6 months. Of particular note is the #OURISE project which is already reporting significant progress in terms of their technical build work. Their observation about the additional responsibility of developing code that will be released as open source will no doubt strike a chord with the other project developers.

Monday, 7 March 2011

And they're off ...

Last week the Synthesis team attended a gathering of the JISC Activity Data projects for the programme launch event.

Andy McGregor (the Programme Manager) started the day by setting the context in terms of previous JISC projects which have shaped this programme and highlighted other relevant work going on within JISC and across the sector, including:
- the JISC Business Intelligence strand (which is particularly relevant to the student retention projects);
- the work going on at, particularly around 'para/event data';
- The Harvard Library Innovation Laboratory and its work around the display of live circulation data.
- the Publisher & Institutional Repository Usage Statistics 2 (PIRUS2) project;
- and the Journal Usage Statistics Portal (JUSP) project.

Andy kindly painted a picture of the Synthesis Team as a band of renegade dental extraction enthusiasts which will hopefully prove to be far from the truth - I'd like to think that our approach will be relatively painless; more Derren Brown than Sweeney Todd (with sincere apologies for the barbaric mixing of metaphors). If Andy's introduction to our team did leave you feeling a little trepidatious then a quick read of 'Tips for overcoming a fear of dentists' should set you right (just replace 'dentist' with 'synthesist' for the required effect).

David Kay, representing the synthesis team, started with a quick history lesson (with a particular mention for the MOSAIC project) and then gave an overview of our objectives, namely:
- Identification of widely applicable approaches
- Development of knowledge and skills
- Evidencing of business cases
- Dissemination of information

In practical terms this means we will be writing technical 'cookbooks', 'How-to'/'Exemplar' mini-guides and a final synthesis report. We will be delivering a number of events to enable projects to share knowledge and good practice (the first event will be an Online Technical Exchange taking place in the first week of April). We'll also be a source of information and advice through the live synthesis that takes place here on this blog, and developing online advice/FAQs via Quora. Finally we'll be gathering project knowledge and artefacts into a space online, around which a Community of Practice can draw together.

All of the 9 Activity Data projects had a presenting slot so that they could share the hypothesis that sits at the core of their project and as much other information as they could cram into the 5 minutes they were allotted. This gave projects the opportunity to start making connections between their projects and identify potential areas of overlap in terms of scope and technical challenges.

The rest of the day was spent particpating in interactive sessions which were useful in revealing further details about the projects and other areas of overlap/shared pain.

Mark van Harmelen led a session on identifying the key technical challenges faced by projects and the key technical benefits that they will potentially be delivering back to the programme and to the wider community. You can see from the photo below that this resulted in a proliferation of post-it notes which were placed in their 'natural' groupings and ocassionally fluttered to the floor in a most delicate manner. Mark collected all of the post-its and plans to produce a document that shows the groupings which emerged.

In the afternoon the project representatives were asked to identify the key IPR challenges their projects face. The discussions on our table reflected some of the other groups and centred around the issue of establishing 'robust anonymity thresholds' (or 'the magic number') for any data that will be released (both internally and externally to the institution).

Tom Franklin ran the last interactive session of the day which was an exploration of the projects' business case. His definition of a business case as "something you need in order to get money from someone who doesn't want to give you money" seemed a particularly useful way of keeping the correct audience for the business case in mind (i.e. the person with the money, not the person who wants to get hold of the money).

The final session of the day was a synthesis of the days discussions, which will form the basis of a near-future blogpost here.