Activity Data Synthesis

Tuesday, 15 November 2011

Adieu from JISC AD

The launch of our final report website this week brings the work of the Activity Data synthesis team to a close and, therefore, this blog will now be mothballed and there are no further updates planned.

You can browse the final report using the web interface (expertly designed for us by Dan Moat) or download a pdf of the final report.

There is still activity happening on the individual project blogs and via the twitter hashtag #jiscad which you can keep up to date with by creating a Five Filters news digest on the fly whenever you get the urge.

So for now it's a heartfelt adieu from me and the rest of the project synthesis team. < waves >

Notes and Photos from Library Camp 2011

Back in October I joined a gathering of library workers, geeks, advocates and enthusiasts in Birmingham for the Library Camp 2011 unconference.

There were a few unusual things about the event from my point of view ...

To start with, it was the first time I'd been at a library event with such a mixed crowd - Public libraries folks rubbing shoulders with folks from the academic libraries doesn't happen as frequently as it should do.

Secondly, I have never seen so much cake.

Thirdly, it was the first unconference event I've been too where everyone introduced themselves at the start of the day.

Fourthly, I've never seen more sessions get proposed than the slots available.

Fifthly, there was a poet-in-residence which is, again, a first for me (though strangely I've been at another event since then which had a poet-in-residence too).

The idea of 150ish people introducing themselves one by one at the start of an event might seem like lunacy but I have to say I found it very moving and uplifting to hear everyone's reasons for being in the same place.

Here are some of the reasons I managed to scribble down as folks introduced themselves:
"... to capture the libgeist."
"I'm here to start the revolution."
"... lured by cake and curiosity."
"I'm looking for library lovers."
"... critique, collaboration and revolution."
"... to steal people's enthusiasm, passion and, hopefully, anger."
"Gratuitous hugging."
"Libation in the library."
"To show the rage and passion for libraries." 

Dave Pattern kindly agreed to be my wing man and ran a session on Activity Data and Recommender Services with me. Dave shared the good work he’s been involved with at Huddersfield University and I talked a bit about this programme and also shared some of the great (open source) applications that have come out of the JISC MOSAIC and Discovery developer competitions. Hopefully my interpretative dance representation of Alex Parker's Book Galaxy serendipitous search interface persuaded a few of those present to take a look at whether they can exploit any of the applications that are sitting there waiting to be plundered for a good cause. Part of the discussions we had in our session were around the challenge for libraries who don’t have developers on their staff to take advantage of opportunities like those. One of the solutions we discussed for that problem was to check who else is using the same library systems that your institution is using and looking for opportunities to form alliances around shared development goals.

All in all it was an invigorating day full of positive conversations and rapidly shared ideas. My only regrets are that a) I couldn't stay on into the evening to continue with the conversations and b) I didn't have a large tupperware box with me to take some of the cake home with me :)

Wednesday, 28 September 2011

A round-up of recent JISC Activity Data activity

The synthesis team have been busy honing the content of our final deliverable for this programme. Namely a one stop shop which gathers together the projects' collected wisdom on identifying, collecting, managing and sharing activity data within UK HE. The amount of content we have to share means we have a challenge to present it in an intuitive way with easily navigable routes into (and out of) the information for end users but we're hopeful that it's achievable and you'll be able to judge the fruits of our labour next month when we'll be launching it.

Earlier this month we ran a pre-conference workshop at ALT-C which (talking of navigation issues) seemed to go very well once the initial challenge of finding the room itself was conquered. The session was entitled 'Improving processes by using activity data' and featured the following sessions:
- Introduction to activity data [presented by Tom Franklin]
- Challenges raised by activity data [presented by Mark van Harmelen]
- [Case Study] Leeds Met STARTrak: NG - using activity data in support of student success [presented by Rob Moores]
- Discussion of potential use case [facilitated by David Kay]
- [workshop session] Building a business case [facilitated by Tom Franklin and myself]
- [workshop session] Working with activity data - technical discussion of value and challenges [facilitated by Mark van Harmelen and David Kay]

I've combined (i.e. wrestled) all the slides from the workshop into one presentation and uploaded it as a pdf which you can view below:

The Tabbloid digest for that week captures the tweets from the workshop:

Here are some of my twitter highlights from the day:

Hidden amongst all the workshop tweets is a gem from the AGtivity project: a blogpost sharing what they've worked out about handling timestamps between UNIX, GNUplot and Excel during the course of their project. Another gem came from the direction of AGtivity in the shape of Martin Turner bringing CritterVRE to my attention. Martin used it to capture the twitter activity from the day and it looks like a useful tool to add into my event amplification / capture toolbox. The service was developed to use alongside Access Grid sessions but it looks useful for other purposes too (if that's allowed, I haven't had a go yet).

September was a busy month on the Exposing VLE Activity Data project blog as the end of their project extension period:
- A guide to their Perl analysis tool.
-  A guide to using Gephi to visualise a bipartite network of users and websites [including a discussion of their approach and a technical recipe/guide] .
- Analysis of their VLE event logs.
- A discussion of releasing anonymised VLE event log data [including a link to the dataset they've released (4gb download)].

The LIDP project have now released data as well as a Library Impact Data toolkit (both of which are published under open licences).

Wednesday, 14 September 2011

Draft Guide: 'Legal issues relating to sharing data'

The problem

If you want to share activity data with others then you have to make sure that you have the right to do so, that you share it in an appropriate way and that the terms under which you share it are appropriate.

In order to share data you have to have the right to do so, In practice this means that you need to ensure that you have the right to do so because you have appropriate intellectual property rights (IPR) in the data. If the data subjects might be able to be identified (i.e. you are realising full data rather than statistical data) then the data subjects need to have been informed that sharing can happen when they agreed to the data being collected (and they had a real ability to opt out of this). Finally you will need to select an appropriate licence under which to release the data.

The options

Intellectual property rights (IPR)

It is likely that you will own the data from any systems that you are running, though it may be necessary to check the licence conditions in case the supplier is laying any claim to the data. However, if the system is externally hosted then it is also possible that the host may lay some claim to the log-file data, and again you may need to check with them.

  • JISC Legal has a section addressing copyright and intellectual property right law

Data protection

Data protection, which addresses what one may do with personal data, is covered by the 1988 Data Protection act, and there is much advice available including:

An alternative approach to addressing the needs of data protection is to anonymise the data.

Licensing the data

Any data automatically comes with copyright, and therefore you need to licence the data in order for other people to legitimately use the data. There are a wide variety of types of licence that you can use, though the most common is likely to be some form of creative commons licence.

Guidance is available from a wide variety of places including:

- Introduction to licensing and IPR

- Creative Commons license:

Wednesday, 31 August 2011

Tabbloid: 31 August 2011

A couple of updates from the project blogs this week:
In the wider world I stumbled across a mention that 'big data' has now made it onto the Gartner Hype Cycle for the first time, which seems significant even if, like me, you find yourself wondering where the Gartner Hype Cycle itself falls on their chart.

Next week the synthesis team will be reunited when we head to Leeds to run one of the ALT-C pre-conference workshop, 'Improving processes by using activity data', where we'll be joined by the geographically convenient Rob Moores who'll be sharing knowledge and experience from the Leeds Met STAR-Trak project with those who attend.

Wednesday, 24 August 2011

Tabbloid: 24 August 2011

This week there's an interesting post over on the EVAD project blog about the problem of finding the right 'data munging' tool and how they ended up developing their own custom perl script instead. They've publically released the perl script so it will be interesting to watch and see whether their custom built script suits the needs of another project or whether a new bespoke tool needs to be fashioned for every project going.

The LIDP project have been presenting to, and in attendance at, the Performance Measurement in Libraries and Information Services conference which is a week-long event taking place at York University [#pm9york]. Word on the twittersphere is that the LIDP toolkit will be released next week so I'll probably be linking to that next week.

The OpenURL Router Data project launched their article recommender prototype and it's just as well that I don't have an Athens log-in because I was quickly drawn in all sorts of intriguing looking material, including an article entitled 'Getting a Grip on Strangles'.

Out in the wider world there have been relevant links flying into my twitterstream from unexpected quarters which suggests to me that either a tipping point is coming our way in terms of a wider awareness of activity data, or I'm am getting more creative in my interpretation of what is relevant to the programme. In any case here are a few highlights that I've picked out of this week's Tabbloid:

Wednesday, 10 August 2011

Five Filters news digest: 10 August 2011

A Tabbloid did in fact wend its way into my inbox this morning but it was a little bereft of life so I've turned to the trusty Five Filters website to create this week's blog digest. As before, you can generate a digest on the fly but I'll also be sending it out via email.

Just a couple of project updates this week:
  • the UCIAD project published their final project blogpost, including a video which gives a demo of the UCIAD platform, with an accompanying written commentary nestled below the video [and I can confirm that it's in with a good chance of winning both the 'techiest video I've watched' and 'longest video without a soundtrack' awards in my imaginary video award ceremony at the end of the year]. It's a shame we haven't got any more online exchanges planned because it would have been a good opportunity to get Mathieu to talk through the demo. I'll be interested to hear the results of the user feedback that the project plans to gather as part of their post-JISC project activity.
News from the twittersphere:
News from the synthesis team is that we've finalised the programme for the pre-conference ALT-C ['Improving processes by using activity data'] workshop which we're running on 5 September in Leeds. The workshop is free, includes lunch, and you don't need to be going to ALT-C in order to attend.

Wednesday, 3 August 2011

Tabbloid: 3 August 2011

Some more final blogposts have emerged this week:
Some other project blogposts worth visiting if you're interested in the more technical side of what they've achieved:
A couple of other interesting reads I saw flying at high velocity around the twittersphere today:

Tuesday, 2 August 2011

Draft Guide: 'Dealing with Activity Data'

[This is a draft Guide that will be published as a deliverable of the synthesis team's activities. Your comments are very much welcomed and will inform the final published version of this Guide. We are particularly interested in any additional examples you might have for the 'Additional Resources' section]

The problem:

A project that aims to make use of activity data from sources such as those in the Identifying Activity Data draft Guide can’t avoid the fact that they will inevitably have to roll their collective sleeves up and get hands on with various data sources. It is likely that the data you hope to extract and manipulate will be either hard to reach, unwieldy, incompatible, incomplete, downright uncooperative or all of the above. This guide shares some helpful hints from the experiences of the JISC Activity Data projects and the wider world of library data hacking.

The solution:

Dealing with activity data relies on embracing a pioneering mindset, requiring equal measures of experimentation and hacking, together with a sixth sense of how far down one route you should go before accepting that a different solution is needed. Unfortunately there are no hard and fast rules you can follow but here are helpful principles and pointers that have come out of the JISC AD projects and beyond:

Taking it further:

If you are releasing open data with the hope that people outside of the project and the institution will do something with that data, it’s worth taking steps to remove any unnecessary barriers. Many of those barriers will be the same things that made it a challenge for you to deal with the data in the first place:

  • create small sample files that enable potential end-users to get a feel for the scope and structure of the data you’re sharing.
  • use lowest common denominator/widely accepted formats e.g. CSV
  • publish the scripts you yourself used to manipulate the data. If you adapted someone else’s script/code then share what you’ve done with them to create a virtuous cycle of iterative improvements.

Additional resources:

Tony Hirst’s Online Exchange presentation covers some of the issues mentioned in the section above: . Tony’s blog is also a robust source of further information:

This twinset of AEIOU project blogposts were the initial inspiration for this guide:

The EVAD project is handling a vast dataset and have blogged about the data and also published a Guide to Using Pivot Tables in Open Office. They’ve also shared their thoughts around taking a user-centric approach to their data:

The OU RISE project documented their thoughts about how they could most usefully format the recommender data they plan to release:

Thursday, 28 July 2011

Tabbloid: 27 July 2011

In true tabloid vernacular I think it's fair to say that this week's Tabbloid is 'a whopper' and I can tell that the OU RISE project team are certainly back from their holidays.

Some of the projects have posted their official 'final blogpost' {wipes tear from corner of eye} but I have a feeling that we will continue to see further blogposts from them in the weeks to come. Here are the project's final blogposts, no doubt there will be another flurry of them before the week is out:
Other newsworthy news (ahem) this week:
  • I know I use the phrase 'thought provoking' a lot in these synthesis posts but that is the perfect descriptor for Leeds Met STAR-Trak's post on the domain knowledge chasm that they discovered in the course of running feedback workshops with students and staff.
  • Both the OU RISE and the SALT projects have been thinking deep thoughts about licensing this week (which is handy for me as I'm just finalising the draft guide on that very topic).
And finally, some other links of interest regarding activity data within academia and without the wider world:

Monday, 25 July 2011

Online Exchange #4: Event Recording [21 July 2011]

The fourth, and most likely final, Online Exchange took place last week and the topic this time was data visualisation (or 'visualization' depending on which side of the pond you reside).

The session was an opportunity for the JISC AD projects to share information about the data that they're wrangling as part of their project and their thoughts on/experience of the challenge of presenting that data visually. The main attraction though was a presentation from Tony Hirst who gave a very useful (or should I say 'OUseful' {nice pun Helen!}) overview of the tools and techniques you can use to create data visualisations.

You can playback the whole session by following the link below. [Note that you'll need to run the Java application that launches in order to watch it] The playback is slightly crackly on my machine but hopefully it won't detract from your listening pleasure:
You can see Tony's accompanying slides below and the good news is that he hopes to build and openly release a data viz 'uncourse' along the same lines later this year:

Tony's tour of the various data visualisation tools was great and brought the tools to life in a very engaging way with lots of examples showing how Tony's used them with real data. Personally speaking, the really interesting part for me was listening to Tony talk about the purpose and process of data visualisation. Tony is the first to admit that he is not a statistician and when he describes the process of using visualisation tools as 'having a conversation with your data' and 'exposing the hidden shapes, stories and messages within the data' it strikes me that working with data in this way requires an artistic / poetic / craftsperson mind-set as much as it does an analytic skill-set. I'll be mining Tony's talk to improve the data visualisation Draft Guide we've written but please do add your thoughts and tips below.

Wednesday, 20 July 2011

Tabbloid: 20 July 2011

It's been a fairly busy week on the project blogs and no doubt will continue in that manner over the next few weeks as the projects publish their final blogposts.

The AGtivity team in particular have been busy and are producing some interesting stuff, including a couple of hot off the press posts that aren't included in this week's Tabbloid:
  • Ahead of tomorrow's Online Exchange on the subject of Data Visualisation there's a timely post on the different ways that activity data can be visualised and the challenge that presents when choosing which visualisation to show the end user.
  • A breakdown of the numbers of data items the project has processed.
  • A first pass at writing up the project's Wins and Fails - no doubt the various data headaches they've had to deal with will chime strongly with a fair few of the other projects.
  • The 'Tale of Two Rooms' case study the team have compiled gives a good insight into the stories that the AGtivity data can tell - it also demonstrates how important contextual information is for making sensible interpretations of the data.
The LIDP project have been delving further into the data behind *that* graph (you'll recognise it when you see it) and have come up with the interesting conclusion that the differentiating behaviour is replicated year by year. It's got me wondering about what type and scale of intervention would be needed to buck the trend. I'm also wondering whether the students with higher outcomes might also be going to the library earlier in each term (and therefore having a wider choice of books) than their course mates.

These sorts of wonderings are some of the things that the LIDP team have been discussing while they've been out on the road sharing the project outcomes so far.

On their blog there's also a (slightly stolen) guest post from one of the LIDP project partners - Paul Stainthorp looks back at what they had to do to get at their data, how they wrangled it into one giant .csv file and how they discovered one of their datasets was missing.

On Twitter, Amber Thomas shared a link to an interesting article about how some of the for-profit universities in the US, such as Kaplan, APUS and Phoenix are surprisingly open to the idea of sharing data on student success with their not-for-profit competitors.

Monday, 18 July 2011

Online Exchange #3: Event Recording [13 July 2011]

Last week we held the third of our Online Exchange sessions. This time we opted for Elluminate as our conferencing weapon of choice and it served us well.

You can playback the whole session by following the link below. Note that you'll need to run the Java application that launches in order to watch it:

Ross MacIntyre introduced us to the Journal Usage Statistics Portal (JUSP) service and gave a live demo of the JUSP portal itself.

Nicole Harris talked about Cardiff University's Raptor (JISC-funded) project and their recently launched software. Nicole's slides are below.

Thursday, 14 July 2011

Tabbloid: 6 and 14 July 2011

It's a double header blog round up as I look back over the past two weeks of blog and twitter activity within the Activity Data programme. Amazingly there's a Tabbloid for both weeks (wonders will never cease!). It's been busy couple of weeks for the synthesis team with multiple events in Milton Keynes, plus our third Online Exchange session - I'll talk more about those events in separate posts.

6 July update:

  • The UCIAD project are continuing to do some deep thinking about user-centric activity data and have drawn up some concept diagrams which show (I think) that the organisational-centric activity data is simply an aggregation of user-centric data. Which means that an organisational-centric approach shouldn't preclude the potential that exists for releasing activity data to individual users too. It's got me thinking about what would happen if users were fed metrics about their usage such as 87% of the books/resources you've borrowed are off the reading list; 24% of your returns have been x days late etc - would it feed into a sense of self-responsibility or have a negative impact on under-achieving students. Would students welcome the additional data?

14 July update:

This week's issue might more accurately be called the OU RISE Weekly, since nearly all of the content comes from their blog:
In addition to these posts on the OU RISE project blog, Richard Nurse was also pondering activity data and open metadata over on his personal blog.

Some other items of (leftfield) interest that I've stumbled across in the last couple of weeks:

Wednesday, 29 June 2011

Draft Guide: 'Anonymising data'

The problem:
Data protection requirements mean that we cannot release personal data to other people without the data subjects' permission. Much of the activity data that is collected and used contains information which can identify the person responsible for its creation. It may contain their username, the IP address from which they were working or other information including patterns of behaviour that can identify them.

Therefore where information is to be released either as open data for anyone to consideration needs to be given to anonymising the data. This may also be required for sharing data with partners in a closed manner depending on the reasons for sharing and the nature of the data together with any consent provided by the user.

The options:
Two main options exist if you want to share data.

The first is to only share statistical data. As the Information commissioner recently wrote:
"Some data sharing doesn’t involve personal data, for example where only statistics that cannot identify anyone are being shared. Neither the Data Protection Act (DPA), nor this code of practice, apply to that type of sharing."

The second is to anonymise the personal data so that it cannot be traced back to an individual. This can take a number of forms. For instance, some log files store user names while other log files may store IP addresses, where a user uses a fixed IP address these could be traced back to them. anonymising the user name or IP address through some algorithm would prevent this. A further problem may arise where rare data might be able to be used to identify an individual. For instance a pattern of accessing some rare books could be identified to someone with a particular research interest.

Taking it further:
If you want to take it further then you will need to consider the following as a starting point:
  • Does the data you are considering releasing contain any personal information?
  • Are the people that you are sharing the data with already covered by the purpose the data was collected for (eg a student’s tutor)?
  • Is the personal information directly held in the data (user name, IP address)?
  • Does the data enable one to deduce who used that data (only x could have borrowed those two rare books – so what else have they borrowed)?
Additional resources:

Friday, 24 June 2011

Draft Guide: 'Developing a Business Case'

[This is a draft Guide that will be published as a deliverable of the synthesis team's activities. Your comments are very much welcomed and will inform the final published version of this Guide. We are particularly interested in any additional examples you might have for the 'Additional Resources' section]

The problem:
Getting senior management buy in for projects which make use of activity data to enhance the user experience or management of facilities is key if projects are to get the go ahead in the first place and become a sustainable service in the long term. There is a lack of persuasive business cases to refer to in the public realm. This guide gives some high level advice for the effective development of a solid business case.

In the current programme, activity data is being used to enhance the learner experience through recommending additional material, effectively manage resources and increase student success by helping them improve their online practices. Each of these is a powerful strategic benefit.

The solution:
The most important thing to remember when developing a business case is that its purpose is to persuade someone to release resources (primarily money or staff time) for the proposed activity. The person who will have to make the decision has a wide variety of competing requests and demands on the available resources, so that what they need to know is how the proposed project will benefit them.

The answer to this question should be that it helps them move towards their strategic goals. So the first thing that you need to find out is what their strategic goals are. Typically these are likely to include delivering cost savings, improving the student experience or making finite resources go further. You should then select one (or at most two) of these goals and explain how the project will help to meet this goal (or goals). Aligning the project to many goals has the danger of diluting each of them and having less impact than a strong case for a single goal.

Structure of a business case:
- Title
- Intended audience
- Brief description
- Alternative options
- Return on investment
- Costs
- Project plan
- Risks
- Recommendation

Do not 'over egg the pudding' in terms of understating the costs and risks or overstating the benefits. If the costs or benefits are not credible then the business case may be rejected as it appears to be not offering realistic alternatives.

The benefits should be realistic and quantifiable and, wherever possible, the benefits should be quantified in monetary terms. This allows the decision maker to compare the benefits and costs (which can usually be expressed in monetary terms), and so clearly see the return on investment, and compare this business case with other calls on their funding and staff.

Taking it further:
If the sector is to build a higher level picture of the business cases for exploiting activity data and also for pursuing the path towards open data then it is important to share knowledge of what works in terms of convincing key decision makers to give sustained support to using activity data.

The programme has produced some example business cases which can be used to understand the type of information that it is sensible to include, and which may form the basis for your business case. However, the business case must relate to the local circumstances in which you are writing it, and the audience for which you are writing it.

Additional resources:

Guidance and templates

Examples and further reading

Thursday, 23 June 2011

JISC online consultation

JISC is currently undertaking a consultation exercise and wrote:

As part of our institutional engagement work, the JISC Organisation and User Technologies Team is carrying out an online consultation (using moodle) to identify emerging issues and concerns in UK Higher Education that we may, in the future, be looking to develop programmes of activity around. There are five top level areas each with a discussion forum attached, please feel free to either post a new concern or issue or respond to someone else’s post.
The site is at it’s a moodle site so there is a quick and simple 2 part registration before you post
Anything you can contribute will be helpful in shaping our future plans

I have added the following post on analytics - you may wish to comment or add others...

One of the key factors for both students and universities will be student success; though at times they may have different definitions of what this means.

There are two key areas here; retention and outcome (loosely result but also that the student has achieved what they set out to do). Retention is already good by international standards, but this does not give grounds for complacency, and there is much that can (and is) being done to improve it.

It is arguable that student success is also one of the factors in the student experience.

In this posting I want to look at one tool that can be used to enhance student success, where JISC is already doing some work, but much more could be done and would have a very positive return on investment for institutions. This is data analytics to support student success.

Universities and colleges are already collecting vast amounts of data about their students, but making very little use of it. Every time a student logs on to the VLE, undertakes a search of the library resources, accesses an e-journal, swipes their card through the library turnstile or lecture theatre the event is recorded in logs on servers at the university. Most of the time this information simply sits there gathering electronic dust until it is archived or deleted. However, there is much valuable information that could be used to help students to help themselves.

For example there are patterns of behaviour which may give early indications that a student is at risk of dropping out (non- attendance, declining use of VLE perhaps) where early intervention to support students may help them to achieve the results that they wanted to.

Similarly there are patterns of behaviour which may indicate that students are studying as effectively as they might, again where early intervention could be of great assistance to the student.

There are a number of areas where intervention at the national level would be of great value to the sector. These include:

  • Understanding the information that universities have available to them
  • Identifying patterns associated with success and failure. Note that these are likely to be discipline dependent. Some disciplines make much more use the library than others. They are also likely to be institution dependent as, for instance some universities make much more use of VLEs than others,
  • Developing algorithms to identify students at risk or with sub-optimal study patterns
  • Researching methods of intervention that actually support students to succeed. There is evidence that some approaches may be counter-productive

These methods can form part of the way in which to enhance student learning and success, and where national support will enable all universities and colleges to achieve more than they could by developing the tools and algorithms for themselves.