Activity Data Synthesis

Monday, 20 June 2011

Draft Guide: 'Strategies for collecting and storing activity data'

[This is a draft Guide that will be published as a deliverable of the synthesis team's activities. Your comments are very much welcomed and will inform the final published version of this Guide. We are particularly interested in any additional examples you might have for the 'References' section]

The problem:
Activity data typically comes in large volumes that require processing to be useful. The challenge is where to start and at what stage to become selective (e.g. analyse student transactions and not staff) and to aggregate (add transactions together – e.g. 1 record per day for books borrowed).
If we are being driven by information requests or existing Performance Indicators, we will typically manipulate (select, aggregate) the raw data early. Alternatively, if we are searching for whatever the data might tell us then maintaining granularity is essential (e.g. if you aggregate by time period, by event or by cohort, you may be burying vital clues). However, there is the added dimension of data protection – raw activity datasets probably contain links to individuals and therefore aggregation may be a good safeguard (though only partial, as you may still need to throw away low incidence groupings that could betray individual identity).

The options:
It is therefore important to consider the differences between two approaches before you start burning bridges by selection / aggregation or unnecessarily filling terabytes of storage.
Approach 1 - Start with a pre-determined performance indicator or other statistical requirement and therefore selectively extract, aggregate and analyse a subset of the data accordingly; for example:
  • Analyse library circulation trends by time period or by faculty or …
  • Analyse VLE logs to identify users according to their access patterns (time of day, length of session)
Approach 2 - Analyse the full set (or sets) of available data in search of patterns using data mining and statistical techniques. This is likely to be an iterative process involving established statistical techniques (and tools), leading to cross-tabulation of discovered patterns, for example:
  • Discovery 1 – A very low proportion of lecturers never post content in the VLE
  • Discovery 2 – A very low proportion of students never download content
  • Discovery 3 – These groups are both growing year on year
  • Pattern – The vast majority of both groups are not based in the UK (and the surprise is very low subject area or course correlation between the lecturers and the students)
Additional resources:
Approach 1 – The Library Impact Data Project (#LIDP) had a hypothesis and went about collecting data to test it - http://library.hud.ac.uk/blogs/projects/lidp/
Approach 2 - The Exposing VLE Data project (#EVAD) was faced with the availability of around 40 million VLE event records covering 5 years and decided to investigate the patterns - http://vledata.blogspot.com/

Recommender systems (a particular form of data mining used by such as supermarkets and online stores) typically adopt Approach 2, looking for patterns using established statistical techniques - http://en.wikipedia.org/wiki/Recommender_system and http://en.wikipedia.org/wiki/Data_Mining

No comments:

Post a Comment