Activity data typically comes in large volumes that require processing to be useful. The challenge is where to start and at what stage to become selective (e.g. analyse student transactions and not staff) and to aggregate (add transactions together – e.g. 1 record per day for books borrowed).
If we are being driven by information requests or existing Performance Indicators, we will typically manipulate (select, aggregate) the raw data early. Alternatively, if we are searching for whatever the data might tell us then maintaining granularity is essential (e.g. if you aggregate by time period, by event or by cohort, you may be burying vital clues). However, there is the added dimension of data protection – raw activity datasets probably contain links to individuals and therefore aggregation may be a good safeguard (though only partial, as you may still need to throw away low incidence groupings that could betray individual identity).
It is therefore important to consider the differences between two approaches before you start burning bridges by selection / aggregation or unnecessarily filling terabytes of storage.
Approach 1 - Start with a pre-determined performance indicator or other statistical requirement and therefore selectively extract, aggregate and analyse a subset of the data accordingly; for example:
- Analyse library circulation trends by time period or by faculty or …
- Analyse VLE logs to identify users according to their access patterns (time of day, length of session)
- Discovery 1 – A very low proportion of lecturers never post content in the VLE
- Discovery 2 – A very low proportion of students never download content
- Discovery 3 – These groups are both growing year on year
- Pattern – The vast majority of both groups are not based in the UK (and the surprise is very low subject area or course correlation between the lecturers and the students)
Approach 1 – The Library Impact Data Project (#LIDP) had a hypothesis and went about collecting data to test it - http://library.hud.ac.uk/blogs/projects/lidp/
Approach 2 - The Exposing VLE Data project (#EVAD) was faced with the availability of around 40 million VLE event records covering 5 years and decided to investigate the patterns - http://vledata.blogspot.com/
Recommender systems (a particular form of data mining used by such as supermarkets and online stores) typically adopt Approach 2, looking for patterns using established statistical techniques - http://en.wikipedia.org/wiki/Recommender_system and http://en.wikipedia.org/wiki/Data_Mining