24 January 2012

More thoughts about data

Is managing data a bit like designing a fountain, striking a balance between a contained flow and a deluge?

Last week I attended an event about data management, led by the Digital Curation Centre. There's also a new book out about this topic: Managing Research Data, edited by Graham Pryor and published by Facet.

The session I attended here in Sheffield focused particularly on the importance of effective data management planning and introduced us to the DMP Online tool. It offers an interesting way of building a data management plan, particularly for supporting applications for funding from research councils.

One of the issues about data management which seems to be given different emphasis in different contexts is the question of when not to share or retain data. Are all data equal or are some data more equal than others? For example, in my project I can imagine a possible future use for raw quantitative data from a survey I hope to carry out later this year. However, I don't think the same applies to qualitative interview data. I'm almost tempted to say that these data are just too unique: the process of carrying out the interviews and my subjective participation in them as a researcher is too significant and important an influence on them for them to be truly reusable by others. I also think that even carefully "anonymised" qualitative data, when taking the form of a full transcript, is rarely truly, fully "anonymous".

It's also interesting to think about this in the context of library collection decisions. How do we assess the potential use or usefulness of an item (or a dataset) before adding it to a collection? Would some version of a 20:80 rule apply to large data collections, as it is said to apply to print collections?

Another question which interests me is the division between different types of data. In November it was announced that more government data will be being made freely accessible including, controversially, the potential release of anonymised health data. There's more detail here, but it's interesting that the separation between "research data" and "data potentially useful for research" seems so clearly established. And what about the grey data generated by organisations which are neither public sector, nor research-led - perhaps including social enterprises?