Science

Sat, 2 May 2009

DCC/RIN data management workshop

Just coming back from this Digitial Curation Centre/RIN data mangement workshop. It was an interesting meeting with lots of discussions; I know relatively few people there, so it was a new environment for me. A lot of it was about establishing value for data sharing — something which is hard to do because, by definition, there is a time lag between the lodging of data and the point at which it actually gets used.

One of the problems with establishing value was that the talks seemed to cover two different types of repository; for example, Matthew Wollard for the UK Data Archive really is maintaining a data archive. The datasets there are curated for metadata, but other than that, it's the raw dataset. On the other hand, most of the value that Jenny Walsby of the British Geological Survey talked about was from "value-added", secondary analysis of their primary data sources; for example, rather the releasing raw data about the presence of clay soils, for the insurance companies, they release risk factors for subsidence. Likewise, for biology most of our database "curators" are actually annotaters; they add more to the data than it started off with.

I found a couple of the talks slightly concerning; Simon Hodgson from JISC was talking about institutional repositories; something I have ranted about before. From questions afterward, he is clearly not limiting the proposed support to single institutions which is good thing; structuring knowledge along the lines of the current financial and managerial organisation of the universities, rather than along the lines of, well, something sensible and comprehensible does not seem a good idea to me. Secondly, Adam Farquar of the British Library gave a detailed talk about their plans for DOIs for data. I think that the social aspect — that data should be a first-class citizen with papers is a good idea; well, sort of; actually, I think we should value data and not the current publication proces, but that's a slightly different argument. I'm not convinced by DOIs though; technologically, DOIs are handles, with some social conventions layered on top for the publishing industry. The technology is good, but the social conventions are just wrong for data; an individual may want to release thousands of datasets a day, rather than 100 papers a career; they may want multiple versions or refer to subsets. Getting this to work with DOIs doesn't make any sense because you have to fight the social conventions; why not just use handles directly; this side-steps many of the issues (like the cost to DOIs, and the bias in the registration process to large organisations), while still building on the mature infrastructure base (Handles) that makes DOIs successful.

After all, Nature Preceedings did this for preprints; I suspect most people don't even know that they are not using DOIs.

One issue which also came up was identifiers; a pet topic of mine. In this case, identifiers for indiviual scientists. PLoS recently commented on this (I'm a number not a name, I think the title was — am on a Cross Country train, so can't check) also. It's about time, we got this sorted out and it's much more tractable than most identification issues in biology. I have some ideas about this, which I may blog about in a few days time.

Permalink