Sat, 2 May 2009
DCC/RIN data management workshop
Just coming back from this Digitial Curation Centre/RIN data mangement
workshop. It was an interesting meeting with lots of discussions; I know
relatively few people there, so it was a new environment for me. A lot of it
was about establishing value for data sharing — something which is hard to
do because, by definition, there is a time lag between the lodging of data and
the point at which it actually gets used.
One of the problems with establishing value was that the talks seemed to cover
two different types of repository; for example, Matthew Wollard for the UK
Data Archive really is maintaining a data archive. The datasets there are
curated for metadata, but other than that, it's the raw dataset. On the other
hand, most of the value that Jenny Walsby of the British Geological Survey
talked about was from "value-added", secondary analysis of their primary data
sources; for example, rather the releasing raw data about the presence of clay
soils, for the insurance companies, they release risk factors for subsidence.
Likewise, for biology most of our database "curators" are actually annotaters;
they add more to the data than it started off with.
I found a couple of the talks slightly concerning; Simon Hodgson from JISC was
talking about institutional repositories; something I have ranted about
before. From questions afterward, he is clearly not limiting the proposed
support to single institutions which is good thing; structuring knowledge
along the lines of the current financial and managerial organisation of the
universities, rather than along the lines of, well, something sensible and
comprehensible does not seem a good idea to me. Secondly, Adam Farquar of the
British Library gave a detailed talk about their plans for DOIs for data. I
think that the social aspect — that data should be a first-class citizen with
papers is a good idea; well, sort of; actually, I think we should value data
and not the current publication proces, but that's a slightly different
argument. I'm not convinced by DOIs though; technologically, DOIs are handles,
with some social conventions layered on top for the publishing industry. The
technology is good, but the social conventions are just wrong for data;
an individual may want to release thousands of datasets a day, rather than 100
papers a career; they may want multiple versions or refer to subsets. Getting
this to work with DOIs doesn't make any sense because you have to fight the
social conventions; why not just use handles directly; this side-steps many
of the issues (like the cost to DOIs, and the bias in the registration process
to large organisations), while still building on the mature infrastructure
base (Handles) that makes DOIs successful.
After all, Nature Preceedings did this for preprints; I suspect most people
don't even know that they are not using DOIs.
One issue which also came up was identifiers; a pet topic of mine. In this
case, identifiers for indiviual scientists. PLoS recently commented on this
(I'm a number not a name, I think the title was — am on a Cross Country
train, so can't check) also. It's about time, we got this sorted out and it's
much more tractable than most identification issues in biology. I have some
ideas about this, which I may blog about in a few days time.