DCC/RIN data management workshop

Just coming back from this Digitial Curation Centre/RIN data mangement workshop. It was an interesting meeting with lots of discussions; I know relatively few people there, so it was a new environment for me. A lot of it was about establishing value for data sharing — something which is hard to do because, by definition, there is a time lag between the lodging of data and the point at which it actually gets used.

One of the problems with establishing value was that the talks seemed to cover two different types of repository; for example, Matthew Wollard for the UK Data Archive really is maintaining a data archive. The datasets there are curated for metadata, but other than that, it's the raw dataset. On the other hand, most of the value that Jenny Walsby of the British Geological Survey talked about was from "value-added", secondary analysis of their primary data sources; for example, rather the releasing raw data about the presence of clay soils, for the insurance companies, they release risk factors for subsidence. Likewise, for biology most of our database "curators" are actually annotaters; they add more to the data than it started off with.

I found a couple of the talks slightly concerning; Simon Hodgson from JISC was talking about institutional repositories; something I have ranted about before. From questions afterward, he is clearly not limiting the proposed support to single institutions which is good thing; structuring knowledge along the lines of the current financial and managerial organisation of the universities, rather than along the lines of, well, something sensible and comprehensible does not seem a good idea to me. Secondly, Adam Farquar of the British Library gave a detailed talk about their plans for DOIs for data. I think that the social aspect — that data should be a first-class citizen with papers is a good idea; well, sort of; actually, I think we should value data and not the current publication proces, but that's a slightly different argument. I'm not convinced by DOIs though; technologically, DOIs are handles, with some social conventions layered on top for the publishing industry. The technology is good, but the social conventions are just wrong for data; an individual may want to release thousands of datasets a day, rather than 100 papers a career; they may want multiple versions or refer to subsets. Getting this to work with DOIs doesn't make any sense because you have to fight the social conventions; why not just use handles directly; this side-steps many of the issues (like the cost to DOIs, and the bias in the registration process to large organisations), while still building on the mature infrastructure base (Handles) that makes DOIs successful.

After all, Nature Preceedings did this for preprints; I suspect most people don't even know that they are not using DOIs.

One issue which also came up was identifiers; a pet topic of mine. In this case, identifiers for indiviual scientists. PLoS recently commented on this (I'm a number not a name, I think the title was — am on a Cross Country train, so can't check) also. It's about time, we got this sorted out and it's much more tractable than most identification issues in biology. I have some ideas about this, which I may blog about in a few days time.

Permalink
   

Functioning in the Upper Reaches of Knowledge

When Peter Murray-Rust restarted blogging recently, I must admit that to having mixed feelings. His blog is very interesting, often insightful and entertaining. He is, however, rather prolific, making up a considerable chunk of my RSS inbox. This would be okay if he was dull, of course, as I'd just unsubscribe, but it's not true.

As a case in point, he recently talked about Ontological Wars; this lead me to the Upper Ontology page on wikipedia which I'd not read before. Mostly of this page is not about upper ontologies but two sides sniping at each other about why upper ontologies are or are not possible.

Since the whole idea of upper ontologies came into bio-ontologies, I have to admit to being deeply ambivalent about them; I can see the appeal, of course. There is a pleasure at fiddling about at the upper, most abstract levels of knowledge. Career-wise, upper ontologies are high-risk but think about the potential publication rate if every one uses your ontology; of course, the actual value might be small to each individual, but if you get a publication out of each; well, it's like writing Maniatis (famous for having a funky name, as well as the book), or BLAST (which gets cited by everyone).

The flip side is that, I think that there is rather little evidence that using a single common upper ontology actually aids the processes of ontology development, deployment or integration. It can help somewhat, but then people end up spend too much time thinking about the philosophy of upper ontologies, which ultimately can take a lot of time; take a look at the BFO mailing list if you want to see how I have fallen in to this trap. On the other end of the process, how much use is an upper ontology in terms of querying? For example, it might be good to know that the function of a test tube and the function of beta-galactosidase are actually instances of the same RealizableEntity, but does anyone ever query at this level of abstraction.

I think that the core problem here, is that upper ontologies tend to be built without evidence; rather illustrative examples are chosen and then used to derive general truths. An illustrative example of this approach is, for example, Barry Smith's paper on part of; the example is a circle half of which is red, half of which is white. Okay, but where is the evidence that this is a good example? Can we be sure that if we picked a different example, the conclusions would not have been different?

This, I think, covers the key problem. At the moment, are attempting to build upper ontologies from the top down; those people who are interested in upper ontologies tend not to apply them to large scale projects; those people building lower ontologies tend not to discuss the applicability of upper ontologies for fear of getting shot down in flames; see wikipedia if you like flame wars. What we need is an arbiter, some way of determining who is right, who is wrong; as a scientist, of course, I know how to do this; I do an experiment. You can argue philosophy all you like, but having not one illustrative example can never outdo having several hundred actual uses. We don't entirely know how to do these experiments yet, but that's partly because we are not trying. I once got told on the BFO mailing list (and I paraphrase): YOU can do controlled experiments if you like, but I'm too busy doing science for that.

In ontology building, we need to avoid arguments like "is it correct", "is it true" or "is it reality" and replace them with "does it work". And to do this, we need to take a small step back and ask: how do we know when our ontology works; and most importantly of all, how can we guess when its likely to work in the future. Only then can we choose with knowledge between the different upper ontologies or, indeed, none at all.

Enough philosophical ramblings; back to work.

Permalink
   

Relative Risk

I'm kind of irritated by the response to David Nutt's comments on Ecstasy and horse riding. He's been widely slated by various politicians, desperate to get their "tough on drugs" soundbites.

"There's no comparison", said one. "Horse riding teaches you discipline", he said, making a comparison. Well, yes, maybe it does. But it also rather likely to kill, cripple or maim you.

It's not an issue of choice, of opinion, whether horse riding and ecstasy are as dangerous as each other, it's an issue of evidence, and measurement. We can, potentially, answer the question. What we choose to do about it is a different issue, but, having measured the risk, to argue against ecstacy on the basis of danger is a poor argument if there are many other equivalent activities.

Permalink
   

Where Pedro has gone

Previously, I asked where Pedro had gone. Well, I'm delighted (although mystified) to find that he seems to have actually read my blog, because he's left a comment on it.

I'm not quite sure why I had such troubles finding him on Google — possibly a bad day; I think mostly it was just that his website doesn't mention Pedro's tools anymore. He's been building CAZy, which I know of, although I don't think I've used it. Poking through his bibliography, he's published on semantic similarity, one of my pet topics.

It's amazing to me how much traction Pedro's tools has got with bioinformaticians/biologists; during the Ontogenesis at which I was present last night, I mentioned the website and the three who were old enough all perked up, saying "yeah, Pedro's tools was excellent". We started comparing Pedro numbers; for the record mine is currently 4 (me, Norman Paton, Mike Cornell, Pedro albeit via a genome paper), although if I get lucky with a paper under submission this will go to 2 (you'll have to wait to find out how...).

As he says, it's rewarding to observe how bioinformatics has moved on to become central to all biology; of course, it's also amazing how the web has become commonplace to the rest of our lives. At the time, it was image-poor, slow and clunky. Most of us hardly knew how to use bookmarks (if they'd been invented then, I don't remember), search engines were in their infancy, URL naming was inconsistent and changeable; it was really hard to navigate, to discover. It's perhaps not surprising that the website was such a success and remembered so fondly; the only question that remains is, how the hell did everybody find out about it in the first place?

His comment finishes with the statement that "And, if there is a final message, [it] is that with some good will, anyone can make a difference in Biology and elsewhere." What a cool bloke!

Anyway, in case my comment engine goes awry, I reproduce the quote here...

Where has Pedro gone? Well, I've been happily busy researching and teaching on my favorite subjects.

After a PhD in (Bio)Chemical Engineering at Iowa State University (ISU) in 1996, I did a Post-Doc in Grenoble and Marseille, France. In 1999 I got a faculty position in Biological Engineering at the Instituto Superior Tecnico, Lisbon, Portugal. From 2002 onwards, I moved to a new a faculty position at the University of Provence in Marseille, France where I presently teach Biocatalysis and Bioinformatics.

Since 1998, I've been developing and maintaining a database on Carbohydrate-Active Enzymes (CAZy, http://www.cazy.org), that presently constitutes a reference resource in Glycobiology and a research tool for Glycogenomics.

The "Pedro's Biomolecular Research Tools" adventure lasted from late 1993 to early 1997. It was a great learning moment for me and I'm proud of leaving my little brick on the wall of Bioinformatics. Initially developed as the web complement to an internal software locker that I maintained, the list grew up in importance thanks to the positive response from the burgeoning community of web-aware Biologists and the many encouragements and suggestions I received from users from all over the world. It was a (hopefully) good index for those pioneering days where I intuitively tried to reveal the potential of the new discipline. Naturally, I used my on research subjects to test the different tools available at the time, and this had some impact on my thesis. However, the most rewarding and interesting was to observe how Bioinformatics moved on in a few years from an obscure and marginal discipline to become absolutely central to almost all aspects of Biology and its applications. As a community, Biologists created since then an impressive dynamic that makes other scientific communities envious. And, if there is a final message, is that with some good will, anyone can make a difference in Biology and elsewhere.

Permalink
   

Home from Workshop

Well, it was a good meeting. I enjoyed listening to the talks, although I frequently found myself a little out of my depth; perhaps both a sign of how much biology I have forgotten and how much maths I never knew. Also, I think that the conference was not ideally weighted. Some multi-track, shorter talks, I think. It felt rather like the early eScience All Hands meetings.

On the way, down almost all the Newcastle people travelled together; for some reason, on the way back, we all scattered and went different routes. I thought I was on my own, going through Sheffield, but bumped into a fellow Newcastle academic on the platform, in the shape of Tom Kirkwood: Professor of Gerontology, former Reith Lecturer, and all rather clever chap. What brilliant and incisive obervation on the state of systems biology did I make? What stunning analysis of the impact of the RAE results did I posit? "Hello," I said, "did you get the train to Sheffield too?"

Still, it isn't all bad; I did manage to proof conclusively that it is possible to survive for three days eating only two of the major food groups: fat and carbohydrates.

Permalink
   

BBSRC Grant Holders Workshop

So far, the BBSRC Grant Holders Workshop has been fascinating. Dennis Noble's talk last night, including an entertaining slagging of the Gene Ontology; entertaining but as wrong as you can be when you confuse a gene name and a function. Nice to hear a new variation of the Syndney Brenner "but an ontology doesn't allow you to understand all of biology" argument.

I also learnt that a) without convection it would take 10,000 years to make a cup of tea (unless you invent a spoon) and b) there are, on average 8 sausages in a tin of sausage and beans and, further, that the distribution of sausage number is low enough that the machine that puts them in the tin must be counting.

I also learnt that some people have too much time on their hands; that I am blogging about this means I have to include myself in this category.

Permalink
   

SWAT4LS

At the SWAT4LS meeting in Edinburgh. After the melee of teaching over the last two months, it's a real delight to be back in a research environment, to have some real quality time writing emails, while someone is talking in the background.

So far, it's been pretty good. I'm surprised by the size (75+ delegates) and the large number of papers submitted (40ish). They seem to have really hit the time right. The talks so far have been interesting; lots of integration, lots of querying, and far more architecture diagrams than I want to look at in one day.

Permalink
   

The impact of Organophosphates

This paper surely has to win the prize for the most entertaining title of the year.

Permalink
   

Wow

In the basic biological sciences, statistical considerations are secondary or nonexistent, results entirely unpredicted by hypotheses are celebrated, and there are few formal rules for reproducibility

doi:10.1371/journal.pmed.0050201

Now, that is what I call a real quote.

Permalink
   

Where has Pedro gone?

In the dim and distant past, Pedro's list was this amazing resource for biologists. Speak to anyone of my age, and they will remember this list; in the early days of the web it was the best place to go, to find out where to find your bioinformatics tools.

Pedro's list hasn't been updated since 1995. There are still copies of it around which google will find for you if you want. It turns out that Pedro was, in fact, Pedro Maldonado Coutinho who was a graduate student at the time. A little more poking uncovers his thesis from 1996; this explains why he stopped maintaining the list. A little more poking reveals very little. He worked in France for a while but then disappeared from the web record. A later google hit suggests he might not have left biology altogether — but it's hard to tell for sure.

Pedro, early web pioneer, I salute you!

Permalink
   

Fall out from Neuroinformatics

Well, there were a large number of specific outcomes from Neuroinformatics 2008, most of which I won't bore you with. The best idea, though, came out as piece of humour. I was ranting (yes, I know, it's hard to believe) about public understanding of science. I'm a bit fan of this because I think that as scientists we should be able to write about what we do clearly and at a level suitable for an intelligent but uninformed individual. Of course, I believe this because in Neuroinformatics, this covers me; I don't know much about brains, just computers and biology.

The suggestion was that, to every scientific paper we publish, scientists would be forced to add an explanatory paragraph; now, as I say, this was meant as a joke, but I think it's a great idea. It is the beginning of term, so it's going to take a while, but I intend to do exactly this; I shall add explanations to my publications page for each of my papers. I'm slightly worried about this, of course; it's a well-known secret but, like many scientists, I don't actually know what all of my papers are about; some of them were written by other people, some of them were written by me so long ago that I was "other people". So, I might even learn something in the process.

I shall announce releases here; the world will, no doubt, hold it's breath till it turns up.

Permalink
   

Neuroinformatics 2008 — Day Two

Today, we have neuroinformatics meets bioinformatics. I've been looking forward to this; unfortunately, I'm feeling a bit washed out having slept badly. I went to be at 10ish (I was tired!) and went to sleep at 2ish. The room was too hot and, by bad design, I left my melatonin at home so I lack even chemical solutions.

We've started off with a talk by Ed Lein from the Allen Brain Atlas. Lots and lots of gene expression analysis!

Permalink
   

Neuroinformatics 2008 — Day One

So far, we've had two talks, one from David Essen, one from Mary Kennedy. A nice bit of organisation because they have jumped scales — the first was mostly about brain gross anatomy and the second about molecular modelling.

A bit like it's forerunner — databasing the brain — there is not that much informatics here. The keynotes have been very much about the neuroscience; this makes it both novel and interesting for me, although fairly heavy going at times.

It confirms my feeling that neurosinformatics is much less mature than bioinformatics; it's not really a separate discipline yet. Not that this is a bad thing; I've been at bioinformatics conferences where the "bio" seems barely relevant. If I am honest about it, I think more about computers these days and sometimes forget the point — understanding life — although I guess this is inevitable working in a computer science department. Less mature is another phrase for new, young and fresh. It feels good to be in this environment.

Permalink
   

Neuroinformatics 2008

Ah, off to a conference again. Depressingly on saturday, so the airport is heaving. I'm going to Neuroscience 2008 which is a new one to me, in Stockholm which is also new. I'm taking a poster which seems distressingly old. Been a long time since I've done this. It's already been a struggle — some of my colleagues didn't like it; I think because neuroscientists tend toward lots of text, while I do a light-weight, advert-style, if-you-want-more-details-read-the-paper form of poster. And I hate travelling with a poster; it's hard to replace your belt while carrying a bag and an A0 poster tube. My subconscious tried to leave it at a Starbucks in Schipol, but my better judgement forced me to go back for it.

I'm not in the best of moods: my toe, which I appear to have broken is nothing but a a dull ache and I was frozen on the flight having got a bath between the terminal at Newcastle and the plane. Still the conference should be fun.

Permalink
   

CNS'08

At the interoperability workshop. It's small but focused. But there's no network! I don't know what to do? I might end up even listening to the talks now.

Permalink
   

Bill Bug

I've just found out the terrible news that Bill Bug has died unexpectedly; this has come as a shock to the community. Bill was a phenomenon and the sort of person that you need in science; he was interested in everything, had ideas and opinions about it all, topped with an almost childlike pleasure in it all. He was a good scientist, a motivation and a reminder why most of us got into science in the first place.

His emails and their length were legendary. He was hard-work — you had to fight through the morass of ideas — but well worth it. I only had the pleasure of meeting him once; I was looking forward to meeting him again, something that now will never be.

Permalink
   

Doing research

Yesterday was the board of studies. Day before was the board of examiners. Conclusion: today is the first day of summer, an opportunity to apply myself, mostly fulltime, to research.

So, what have I done today. Erm, teaching. Almost all day. Life can be hard at times.

Permalink
   

Ondex

Today is the kick-off meeting for ONDEX. This is a new project which is doing something that I've wanted to do for ages; in a nutshell, it's a large, graph-based datawarehouse. It's rather similar to a proposal that I wrote with Mark Wilkinson from BioMOBY a few years back, with one important difference — the system actually exists, produced at Rothamstead over the last few years.

The new project involves integrating some other bits of technology — taverna, text mining and so on, and a couple of specific biological examples. I think it's going to be a pretty cool project, and we should get some useful biology out of it.

Two things that I have learnt today: firstly, what "Ondex" actually stands for is not actually sure and, secondly, some varieties of willow are dodecaploid. Why would any plant need that many genomes?

Permalink
   

Ontogenesis

Am in Manchester for an Ontogenesis meeting, which is focused on tools and APIs this time; perhaps less exciting than previous meetings, but also potentially most useful; spanners are not interesting, per se, but where would we be without them.

Sean Bechhofer started off talking about the OWL API — it's taken a long time, but this seems to have been a bit of a slow burn; it was started in 2002. It's starting to get a lot wider use now, and a bit of a community around it.

Permalink
   

CARMEN on Tour

Just given a talk at Riken about metadata. People seemed very positive, there is clearly a desire to do this and to get more data types out there. I got the question about requiring too much metadata to understand an experiment; most of the rest were people saying "have you thought about using...?".

The one that I hadn't thought about is provide metadata for gold standard, generated (non-experimental) data. My initial response is to say that we should be storing the service for producing the data, rather than the data, although there are purposes for standard generated data — enabling deterministic behaviour of tools over "random" data.

Permalink
   

Data Sharing in Neurosciences

There was much amusement in the CARMEN project today. The journal Neuroinformatics published what looked like an interesting article on data sharing.

Sadly, however, no one has been able to read it; it's a Springer article and none of us can read it because it's closed access and $32 to look at. A strange and ironic reflection on the state of data sharing.

Perhaps, is what the paper says. Data is Mine!

Addendum

Immediately after posting this, I started writing some lecture notes. I have so far copied images of Northerns, Westerns and several kinds of immunofluorescence straight of the web, all legal, all thanks to the wonders of PLoS. It's even easy to attribute them because they have given all of the figures individual DOIs. Working in neuroinformatics is interesting and exciting, but it also helps to remind me how wonderful bioinformatics it is.

Permalink
   

Abbrvs bad for helf

I found out about a fascinating report about abbreviations from BBC News. The practical upshot of it all is that medics commonly use abbreviations in their records, and traced back to a number of fatalaties when they were misunderstood.

Abbreviations have got a lot of history in medicine. In many cases, they were meant to be confusing: FLK (Funny Looking Kid) or NFN (Normal for Norfolk) were designed to express something that the doctor didn't want the patient from seeing.

The whole problem here is the user interface is wrong. The person writing the notes is trying to save themselves effort, to the detriment of the reader. What we really need is something better to interact with, which is quick to input the data but where the underlying representation is precise. Difficult to do with the paper and pens that most doctors still seem to use.

At the same time, I saw a blog post about Scrivener. This is a new form of word processor, which is attempting to consider the way that authors work. As a long term LaTeX user, I am somewhat isolated from the horror of word, but I can still appreciate the desire. To be dealt with by the application like an author rather than a typesetter is something that word has still failed on. Scrivener has features like a proper outlining and the ability to attach notes. LaTeX gets outlining right (word fails because most people use the physical style markups rather than the "heading" markups), but I love the synopsis idea that Scrivener has. Notes I currently do as comments in LaTeX but something better would be good.

The irony here, is that the problem is backward from the medical notes. The author wants to write much more than than the reader actually sees. Word is more scalable than 10 years ago, and has more fonts, but the user interface basically is the same. Perhaps it is time for a change?

Permalink
   

Start the Week with Craig Venter

Craig Venter was on the start the week. I meant to miss it, but ended up listening by random chance. It was strange; he was thoughtful, understated, entirely reasonable and only talk about how great he was once. Unexpected to say the least.

Permalink
   

Jim Watson

I was going to see him talk on Sunday at Newcastle but it turns out that he's gone home instead. I'm reasonably irritated about this to be honest. I mean, I know he keeps on coming out with these daft statements, but I was going to see hear what he had to say; more just to experience a piece of history. Maybe a bit pathetic, but his work has helped to define my own working life and it would have been good to see it.

There seems to be a theme running along here. I was hoping to see Bo Diddley earlier this year but then he had a stroke.

Permalink
   

Incentives

I notice that the CBI are suggesting that the government should provide £1000 bursaries for students starting science programmes. Don't get me wrong, I think that this would be a good thing, but I can't help but wonder: would it not make more sense to just pay them more? After all, a bursary is a course, and salary is for life.

Permalink
   

ISMB finishing

Finally, ISMB is coming to an end. The database and ontologies track had a couple of interesting talks, with Suzi Lewis' being the day before. To finish off, I am in a Open Science meeting — rather smaller than I thought it would be, but this might be because it was not very well attended, but then it's at the end of the conference.

Not a bad conference, but too long as always.

Permalink
   

ISMB

Yesterday was the SIG co-ordinators meeting for ISMB. One of the big and recurrent issues (besides the timing of coffee breaks) was the timing of ISMB. At 7 days, ISMB is a long, long conference and is a bit of a killer. Of course, bringing it down to 4 days will mean that more events will run concurrently. Live with it, I say.

Bio-Ontologies was a success, but I want to think about the future (Blair-like, perhaps I am thinking of my legacy, as I will not chair it for that much longer). Perhaps, "Bio-Ontologies: knowledge in biology" would be a way to go — I want to move the workshop away from a technology and more toward a function.

Permalink
   

10th Annual Bio-Ontologies Meeting

Today is the day of the Bio-Ontologies SIG meeting, which I have now co-organised for 4 years or so. It's a surprisingly large amount of work to do, not least this year because we had 36 submissions. The organisation of this is a large part of the effort, but it has made for a strong programme; it's gratifying to see that we have an audience of size to match.

09:10

We had a moment of worry when the first speaker didn't register, but Mark Musen is a notable replacement, talking about representing OBO to OWL mappings.

09:30

Following Mark's talk about using more rigourous models of OWL, Simon Jupp is talking about using the more light-weight semantics of SKOS, which turns out to be well suited for document navigation.

09:50

Lina Yip covers a familar problem — mapping between one resource and another: in this case MESH and Swissprot — to support the flow of knowledge from bioinformatics research toward medical practice.

10:10

The mapping theme continued (you'd almost think it was planned!) by Julie Chabalier who has mapped a number of resources to build a query warehouse.

11:00

Judy Blake has just spoke on annotation of GO and exactly what they mean. It's good to see an increased formality to the relationships between a GO term and the entity that it is describing. This talk has generated the most questions so far, mostly asking for more details.

11:29

Mikel Arungen is now talking about design patterns, which are analogous to software design patterns. These should help to bridge the gap between the desire to write rigourous logical definitions, but the difficulties of doing this.

11:51

Daniel Schober is now describing efforts to standardise naming conventions, fitting with the theme of methods to help people produce interoperable and standardised ontologies.

12:10

Lunch, and nearly on time. Most of the lag was from coffee break, so I don't feel that I, as timekeeper can be held responsible for this! Next for poster session, followed by the panel.

14:00

Well, the panel session has an element of self-indulgence about it. Robert has been doing this for much longer than I, but even for me it's four years. After such a long span, it'a amasing that we have got to ten yeas. All of the speakers commented on how big the community has got, and that we are all a little surprised about this. The current religious themes running through bio-ontologies are also here, but so far fairly muted. A good panel all in all, and a nice marker for 10 years.

16:00 (ish)

Larisa Soldatova's talk addressed the need for an tool enabling scientists to add additional semantics to their written work.

16:30

Catia Pesquita is talking about semantic similarity, which is a topic close to my heart. An interesting and careful body of work which covered the ground well, I thought.

16:50

Kieran O'Neil is not showing some interesting research, where he has been investigating novel techniques for query building over integrated databases.

17:10

Irena Spasic talked about some building term lists for metabolomics from literature mining. Once again she highlighted the need for access to full papers.

17:30

Daniel Faria took the graveyard slot, and discussed measure for protein clustering using sequence and GO information.

Conclusions

Overall a good day. It was great to have some many papers, and such a lively debate. This also marks the retirement of Robert as co-chair. His presence will be greatly missed — he's taught my everything I know about being relaxed and not faffing too much while conference organising.

Onward till next year.

Permalink
   

Preservation for the Future

I've been attacking email systems this week. I've been helping to transfer email from the Nottingham exchange server upto Newcastle. The process has not gone easily. I think that the problem is that university IT departments think mostly about their current users, rather than users coming or going elsewhere. To me this is a real problem: for an academic, their correspondence is an essential ingredient of the historical record, their knowledge of what they have done.

Spurred on by this, I decided to recover all of my mail from the archives where I have kept it, and place it into my current email system. This is made easier for me because I have used Emacs for pretty much my entire time on a computer; I remember a DOS based application before that. I've moved from RMAIL to Gnus, but that is it. Gnus uses an one message per file, text based format. It's pretty future proof; I suspect in 2000 years, when people look back they will assume that everyone used Gnus and similar applications, as all the PST files will be unreadable. There's a big gap in the middle of my email for 6 months after I got to Newcastle, when I had used Outlook. A pity.

My total collection of email is 1.4G in size — I've been reasonably careful about dumping 100M attachments over the years. The earliest email sent by me talking about SET domains in a Drosophila gene. The oldest email I can find sent to me comes from 1994. It's from a nice bloke I remember meeting on one of the guitar boards, called Paul R. Leach. At that time he was at Colorado. He was kind enough to send me some Herco Flex 50s from the US. These are guitar plectrums that seem to have disappeared from the market at the time. I think I still have a few of them left. Thanks Paul! An act of generosity, that I now remember 13 years later. The internet was a kinder place in those days.

Permalink
   

Aging File Formats

An interesting article on the BBC today about digitial preservation. The issue is a well-known one, that file formats go out of date very quickly. They have a chap from Microsoft showing that you using a virtual machine you can still open word 3.0 documents; this seems to miss the point, to my mind. Great, so I can still read it, with my eyes, by looking at it. But can I compute over it? If we are to take this approach, then it might make more sense to just print out over thing that we want to store and save the paper.

I think that it's good that we are moving toward open documentation standards. Microsoft's standardisation of their file formats is welcome, if belated. However, it has to be acknowledged that a large, 6000 page specification is going to be a problem in the future. It's notable, that I have 15 year old latex documents on my machine and on the whole they still just work; when they do not, almost all of the knowledge in them is easily recoverable with a text editor. As far as I can see, the only way that you can guarentee that a file format will be usable into the future is to make it as simple as possible.

Permalink
   

Lightweight Repositories

Well, my rant of a few days ago did give rise to a useful discovery which is Nature Proceedings. Lodge your PDF, get back a DOI. Nature are doing great guns on this at the moment, although I think it's a pity that we tie ourselves to the mast of a publishing house.

I will definately consider going this route for next year. Maybe would just ask all authors to just submit here, and send us a DOI.

Permalink
   

Institutional and Subject Archives

I've been looking at options for storing papers from bio-ontologies. All I want is a place to lodge PDFs, with some standardised Dublic Core metadata, and get a DOI out. It's turning out to be surprisingly hard.

In the process, I have found that JISC has been funding a repositories programme. If you look at their architecture you see a depressing thing. They have actually got terrible idea that "institutional" and "subject" repositories should be built into their architecture. The point is that institution and subject should be just a part of the data model that are used to store papers; by making it explicit in the architecture, it becomes fixed, unchangable.

Why do I care? Well, first as a cross-disciplinary scientist, I am also scared of anything organised by subject — I always tend to fall between the cracks. As for institution, why would anyone thing that 100 year old, bureaucratic, administrative orgaisation of the employers of the paper authors are a good basis for organising modern science?

The best I could find is Depot, but this describes itself as a stop-gap till the authors get a proper institutional repository. Also no one is using it. It's got one biological paper, and that's under the subject heading of "Biology not elsewhere classified" — a sin against good classification if ever I saw one.

The subject classification comes from JACS. From their documentation,

C190 Biology not elsewhere classified Miscellaneous grouping which do not fit into the other Biology categories. To be used sparingly.

Entertainingly, this has a subclass (!!)

C191 Biometry Concerned with the quantitative techniques and measurement in the biological Sciences.

Which as well as being a contradiction, is a definition that is wrong.

Perhaps I should just give up and go home.

Permalink
   

BFO and Connotea

Well, I am not sure that my brain wave on conductance worked quite as well as I had hoped. Can't win them all.

I've been playing with Connotea which is an online reference manager, with added social networking. It's quite cute actually. The basic idea is sound enough, the interface reasonable. It allows commenting and you can look at other peoples stuff also. But it would be a hell of a lot better if it worked all the time. It seems to fail on a lot of DOIs, doesn't seem to work on pubmed as it is advertised to do, and can't work at all with sites that it doesn't know — you would have thought that some heuristics would do the trick for most pages. After all, it works for Google scholar.

Permalink
   

Conductance

Spent a large part of this week arguing about conductance and how to model it ontologically, with Pierre Grenon one of the authors of BFO. The basic scenario is a membrane — is conductance a property of the ions travelling through it or the membrane?

I won't repeat the argument here, but I had a blinding flash of light last night and realised what the solution was, which I shall post tommorrow. I even know how I would represent the solution in an OWL ontology. How this maps to BFO, I have no idea, and I'll be interested to find out how it works.

It's been an interesting discussion; I am still rather sceptical about BFO, largely on the grounds of its supposed "realism". I don't understand this. Claiming to be representing reality appears to me to be rather arrogant and, essentially, faith based. Worse, I can't see clear criteria for determining whether something is real or not. Is a the notion of a dimension real or not? Or does it depend on whether they are representing space and time or something else? I can't see it. More over, if you insist of representing "universals" rather than concepts, I don't think that you are can represent multiple (potentially contradictory) descriptions of the same observations; in short, you deny the possibility of an extra layer of abstraction, which I think that you need.

Having said all of this, I've enjoyed the discusson with Pierre. It's been hard at times, and we've worked through the example slowly. He's seems to be a nice chap. This seems like a good thing to me. I'd like to understand the realist position better than I do, and it's nice to find someone who I can talk with discursively, even when I am rather dubious about their technical position.

Permalink
   

British Neuroscience Meeting

Yeah, this isn't an April fool. It's Sunday, and I am working. Today is the kick-off meeting for the CARMEN project, and tomorrow we move into the British Neurosciences Association meeting. I've never been to a neurosciences meeting so I am looking forward to it. We've spent the last few days getting a demo working for it, which has been fairly stressful — as demos tend to be — but we got it working in the end.

One of the recurrent themes, that I've heard before within CARMEN, is that people are more than willing to give us their data; if so, it will run counter of many of my expectations. In general, getting anyone to provide data and sane metadata is a hard task; I really hope this works straight-forwardly within CARMEN, and that I am proved wrong.

Permalink
   

Ontogenesis

Having a wonderful time at the 2nd Ontogenesis meeting. I've just escaped from teaching for the year, and have managed to fill my diary for the next two weeks with research.

There's been a large amount of discussion about ontology building. The practical upshot of this is that the two most important tools are the phone and the plane. It's all about talking to people.

We need more and better tools for allowing collaboration on ontologies; we need easy to use interfaces which encourage people to make small contributions, while remaining formality. We need to make better use of the internet — skype has turned out to be a boon, but it's telecon capabilities are poor. Best of all, we need to be able to put our feet up, share a coffee, beer and scrap paper without being in one place.

I think my proudest moment was when I spent 5 minutes managing to make the point that sometimes people take a long time to actually say anything.

Permalink
   

Foundational Ontology

Prompted by Matt Pocock, I've just had a look at the basic foundational ontology. I have to admit to have been left feeling very confused. SpatialRegion have to be either a Zero, One, Two or ThreeDimensionalRegion which seems to preclude other dimensions. They are suggested to be immobile, but I can't see that this has any meaning. Also, SpatiotemporalRegion is a sibling of TemporalRegion, but not SpatialRegion (they share a common grandparent) and all three are disjoint from each other.

More reading is required, I guess. Unforunately, the documentation seems a bit long. The BFO in a nutshell document is 37 pages in total.

Permalink
   

Learning to Teach

Have spent the last three days doing a teaching and learning course. I wouldn't mind doing this course but, like most academics, I'm fairly overloaded and would be more interested in doing my own research, rather than listening to others talk about research that I am not that interested in. Still, there have been sections of the course that was quite interesting. I'm a bit distressed to find that I found the section of resource allocation — that is, how the finance system of the university system works, and where the cash goes — has been by far the most interesting section. What am I turning into?

Permalink
   

Annoyingly Visible

Just listening to Radio 4 and getting a bit annoyed. A few weeks ago, there was a daft story about invisibility cloaks.

Look, I am sure that it works really well, and potentially it's going to have major technological applications. But it only works with one wave length. In what sense is this invisibility? Have we really reached the stage, where scientific reporting is based purely on how many inappropriate cultural references we can throw in? I like Harry Potter, but this is really starting to put me off.

Permalink
   

Schizophrenia as a use case

At a workshop in NESC, looking at data integration in the Neurosciences.

Very interesting talk from Maryann Martone. She showed a slightly depressing slide describing the aims of the various eScience projects which is basically interchangable between all projects — data heterogeneity, distribution, autonomy. Like other medical research projects that I have heard off, they spent nearly three years getting the data through the various ethical approval committees before they could even think about hosting the data. The requirement for anonymity is important, of course, but the cost is enormous. It's a pity that this effort can't be shared for different projects.

Neurobase presented an interesting architecture which looks very like ComparaGRID — they have a set of wrappers mapping into a common relational datamode; essentially ComparGRID does the same thing but with an OWL based model.

Permalink
   

Ontogensis

Well, I was a little bit worried about my talk, as the last time I tried it, it wasn't that good. But in the end, it went reasonably well, which was nice.

The Ontogenesis meeting was a good meeting — and only partly because I was enjoying doing research so much. There was lots of discussion on the softer aspects of ontology building. What metadata do we store about ontologies, how do we get information about of domain scientists and so on.

One slightly embarrasing thing happened — Andy Gibson refered to my talk during his, and then asked me a question about it. But I hadn't been listening, having written email most of the way through his talk. I have a good excuse: first, I'd trieda to rearrange the timetable and that had gone horribly wrong, as none of the students heard about it in time; and second I'd arranged for Keith to cover my practical session, but he ran over a dog and his bike and knocked himself about a bit. Even when I get let out for a bit, it seems teaching still has a hold on my attention.

Permalink
   

Ontogenesis

I'm being let out of teaching for a few days to go to a meeting entitled "Ontogenesis", which is about ontology building as far as I can see.

I'm greatly looking forward to it, although being stuffed up in buffet car of an overpacked, overheated London train is not ideal. I'm going to talk about what appear to be the differences between neuroscience and biology in terms of ontology building — I'm basing the talk on ignorance and supposition as I hadn't been doing this for long enough to know better.

I trialled the talk on Friday. It wasn't very good. I should be working on it now, but the train is too horrible to concentrate on anything serious.

Permalink
   

All Hands 2

Finally got back from All Hands. Could have done without the meeting really, as it's left me very tight for the beginning of term and a BBSRC grant deadline. Was a good meeting though. There was an interesting talk on a ontology of units of measurement — perhaps not exciting but everyone needs it. Peter Buneman gave a talk on why annotation is hard. His conclusion — that you need a reliable identifier system — seems fair, although problematic; reliable identifers have been discussed before, but they require coordination and probably centralisation. While I don't hold entirely with the "404 is a feature not a bug" argument, it is true that requiring this form of centralisation brings with it many disadvantages.

Ah, term start; bang goes any chance of sciece happening for the next few weeks.

Permalink
   

All Hands

At All Hands Meeting in Nottingham. It's changed over the years from a very poor conference when no one had anyhthing to talk about to something more reasonable. Already had a couple of interesting discussions, one of which might help with getting a statistical ontology together for CARMEN.

The talks have been okay, although of widely different quality from the interesting to the inconsequential. One of the big changes this year is that people are spending much more time talking about their science rather than the technology which was used to achieve this. A very good thing, to my mind. It's important that this work be kepts grounded and if projects can't get someone to talk about the science then I think that there are problems. Also, you get to hear about some new areas science (crystallography at the moment) which has to be up from 15 talks in a row on "what I did with globus, web services, other buzz word".

Permalink
   

PhD programmes

I've been trying to appoint someone onto an EPSRC Case studentship. The eligability rules are a nightmare. Apart from the fact that no one knows exactly what they are (I phoned up EPSRC and no one there knew!), they appear to be largely UK only. Other EU citizens can apply, but they need a three year residency in the UK. Stupid! The PhD is an international qualification. PhD students add immeasurably to the research environment. We should be glad that talented people want to come to the UK from abroad.

The core problem is, I think, that the PhD is considered to be an education, rather than a job. Thus, we have PhD students rather than researchers. This is only to the disadvantage of the students — they get treated poorly by the University system, it's harder to get loans or mortgages. Even the tax free status is a disadvantage — it saves the employer money, while the student comes out with a large gap in their stamps.

Ho hum.

Permalink
   

ISMB 2006

Pretty much as expected, ISMB was small this year as it was sited in Brazil; while those of us who got there really enjoyed the place, I think it put many people off. The small size of the conference made it very friendly and easy to find people, which was good. The centre itself was excellent in most ways, with the only really problem being the air con, which was fairly noisy and somewhat overwhelmed the AV.

BioOntologies took a particularly heavy hit in terms on submissions — most people needed a main conference publication to justify the travel. It was lucky that we had merged with BioLink for the year, or we would had to have cancelled the day. Hopefully next year will be better, as this is our 10th anniversary meeting — a long time for a SIG to be going.

The main conference was quite good; it's noticable that the days of the microarray normalisation and sequence searching talks are largely over; thank god for small mercies. The ontologies section was quite interesting as two of the three papers were heavily biological in content — Katy Wolstencrofts paper was excellent (okay, I am an author which makes my biased), while Larisa Soldatova gave a great talk on their experimental ontology, EXPO, written for the robot scientist, the videos of which were entertaining. The last paper, on a ontology of function was more theoretical, being about an upper or middle ontology. It seemed sensible at the time, but these things need to be tried out in reality — it's hard to make a critical judgement in the short term.

Next year is Vienna. It should be better attended, but I do wonder about ISMB. Bioinformatics has no reached a point where it is part of most biologists lifes. Those with a more theoretical bent are moving off in a systems biology route — this gives them lots of opportunity to argue and discuss which probably explains why, 3 years on, no one has a decent, clear and consistent definition of systems biology. Perhaps, Brazil will mark the ending of ISMB's day in the sun?

Permalink
   

Databasing the Brain II

Interesting day, so far. The talk on the "Cell Centred Database" was a bit of a highlight; looks like an extremely competant and capable system. They are using a very ontological driven system, and trying to incorporate annotation into the tools which are used to generate the data in the first place. Very sensible, although hits the problem that the ontological markup can be hard to understand.

One strange thing that I have discovered today is that almost all neuroscientists use "data" and "metadata" as plurals; bioinformaticians use either but tend, these days, much more to the singular.

Permalink
   

Databasing the Brain

Am at the "Databasing the Brain" conference in Oslo. So far, we've had a fairly hairy start; the taxi ran out of petrol on the way. We decided to walk the last 1-2km; it turned out to be more like 5-6km, uphill with luggage and a laptop. The guy didn't even apologise or thank us for pushing him of the road.

Still, gave my the chance for a look at the environment which was lovely. We're up in the hills, past a sky jump, pine forest, fresh air. What more could you want (other than time to enjoy it of course).

Permalink
   

Awards for New Academics

I've been writing up a document for the EPSRC Case for New Academics aware today; it's an interesting award, in that it is a fairly low bar for entry, if you can get the CASE component. One of the odd things about it, though, is that you have to submit the the details of the student before you have the cash; at this stage, obviously, you can't promise the student anything, and not having the cash you can't advertise for the student. Bit of at Catch-22 really.

Some of the other requirements are a bit odd as well, all of which have what I think have unintended consequences. First, you can't have been PI on any other grant; this means that you can't really do collaborative work until you have got the first grant because it will make you ineligable for the first grant. Second, there has to be a maximum of ten years since you PhD. This tends to discriminate against people who have not been in academia continously, either because they have been involved in another career or involved in something else.

The basic idea behind these grants is good; I also understand that the research councils don't want them to be seen as a freebie for new academics. It's a pity that they are causing these slightly strange consequences.

Permalink
   

Breaking an Identifiable Silence

After weeks of not much of interest happening, there was a flury of activity today on the Semantic Web for Life Sciences mailing list. This was largely the fault of Alan Ruttenberg who used the two words which on their own are most likely to cause an argument between bioinformaticians — "identifier" and "standard".

How depressing it is that we are still having these discussions after so much has been achieved. Bioinformatics will use a standard when it suits them; people have been active in using GO or MGED. Identifiers, however, still remain a problem.

Permalink
   

Cross-Cutting issues

The workshop has today been discussing cross cutting issues between neurosciences and systems biology. Funnily enough, many of them seem fairly familiar: how to visualise complex, multi-dimensional data; how to combine and standardise the representation of data; how to combine models; how to enable scientists to work cross-disciplinary; and, how to train students to work in the area in the future.

One of the main differences seems to be a cultural differences: if you put two bioinformaticians into a room, they will publish a database; in neuroinformatics this tendency doesn't appear to be there. I think that part of the reason for this is the lack of an obvious common standard representation. In bioinformatics, we worked from the DNA and protein sequence outward.

Permalink
   

Systems Biology and Neuroinformatics

At a workshop in Edinburgh today. Thought it would be a good ideas; the CARMEN project is coming up so having some understanding of neuroinformatics. As for systems biology, thought I'd like to fail to understand some more people telling me what it actually is.

Permalink
   

Advantages of Open Access Publication

I realised today one of the more obscure advantages of Open Access publishing. This produces a major change in the economics of the scientific publishing, which is that the payment happens during the publication, rather than before reading. This is entirely wrong, it seems to me. Most scientists spend far too much time publishing and not nearly enough time reading. Making people pay to publish, but allowing cost-free reading should help to redress this balance.

Permalink
   

Contract Law

Was good to see some friends from Manchester up north. Michael Parkin and Dean Kuo came up and talked about a protocol that they are developing which is based around contract law; the idea is that this is a form of negotiation which they should just be able to lift and reapply to computer science.

It was a good talk which caused lots of interest. Indeed, I was surprised that they got through all their slides; there were so many questions; felt like much more of a discussion session.

Permalink
   

Teaching creationism

Should we teach creationism in science? I have to say, I think, we should. I don't like the notion that you should separate out science from the rest of the world; is it alright to teach creationism outside science, but not in it; should we not be teaching, within science, the impact that science has on society?

I'm happy for science to stand up on its own merits; by attempting to protect it from creationism, we are also preventing from describing its strength.

Permalink
   

The Economics of Science and Teaching

Had a slightly daft conversation in the pub last night, covering science, industry and economics. As is inevitable from such a conversation, this failed to reach any big conclusions.

Thinking about it later, though, I've decided that research and teaching have fundamental economics. Thinking back into the past, my educational experiences have all been valuable to me; just not that valuable, at least not for a given piece of teaching. Teaching, then, seems to pay off, in that it's for a given course you chances of getting some return are high, but the return is likely to be small: anything you learn you are going to use, just not that often.

Science and research in general are very different; most of the research done in the world, more or less by definition, comes to nothing at all. Some of it, however, pays off in a huge way. Occasionally, a small piece of research changes the world. So, the chances of getting a return are small, but the potential return is huge.

It's odd that two such different activities have been combined in the education sector. From a practical point of view, the combination seems natural to me; my research provides the foundation to my teaching. But from an economic point of view, is the combination of the two sustainable?

Permalink
   

Wave Power

"Unlike solar, or wind power, the tides move all the time"

Interesting story on the news about the first commercial wave power system. This is happening in Portugal, despite the technology being developed in this country because Portugal gives preferential treatment to energy from renewable sources.

It's great to hear that this is happening, regardless of where it is happening. It fits quite nicely with stories earlier in the week about gas prices. Currently, the problem with all renewable energy supplies is there high, up-front costs. But, energy supplies are getting less dependable and more expensive with time, and renewable technologies are getting cheaper as they are moving toward mass production.

The quote is from a listener to the radio. I'm not sure it makes sense. The system was a pelamis system (pipes which hinges, which pump hydraulic fluid, while they bend). As most waves occur as a result of the wind, rather than tidal movements, the pelamis system would be susceptible to becalming; just not very often.

Permalink
   

20060311 Teaching Clusters

Finished teaching today for year; well, ignoring the research projects, which might be a mistake. I don't understand exactly why I find the teaching so tiring; probably the main reason is getting on top of so much background material. Still, it's been a good thing; I've needed to get on top of MIAME for quite a while.

The lecture actually went okay. Rather than go through the data model, which would have been dull, I think, I did a "clustering exercise", which I learnt at last weeks LSI meeting: everybody wrote down terms on post-it notes; then, they get arranged on the board, into related clusters. In the end, we got clusters which fell neatly into the six points from the MIAME checklist. Fairly pleasing, really.

Permalink
   

20060306 Strike

"They get great working conditions, extended holidays and commission from all the books they write", a student fumed

This quote was from the local student newspaper.

It's perhaps not surprising that students (like most of the population) are unaware of what academics actually do. Teaching, itself, takes a lot of effort, time and thought. Few students wonder where the knowledge that we try to teach actually comes from; it's in the creation of this knowledge that we spend the rest of our time on. It's the reason that we don't go on holiday, when the students go home.

There is a lot of cynicism among academics; when you feel part of the degree awarding, paper writing, grant applying treadmill, it's not that surprising. But academics are hamstrung in their industrial dispute not by their cynicism, but their naivety; most of us still get a thrill and excitement out of our subjects; the pleasure in the knowledge that we teach, the excitement of extending it palpable. It's for this reason that most of us work silly hours. It's the reason that most of us will spend the time on strike working at home.

We find it hard to withhold our labour, because in doing so we hurt ourselves as much as we hurt others.In our market driven society, the value we put on the process of science subtracts from the value that society puts on us. Despite this, I will go on strike tomorrow; perhaps I am naive, but perhaps I like it this way.

Permalink
   

Top Down vs Emergent Standards

There has been an interesting discussion on data standards for systems biology. This theme seems to repeat itself again and again. Despite the obvious difficulties in getting scientists to work together, slow, steady, building of standards with as broad a consensus as possible has to be the best way of doing things.

Permalink
   

20060228 Life Sciences Interface

Went to an interesting workshop on EPSRC's LSI programme. One of the interesting things which came out of this, is that most people who actually have LSI funding are not aware of the fact.

I was a bit surprised about this. So were the people in charge of the LSI. However, they later admitted that the main reason for this was probably that they had made a strategic decision not to tell people when they got funding.

Permalink
   

Thoughts on a Thesis

The Semantic Enrichment workshop has left me thinking about the presentation of science. The thesis or long dissertation of post-graduate courses, is surely one of the oddities of the scientific education system. If you read "Origin of the Species", and other literature of the time, with its slow, gentlemanly meanderings, then then, perhaps, it makes sense. But, in this day and age, almost no scientific research is publishing in long-hand, book form. Everything happens in the papers. Even our books are normally a collection of papers.

So, why do we force PhD students to write a thesis? It clearly is not the best training for what is to come after.

Permalink
   

Semantic Enrichment of the Literature

Was at a workshop on Semantic Enrichment of the Literature. There were a combination of text miners, ontologists and publishers. It was a pretty interesting meeting; however, there was a lack of coherence. The problem is, at the moment, there are too many interacting possibilities of the way scientific publishing could develop, and too many requirements. The big issues that I can see are:

  • Electronic Publishing
  • High Throughput
  • Open Access

Electronic publishing give us enormous possibilities, of which we have barely touched the surface. Should we go "wiki", should we enable annotation of papers after publication, and, if so, how do we maintain the provenance the literature curation that we have at the moment. High throughput means that we are suffering from a — data deluge, tsunami, insert your meteorological metaphor here — means that we have to support computational amenability. Finally, the open access movement offers the possibility that we can investigate these issues, while leaving the raw data free for the future and, frankly, without having to constantly wait for the lawyers and finance people to catch up.

Each of these issues are complex enough, in and off themselves. But combine all of them together—well, it's unsurprising that we lack coherence.

Permalink

Page by Phillip Lord
Disclaimer: This is my personal website, and represents my opinion.
Science