Is Academic Publishing a Dutch Tulip Bulb?

About

Note

Abstract

Currently, academic publishing is in flux, with several different narratives colliding. First, there is now a move toward increasing openness within science with open access, open data and open source. Open access means that academic articles are freely available for all to read, and is fast becoming a standard mechanism for publishing. I will describe the roots of this movement. Second, although the web has been around for a while, academic publishing has been largely untouched by it; meanwhile the web itself is increasingly been pushed as a mechanism for large scale data integration as part of a linked-data environment. I will describe two pieces of my own work, knowledgeblog and greycite, which enable academics to publish natively to the web as linked data. I will also describe some of the emerging publishing models from elsewhere. Finally, I will consider ways in which we as a school might alter our current practice, in anticipation of these changes.

If I remember, I will also explain why academic publishing is a Dutch tulip bulb.

A Question

Who has an article in Lecture Notes in Computing Science?

Open Science

A broad definition would be:

Open Data
Open Source
Open Access

Note

Today I am going to talk about open access, because currently there are a lot of changes happening here politically. But I want to talk about this more generically in the context of open science; the idea that, to quote a recent Royal Society report, Science is an open enterprise.

As well as talking about the background in general, I am also going to talk about some of the work that we have been doing recently, looking at how the publication process might change, most enabling or predicated on the basis of free access to the material.

Some human interest

Taken from: http://en.wikipedia.org/wiki/File:Jonathan-eisen.jpg

A biologist from UC Davis.
Wanted to collect his father’s (also a biologist) life work
But couldn’t.

Story by David Dobbs Taken from: http://www.wired.com/wiredscience/2011/05/free-science-one-paper-at-a-time-2/

Note

Following a grand journalistic tradition, I thought I would start off with a human interest angle, with a story that I have borrowed from an article in Wired from last year. I do this both some I can warm your hearts, but also so I can plug the fact that I got into the article as well.

This is a picture of Jonathan Eisen. After his fathers untimely death in 1987, he wanted to commerate his fathers work by collecting all his papers. But he couldn’t, because many of them were note available and most had paywalls.

Harvard runs out of cash

"fiscally unsustainable"
"academically restrictive"
"online content from two providers have increased by about 145% over the past six years"
"exacerbated by […] publishers to acquire, bundle, and increase the pricing"

Taken from: http://www.guardian.co.uk/science/2012/apr/24/harvard-university-journal-publishers-prices

Taken from: http://isites.harvard.edu/icb/icb.do?keyword=k77982&tabgroupid=icb.tabgroup143448

Note

This was followed up by this amusing story. These are quotes from a memo sent by the library at Harvard to its faculty. Yes, Harvard are running out of cash.

The basic problem identified by Harvard library is two-fold. First the publishers keep on increasing their prices. It’s actually hard to know for sure how much they are doing this, of course, because most publishers require a NDA for deals done with individual libraries. And, second, because of bundling, also known as "big deals". This is the process where by a library buys not one journal but many at a discount from a publisher; the problem is that libraries often find that the prices increase steadily over time.

Open Access

Pioneered by BMC — an online journal
Originally took the bottom end of the market
Followed by PLoS
Originally aimed at the top end, and grant supported

Note

One potential solution to the problem is open access. This was pioneered by BioMedCentral from about 2002. This was an online publication house, with the end papers being free for everyone to read. They made their money by charging authors for publishing. Interestingly, this wasn’t see as terribly novel because in biology, page charges were common anyway and often substantial (1000 a page for instance).

They originally took the low end of the market. Which is why a couple of years later, PLoS was formed. It came about because a number of bioinformaticians were getting disgruntled at their inability to text mine articles; to continue with the human interest angle, one of these was Michael Eisen, brother of Jonathan mentioned earlier. Are your hearts all warm yet?

Now, the first time that I saw Michael Eisen talk about this they had an interesting angle; they wanted to do nothing innovative — the publication process had to be as much like the existing one as possible, because, they figured one change at a time. Good idea.

Open Access (10 years on)

BMC Bioinformatics is now high impact
PLoS has 6 main journals
And PLoS One

Note

So what has happened in the last 10 years. Well, first, open access has become accepted particularly in some fields. BMC Bioinformatics, for example, is now a high impact journal. In biology, about 20% of papers are open access. PLoS has 6 main journals now (PLoS Biology is edited by Jonathan Eisen).

And PLoS One. It came later, and unlike the "main" journals in PLoS it is turning out to be revolutionary.

PLoS One

Has peer-review
Judges on scientific rigour
Not on perceived importance
Now has impact factor 4.4
In 2010 > 6000 articles
the largest journal in the world

Note

PLoS One is online and open access. It charges for publication. It is peer-reviewed, and this peer-review judges on scientific rigour of the work. Here is the revolutionary one; it judges not on the basis of percieved importance of the work. Basically, it has removed itself from the last shackle of tree-based publication. The marginal cost of publication is now small. There are no issues, and anyway, these days people get to articles via google.

It now has an impact factor of 4.4 — although, incidentally, PLoS has a publically stated policy that IFs are non-sensical, which PLoS One shows clearly. In 2010 it published more than 6000 articles which made it the largest journal in the world. In 2011, this number doubled.

Open Access Mandates

Many funders now mandate OA
RCUK (sort of). NIH. Wellcome.
Welcomed by all!
Research Works Act

Note

Many funders, particularly those with public aims such as Wellcome, or those funded by the public, have started to jump on board. Why should the public pay for things twice, they argue; they pay for the research, why should they pay to read about it. So there are now open access mandates from RCUK, although it’s a little vague what it actually means.

The toll access publishers, of course, have always had providing access to research materials as a key part of their mission statements, so they have been enthusiastic supporters of this. Which is where we come to the Research Works Act.

RWA

No Federal agency may adopt, implement, maintain, continue, or otherwise engage in any policy, program, or other activity that […] causes, permits, or authorizes network dissemination of any private-sector research work without the prior consent of the publisher of such work

sponsered by senators who have recieved multiple donations from Elsevier
"private-sector research work" would include papers
Because, after all a paper is produced by the private sector

Note

RWA was a US senate bill which cause a lot of grief, especially coming on the back of SOPA and ACTA to which it is only broadly related. Reading the quote above, you might think what is the big deal, but it was basically aimed at NIHs open access policy, since "private-sector research work" was seen to include anything published by a private sector publisher, even if all the work had been done within the public sector.

Cost of Knowledge

http://thecostofknowledge.com/
From an original post by Tim Gowers
Fields medalist
polymath, crowd sourcing maths proofs.
Not the first boycott

Note

All of this lead to the Cost of Knowledge. Although the website was not set up by him, it was in response to an article by Tim Gowers. Famous, if you are a mathematician, partly because he won a Field Medal, partly for using the web to stimulate collaborative mathematics.

Basically, it’s a boycott of Elsevier. He will no longer review, author or edit for Elsevier. This is not the first such boycott; that was a decade ago. Will this one work? Hard to tell, but there is a lot more traction, not helped by Elsevier continually refering to papers as "their content". They are correct, of course, it is theirs. But many scientists are starting to ask why, given that they are the authors, reviewers, editors and readers.

Arguments for Paywall Journals

Paywall promotes quality
Australasian Journal of Bone and Joint Medicine
Paid for by Merck
Positive stories about Merck products

OA is vanity publishing

Chaos, Solitons and Fractals

OA is vanity publishing

Chaos, Solitons and Fractals

OA is vanity publishing

Chaos, Solitons and Fractals

OA is vanity publishing

Chaos, Solitons and Fractals

OA is vanity publishing

Author M. S. El Naschie published >300 single author articles
A big favorite of Editor-in-Chief
M. S. El Naschie
At one point, Chaos, Solitons and Fractals was highest IF maths journal

Paywall access prevents copying

images/springer_molecule.jpg

An image of a molecule
Copyright Springer-Verlag
Actually, produced by Peter Murray-Rust
"A global resource for computational chemistry"
Journal of Molecular Modeling

A Question and an Answer

Who has an article in Lecture Notes in Computing Science?
- None of you. If it’s in LNCS, the article is Springer’s

OA or OA/2

Open Access is, however, expensive
Between 800 and 3000(!) pound
It’s "half open access"
Ultimately, do I care? It’s not my cash!
There is also a hidden cost

Note

Open Access is, however, expensive, at between 800 and 3000 pound. This has lead a friend of mine, Bijan Parsia, to coin the phrase "half open access". It’s only open from one side of the equation. But, ultimately, do I care? It’s not actually my cash, and besides research is expensive. Publication costs are normally about 1% of the total cost of research at most.

But I am going to argue that there are a number of hidden costs, and that these costs are ones that you will or should care about.

The Costs

The Second Biggest Scientific Publisher
250,000 articles per year
240 million Downloads
Cost: 1.5 Billion Euro
Elsevier

Note	So, let’s consider the costs. First, lets consider the second biggest scientific publisher in the world. Some basic stats. This is Elsevier. Actually, Elsevier is not that common in Computing, but has an enormous life sciences presence. Springer-Verlag is of a similar size.

The Costs

The First Biggest Scientific Publisher
17 million articles
> 20 languages
365 million readers
Total Cost: 10 million dollars
Wikipedia

Note

And the biggest publisher of scientific literature in the world. Wikipedia. Now, of course, this isn’t entirely fair. Wikipedia also publishes a lot of non-scientific literature (the figures from Elsevier also include this). So this isn’t a fair comparison. But it’s not, however, 2 orders of magnitude unfair either.

So why is it so expensive? Elsevier’s profit margin? Well, no, because even it’s 40% profit margin isn’t enough to explain this.

The process

images/800px-Herrschaftliche_Kutsche.JPG

The process

images/500px-LaTeX_cover.svg.png Taken from: http://commons.wikimedia.org/wiki/File:LaTeX_cover.svg

Note	Let’s consider a typical publishing process. This process is, incidentally, not Elsevier, but it is from PLoS;in fact PLoS One. It is not unrelated to other journals however. I generally write my articles in LaTeX, because I happen to like it. You may use word, but the point remains.

The process

images/500px-Adobe_PDF_Icon.svg.png Taken from: http://en.wikipedia.org/wiki/File:Adobe_PDF_Icon.svg

Note	I convert this into PDF on my machine and then upload this. Along with all the TeX in case they need them for some unspecified reason.

The process

images/500px-Adobe_PDF_Icon.svg.png

Again!

Note	My PDF gets converted into another PDF. As far as I can tell, this happens at the level of a PDF→PDF conversion. I still have no idea what this is hoping to achieve. My believe is that Word docs also get PDF converted at this point.

The process

images/500px-Microsoft_Word_Icon.svg.png Taken from: http://en.wikipedia.org/wiki/File:Microsoft_Word_Icon.svg

Note	This PDF gets converted into an Word doc. Hmmm. Surely, you say, this makes no sense? Even worse when you discover that this conversion process involves someone copying the PDF into word.

The process

images/500px-XML.svg.png Taken from: http://en.wikipedia.org/wiki/File:XML.svg

The process

images/HTML5_Logo_512.png Taken from: http://www.w3c.org/html/logo

Note	Actually, it’s HTML4, but I couldn’t find an HTML4 logo.

The process

images/500px-Adobe_PDF_Icon.svg.png

Again, Again!

Hidden Cost

Compare to arXiv

The "physics" pre-print server
Cut and paste in standard metadata
Upload .tex and .bbl file (also images)
View PDF
Click "go"
Cost $7 per paper
Takes about ~7minutes (n=1) from LaTeX to published

Note

Now we can compare this to arXiv which was originally a preprints service for physics and now takes articles from a much wider set of disciplines including computing science. It includes a standard(ish) and stable identifier, allows metadata harvesting and sends out emails and stuff. On a short survey, n=1, this takes an average of 7 minutes to complete from start to finish.

Compare to Wordpress

Open Lab Book
Write in Word
Click "go"
Publication time measures in ms

Note

As part of my commitment to open science, I also use Wordpress to host my open notebook.

Note here, that I have carefully phrased this to avoid the word "blog". To a large part I do not write about my life, my hobbies or my cat here. Although I did used to talk about my dinner. Over time, my blog has evolved to become largely profession. And publishing there is very, very easy. Actually, I don’t use word. But, regardless of what you use, the publication process — as opposed to the authoring process — is very quick, and is best measured in milliseconds.

Ontogenesis

Wanted to publish tutorial information
Writing a book is painful
Open, public peer-review
We now have around 30 articles, published over 2 years
Many articles short, discrete
We have published unpublishable material

Note

So, where does this new found rapidity leave us. Well, we started experimenting with this a couple of years ago. In my other life, I build ontologies. It isn’t easy, not helped by the total lack of tutorial information available.

We thought about writing a book, but it never happened. It never happened because we had all got so irritated with publishing taking material, then after a year of no communication suddenly getting an email saying "here are the proofs, please correct within 5 days." Books also require either very significant pieces of writing, or are multi-multi-author. Either way it’s a pain.

In some cases, we wanted to publish quite short material. So, I wrote a nice article on "do cyclists pay tax?" which turns out to be a good example describing the difference between universal and existential quantification, as well as roles and inheritance. It’s 2 pages, it’s complete. Similarly, "what is disjointness"

In short, ontogenesis is full of unpublishable material. It’s unpublishable not because the material is bad, but because the publication process is bad. We now have 30 articles, and about 30000 page views.

Linked Data

Now we have a simple process
Which we have extended
Knowledgeblog
We can publish linked, semantic article

Note	But there’s more…. We now also have a simple process, with no human involvement. We can now start to push semantics through this process, we can make an article a part of a linked data environment. This allows us to do interesting things.

References

Academics love in text-citations (Lord, 2012)
Will describe two tools, kcite and greycite

References

Lord, P (2012) Academics love Citations. J.Unsupport.Assert

References (author)

Authors insert primary identifiers
[cite]10.100/100.1[/cite]
This is a DOI.
We also do arXiv, pubmed IDs.

images/kblog-with-edam-with-more-references.png

Note

We are now part of a linked data environment. The article is an active mashup. None of the metadata you see here is embedded. All of it is gathered from other sources. More over, every reference has a URL, and can be unambiguously identified (in the sense, we can be sure what we are point to — comparision of two references is a little harder).

And the author has gained some advantage. They don’t need to type the metadata in, and if they get the reference wrong they will see it straight away.

References (reader)

Readers can see citation or direct link
Readers can change citation style

images/kblog-with-edam-with-numeric-listing.png

Note	We also have an active document. The reference style is no longer handed down from on high. The reader can choose how they want to see things.

References (kcite architecture)

The Problem

Inserting links is painful
Fortunately, we can use the bibliographic metadata to help
Will demonstrate this after a brief interlude

Note	Inserting references in this way is a pain. However, in most cases, we can do this in a metadata driven way. We can use the same metadata that is used to generate the bibliography for the authors.

Greycite

Originally, kcite could not support URLs
As well as Publishing the Unpublishable
We want to Cite the Uncitable
We needed a source for metadata
Now we have http://greycite.knowledgeblog.org
Developed by Lindsay Marshall, Computing Science, Newcastle

Note

One substantial problem is that absence of an ability to do this for URIs. So, we wanted to address this with greycite. So, going to show how this works. Greycite is a new tool which mines bibliographic metadata from URIs.

Doesn’t work for all URIs. Mining is (deliberately) not too intelligent. We look mostly for things that are intended to be mined. We have also added tools to wordpress to allow flexible insertion of metadata (including with shortcodes, or through a nice GUI).

Greycite

This is greycite

Greycite

Add a URL

Note	Putting a URL into to an article on my blog

Greycite

URL results

Note	In this case, we’ve seen the URL before and greycite knows some basic metadata.

Greycite

Scrolling to more detail

Note	Looking in more detail, we can see that in March, the article had a title "Why Not?", was authored by me and dates from 2010.

Greycite

Provenance (coins)

Note

We have gathered this metadata from a variety of sources, and you can see the provenance. Coins is a dreadful standard which is now quite a few years old, but some people use it. Works by embedding a span tag into the body of the post.

Although it is dreadful in everyway, that it is embedded in the body is it’s most useful feature, as many people don’t control their headers.

Greycite

Provenance (OGP)

Note	Also we support Open Graph Protocol. Much nicer "standard", partly developed by Facebook who would appear to be much better at uncovering other peoples bibliographic metadata, than alledgedly they are at uncovering their own financial data.

Greycite

And elsewhere

Note	We have used somewhat established ad-hoc standards, so it works on sites outside of our control, and also sites which are not necessarily academic. This is an article from the BBC for instance.

Greycite

And elsewhere

Greycite

Sadly, some websites have no semantics that we can find

Greycite

Preservation

Note	We can also link through to other resourecs, such as archive.org. We also support archive.org.uk — provided by the British Library and are working on webcitations.org

Greycite

Note	So, we can maintain links to the academic record even if the links break. Currently, this is not apparent in kcite generated bibliographies, but we will add redirection in soon for links which appear to have gone 404.

Citing Links

How does citation work?

Note	This is the bio-ontologies website. Lots of papers on it. Many with complex metadata. For those of you interested in this sort of thing, I have now separated the metadata from the wordpress environment. Previously all the authors needed logins. PITA.

Citing Links

We want to cite this page

Citing Links

In an article, I am writing

Note	For my own editing environment, I use asciidoc, bibtex and emacs. I acknowledge that this is a little niche, but it does work with other environments also.

Citing Links

First we take the URL

Note

Actually, this is enough. It is all that you need; however, as an author I find myself citing the same URL repeatedly. Google is very good at getting you to where you want to go, but it is not perfect, particularly when there are a lot of articles on one topic. So, I wanted something quicker, searching over what I am interested in.

Citing Links

Query Greycite

images/emacs-with-edam-url-and-mxgreycite.png

Citing Links

Get back bibtex

Note

The metadata here comes from the web page, so in one sense cannot be wrong, although it can be different from what the author wants it to be. Recently, I’ve added support for Wordpress to advertise it’s metadata on the page as well, so it’s visible.

We can do similar things with DOIs, arxiv and the like.

Citing Links

So, we search

images/emacs-and-post-ref-with-regexp.png

Citing Links

And select

images/emacs-and-post-ref-with-reftex-dialog.png

Citing Links

And insert

images/emacs-and-post-ref-with-kurl-inserted.png

Citing Links

And publish

Note	All of this is tied together with just a little semantic glue. We needed some hueristics, we needed some format shifting, but that is it.

Accessibility

This is all fairly simple
And works because the content is OA
Greycite can get to the metadata because it is open
BL can archive, because the content is open
Archive also includes metadata because it is open
Much of it works outside academic publishing
Compare DOI, CrossRef, LOCKSS and so on.

Note

All of this works because we have an open resource, based on widely available standards. Compare this to CrossRef and DOI technology which is more complex. Compare this to LOCKSS which is more complex. And, in most cases, we are not using bespoke software specific to the academic publishing industry. Hence it works with BBC news.

Ideas: Glossary

Short 140 word articles with a title
Word, not character!
Fully attributed
Linkble
Publish via email
Displayed inline

Note

So, further ideas. We have already pursed a mashup strategy. Want to push this further. In this publication environment we can do very small-scale publishing. While others are pushing nanopublications, this is more mini-publication. You publish a short article, 140 words long to operate as a glossary.

Links back will then appear in popups as a glossary, rather than as a hyperlink or in a reference list. The glossary will not have a single name space, so multiple definitions are possible. And all the things we have added so far will help. Greycite will provide bibtex to make the link insertion easier.

Probably going to investigate a publish by email protocol also, so that new articles are very easy to publish while still maintaining a moderation step.

Ideas: Structured Knowledge

Open Disease Reports from David Shotton
Short summaries of disease information published elsewhere
Critical for third world
We want to structure the knowledge for mining.

Ideas: Enhanced Linking

Note	We want to link to other resources out there. We would like to do this intelligently. For instance, this is a tool called chemicalize which inserts implicit links by named-entity recognition. Very nice. But probably not something that you want on a cookery page.

Ideas: Linking

Ideas: The NearCon

Nearly a conference
Cross between a workshop and a special issue
Publish papers and then talk about them!
Like a workshop
- see papers you might miss
Asynchronous!

Note	Also wanted to pursue

Publishing in Flux

eLife - modelled on PLoS one
F1000 - Cross between PLoS One and arXiv
PeerJ - $100 to register, free to publish

Dutch Tulip bulb

Note

We are not the only people playing in this environment. Publishing is in a lot of flux at the moment and there are a lot of new ideas coming out. So, eLife for example, which is modelled on PLoS One and is lead by Mark Patterson who previously worked for PLoS Currents. Interestingly, it’s directly supported by the Wellcome trust.

F1000 are producing something new. This is rather an offshoot of their prepublication, and poster publication service. Probably free to publish initially but we are not sure yet.

Finally, is PeerJ. If we can sequence the human genome for $1000, why can we not publish a paper for $99. Basically, you register for $100, then can publish for life (1 paper a year, and you have to do 1 review a year, and all authors need to be registered). This is Peter Binfield, Jason Vogt and, most interestingly, Tim O’Reilly.

The dutch tulip bulb? First good example of a speculative economic bulb, where the cost of a resource increased totally out of proportion to its value. Currently, academic publishing comes with a lot of costs, but does it come with any value?

What can we do?

Technical Report Series

Replace with arXiv
Recognised
Stable
Standard Metadata
Supplement with Kblog
More experimental
Web First

What can we do?

Grants

Publish all our grants on the web
Successful or not!
Knowledge blog would be a good framework

What can we do?

Online thesis

We can encourage students to publish their theses online
Allyson Lister has just done this
http://themindwobbles.wordpress.com/2012/06/14/thesis-abstract/

What can we do?

Open Notebook Science

all research active staff and students
Publish as we go!

What can we do?

Cash!

Q: Can I get cash for my PhD student to conference?
A: Yes! But conference should come with publication.
All the suggested publication locations are toll access

What can we do?

Teaching

Open Education Resources
Release lecture notes online
Release all recap online
Try before you buy!

Elephant in the Room

REF
Promotion Committees
Lawyers

Note

REF is a problem — there is a tendency toward the false metrics such as IF, and this is an undeniable problem.

Promotion Committees — tend to have a bias toward high quality journals, where "high quality" is defined as "the ones I published/publish in". The older a scientist is, the more likely they are to have the bulk of their work in TA journals.

Lawyers — currently, RECAP is not going online and has, in fact, got more restrictive because of fears that we will uncover the large scale copyright violation by staff. Also, university does not allow staff to attach CC, OA licenses to their work without permission. Although, strangely, it does allow them to give their work away without compensation and without even retaining the right to use it themselves. Hmmm.

Acknowledments

Kblog/Ontogenesis
- Robert Stevens, Georgina Moulton (Manchester).
- Dan Swan, Simon Cockell (Newcastle)
Kcite
- Simon Cockell
Greycite
- Lindsay Marshall

Summary/Conclusion

Open Access == freely available papers
Part of the move to Open Science
Happening whether we like it or not
It should be seen as an opportunity not a risk