About

images/permission_to_blog.jpeg
Note

Abstract

Currently, academic publishing is in flux, with several different narratives colliding. First, there is now a move toward increasing openness within science with open access, open data and open source. Open access means that academic articles are freely available for all to read, and is fast becoming a standard mechanism for publishing. I will describe the roots of this movement. Second, although the web has been around for a while, academic publishing has been largely untouched by it; meanwhile the web itself is increasingly been pushed as a mechanism for large scale data integration as part of a linked-data environment. I will describe two pieces of my own work, knowledgeblog and greycite, which enable academics to publish natively to the web as linked data. I will also describe some of the emerging publishing models from elsewhere. Finally, I will consider ways in which we as a school might alter our current practice, in anticipation of these changes.

If I remember, I will also explain why academic publishing is a Dutch tulip bulb.

A Question

Who has an article in Lecture Notes in Computing Science?

Open Science

A broad definition would be:

  • Open Data

  • Open Source

  • Open Access

Note

Today I am going to talk about open access, because currently there are a lot of changes happening here politically. But I want to talk about this more generically in the context of open science; the idea that, to quote a recent Royal Society report, Science is an open enterprise.

As well as talking about the background in general, I am also going to talk about some of the work that we have been doing recently, looking at how the publication process might change, most enabling or predicated on the basis of free access to the material.

Some human interest

images/Jonathan-eisen-cut.jpg
  • A biologist from UC Davis.

  • Wanted to collect his father’s (also a biologist) life work

  • But couldn’t.

Note

Following a grand journalistic tradition, I thought I would start off with a human interest angle, with a story that I have borrowed from an article in Wired from last year. I do this both some I can warm your hearts, but also so I can plug the fact that I got into the article as well.

This is a picture of Jonathan Eisen. After his fathers untimely death in 1987, he wanted to commerate his fathers work by collecting all his papers. But he couldn’t, because many of them were note available and most had paywalls.

Harvard runs out of cash

  • "fiscally unsustainable"

  • "academically restrictive"

  • "online content from two providers have increased by about 145% over the past six years"

  • "exacerbated by […] publishers to acquire, bundle, and increase the pricing"

Note

This was followed up by this amusing story. These are quotes from a memo sent by the library at Harvard to its faculty. Yes, Harvard are running out of cash.

The basic problem identified by Harvard library is two-fold. First the publishers keep on increasing their prices. It’s actually hard to know for sure how much they are doing this, of course, because most publishers require a NDA for deals done with individual libraries. And, second, because of bundling, also known as "big deals". This is the process where by a library buys not one journal but many at a discount from a publisher; the problem is that libraries often find that the prices increase steadily over time.

Open Access

  • Pioneered by BMC — an online journal

  • Originally took the bottom end of the market

  • Followed by PLoS

  • Originally aimed at the top end, and grant supported

Note

One potential solution to the problem is open access. This was pioneered by BioMedCentral from about 2002. This was an online publication house, with the end papers being free for everyone to read. They made their money by charging authors for publishing. Interestingly, this wasn’t see as terribly novel because in biology, page charges were common anyway and often substantial (1000 a page for instance).

They originally took the low end of the market. Which is why a couple of years later, PLoS was formed. It came about because a number of bioinformaticians were getting disgruntled at their inability to text mine articles; to continue with the human interest angle, one of these was Michael Eisen, brother of Jonathan mentioned earlier. Are your hearts all warm yet?

Now, the first time that I saw Michael Eisen talk about this they had an interesting angle; they wanted to do nothing innovative — the publication process had to be as much like the existing one as possible, because, they figured one change at a time. Good idea.

Open Access (10 years on)

  • BMC Bioinformatics is now high impact

  • PLoS has 6 main journals

  • And PLoS One

Note

So what has happened in the last 10 years. Well, first, open access has become accepted particularly in some fields. BMC Bioinformatics, for example, is now a high impact journal. In biology, about 20% of papers are open access. PLoS has 6 main journals now (PLoS Biology is edited by Jonathan Eisen).

And PLoS One. It came later, and unlike the "main" journals in PLoS it is turning out to be revolutionary.

PLoS One

  • Has peer-review

  • Judges on scientific rigour

  • Not on perceived importance

  • Now has impact factor 4.4

  • In 2010 > 6000 articles

  • the largest journal in the world

Note

PLoS One is online and open access. It charges for publication. It is peer-reviewed, and this peer-review judges on scientific rigour of the work. Here is the revolutionary one; it judges not on the basis of percieved importance of the work. Basically, it has removed itself from the last shackle of tree-based publication. The marginal cost of publication is now small. There are no issues, and anyway, these days people get to articles via google.

It now has an impact factor of 4.4 — although, incidentally, PLoS has a publically stated policy that IFs are non-sensical, which PLoS One shows clearly. In 2010 it published more than 6000 articles which made it the largest journal in the world. In 2011, this number doubled.

Open Access Mandates

  • Many funders now mandate OA

  • RCUK (sort of). NIH. Wellcome.

  • Welcomed by all!

  • Research Works Act

Note

Many funders, particularly those with public aims such as Wellcome, or those funded by the public, have started to jump on board. Why should the public pay for things twice, they argue; they pay for the research, why should they pay to read about it. So there are now open access mandates from RCUK, although it’s a little vague what it actually means.

The toll access publishers, of course, have always had providing access to research materials as a key part of their mission statements, so they have been enthusiastic supporters of this. Which is where we come to the Research Works Act.

RWA

No Federal agency may adopt, implement, maintain, continue, or otherwise engage in any policy, program, or other activity that […] causes, permits, or authorizes network dissemination of any private-sector research work without the prior consent of the publisher of such work

  • sponsered by senators who have recieved multiple donations from Elsevier

  • "private-sector research work" would include papers

  • Because, after all a paper is produced by the private sector

Note

RWA was a US senate bill which cause a lot of grief, especially coming on the back of SOPA and ACTA to which it is only broadly related. Reading the quote above, you might think what is the big deal, but it was basically aimed at NIHs open access policy, since "private-sector research work" was seen to include anything published by a private sector publisher, even if all the work had been done within the public sector.

Cost of Knowledge

Note

All of this lead to the Cost of Knowledge. Although the website was not set up by him, it was in response to an article by Tim Gowers. Famous, if you are a mathematician, partly because he won a Field Medal, partly for using the web to stimulate collaborative mathematics.

Basically, it’s a boycott of Elsevier. He will no longer review, author or edit for Elsevier. This is not the first such boycott; that was a decade ago. Will this one work? Hard to tell, but there is a lot more traction, not helped by Elsevier continually refering to papers as "their content". They are correct, of course, it is theirs. But many scientists are starting to ask why, given that they are the authors, reviewers, editors and readers.

Arguments for Paywall Journals

  • Paywall promotes quality

  • Australasian Journal of Bone and Joint Medicine

  • Paid for by Merck

  • Positive stories about Merck products

OA is vanity publishing

  • Chaos, Solitons and Fractals

images/solitons-1.jpg

OA is vanity publishing

  • Chaos, Solitons and Fractals

images/solitons-2.jpg

OA is vanity publishing

  • Chaos, Solitons and Fractals

images/solitons-3.jpg

OA is vanity publishing

  • Chaos, Solitons and Fractals

images/solitons-4.jpg

OA is vanity publishing

  • Author M. S. El Naschie published >300 single author articles

  • A big favorite of Editor-in-Chief

  • M. S. El Naschie

  • At one point, Chaos, Solitons and Fractals was highest IF maths journal

Paywall access prevents copying

images/springer_molecule.jpg

  • An image of a molecule

  • Copyright Springer-Verlag

  • Actually, produced by Peter Murray-Rust

  • "A global resource for computational chemistry"

  • Journal of Molecular Modeling

A Question and an Answer

  • Who has an article in Lecture Notes in Computing Science?

    • None of you. If it’s in LNCS, the article is Springer’s

OA or OA/2

  • Open Access is, however, expensive

  • Between 800 and 3000(!) pound

  • It’s "half open access"

  • Ultimately, do I care? It’s not my cash!

  • There is also a hidden cost

Note

Open Access is, however, expensive, at between 800 and 3000 pound. This has lead a friend of mine, Bijan Parsia, to coin the phrase "half open access". It’s only open from one side of the equation. But, ultimately, do I care? It’s not actually my cash, and besides research is expensive. Publication costs are normally about 1% of the total cost of research at most.

But I am going to argue that there are a number of hidden costs, and that these costs are ones that you will or should care about.

The Costs

  • The Second Biggest Scientific Publisher

  • 250,000 articles per year

  • 240 million Downloads

  • Cost: 1.5 Billion Euro

  • Elsevier

Note

So, let’s consider the costs. First, lets consider the second biggest scientific publisher in the world. Some basic stats. This is Elsevier. Actually, Elsevier is not that common in Computing, but has an enormous life sciences presence. Springer-Verlag is of a similar size.

The Costs

  • The First Biggest Scientific Publisher

  • 17 million articles

  • > 20 languages

  • 365 million readers

  • Total Cost: 10 million dollars

  • Wikipedia

Note

And the biggest publisher of scientific literature in the world. Wikipedia. Now, of course, this isn’t entirely fair. Wikipedia also publishes a lot of non-scientific literature (the figures from Elsevier also include this). So this isn’t a fair comparison. But it’s not, however, 2 orders of magnitude unfair either.

So why is it so expensive? Elsevier’s profit margin? Well, no, because even it’s 40% profit margin isn’t enough to explain this.

The process

images/800px-Herrschaftliche_Kutsche.JPG

The process

Note

Let’s consider a typical publishing process. This process is, incidentally, not Elsevier, but it is from PLoS;in fact PLoS One. It is not unrelated to other journals however. I generally write my articles in LaTeX, because I happen to like it. You may use word, but the point remains.

The process

Note

I convert this into PDF on my machine and then upload this. Along with all the TeX in case they need them for some unspecified reason.

The process

images/500px-Adobe_PDF_Icon.svg.png

  • Again!

Note

My PDF gets converted into another PDF. As far as I can tell, this happens at the level of a PDF→PDF conversion. I still have no idea what this is hoping to achieve. My believe is that Word docs also get PDF converted at this point.

The process

Note

This PDF gets converted into an Word doc. Hmmm. Surely, you say, this makes no sense? Even worse when you discover that this conversion process involves someone copying the PDF into word.

The process

The process

Note

Actually, it’s HTML4, but I couldn’t find an HTML4 logo.

The process

images/500px-Adobe_PDF_Icon.svg.png

  • Again, Again!

Hidden Cost

images/800px-Clock_in_Kings_Cross.jpg

Compare to arXiv

  • The "physics" pre-print server

  • Cut and paste in standard metadata

  • Upload .tex and .bbl file (also images)

  • View PDF

  • Click "go"

  • Cost $7 per paper

  • Takes about ~7minutes (n=1) from LaTeX to published

Note

Now we can compare this to arXiv which was originally a preprints service for physics and now takes articles from a much wider set of disciplines including computing science. It includes a standard(ish) and stable identifier, allows metadata harvesting and sends out emails and stuff. On a short survey, n=1, this takes an average of 7 minutes to complete from start to finish.

Compare to Wordpress

  • Open Lab Book

  • Write in Word

  • Click "go"

  • Publication time measures in ms

Note

As part of my commitment to open science, I also use Wordpress to host my open notebook.

Note here, that I have carefully phrased this to avoid the word "blog". To a large part I do not write about my life, my hobbies or my cat here. Although I did used to talk about my dinner. Over time, my blog has evolved to become largely profession. And publishing there is very, very easy. Actually, I don’t use word. But, regardless of what you use, the publication process — as opposed to the authoring process — is very quick, and is best measured in milliseconds.

Ontogenesis

  • Wanted to publish tutorial information

  • Writing a book is painful

  • Open, public peer-review

  • We now have around 30 articles, published over 2 years

  • Many articles short, discrete

  • We have published unpublishable material

Note

So, where does this new found rapidity leave us. Well, we started experimenting with this a couple of years ago. In my other life, I build ontologies. It isn’t easy, not helped by the total lack of tutorial information available.

We thought about writing a book, but it never happened. It never happened because we had all got so irritated with publishing taking material, then after a year of no communication suddenly getting an email saying "here are the proofs, please correct within 5 days." Books also require either very significant pieces of writing, or are multi-multi-author. Either way it’s a pain.

In some cases, we wanted to publish quite short material. So, I wrote a nice article on "do cyclists pay tax?" which turns out to be a good example describing the difference between universal and existential quantification, as well as roles and inheritance. It’s 2 pages, it’s complete. Similarly, "what is disjointness"

In short, ontogenesis is full of unpublishable material. It’s unpublishable not because the material is bad, but because the publication process is bad. We now have 30 articles, and about 30000 page views.

Linked Data

  • Now we have a simple process

  • Which we have extended

  • Knowledgeblog

  • We can publish linked, semantic article

Note

But there’s more….

We now also have a simple process, with no human involvement. We can now start to push semantics through this process, we can make an article a part of a linked data environment. This allows us to do interesting things.

References

  • Academics love in text-citations (Lord, 2012)

  • Will describe two tools, kcite and greycite

References

Lord, P (2012) Academics love Citations. J.Unsupport.Assert

References (author)

  • Authors insert primary identifiers

  • [cite]10.100/100.1[/cite]

  • This is a DOI.

  • We also do arXiv, pubmed IDs.

images/kblog-with-edam-with-more-references.png
Note

We are now part of a linked data environment. The article is an active mashup. None of the metadata you see here is embedded. All of it is gathered from other sources. More over, every reference has a URL, and can be unambiguously identified (in the sense, we can be sure what we are point to — comparision of two references is a little harder).

And the author has gained some advantage. They don’t need to type the metadata in, and if they get the reference wrong they will see it straight away.

References (reader)

  • Readers can see citation or direct link

  • Readers can change citation style

images/kblog-with-edam-with-numeric-listing.png
Note

We also have an active document. The reference style is no longer handed down from on high. The reader can choose how they want to see things.

References (kcite architecture)

images/architecture-kcite.jpg

The Problem

  • Inserting links is painful

  • Fortunately, we can use the bibliographic metadata to help

  • Will demonstrate this after a brief interlude

Note

Inserting references in this way is a pain. However, in most cases, we can do this in a metadata driven way. We can use the same metadata that is used to generate the bibliography for the authors.

Greycite

  • Originally, kcite could not support URLs

  • As well as Publishing the Unpublishable

  • We want to Cite the Uncitable

  • We needed a source for metadata

  • Now we have http://greycite.knowledgeblog.org

  • Developed by Lindsay Marshall, Computing Science, Newcastle

Note

One substantial problem is that absence of an ability to do this for URIs. So, we wanted to address this with greycite. So, going to show how this works. Greycite is a new tool which mines bibliographic metadata from URIs.

Doesn’t work for all URIs. Mining is (deliberately) not too intelligent. We look mostly for things that are intended to be mined. We have also added tools to wordpress to allow flexible insertion of metadata (including with shortcodes, or through a nice GUI).

Greycite

  • This is greycite

images/greycite-home.png

Greycite

  • Add a URL

images/greycite-with-url.png
Note

Putting a URL into to an article on my blog

Greycite

  • URL results

images/greycite-russet-why-not.png
Note

In this case, we’ve seen the URL before and greycite knows some basic metadata.

Greycite

  • Scrolling to more detail

images/greycite-url-landing.png
Note

Looking in more detail, we can see that in March, the article had a title "Why Not?", was authored by me and dates from 2010.

Greycite

  • Provenance (coins)

images/greycite-with-coins.png
Note

We have gathered this metadata from a variety of sources, and you can see the provenance. Coins is a dreadful standard which is now quite a few years old, but some people use it. Works by embedding a span tag into the body of the post.

Although it is dreadful in everyway, that it is embedded in the body is it’s most useful feature, as many people don’t control their headers.

Greycite

  • Provenance (OGP)

images/greycite-with-ogp.png
Note

Also we support Open Graph Protocol. Much nicer "standard", partly developed by Facebook who would appear to be much better at uncovering other peoples bibliographic metadata, than alledgedly they are at uncovering their own financial data.

Greycite

  • And elsewhere

images/bbc-telescope-post.png
Note

We have used somewhat established ad-hoc standards, so it works on sites outside of our control, and also sites which are not necessarily academic. This is an article from the BBC for instance.

Greycite

  • And elsewhere

images/greycite-bbc-telescope-post.png

Greycite

  • Sadly, some websites have no semantics that we can find

images/greycite-computing.png

Greycite

  • Preservation

images/greycite--url-metadata-detail.png
Note

We can also link through to other resourecs, such as archive.org. We also support archive.org.uk — provided by the British Library and are working on webcitations.org

Greycite

images/greycite-at-archive-org.png
Note

So, we can maintain links to the academic record even if the links break. Currently, this is not apparent in kcite generated bibliographies, but we will add redirection in soon for links which appear to have gone 404.

  • How does citation work?

images/bio-ontologies-toc.png
Note

This is the bio-ontologies website. Lots of papers on it. Many with complex metadata. For those of you interested in this sort of thing, I have now separated the metadata from the wordpress environment. Previously all the authors needed logins. PITA.

  • We want to cite this page

images/bio-ontologies-edam.png
  • In an article, I am writing

images/emacs-and-kblog.png
Note

For my own editing environment, I use asciidoc, bibtex and emacs. I acknowledge that this is a little niche, but it does work with other environments also.

  • First we take the URL

images/emacs-with-edam-url.png
Note

Actually, this is enough. It is all that you need; however, as an author I find myself citing the same URL repeatedly. Google is very good at getting you to where you want to go, but it is not perfect, particularly when there are a lot of articles on one topic. So, I wanted something quicker, searching over what I am interested in.

  • Query Greycite

images/emacs-with-edam-url-and-mxgreycite.png
  • Get back bibtex

images/emacs-with-edam-bibtex.png
Note

The metadata here comes from the web page, so in one sense cannot be wrong, although it can be different from what the author wants it to be. Recently, I’ve added support for Wordpress to advertise it’s metadata on the page as well, so it’s visible.

We can do similar things with DOIs, arxiv and the like.

  • So, we search

images/emacs-and-post-ref-with-regexp.png
  • And select

images/emacs-and-post-ref-with-reftex-dialog.png
  • And insert

images/emacs-and-post-ref-with-kurl-inserted.png
  • And publish

images/kblog-with-edam.png
Note

All of this is tied together with just a little semantic glue. We needed some hueristics, we needed some format shifting, but that is it.

Accessibility

  • This is all fairly simple

  • And works because the content is OA

  • Greycite can get to the metadata because it is open

  • BL can archive, because the content is open

  • Archive also includes metadata because it is open

  • Much of it works outside academic publishing

  • Compare DOI, CrossRef, LOCKSS and so on.

Note

All of this works because we have an open resource, based on widely available standards. Compare this to CrossRef and DOI technology which is more complex. Compare this to LOCKSS which is more complex. And, in most cases, we are not using bespoke software specific to the academic publishing industry. Hence it works with BBC news.

Ideas: Glossary

  • Short 140 word articles with a title

  • Word, not character!

  • Fully attributed

  • Linkble

  • Publish via email

  • Displayed inline

Note

So, further ideas. We have already pursed a mashup strategy. Want to push this further. In this publication environment we can do very small-scale publishing. While others are pushing nanopublications, this is more mini-publication. You publish a short article, 140 words long to operate as a glossary.

Links back will then appear in popups as a glossary, rather than as a hyperlink or in a reference list. The glossary will not have a single name space, so multiple definitions are possible. And all the things we have added so far will help. Greycite will provide bibtex to make the link insertion easier.

Probably going to investigate a publish by email protocol also, so that new articles are very easy to publish while still maintaining a moderation step.

Ideas: Structured Knowledge

  • Open Disease Reports from David Shotton

  • Short summaries of disease information published elsewhere

  • Critical for third world

  • We want to structure the knowledge for mining.

Ideas: Enhanced Linking

images/chemicalize.png
Note

We want to link to other resources out there. We would like to do this intelligently. For instance, this is a tool called chemicalize which inserts implicit links by named-entity recognition. Very nice. But probably not something that you want on a cookery page.

Ideas: Linking

images/chemicalize-link.png

Ideas: The NearCon

  • Nearly a conference

  • Cross between a workshop and a special issue

  • Publish papers and then talk about them!

  • Like a workshop

    • see papers you might miss

  • Asynchronous!

Note

Also wanted to pursue

Publishing in Flux

  • eLife - modelled on PLoS one

  • F1000 - Cross between PLoS One and arXiv

  • PeerJ - $100 to register, free to publish

  • Dutch Tulip bulb

Note

We are not the only people playing in this environment. Publishing is in a lot of flux at the moment and there are a lot of new ideas coming out. So, eLife for example, which is modelled on PLoS One and is lead by Mark Patterson who previously worked for PLoS Currents. Interestingly, it’s directly supported by the Wellcome trust.

F1000 are producing something new. This is rather an offshoot of their prepublication, and poster publication service. Probably free to publish initially but we are not sure yet.

Finally, is PeerJ. If we can sequence the human genome for $1000, why can we not publish a paper for $99. Basically, you register for $100, then can publish for life (1 paper a year, and you have to do 1 review a year, and all authors need to be registered). This is Peter Binfield, Jason Vogt and, most interestingly, Tim O’Reilly.

The dutch tulip bulb? First good example of a speculative economic bulb, where the cost of a resource increased totally out of proportion to its value. Currently, academic publishing comes with a lot of costs, but does it come with any value?

What can we do?

  • Technical Report Series

  • Replace with arXiv

  • Recognised

  • Stable

  • Standard Metadata

  • Supplement with Kblog

  • More experimental

  • Web First

What can we do?

  • Grants

  • Publish all our grants on the web

  • Successful or not!

  • Knowledge blog would be a good framework

What can we do?

  • Online thesis

What can we do?

  • Open Notebook Science

  • all research active staff and students

  • Publish as we go!

What can we do?

  • Cash!

  • Q: Can I get cash for my PhD student to conference?

  • A: Yes! But conference should come with publication.

  • All the suggested publication locations are toll access

What can we do?

  • Teaching

  • Open Education Resources

  • Release lecture notes online

  • Release all recap online

  • Try before you buy!

Elephant in the Room

  • REF

  • Promotion Committees

  • Lawyers

Note

REF is a problem — there is a tendency toward the false metrics such as IF, and this is an undeniable problem.

Promotion Committees — tend to have a bias toward high quality journals, where "high quality" is defined as "the ones I published/publish in". The older a scientist is, the more likely they are to have the bulk of their work in TA journals.

Lawyers — currently, RECAP is not going online and has, in fact, got more restrictive because of fears that we will uncover the large scale copyright violation by staff. Also, university does not allow staff to attach CC, OA licenses to their work without permission. Although, strangely, it does allow them to give their work away without compensation and without even retaining the right to use it themselves. Hmmm.

Acknowledments

  • Kblog/Ontogenesis

    • Robert Stevens, Georgina Moulton (Manchester).

    • Dan Swan, Simon Cockell (Newcastle)

  • Kcite

    • Simon Cockell

  • Greycite

    • Lindsay Marshall

Summary/Conclusion

  • Open Access == freely available papers

  • Part of the move to Open Science

  • Happening whether we like it or not

  • It should be seen as an opportunity not a risk

"In the longer term, the future lies with open access publishing," said Finch at the launch of her report on Monday. "The UK should recognise this change, should embrace it and should find ways of managing it in a measured way."

June 18 2012
— Janet Finch