Three Steps to Heaven

About

Full Text available: http://www.russet.org.uk/blog/2012/04/three-steps-to-heaven/

Three Steps to Heaven

We argue that to make semantic publishing possible we need to:

make life better for the machine to
make life better for the author to
make life better for the reader

Making life better for the machine is a reasonable objective in itself. To some extent, it is the point of semantic publishing. But, it is not an end, because, ultimately, we need the machine to do something for us.

We have to make life better for the author. Ultimately, it’s fine for a bioinformatician to say to a biologist "please mark up your text, so I can mine it". They will say no, I don’t have time.

We need to make life better for the reader because that way they may nag the authors to add more, which is something the authors care about.

To begin at the beginning

The birth of kblog (http://knowledgeblog.org)

from:[http://en.wikipedia.org/wiki/File:La_Palmyre_041-crop.jpg]

Or to begin at the end. We are interested in the long tail. Academics produce a lot of content, much of it falling into the category of grey literature. Short tutorial information, lab books and so on. We believe that an academic should be able to publish this information straight-forwardly, while maintaining the important features of current academic publishing.

The problem with publishing

images/800px-Clock_in_Kings_Cross.jpg from:[http://en.wikipedia.org/wiki/File:Clock_in_Kings_Cross.jpg]

Often time-consuming
Over a long time

There are three big problems with current academic publishing, however, which are going to prevent this. First, it is far too time consuming; I refer here to the publishing process itself and not to the effort of authoring which is hard because of the intellectual creativity that it requires. Second, it’s too expensive; to some extent, I don’t care about this — the money comes of my grant, and it’s still relatively small compared to the cost of science over all. But, it does mean that I have to think about what is worth publishing. Third, the publishing industry is one that, to misquote Douglas Adams, thinks that dumping a PDF on the web is a pretty neat idea.

The process

images/500px-LaTeX_cover.svg.png from:[http://commons.wikimedia.org/wiki/File:LaTeX_cover.svg]

The process

images/500px-Adobe_PDF_Icon.svg.png from:[http://en.wikipedia.org/wiki/File:Adobe_PDF_Icon.svg]

The process

images/500px-Adobe_PDF_Icon.svg.png

Again!

The process

images/500px-Microsoft_Word_Icon.svg.png from:[http://en.wikipedia.org/wiki/File:Microsoft_Word_Icon.svg]

The process

images/500px-XML.svg.png from:[http://en.wikipedia.org/wiki/File:XML.svg]

The process

images/HTML5_Logo_512.png from:[http://www.w3c.org/html/logo]

The process

images/500px-Adobe_PDF_Icon.svg.png

Again, Again!

The process

images/wordpress-blue-xl.png

Web First
RSS
Trackbacks
Stats
Gravatars

Here is one journal publishing process. It’s very heavy-weight. Pushing semantics through this is going to be well-nigh impossible.

So, we wanted a web-first, easily accessible process. We wanted to integrate with existing scientists workflows, with existing tools. We wanted to build on top of commodity software where ever possible: we need tools like RSS, trackbacks, gravatars, commenting; we didn’t want to write them.

So, we used wordpress.

Existing Workflow

Fitting in with peoples workflows
Fitting in with peoples tools
- Word/Email
- Word/Dropbox
- LaTeX/Versioning
- Asciidoc/Dropbox
The wordpress editor is lacking

There are many existing workflows. Most of them use word, email, dropbox and so on. One of the nice consequences of going commodity is that most of this is sorted out for us up front. Most blogging engines provide an XML-RPC; and many text production systems can or can be induced to talk this.

Adding Semantics

Adding semantics is hard
The three steps to heaven limit us
Going to describe two exemplars

The three steps place constraints on us. We cannot ask too much of the authors. We want them to see immediate advantage at each point.

In this paper, we describe three use cases. Because of time constraints, I am just going to describe two here, and also want to add some additional stuff we have done more recently.

Maths (author)

The author adds [latex]e=mc^2[/latex] to their document
Which is rendered as:

e = m c^{2}

This works in any authoring environment
The author adds "here is maths" semantics

Maths (reader)

The reader gets a nicely rendered equation

Maths (reader)

Which is NOT image based

Maths (reader)

And has some additional features

Maths (reader)

Access the tex

Maths (reader)

Or transformed to MathML

We’ve used three key technologies here. First TeX as a markup language (not the processor!). Second, wordpress and short codes. Finally, mathjax which does the rendering. So, what does this look like to the user. Practical upshot is you stick short code in, get maths out.

Architecture

$images/architecture-mathjax.jpg$

Why TeX?

There is a W3C standard

e = m c^{2}

<math>
   <mrow>
     <mi>e</mi>
     <mo>=</mo>
     <mrow>
       <mi>m</mi>
       <msup>
         <mi>c</mi>
         <mn>2</mn>
       </msup>
     </mrow>
  </mrow>
</math>

There is a W3C standard for representation of maths. It’s long and horrible.

Most mathematicians are happy with TeX.

Why [Shortcodes]?

XML is painful
Most of the tools escape it anyway
Short codes are simple and reasonably well supported
We support other syntaxes also

Ironically anything XML has a major problem. Most scientists do not want to edit in wordpress, or at the HTML level. Most of the tools that people use — word, latex (our hack), asciidoc or the like assume that when people write XML they want to present this and not include it. So, it gets escaped

Short codes are well supported in wordpress, and familiar to at least some people. Are largely pass through the various editing environments freely.

Why client-side rendering

MathJax did not need developing
It looks nice
And has added functionality

Essentially because it looks nice. We can click on equations. We can zoom in. We can get the TeX OR the MathML out.

What have we achieved?

Authors can write TeX.
Readers get a nice view
Machines get a markup saying "maths is here"
- Can tell that markup is being used
- Markup can be converted to MathML
- Authors have (probably) already checked this works

The authors are happy. They can write maths anywhere. Including TeX in word. The readers get a nicer view.

The machines get a markup (TeX). They can either use this directly or use mathjax to convert this to MathML, in the knowledge that the author has already (effectively) done this, and checked it is correct.

Maths

What we have achieved adds limited semantics
But is useful to reader, author and machine

References

Academics love in text-citations (Lord, 2012)
Will describe two tools, kcite and greycite

References

Lord, P (2012) Academics love Citations. J.Unsupport.Assert

Citations are problematic, but relatively high value. Can use the to track impact, knowledge flow, discourse structure and the like. We have followed a similar methodology to before. Citation is inserted. Wordpress gathers the metadata for this in a very linked data way. The browser then javascripts this.

References (author)

Authors again insert shortcodes
[cite]10.100/100.1[/cite]

images/kblog-with-edam-with-more-references.png

References (reader)

Readers can see citation or direct link
Readers can change citation style

images/kblog-with-edam-with-numeric-listing.png

References (kcite architecture)

First wordpress detects the identifier.
It either guesses or is told what kind of identifier we have.
It then queries one of four sources: crossref, greycite, arxic, pubmed
Either by URL query, or content negotiation
CrossRef and Greycite return JSON
Arxiv and pubmed return their own XML, which we transform into JSON
The JSON is actually cached in the database for 2 months
Wordpress serves the HTML
Javascript calls back to wordpress
Wordpress serves the integrated JSON.

The Problem

Inserting links is painful
Fortunately, we can use the bibliographic metadata to help
Will demonstrate this after a brief interlude

Inserting references in this way is a pain. However, in most cases, we can do this in a metadata driven way. We can use the same metadata that is used to generate the bibliography for the authors.

Greycite

Originally, kcite could not support URLs
We needed a source for metadata
Now we have http://greycite.knowledgeblog.org
Developed by Lindsay Marshall, Computing Science, Newcastle

One substantial problem is that absence of an ability to do this for URIs. So, we wanted to address this with greycite. So, going to show how this works. Greycite is a new tool which mines bibliographic metadata from URIs.

Doesn’t work for all URIs. Mining is (deliberately) not too intelligent. We look mostly for things that are intended to be mined. We have also added tools to wordpress to allow flexible insertion of metadata (including with shortcodes, or through a nice GUI).

Greycite

This is greycite

Greycite

Add a URL

Putting a URL into to an article on my blog

Greycite

URL results

In this case, we’ve seen the URL before and greycite knows some basic metadata.

Greycite

Scrolling to more detail

Looking in more detail, we can see that in March, the article had a title "Why Not?", was authored by me and dates from 2010.

Greycite

Provenance (coins)

We have gathered this metadata from a variety of sources, and you can see the provenance. Coins is a dreadful standard which is now quite a few years old, but some people use it. Works by embedding a span tag into the body of the post.

Although it is dreadful in everyway, that it is embedded in the body is it’s most useful feature, as many people don’t control their headers.

Greycite

Provenance (OGP)

Also we support Open Graph Protocol. Much nicer "standard", partly developed by Facebook who would appear to be much better at uncovering other peoples bibliographic metadata, than alledgedly they are at uncovering their own financial data.

Greycite

And other (hosted) blogs

It does work with other sites outside our control. This is Robert’s blog. He is a raw wordpress user.

Greycite

Although somewhat limited

Which advertises some metadata, but not sadly the author or the date. This is hard for us to fix because we don’t control the website. We’ll probably add a coins generator at some point, to generate the metadata that the website does not provide.

Greycite

And elsewhere

As well as non-blog sites. This is an article from the BBC

Greycite

Again, no author or date, but otherwise intact.

Greycite

Sadly, some websites have no semantics that we can find

Greycite

Preservation

We can also link through to other resourecs, such as archive.org. We also support archive.org.uk — provided by the British Library and are working on webcitations.org

Greycite

So, we can maintain links to the academic record even if the links break. Currently, this is not apparent in kcite generated bibliographies, but we will add redirection in soon for links which appear to have gone 404.

Greycite

Finally

Greycite

We can retrieve as JSON (or bibtex)

Citing Links

How does adding links work?

This is the bio-ontologies website. Lots of papers on it. Many with complex metadata.

I have added support

Citing Links

We want to cite this page

Citing Links

In an article, I am writing

For my own editing environment, I use asciidoc, bibtex and emacs. I acknowledge that this is a little niche, but it does work with other environments also.

Citing Links

First we take the URL

Actually, this is enough. It is all that you need; however, as an author I find myself citing the same URL repeatedly. Google is very good at getting you to where you want to go, but it is not perfect, particularly when there are a lot of articles on one topic. So, I wanted something quicker, searching over what I am interested in.

Citing Links

Query Greycite

images/emacs-with-edam-url-and-mxgreycite.png

Citing Links

Get back bibtex

Citing Links

So, we search

images/emacs-and-post-ref-with-regexp.png

Citing Links

And select

images/emacs-and-post-ref-with-reftex-dialog.png

Citing Links

And insert

images/emacs-and-post-ref-with-kurl-inserted.png

Citing Links

And publish

All of this is tied together with just a little semantic glue. We needed some hueristics, we needed some format shifting, but that is it.

Citing with Emacs

Most of the tools are easy to connect

(defadvice reftex-format-citation (around phil-asciidoc-around activate)
   (if phil-reftex-citation-override
      (setq ad-return-value (phil-reftex-format-citation entry format))
     ad-do-it))

(defun phil-reftex-format-citation( entry format )
    (let ((doi (reftex-get-bib-field "doi" entry)))
    (format "pass:[[cite source='doi'\\]%s[/cite\\]]" doi)))

Or with CSL

Citation Style Language, supported by Zotero, Mendeley and others

<citation>
   <layout prefix="[cite]" suffix="[/cite]"
     delimiter="[/cite] [cite]">
      <text variable="DOI"/>
   </layout>
</citation>

What have we achieved?

Reader gets a nicely laid out bibliography
Machine gets a primary identifier
Author gets some reasonable tools

The reader gets something nice. They can reformat the references. We would like to make this a user preference rather than that of the author or the publisher.

The author gets a nice visualisation. More over, these references DO something. These IDs need to be correct, or the reference will be wrong, but authors get immediate feedback on this.

Finally, the machines get a marked up primary identifier. There is a clear, unambiguous link from the in text citation to the reference. There is a clear link from the in text citation to the outside world and the primary ID. There is a clear link from the reference in bibliography to the outside world also.

What have we achieved?

We have made both the primary ID and the metadata useful
We have made authors want them to be right
Since using kcite I have
- Complained about metadata to crossref and datacite
  - Both now supply JSON directly, with the same interface
- Got several crossref metadata records fixed
- Discovered three broken DOIs in one special issue
  - One linked from my website; 404 is not enough
- Fixed author metadata on my blog
- Removed broken characters from bio-ontologies titles
- Picked up http://dropbox.org ref when posting this paper
- It is only after we wrote a tool, that I fixed my metadata

What have we achieved?

In short, the total metadata in the world
is slightly better than it was before

Summary

We need to use metadata to improve everyones life
A balanced approach is necessary
A little semantics is a good thing

Acknowledgements

Robert Stevens, Manchester (kblog)
Lindsay Marshall, Newcastle (greycite)
Simon Cockell, Newcastle (kcite)
Dan Swan, (ex-)Newcastle (array express — see paper)
Georgina Moulton, Manchester (kblog content)
http://knowledgeblog.org
http://bio-ontologies.knowledgeblog.org
http://greycite.knowledgeblog.org