About

Three Steps to Heaven

We argue that to make semantic publishing possible we need to:

  • make life better for the machine to

  • make life better for the author to

  • make life better for the reader

Making life better for the machine is a reasonable objective in itself. To some extent, it is the point of semantic publishing. But, it is not an end, because, ultimately, we need the machine to do something for us.

We have to make life better for the author. Ultimately, it’s fine for a bioinformatician to say to a biologist "please mark up your text, so I can mine it". They will say no, I don’t have time.

We need to make life better for the reader because that way they may nag the authors to add more, which is something the authors care about.

To begin at the beginning

images/537px-La_Palmyre_041-crop.jpg

Or to begin at the end. We are interested in the long tail. Academics produce a lot of content, much of it falling into the category of grey literature. Short tutorial information, lab books and so on. We believe that an academic should be able to publish this information straight-forwardly, while maintaining the important features of current academic publishing.

The problem with publishing

  • Often time-consuming

  • Over a long time

There are three big problems with current academic publishing, however, which are going to prevent this. First, it is far too time consuming; I refer here to the publishing process itself and not to the effort of authoring which is hard because of the intellectual creativity that it requires. Second, it’s too expensive; to some extent, I don’t care about this — the money comes of my grant, and it’s still relatively small compared to the cost of science over all. But, it does mean that I have to think about what is worth publishing. Third, the publishing industry is one that, to misquote Douglas Adams, thinks that dumping a PDF on the web is a pretty neat idea.

The process

The process

The process

images/500px-Adobe_PDF_Icon.svg.png

  • Again!

The process

The process

The process

The process

images/500px-Adobe_PDF_Icon.svg.png

  • Again, Again!

The process

images/wordpress-blue-xl.png

  • Web First

  • RSS

  • Trackbacks

  • Stats

  • Gravatars

Here is one journal publishing process. It’s very heavy-weight. Pushing semantics through this is going to be well-nigh impossible.

So, we wanted a web-first, easily accessible process. We wanted to integrate with existing scientists workflows, with existing tools. We wanted to build on top of commodity software where ever possible: we need tools like RSS, trackbacks, gravatars, commenting; we didn’t want to write them.

So, we used wordpress.

Existing Workflow

  • Fitting in with peoples workflows

  • Fitting in with peoples tools

    • Word/Email

    • Word/Dropbox

    • LaTeX/Versioning

    • Asciidoc/Dropbox

  • The wordpress editor is lacking

There are many existing workflows. Most of them use word, email, dropbox and so on. One of the nice consequences of going commodity is that most of this is sorted out for us up front. Most blogging engines provide an XML-RPC; and many text production systems can or can be induced to talk this.

Adding Semantics

  • Adding semantics is hard

  • The three steps to heaven limit us

  • Going to describe two exemplars

The three steps place constraints on us. We cannot ask too much of the authors. We want them to see immediate advantage at each point.

In this paper, we describe three use cases. Because of time constraints, I am just going to describe two here, and also want to add some additional stuff we have done more recently.

Maths (author)

  • The author adds [latex]e=mc^2[/latex] to their document

  • Which is rendered as:

e = m c 2
  • This works in any authoring environment

  • The author adds "here is maths" semantics

Maths (reader)

  • The reader gets a nicely rendered equation

images/mathjax-emc2.png

Maths (reader)

  • Which is NOT image based

images/mathjax-emc2-zoomed.png

Maths (reader)

  • And has some additional features

images/mathjax-emc2-with-popup.png

Maths (reader)

  • Access the tex

images/mathjax-emc2-with-tex.png

Maths (reader)

  • Or transformed to MathML

images/mathjax-emc2-with-mathml.png

We’ve used three key technologies here. First TeX as a markup language (not the processor!). Second, wordpress and short codes. Finally, mathjax which does the rendering. So, what does this look like to the user. Practical upshot is you stick short code in, get maths out.

Architecture

images/architecture-mathjax.jpg

Why TeX?

  • There is a W3C standard

e = m c 2
<math>
   <mrow>
     <mi>e</mi>
     <mo>=</mo>
     <mrow>
       <mi>m</mi>
       <msup>
         <mi>c</mi>
         <mn>2</mn>
       </msup>
     </mrow>
  </mrow>
</math>

There is a W3C standard for representation of maths. It’s long and horrible.

Most mathematicians are happy with TeX.

Why [Shortcodes]?

  • XML is painful

  • Most of the tools escape it anyway

  • Short codes are simple and reasonably well supported

  • We support other syntaxes also

Ironically anything XML has a major problem. Most scientists do not want to edit in wordpress, or at the HTML level. Most of the tools that people use — word, latex (our hack), asciidoc or the like assume that when people write XML they want to present this and not include it. So, it gets escaped

Short codes are well supported in wordpress, and familiar to at least some people. Are largely pass through the various editing environments freely.

Why client-side rendering

  • MathJax did not need developing

  • It looks nice

  • And has added functionality

Essentially because it looks nice. We can click on equations. We can zoom in. We can get the TeX OR the MathML out.

What have we achieved?

  • Authors can write TeX.

  • Readers get a nice view

  • Machines get a markup saying "maths is here"

    • Can tell that markup is being used

    • Markup can be converted to MathML

    • Authors have (probably) already checked this works

The authors are happy. They can write maths anywhere. Including TeX in word. The readers get a nicer view.

The machines get a markup (TeX). They can either use this directly or use mathjax to convert this to MathML, in the knowledge that the author has already (effectively) done this, and checked it is correct.

Maths

  • What we have achieved adds limited semantics

  • But is useful to reader, author and machine

References

  • Academics love in text-citations (Lord, 2012)

  • Will describe two tools, kcite and greycite

References

Lord, P (2012) Academics love Citations. J.Unsupport.Assert

Citations are problematic, but relatively high value. Can use the to track impact, knowledge flow, discourse structure and the like. We have followed a similar methodology to before. Citation is inserted. Wordpress gathers the metadata for this in a very linked data way. The browser then javascripts this.

References (author)

  • Authors again insert shortcodes

  • [cite]10.100/100.1[/cite]

images/kblog-with-edam-with-more-references.png

References (reader)

  • Readers can see citation or direct link

  • Readers can change citation style

images/kblog-with-edam-with-numeric-listing.png

References (kcite architecture)

images/architecture-kcite.jpg
  • First wordpress detects the identifier.

  • It either guesses or is told what kind of identifier we have.

  • It then queries one of four sources: crossref, greycite, arxic, pubmed

  • Either by URL query, or content negotiation

  • CrossRef and Greycite return JSON

  • Arxiv and pubmed return their own XML, which we transform into JSON

  • The JSON is actually cached in the database for 2 months

  • Wordpress serves the HTML

  • Javascript calls back to wordpress

  • Wordpress serves the integrated JSON.

The Problem

  • Inserting links is painful

  • Fortunately, we can use the bibliographic metadata to help

  • Will demonstrate this after a brief interlude

Inserting references in this way is a pain. However, in most cases, we can do this in a metadata driven way. We can use the same metadata that is used to generate the bibliography for the authors.

Greycite

  • Originally, kcite could not support URLs

  • We needed a source for metadata

  • Now we have http://greycite.knowledgeblog.org

  • Developed by Lindsay Marshall, Computing Science, Newcastle

One substantial problem is that absence of an ability to do this for URIs. So, we wanted to address this with greycite. So, going to show how this works. Greycite is a new tool which mines bibliographic metadata from URIs.

Doesn’t work for all URIs. Mining is (deliberately) not too intelligent. We look mostly for things that are intended to be mined. We have also added tools to wordpress to allow flexible insertion of metadata (including with shortcodes, or through a nice GUI).

Greycite

  • This is greycite

images/greycite-home.png

Greycite

  • Add a URL

images/greycite-with-url.png

Putting a URL into to an article on my blog

Greycite

  • URL results

images/greycite-russet-why-not.png

In this case, we’ve seen the URL before and greycite knows some basic metadata.

Greycite

  • Scrolling to more detail

images/greycite-url-landing.png

Looking in more detail, we can see that in March, the article had a title "Why Not?", was authored by me and dates from 2010.

Greycite

  • Provenance (coins)

images/greycite-with-coins.png

We have gathered this metadata from a variety of sources, and you can see the provenance. Coins is a dreadful standard which is now quite a few years old, but some people use it. Works by embedding a span tag into the body of the post.

Although it is dreadful in everyway, that it is embedded in the body is it’s most useful feature, as many people don’t control their headers.

Greycite

  • Provenance (OGP)

images/greycite-with-ogp.png

Also we support Open Graph Protocol. Much nicer "standard", partly developed by Facebook who would appear to be much better at uncovering other peoples bibliographic metadata, than alledgedly they are at uncovering their own financial data.

Greycite

  • And other (hosted) blogs

images/rds-expedition-post.png

It does work with other sites outside our control. This is Robert’s blog. He is a raw wordpress user.

Greycite

  • Although somewhat limited

images/greycite-rds-expedition-post.png

Which advertises some metadata, but not sadly the author or the date. This is hard for us to fix because we don’t control the website. We’ll probably add a coins generator at some point, to generate the metadata that the website does not provide.

Greycite

  • And elsewhere

images/bbc-telescope-post.png

As well as non-blog sites. This is an article from the BBC

Greycite

images/greycite-bbc-telescope-post.png

Again, no author or date, but otherwise intact.

Greycite

  • Sadly, some websites have no semantics that we can find

images/sepublica-greycite.png

Greycite

  • Preservation

images/greycite--url-metadata-detail.png

We can also link through to other resourecs, such as archive.org. We also support archive.org.uk — provided by the British Library and are working on webcitations.org

Greycite

images/greycite-at-archive-org.png

So, we can maintain links to the academic record even if the links break. Currently, this is not apparent in kcite generated bibliographies, but we will add redirection in soon for links which appear to have gone 404.

Greycite

  • Finally

images/greycite--url-metadata-detail.png

Greycite

  • We can retrieve as JSON (or bibtex)

images/greycite-json.png
  • How does adding links work?

images/bio-ontologies-toc.png

This is the bio-ontologies website. Lots of papers on it. Many with complex metadata.

I have added support

  • We want to cite this page

images/bio-ontologies-edam.png
  • In an article, I am writing

images/emacs-and-kblog.png

For my own editing environment, I use asciidoc, bibtex and emacs. I acknowledge that this is a little niche, but it does work with other environments also.

  • First we take the URL

images/emacs-with-edam-url.png

Actually, this is enough. It is all that you need; however, as an author I find myself citing the same URL repeatedly. Google is very good at getting you to where you want to go, but it is not perfect, particularly when there are a lot of articles on one topic. So, I wanted something quicker, searching over what I am interested in.

  • Query Greycite

images/emacs-with-edam-url-and-mxgreycite.png
  • Get back bibtex

images/emacs-with-edam-bibtex.png
  • So, we search

images/emacs-and-post-ref-with-regexp.png
  • And select

images/emacs-and-post-ref-with-reftex-dialog.png
  • And insert

images/emacs-and-post-ref-with-kurl-inserted.png
  • And publish

images/kblog-with-edam.png

All of this is tied together with just a little semantic glue. We needed some hueristics, we needed some format shifting, but that is it.

Citing with Emacs

  • Most of the tools are easy to connect

(defadvice reftex-format-citation (around phil-asciidoc-around activate)
   (if phil-reftex-citation-override
      (setq ad-return-value (phil-reftex-format-citation entry format))
     ad-do-it))

(defun phil-reftex-format-citation( entry format )
    (let ((doi (reftex-get-bib-field "doi" entry)))
    (format "pass:[[cite source='doi'\\]%s[/cite\\]]" doi)))

Or with CSL

  • Citation Style Language, supported by Zotero, Mendeley and others

<citation>
   <layout prefix="[cite]" suffix="[/cite]"
     delimiter="[/cite] [cite]">
      <text variable="DOI"/>
   </layout>
</citation>

What have we achieved?

  • Reader gets a nicely laid out bibliography

  • Machine gets a primary identifier

  • Author gets some reasonable tools

The reader gets something nice. They can reformat the references. We would like to make this a user preference rather than that of the author or the publisher.

The author gets a nice visualisation. More over, these references DO something. These IDs need to be correct, or the reference will be wrong, but authors get immediate feedback on this.

Finally, the machines get a marked up primary identifier. There is a clear, unambiguous link from the in text citation to the reference. There is a clear link from the in text citation to the outside world and the primary ID. There is a clear link from the reference in bibliography to the outside world also.

What have we achieved?

  • We have made both the primary ID and the metadata useful

  • We have made authors want them to be right

  • Since using kcite I have

    • Complained about metadata to crossref and datacite

      • Both now supply JSON directly, with the same interface

    • Got several crossref metadata records fixed

    • Discovered three broken DOIs in one special issue

      • One linked from my website; 404 is not enough

    • Fixed author metadata on my blog

    • Removed broken characters from bio-ontologies titles

    • Picked up http://dropbox.org ref when posting this paper

    • It is only after we wrote a tool, that I fixed my metadata

What have we achieved?

  • In short, the total metadata in the world

  • is slightly better than it was before

Summary

  • We need to use metadata to improve everyones life

  • A balanced approach is necessary

  • A little semantics is a good thing

Acknowledgements