Michael Bell

Ph.D. Students Blog

Skip to: Content | Sidebar | Footer

Revisiting how many databases have become extinct

5 April, 2013 (14:30) | Uncategorized | By: Michael Bell

In my previous post I checked how many of the databases listed in DBcat and the NAR database issue from 1999 were still reachable. I did this as a quick way to try and gain an indication as to how databases have become extinct.

This approach, as I acknowledged, had a number of limitations. Indeed, a comment by Alex Bateman raised a very good point that I failed to explicitly state — many URLs from 1999 will have suffered URL decay. Basically, a database may still be alive but with a new URL. For example, DBcat has an entry for the Ad5E1A Database, with the URL “http://www.geocities.com/CapeCanaveral/Hangar/2541/“, showing the database was hosted on GeoCities. GeoCities was a popular host for webpages in the late 1990′s/mid 2000′s, but Yahoo! closed the GeoCities service in 2009. A quick search identifies that the Ad5E1A database is still reachable at a different URL, although it hasn’t been updated since October 2009. Therefore, as pointed out by Alex, this analysis served more as an indicator for URL decay than how many databases have become extinct.

Whilst reading the 2006 NAR summary/update paper (http://dx.doi.org/10.1093/nar/gkj162), I realised that database summaries are provided with a persistent URL based on their entry number. So, for example, the URL http://www.oxfordjournals.org/nar/database/summary/157 points to the database entry for ABCdb, as ABCdb has the entry number 157. If ABCdb was to be removed from the list (as it previously had been, but was later resurrected) it would retain it’s previous entry number. Therefore, if this URL was to be visited, and a 404 HTTP response code was returned, it will indicate that a database has been removed from the list. For example, the URL for entry 61 returns 404 — it used to point to the IXDB database which appears to have been shut down.

Therefore, by attempting to connect to each URL for entries 1 up to 1673 we can gain a better idea of how many databases have died. However, this isn’t strictly correct. A database may be removed from the NAR list for a number of reasons. Whilst a database being shut down is one such reason, it may just mean a database requires registration or has become commercialised. The aim of the NAR list was not to be exhaustive, but to contain databases that are publicly available and of high quality. Therefore, I guess it would be fair to say such a result will give an indication as to how many databases have failed to maintain the high standards required for inclusion in the NAR database list.

Anyway, as it was fairly quick to do, I wrote some code to check how many database summaries returned 404. In total 161 databases have been removed from the NAR database. For each of these URLs I checked against the web archive to try and determine the name of the database that was removed (with mixed success). This list can be downloaded in CSV format.

Finally, it is worth acknowledging that there is a significant amount of literature looking at this kind of phenomena. Perhaps the best known is the extensive work by Jonathan Wren, who has looked at factors such as URL and E-Mail decay for articles indexed by MEDLINE.

Bibliography

How many biological databases have become extinct?

2 April, 2013 (13:53) | Uncategorized | By: Michael Bell

In my previous post I showed the number of biological databases in existence is growing each year. Here I wish to try and gain an insight into how many databases are now unreachable and thus have likely become extinct.

Although a number of “database of databases” exist, they will only contain databases that are currently active. Therefore, I obtained two lists of databases published around 1999/2000 to test how many of these still remain active. The first collection was obtained from the oldest NAR database list available (http://dx.doi.org/10.1093/nar/27.1.1). The second list was extracted from DBcat (http://www.ncbi.nlm.nih.gov/pubmed/10592168), a list that was obtained from the web archive, as the site no longer exists. These lists contain 202 and 511 databases, respectively. I have made the assumption that the databases in these lists were correct, reachable and active at the time of publishing. However, this assumption could be checked by viewing each URL in the web archive. These two lists vary in the information provided, but both contain a name, URL and a category for each database (although some DBcat entries don’t list a URL).

For each database, a connection was attempted for the provided URL. If a connection could not be established, it was registered as unreachable. If a connection was possible, the HTTP response code was noted. If the code was successful (2xx) it was determined as active whilst a client error (4xx) was recorded as unreachable. A couple of links returned a redirection code (3xx), which was followed to the redirected address and then analysed.

Whilst this approach gives a reasonable indication as to the life of a database, it has to be acknowledged and stressed that it doesn’t cover all cases. For example, a database may be marked as “active” as it returns a 200 response code, yet it could just be a page saying the database no longer exists. Similarly, a database may still be active but was unreachable when the analysis was carried out.

In addition to the full URL, the host URL was also tested. For example, this blog has the “full” url “http://homepages.cs.ncl.ac.uk/m.j.bell1/blog/” whilst the host URL is “http://homepages.cs.ncl.ac.uk/”. This gives another interesting view. For example, it is possible that I may move my blog to, say, “/m.j.bell1/new_blog/” and the old link would then return 404 (assuming no redirection was in place), yet the homepages.cs.ncl.ac.uk would remain active. In this case the fact homepages.cs.ncl.ac.uk remains alive, may mean the database has simply moved. This isn’t always the case though, as the university will likely remove my hosting when I leave the university, and then access to my blog will return a 404 code, whilst the homepages.cs.ncl.ac.uk will again remain active. Either way, it is more likely that the database has become extinct if the host is unreachable.

The results of this analysis for the databases listed in DBcat are shown in the table below:

Full URL Host URL
Category Total Alive Dead % Alive Alive Dead % Alive
Protein Structure 18 8 10 44.4% 13 5 72.2%
Genomic 55 20 35 36.4% 35 20 63.6%
DNA 83 43 40 51.8% 62 21 74.7%
Literature 38 18 20 47.4% 32 6 84.2%
Mapping 28 8 20 28.6% 15 13 53.6%
RNA 29 8 21 27.6% 18 11 62.1%
Protein 89 31 58 34.8% 59 30 66.3%
Miscellaneous 141 50 91 35.5% 78 63 55.3%

 

…with the results for databases listed in NAR shown in the following table:

  Full URL Host URL
Category   Total
  Reachable
  Dead
 % Alive
  Reachable
  Dead
 % Alive
Protein sequences 7 4 3 57.1% 6 1 85.7%
DNA and cDNA sequences
9 4 5 44.4% 7 2 77.8%
DNA sequence motifs 7 5 2 71.4% 6 1 85.7%
Gene Expression 8 1 7 12.5% 3 5 37.5%
Genome Overview 15 9 6 60% 12 3 80%
Maps 10 3 7 30% 7 3 70%
RNA sequences 16 2 14 12.5% 8 8 50%
Protein sequence motifs 14 6 8 42.9% 9 5 64.3%
Protein curation 26 8 18 30.8% 17 9 65.4%
Proteomics overview 3 2 1 66.7% 3 0 100%
Structure 20 8 12 40% 11 9 55%
Mutations 38 13 25 34.2% 26 12 68.4%
Pathways and regulation 9 2 7 22.2% 3 6 33.3%
Transgenics 2 0 2 0% 1 1 50%
Anatomy 4 2 2 50% 3 1 75%
Other 14 4 10 28.6% 5 9 35.7%

 

These results show a significant number of databases are now unreachable. In total, only 36% of those listed in NAR (63% for host URLs) and 39% of those listed in DBcat (65% for host URLs) remain reachable. The data for DBcat is also shown graphically below:

Bar chat of Full URLs that have are either active or unreachable
Bar chat of host URLs that have are either active or unreachable

As previously stated, there are a number of ways this analysis could be improved, but my purpose here was to just gain an indicator as to the rate of database death, rather than a full blown analysis. Please bear in mind that this analysis and generated data was automated. Although I checked a subset of the results, I didn’t check each URL manually.

These results aren’t too unexpected — if anything, it is somewhat surprising that such a significant number of databases actually remain alive after (at least) 13 years. Extending this analysis to confirm this data and to see how many of those databases that remain reachable are still regularly updated would be interesting. It would also be interesting to determine why each database “died” and in which year it died.

Bibliography

The growth of biological databases

28 March, 2013 (12:44) | Uncategorized | By: Michael Bell

As is commonly stated, biological data is constantly increasing. However, although it is well known that the number of databases in existence is growing, it is less common to see discussions regarding this growth.

The growth of biological data is frequently illustrated by showing a graph of the data being added to a database such as GenBank or UniProtKB over time. I was surprised that I wasn’t able to find many similar and up to date graphs regarding the number of biological databases over time. One such graph (from (http://dx.doi.org/10.1093/nar/gkr1099)) shows the number of publications in PubMed with the keyword “database” in the title over time. However, this isn’t entirely accurate — a paper with “database” in the title doesn’t necessarily describe a new database. Similarly, the growth and importance of databases can be highlighted by the introduction of “Database: The Journal of Biological Databases and Curation” in 2009, however this doesn’t give us a quantitative measure of the growth.

I did fine suitable graphs on two pages (See 1 and 2), however these are a number of years out of date (2 and 6, respectively) and sadly don’t appear to provide the raw data. Therefore, I’ve gone about reproducing and updating these graph, making available the raw data.

The pages showing this growth are based on databases listed in the Nucleic Acids Research (NAR) online molecular Biology Database Collection, which currently lists 1512 online databases. The most recent issue (2013) (http://dx.doi.org/10.1093/nar/gks1297) marks the 20th annual database issue of NAR — the first issue in 1993 consisted of 24 articles (there was two issues prior to this, although not formally labelled as a database issue, in 1992 and 1991 which contained 18 and 19 articles, respectively).

Although there are over 20 years of database issues, it would be too time consuming to calculate the growth of databases prior to 1999. Since 1999, each issue has had a corresponding paper summarising the databases and the number of databases listed, meaning the number of biological databases listed can be easily extracted. Prior to this it would become more complicated to calculate the size of the list, as the number or articles within an issue doesn’t necessarily reflect the number of databases listed. For example, the 2013 version includes 176 articles (corresponding to 88 new databases and 88 updates), yet the collection lists 1512 online databases. Whilst it would be possible to calculate this information for databases prior to 1999, it would add little benefit for my needs — 15 years worth of data is sufficient to illustrate that the number of databases is continually growing over time.

I’ve extracted the information from each summary paper and shown it below, supported with the corresponding paper reference and a graph to illustrate this growth:

Year Number of Databases Reference
1999 202 (http://dx.doi.org/10.1093/nar/27.1.1)
2000 226 (http://dx.doi.org/10.1093/nar/28.1.1)
2001 281 (http://dx.doi.org/10.1093/nar/29.1.1)
2002 335 (http://dx.doi.org/10.1093/nar/30.1.1)
2003 386 (http://dx.doi.org/10.1093/nar/gkg120)
2004 548 (http://dx.doi.org/10.1093/nar/gkh143)
2005 719 (http://dx.doi.org/10.1093/nar/gki139)
2006 858 (http://dx.doi.org/10.1093/nar/gkj162)
2007 968 (http://dx.doi.org/10.1093/nar/gkl1008)
2008 1078 (http://dx.doi.org/10.1093/nar/gkm1037)
2009 1170 (http://dx.doi.org/10.1093/nar/gkn942)
2010 1230 (http://dx.doi.org/10.1093/nar/gkp1077)
2011 1330 (http://dx.doi.org/10.1093/nar/gkq1243)
2012 1380 (http://dx.doi.org/10.1093/nar/gkr1196)
2013 1512 (http://dx.doi.org/10.1093/nar/gks1297)

The growth of biological databases over time.

This growth is mostly due to the creation of databases that deal with a specific specialisation. For example, the neXtProt database was created to provide a resource dealing exclusively with human proteins (http://dx.doi.org/10.1093/nar/gkr1179). Given this, the list of databases on NAR divides databases into 15 main groups, which are further sub-divided into 40 sub-groups.

The NAR list provides an ideal platform to analyse growth over time as we can extract the number of databases listed for each year. However, the NAR list doesn’t capture all available databases. The databases listed are only those that are “of high value to the biological community“, whilst inclusion of new databases is by invitation only. Therefore, it is likely many more databases exist.

A number of “database of databases” sites exist, such as a list in Wikipedia, MetaBase (http://dx.doi.org/10.1093/nar/gkr1099) and the Bioinformatics links directory (http://dx.doi.org/10.1093/nar/gkr514), although these are significantly based on the NAR list. However, one such site, DBCat (http://www.ncbi.nlm.nih.gov/pubmed/10592168), listed 513 databases in 2000 compared to the 226 listed by NAR. DBcat went without updated from 2000 until 2006, when it was subsequently shut down (archived version).

Sadly a number of resources, such DBcat, will exist without updates or will be shut down due to factors such as financial funding running out. An analysis of databases that have become extinct due to such factors is arguably more interesting than one about the growth of databases — I hope to do such an analysis in a follow-up post. I suspect a significant number of databases will sadly have gone this way. For example, it is hard to imagine that a database such as Swiss-Prot could disappear, yet in 1996 it almost did due to a significant funding crisis. Financial support remains a common challenge amongst databases (http://dx.doi.org/10.1093/database/bap017), with resources such as Protégé, BioMagResBank (BMRB) and REBASE recently having funding difficulties (http://dx.doi.org/10.1038/489019a). This really does highlight the importance of long term preservation of the data held within databases; something that isn’t necessarily guaranteed given the way that the majority of data is currently stored (http://dx.doi.org/10.1093/database/bap017).

Bibliography

Citation reuse in UniProtKB

19 March, 2013 (10:50) | Uncategorized | By: Michael Bell

The majority of the work discussed on this blog has looked at annotations within UniProtKB. In a nutshell, we have shown high levels of annotation reuse, which is essentially done as a matter or protocol. I’m now in the process of writing up, and have just been looking at citation information in UniProtKB. Although not too surprising, it appears that the reuse of citations in UniProtKB is also common.

I started to look at this data as I was curious how many articles have been manually curated by annotators over the lifespan of Swiss-Prot (the last 27 years or so). Reading an article about manual curation in FlyBase, I was interested to read that it can take between two and four months for an article to be fully curated (curation in FlyBase is done on a article, rather than protein/entry, basis) (http://dx.doi.org/10.1093/database/bas039). Given this, I was wanting to get a grasp of how much of a bottleneck literature curation in UniProtKB is.

The UniProtKB curation process is well documented (http://dx.doi.org/10.1093/database/bar009) and states that, for each paper identified for curation, the full text is read, with relevant information extracted.

The most current statistics for Swiss-Prot (Version 2013_03 at time of writing) state that there are 1,037,168 reference lines (RL), which relates to one reference (citation) in one entry (the RL lines don’t wrap over two or more lines). For example, in P63015 there are three RL lines, meaning that three journal articles are referenced. The statistics also state that an average entry has 1.92 references, with 829,697 of the references being journal references, coming from 2306 journals (the remaining references are made up from sources such as books, theses and Worm Breeder’s Gazette). The latest statistics for TrEMBL show 23,815,989 citations over 21,182,512 entries.

Clearly references are reused; it is unlikely that 829, 697 journal articles have been read by UniProt curators (who currently list 43 biocurators on their staff list), whilst TrEMBL has more citations than exist in PubMed. Sadly, these statistics don’t state how many distinct citations there are. However in a 2010 paper UniProt stated that “currently, there are ∼228 000 distinct PubMed citations associated with ∼4.2 million UniProtKB sequences and 67% of these citations are in UniProtKB/Swiss-Prot.” with “11 external sources contribute ∼350 000 unique PubMed citations not yet annotated in UniProtKB, covering ∼188 000 UniProtKB entries.” (http://dx.doi.org/10.1093/nar/gkp846). Obtaining the statistics for the nearest major release of Swiss-Prot relating to these figures shows a total of 781,540 references, 628,701 of which are journal references.

This suggests that, in 2010, each journal reference was reused approximately 5 times on average (based on citations in Swiss-Prot). Given the growth in references over 3 years, it is likely this reuse has increased. Presumably the number of relevant papers identified in a typical month will be increasing, which may also impact this reuse. The biomedical literature is growing at a double-exponential rate (http://dx.doi.org/10.1016/j.molcel.2006.02.012). Currently, PubMed indexes over 22 million citations, up from 20 million in mid 2010 (see two interesting posts [1, 2] by Duncan Hull on pros/cons of this, and how many journal articles exist).

This citation may be due to factors such as a paper containing multiple sequences. Either way, further analysis would be interesting. For example, it would be interesting to analyse this citation reuse in relation to sentence reuse — I assume there would be some correlation, which we could potentially use to gain further confidence in identifying sentence propagation.

Bibliography

My experience of presenting at ECCB’12 #eccb12

11 September, 2012 (13:32) | Uncategorized | By: Michael Bell

Earlier today I presented our work on annotation quality (http://dx.doi.org/10.1093/bioinformatics/bts372) at ECCB’12. This was my first time presenting outside of an internal lab seminar and it was a great experience. I was in the 525 seater Montreal auditorium at the Congress Center Basel, which was a really nice place to talk. There was a good number of people in attendance; at a guess, i’d say around 200 people.

I think my talk went reasonably well. There were a couple of places I could have been clearer/more flowing, but overall I’m quite happy. It was also great to have some of the UniProt people in the audience asking questions (which I think I could have answered a bit better). It was also nice to catch some of them for a chat afterwards.

Twitter has been quite popular during the conference, and the hashtag #eccb12 has been trending on Twitter. It was interesting to read the tweets people had sent during the talk, and the whole twitter experience during ECCB has been pretty cool!

It has been a great experience to be at ECCB. Some of the statistics provided in the opening talk were impressive (sadly I can’t remember them specifically/all). However, there are over 1000 attendees and 341 submissions were received, of which 48 were accepted for oral presentation. I’m really pleased that our work was accepted, and want to extend a thanks to the organisers of ECCB for the opportunity to come and talk.

I’d also like to thank my co-authors Colin Gillespie, Daniel Swan and Phillip Lord, without whom this work would not have been possible.

Bibliography

An approach to describing and analysing bulk biological annotation quality: a case study using UniProtKB

9 September, 2012 (10:41) | Uncategorized | By: Michael Bell

Just a quick post to say that our paper has now been published by Bioinformatics:

“An approach to describing and analysing bulk biological annotation quality: a case study using UniProtKB”(http://dx.doi.org/10.1093/bioinformatics/bts372)

Bibliography

Presentation for ECCB’12

5 September, 2012 (12:51) | Annotation Quality, Miscellaneous, Uncategorized | By: Michael Bell

As mentioned previously, we (Colin Gillespie, Daniel Swan, Phillip Lord and I) have recently had some of our work on annotation quality accepted at the European Conference on Computational Biology (ECCB’12). Given this, I am due to give an oral presentation at ECCB’12 in Switzerland on the 11th September (ECCB’12 Schedule). I have made this talk available for those who are interested:

The paper from this work will be published in a special open-access issue of the journal “Bioinformatics” in September. However, it is already available in arXiv for those wishing to read it (http://arxiv.org/abs/1208.2175)!

Bibliography

Additional animated power-law graphs

16 August, 2012 (12:44) | Uncategorized | By: Michael Bell

In the previous post I introduced some animated graphs that were produced to aid the analysis of Swiss-Prot and TrEMBL annotations overtime. Visual inspection of these graphs is beneficial, however the main analytical value comes from the extracted alpha values for each graph – something which I could have included. Many thanks to Colin Gillespie for pointing this out. I have re-created the Swiss-Prot and TrEMBL graphs with the corresponding alpha value now included. These are shown below:

Animation showing swiss-prot over time, with included alpha value.
Make it slower or even slower still…

Animation showing TrEMBL over time, with included alpha value.
Make it slower or even slower still…

Animation showing both Swiss-Prot and TrEMBL over time, with included alpha values.
Make it slower or even slower still…

Our early analysis of Swiss-Prot found an interesting “kink” in the graph, which was identified by visual inspection. This turned out to be copyright statements and provides an interesting talking point, some of which has been discussed previously. It is hard to believe I forgot to include this graph… therefore the graph for Swiss-Prot versions where copyright statements are still included is shown below:

Animation showing Swiss-Prot over time, with copyright included.
Make it slower or even slower still…

Animation: visualising change over time in power-law graphs

15 August, 2012 (08:51) | Annotation Quality, Miscellaneous | By: Michael Bell

As documented in numerous posts, a key aspect of my research has been the application of power laws to word occurrence in bulk biological annotation. A large portion of this work has recently been accepted at the European Conference on Computational Biology (ECCB) 2012 and the resulting paper will shortly appear in a special issue of Bioinformatics. The acceptance into ECCB’12 requires an oral presentation. Therefore, I am currently in the process of creating a selection of images and slides for this talk.

As is clear from previous posts, this work produces a lot of graphs. One of the challenges of writing the paper was trying to include figures that allow the key points to be clearly made within the page limit enforced upon us. For example, if we show the resulting graph for all versions of Swiss-Prot, TrEMBL and the overlay of both we will comfortably exceed 100 images. This clearly cannot be suitably done in publication format, nor will the resulting paper clearly show the change over time (Well, I guess we could have produced something along the lines of a “flick book” as part of the publication – however, I’m unsure my supervisor would be happy with the subsequent publication and colour figure costs that OUP would charge)

However, unlike the paper, the presentation provides an ideal opportunity to show these graphs and the change over time. To do this I have created animated images for Swiss-Prot, TrEMBL and overlay images. The animations run reasonably quickly – 1/10th of a second between transitions. I have provided two further links for each of the animations which are slower (50ms and 1 second between transitions) below each graph. These graphs are shown below:

Animation showing the change over time in Swiss-Prot.
Make it slower or even slower still…

Animation showing the change over time in TrEMBL.
Make it slower or even slower still…

Animation showing the divergence of Swiss-Prot and TrEMBL over time.
Make it slower or even slower still…

These views clearly show, for example, the divergence between Swiss-Prot and TrEMBL over time. These animations are a powerful way to visually analyse data that cannot be replicated by viewing static images side by side, or the extracted alpha values. For example, viewing the change in Swiss-Prot over time we see that while the head of the power law is “structurally” similar, it actually drops significantly (from the second point, i.e. x = 2) over time. This would suggest that over time the richness of individual words appears to be increasing — i.e. more words are occurring only a single time in later versions.

However, I’m not convinced that the animation of word clouds is quite as meaningful:

Animation of various word clouds.

These word clouds show the various words in Swiss-Prot annotations over time. I have previously blogged about these as well making all the images available on my website.

Annotation maturity: Comparison of annotations in new and old sets of UniProtKB entries.

3 February, 2012 (14:57) | Uncategorized | By: Michael Bell

Carrying on from the previous post, we now wish to look at annotation maturity in sets of UniProtKB entries. We have seen that over time the quality of annotations appear to be decreasing over time, for both Swiss-Prot and TrEMBL. A reasonable explanation for this would be that annotations are constantly being added to the newly incorporated data, which in turn has added additional pressure on curators, meaning that over time the least effort has shifted from the reader to the annotator. Whilst we see a reduction in the overall annotation quality, we suspect that a mature set of entries would improve over age.

To approach this analysis, we compare annotations from entries that are in both Swiss-Prot Version 9 and UniProtKB/Swiss-Prot version 15. We also show the resulting alpha value for annotations from entries in UniProtKB/Swiss-Prot version 15 but not Swiss-Prot Version 9. The resulting graph for this is shown below:

Difference in Alpha value between SP9 and UPSP15

With the assumption that maturity is linked to age, we would expect that the quality of annotations within a set of old entries would improve over time. Interestingly, this doesn’t appear to be the case. Whilst the alpha value does decrease (by roughly 0.1), the alpha value for the remaining entries is significantly lower. This would suggest that annotations within the whole database are generally decreasing, although the rate of decrease depends on initial age of the entry. Given this, it is of interest to see how the quality of annotations in only new entries change over time. For this, we extract annotations for those entries that appeared for the first time in a given version. The resulting graph for this is shown below:

Alpha values for new annotations in new entries in various Swiss-Prot databases.

This graph also shows a steady decrease over time – a similar pattern to most of our previous analyses. These results are interesting; it would appear that annotation of new data is getting worse over time. It also appears to have a detrimental affect on other annotations. We have discussed the increase in data in relation to annotation quality, but the impact of size alone would not explain the decrease of annotation quality in older entries. One possible explanation for this is the protocol used for annotation curation.

The curation of annotations is clear process, consisting of 6 key steps. This process is detailed in (http://dx.doi.org/10.1093/database/bar009), with an overview of the process shown in the figure below (taken from (http://dx.doi.org/10.1093/database/bar009)).

Outline of the UniProtKB manual curation process, taken from paper: DOI:10.1093/database/bar009

Part of the protocol is to, for a given sequence, identify similar entries and then standardise and propagate annotation between these entries to ensure data consistency. Presumably, over time the curation process has undergone revisions (worthy of further investigation and another blog post!) due to changes in resources, increase of data, and so on. It is possible that this curation process is refined to deal with larger amounts of data and quicker release dates (both of which are true for Swiss-Prot over time – early versions of Swiss-Prot saw around 1000 new entries being added, with the later versions seeing around 30,000 new entries, whilst the release cycle is more frequent by a couple of months). Although the increase of manually curated entries and faster release dates could be due to more curators rather than change in annotation protocol (which will be investigated further), it is plausible that attempts to standardise annotation between similar entries is actually having a detrimental affect on overall annotation quality.

Bibliography