<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Michael Bell</title>
	<atom:link href="http://homepages.cs.ncl.ac.uk/m.j.bell1/blog/?feed=rss2" rel="self" type="application/rss+xml" />
	<link>http://homepages.cs.ncl.ac.uk/m.j.bell1/blog</link>
	<description>Ph.D. Students Blog</description>
	<lastBuildDate>Fri, 05 Apr 2013 14:30:29 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.5.1</generator>
		<item>
		<title>Revisiting how many databases have become extinct</title>
		<link>http://homepages.cs.ncl.ac.uk/m.j.bell1/blog/?p=928</link>
		<comments>http://homepages.cs.ncl.ac.uk/m.j.bell1/blog/?p=928#comments</comments>
		<pubDate>Fri, 05 Apr 2013 14:30:29 +0000</pubDate>
		<dc:creator>Michael Bell</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://homepages.cs.ncl.ac.uk/m.j.bell1/blog/?p=928</guid>
		<description><![CDATA[In my previous post I checked how many of the databases listed in DBcat and the NAR database issue from 1999 were still reachable. I did this as a quick way to try and gain an indication as to how databases have become extinct. This approach, as I acknowledged, had a number of limitations. Indeed, [...]]]></description>
				<content:encoded><![CDATA[<div class="kcite-section" kcite-section-id="928">
<p>In my <a href="?p=872">previous post</a> I checked how many of the databases listed in DBcat and the NAR database issue from 1999 were still reachable. I did this as a quick way to try and gain an indication as to how databases have become extinct.</p>
<p>This approach, as I acknowledged, had a number of limitations. Indeed, <a href="http://homepages.cs.ncl.ac.uk/m.j.bell1/blog/?p=872#comments">a comment</a> by <a href="https://twitter.com/Alexbateman1">Alex Bateman</a> raised a very good point that I failed to explicitly state &#8212; many URLs from 1999 will have suffered URL decay. Basically, a database may still be alive but with a new URL. For example, DBcat has an entry for the Ad5E1A Database, with the URL &#8220;<a href="http://www.geocities.com/CapeCanaveral/Hangar/2541/">http://www.geocities.com/CapeCanaveral/Hangar/2541/</a>&#8220;, showing the database was hosted on GeoCities. GeoCities was a popular host for webpages in the late 1990&#8242;s/mid 2000&#8242;s, but Yahoo! closed the GeoCities service in 2009. A quick search identifies that the Ad5E1A database is still reachable at a <a href="http://publish.uwo.ca/~jmymryk/E1A/data.html">different URL</a>, although it hasn&#8217;t been updated since October 2009. Therefore, as pointed out by Alex, this analysis served more as an indicator for URL decay than how many databases have become extinct.</p>
<p>Whilst reading the 2006 NAR summary/update paper <span id="cite_ITEM-928-0" name="citation"><a href="#ITEM-928-0">[1]</a></span>, I realised that database summaries are provided with a persistent URL based on their entry number. So, for example, the URL <a href="http://www.oxfordjournals.org/nar/database/summary/157">http://www.oxfordjournals.org/nar/database/summary/157</a> points to the database entry for ABCdb, as ABCdb has the entry number 157. If ABCdb was to be removed from the list (as it previously had been, but was later resurrected) it would retain it&#8217;s previous entry number. Therefore, if this URL was to be visited, and a 404 HTTP response code was returned, it will indicate that a database has been removed from the list. For example, the URL for entry 61 returns 404 &#8212; it used to point to the IXDB database which appears to have been shut down.</p>
<p>Therefore, by attempting to connect to each URL for entries 1 up to 1673 we can gain a better idea of how many databases have died. However, this isn&#8217;t strictly correct. A database may be removed from the NAR list for a number of reasons. Whilst a database being shut down is one such reason, it may just mean a database requires registration or has become commercialised. The aim of the NAR list was not to be exhaustive, but to contain databases that are publicly available and of high quality. Therefore, I guess it would be fair to say such a result will give an indication as to how many databases have failed to maintain the high standards required for inclusion in the NAR database list.</p>
<p>Anyway, as it was fairly quick to do, I wrote some code to check how many database summaries returned 404. In total 161 databases have been removed from the NAR database. For each of these URLs I checked against the web archive to try and determine the name of the database that was removed (with mixed success). This list can be <a href="images/NAR404.csv">downloaded in CSV format</a>.</p>
<p>Finally, it is worth acknowledging that there is a significant amount of literature looking at this kind of phenomena. Perhaps the best known is the <a href="http://scholar.google.co.uk/citations?user=KNC_0MkAAAAJ&#038;hl=en&#038;oi=ao">extensive work</a> by Jonathan Wren, who has looked at factors such as URL and E-Mail decay for articles indexed by MEDLINE.</p>
<h2>References</h2>
    <ol>
    <li><a name='ITEM-928-0'></a>
M.Y. Galperin, "The Molecular Biology Database Collection: 2006 update", <i>Nucleic Acids Research</i>, vol. 34, pp. D3-D5, 2006. <a href="http://dx.doi.org/10.1093/nar/gkj162">http://dx.doi.org/10.1093/nar/gkj162</a>


</li>
</ol>

</div> <!-- kcite-section 928 -->]]></content:encoded>
			<wfw:commentRss>http://homepages.cs.ncl.ac.uk/m.j.bell1/blog/?feed=rss2&#038;p=928</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>How many biological databases have become extinct?</title>
		<link>http://homepages.cs.ncl.ac.uk/m.j.bell1/blog/?p=872</link>
		<comments>http://homepages.cs.ncl.ac.uk/m.j.bell1/blog/?p=872#comments</comments>
		<pubDate>Tue, 02 Apr 2013 13:53:24 +0000</pubDate>
		<dc:creator>Michael Bell</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://homepages.cs.ncl.ac.uk/m.j.bell1/blog/?p=872</guid>
		<description><![CDATA[In my previous post I showed the number of biological databases in existence is growing each year. Here I wish to try and gain an insight into how many databases are now unreachable and thus have likely become extinct. Although a number of &#8220;database of databases&#8221; exist, they will only contain databases that are currently [...]]]></description>
				<content:encoded><![CDATA[<div class="kcite-section" kcite-section-id="872">
<p>In my <a href="http://homepages.cs.ncl.ac.uk/m.j.bell1/blog/?p=830">previous post</a> I showed the number of biological databases in existence is growing each year. Here I wish to try and gain an insight into how many databases are now unreachable and thus have likely become extinct.</p>
<p>Although a number of &#8220;database of databases&#8221; exist, they will only contain databases that are currently active. Therefore, I obtained two lists of databases published around 1999/2000 to test how many of these still remain active. The first collection was obtained from the oldest NAR database list available <span id="cite_ITEM-872-0" name="citation"><a href="#ITEM-872-0">[1]</a></span>. The second list was extracted from DBcat <span id="cite_ITEM-872-1" name="citation"><a href="#ITEM-872-1">[2]</a></span>, a list that was obtained from the <a href="http://web.archive.org/web/20060709044154/http://www.infobiogen.fr/services/dbcat/">web archive</a>, as the site no longer exists. These lists contain 202 and 511 databases, respectively. I have made the assumption that the databases in these lists were correct, reachable and active at the time of publishing. However, this assumption could be checked by viewing each URL in the web archive. These two lists vary in the information provided, but both contain a name, URL and a category for each database (although some DBcat entries don&#8217;t list a URL).</p>
<p>For each database, a connection was attempted for the provided URL. If a connection could not be established, it was registered as unreachable. If a connection was possible, the HTTP response code was noted. If the code was successful (2xx) it was determined as active whilst a client error (4xx) was recorded as unreachable. A couple of links returned a redirection code (3xx), which was followed to the redirected address and then analysed.</p>
<p>Whilst this approach gives a reasonable indication as to the life of a database, it has to be acknowledged and stressed that it doesn&#8217;t cover all cases. For example, a database may be marked as &#8220;active&#8221; as it returns a 200 response code, yet it could just be a page saying the database no longer exists. Similarly, a database may still be active but was unreachable when the analysis was carried out.</p>
<p>In addition to the full URL, the host URL was also tested. For example, this blog has the &#8220;full&#8221; url &#8220;http://homepages.cs.ncl.ac.uk/m.j.bell1/blog/&#8221; whilst the host URL is &#8220;http://homepages.cs.ncl.ac.uk/&#8221;. This gives another interesting view. For example, it is possible that I may move my blog to, say, &#8220;/m.j.bell1/new_blog/&#8221; and the old link would then return 404 (assuming no redirection was in place), yet the homepages.cs.ncl.ac.uk would remain active. In this case the fact homepages.cs.ncl.ac.uk remains alive, may mean the database has simply moved. This isn&#8217;t always the case though, as the university will likely remove my hosting when I leave the university, and then access to my blog will return a 404 code, whilst the homepages.cs.ncl.ac.uk will again remain active. Either way, it is more likely that the database has become extinct if the host is unreachable.</p>
<p>The results of this analysis for the databases listed in DBcat are shown in the table below:</p>
<table border="1" cellspacing="0">
<colgroup width="119"></colgroup>
<colgroup width="112"></colgroup>
<colgroup span="6" width="85"></colgroup>
<tbody>
<tr>
<td align="LEFT" height="16"></td>
<td align="LEFT"></td>
<td colspan="3" align="CENTER" valign="MIDDLE"><strong>Full URL</strong></td>
<td colspan="3" align="CENTER" valign="MIDDLE"><strong>Host URL</strong></td>
</tr>
<tr>
<td align="LEFT" height="16"><strong>Category</strong></td>
<td style="text-align: center;" align="LEFT"><strong>Total</strong></td>
<td style="text-align: center;" align="LEFT"><strong>Alive</strong></td>
<td style="text-align: center;" align="LEFT"><strong>Dead</strong></td>
<td style="text-align: center;" align="LEFT"><strong>% Alive</strong></td>
<td style="text-align: center;" align="LEFT"><strong>Alive</strong></td>
<td style="text-align: center;" align="LEFT"><strong>Dead</strong></td>
<td style="text-align: center;" align="LEFT"><strong>% Alive</strong></td>
</tr>
<tr>
<td align="LEFT" height="16"><strong>Protein Structure</strong></td>
<td align="RIGHT">18</td>
<td align="RIGHT">8</td>
<td align="RIGHT">10</td>
<td align="RIGHT">44.4%</td>
<td align="RIGHT">13</td>
<td align="RIGHT">5</td>
<td align="RIGHT">72.2%</td>
</tr>
<tr>
<td align="LEFT" height="16"><strong>Genomic</strong></td>
<td align="RIGHT">55</td>
<td align="RIGHT">20</td>
<td align="RIGHT">35</td>
<td align="RIGHT">36.4%</td>
<td align="RIGHT">35</td>
<td align="RIGHT">20</td>
<td align="RIGHT">63.6%</td>
</tr>
<tr>
<td align="LEFT" height="16"><strong>DNA</strong></td>
<td align="RIGHT">83</td>
<td align="RIGHT">43</td>
<td align="RIGHT">40</td>
<td align="RIGHT">51.8%</td>
<td align="RIGHT">62</td>
<td align="RIGHT">21</td>
<td align="RIGHT">74.7%</td>
</tr>
<tr>
<td align="LEFT" height="16"><strong>Literature</strong></td>
<td align="RIGHT">38</td>
<td align="RIGHT">18</td>
<td align="RIGHT">20</td>
<td align="RIGHT">47.4%</td>
<td align="RIGHT">32</td>
<td align="RIGHT">6</td>
<td align="RIGHT">84.2%</td>
</tr>
<tr>
<td align="LEFT" height="16"><strong>Mapping</strong></td>
<td align="RIGHT">28</td>
<td align="RIGHT">8</td>
<td align="RIGHT">20</td>
<td align="RIGHT">28.6%</td>
<td align="RIGHT">15</td>
<td align="RIGHT">13</td>
<td align="RIGHT">53.6%</td>
</tr>
<tr>
<td align="LEFT" height="16"><strong>RNA</strong></td>
<td align="RIGHT">29</td>
<td align="RIGHT">8</td>
<td align="RIGHT">21</td>
<td align="RIGHT">27.6%</td>
<td align="RIGHT">18</td>
<td align="RIGHT">11</td>
<td align="RIGHT">62.1%</td>
</tr>
<tr>
<td align="LEFT" height="16"><strong>Protein</strong></td>
<td align="RIGHT">89</td>
<td align="RIGHT">31</td>
<td align="RIGHT">58</td>
<td align="RIGHT">34.8%</td>
<td align="RIGHT">59</td>
<td align="RIGHT">30</td>
<td align="RIGHT">66.3%</td>
</tr>
<tr>
<td align="LEFT" height="16"><strong>Miscellaneous</strong></td>
<td align="RIGHT">141</td>
<td align="RIGHT">50</td>
<td align="RIGHT">91</td>
<td align="RIGHT">35.5%</td>
<td align="RIGHT">78</td>
<td align="RIGHT">63</td>
<td align="RIGHT">55.3%</td>
</tr>
</tbody>
</table>
<p>&nbsp;</p>
<p>&#8230;with the results for databases listed in NAR shown in the following table:</p>
<table border="1" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td style="text-align: center;"></td>
<td style="text-align: center;"><strong> </strong></td>
<td style="text-align: center;" colspan="3"><strong>Full URL</strong></td>
<td style="text-align: center;" colspan="3"><strong>Host URL</strong></td>
</tr>
<tr>
<td style="text-align: left;"><strong>Category</strong></td>
<td style="text-align: center;"><strong>  Total<br />
</strong></td>
<td style="text-align: center;"><strong>  Reachable<br />
</strong></td>
<td style="text-align: center;"><strong>  Dead<br />
</strong></td>
<td style="text-align: center;"><strong> % Alive<br />
</strong></td>
<td style="text-align: center;"><strong>  Reachable<br />
</strong></td>
<td style="text-align: center;"><strong>  Dead<br />
</strong></td>
<td style="text-align: center;"><strong> % Alive</strong></td>
</tr>
<tr>
<td><strong>Protein sequences</strong></td>
<td>7</td>
<td>4</td>
<td>3</td>
<td>57.1%</td>
<td>6</td>
<td>1</td>
<td>85.7%</td>
</tr>
<tr>
<td><strong>DNA and cDNA sequences<br />
</strong></td>
<td>9</td>
<td>4</td>
<td>5</td>
<td>44.4%</td>
<td>7</td>
<td>2</td>
<td>77.8%</td>
</tr>
<tr>
<td><strong>DNA sequence motifs</strong></td>
<td>7</td>
<td>5</td>
<td>2</td>
<td>71.4%</td>
<td>6</td>
<td>1</td>
<td>85.7%</td>
</tr>
<tr>
<td><strong>Gene Expression</strong></td>
<td>8</td>
<td>1</td>
<td>7</td>
<td>12.5%</td>
<td>3</td>
<td>5</td>
<td>37.5%</td>
</tr>
<tr>
<td><strong>Genome Overview</strong></td>
<td>15</td>
<td>9</td>
<td>6</td>
<td>60%</td>
<td>12</td>
<td>3</td>
<td>80%</td>
</tr>
<tr>
<td><strong>Maps</strong></td>
<td>10</td>
<td>3</td>
<td>7</td>
<td>30%</td>
<td>7</td>
<td>3</td>
<td>70%</td>
</tr>
<tr>
<td><strong>RNA sequences</strong></td>
<td>16</td>
<td>2</td>
<td>14</td>
<td>12.5%</td>
<td>8</td>
<td>8</td>
<td>50%</td>
</tr>
<tr>
<td><strong>Protein sequence motifs</strong></td>
<td>14</td>
<td>6</td>
<td>8</td>
<td>42.9%</td>
<td>9</td>
<td>5</td>
<td>64.3%</td>
</tr>
<tr>
<td><strong>Protein curation</strong></td>
<td>26</td>
<td>8</td>
<td>18</td>
<td>30.8%</td>
<td>17</td>
<td>9</td>
<td>65.4%</td>
</tr>
<tr>
<td><strong>Proteomics overview</strong></td>
<td>3</td>
<td>2</td>
<td>1</td>
<td>66.7%</td>
<td>3</td>
<td>0</td>
<td>100%</td>
</tr>
<tr>
<td><strong>Structure</strong></td>
<td>20</td>
<td>8</td>
<td>12</td>
<td>40%</td>
<td>11</td>
<td>9</td>
<td>55%</td>
</tr>
<tr>
<td><strong>Mutations</strong></td>
<td>38</td>
<td>13</td>
<td>25</td>
<td>34.2%</td>
<td>26</td>
<td>12</td>
<td>68.4%</td>
</tr>
<tr>
<td><strong>Pathways and regulation</strong></td>
<td>9</td>
<td>2</td>
<td>7</td>
<td>22.2%</td>
<td>3</td>
<td>6</td>
<td>33.3%</td>
</tr>
<tr>
<td><strong>Transgenics</strong></td>
<td>2</td>
<td>0</td>
<td>2</td>
<td>0%</td>
<td>1</td>
<td>1</td>
<td>50%</td>
</tr>
<tr>
<td><strong>Anatomy</strong></td>
<td>4</td>
<td>2</td>
<td>2</td>
<td>50%</td>
<td>3</td>
<td>1</td>
<td>75%</td>
</tr>
<tr>
<td><strong>Other</strong></td>
<td>14</td>
<td>4</td>
<td>10</td>
<td>28.6%</td>
<td>5</td>
<td>9</td>
<td>35.7%</td>
</tr>
</tbody>
</table>
<p>&nbsp;</p>
<p>These results show a significant number of databases are now unreachable. In total, only 36% of those listed in NAR (63% for host URLs) and 39% of those listed in DBcat (65% for host URLs) remain reachable. The data for DBcat is also shown graphically below:</p>
<p><a href="images/dbcat_deaths.png"><img alt="Bar chat of Full URLs that have are either active or unreachable" src="images/dbcat_deaths.png" width="90%," height="90%," /></a><br />
<a href="images/dbcat_deaths_host.png"><img alt="Bar chat of host URLs that have are either active or unreachable" src="images/dbcat_deaths_host.png" width="90%," height="90%," /></a></p>
<p>As previously stated, there are a number of ways this analysis could be improved, but my purpose here was to just gain an indicator as to the rate of database death, rather than a full blown analysis. Please bear in mind that this analysis and generated data was automated. Although I checked a subset of the results, I didn&#8217;t check each URL manually.</p>
<p>These results aren&#8217;t too unexpected &#8212; if anything, it is somewhat surprising that such a significant number of databases actually remain alive after (at least) 13 years. Extending this analysis to confirm this data and to see how many of those databases that remain reachable are still regularly updated would be interesting. It would also be interesting to determine why each database &#8220;died&#8221; and in which year it died. </p>
<h2>References</h2>
    <ol>
    <li><a name='ITEM-872-0'></a>
C. Burks, "Molecular Biology Database List", <i>Nucleic Acids Research</i>, vol. 27, pp. 1-9, 1999. <a href="http://dx.doi.org/10.1093/nar/27.1.1">http://dx.doi.org/10.1093/nar/27.1.1</a>


</li>
<li><a name='ITEM-872-1'></a>
C. Discala, X. Benigni, E. Barillot, and G. Vaysseix, "DBcat: a catalog of 500 biological databases.", <i>Nucleic acids research</i>, 2000. <a href="http://www.ncbi.nlm.nih.gov/pubmed/10592168">http://www.ncbi.nlm.nih.gov/pubmed/10592168</a>


</li>
</ol>

</div> <!-- kcite-section 872 -->]]></content:encoded>
			<wfw:commentRss>http://homepages.cs.ncl.ac.uk/m.j.bell1/blog/?feed=rss2&#038;p=872</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>The growth of biological databases</title>
		<link>http://homepages.cs.ncl.ac.uk/m.j.bell1/blog/?p=830</link>
		<comments>http://homepages.cs.ncl.ac.uk/m.j.bell1/blog/?p=830#comments</comments>
		<pubDate>Thu, 28 Mar 2013 12:44:07 +0000</pubDate>
		<dc:creator>Michael Bell</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://homepages.cs.ncl.ac.uk/m.j.bell1/blog/?p=830</guid>
		<description><![CDATA[As is commonly stated, biological data is constantly increasing. However, although it is well known that the number of databases in existence is growing, it is less common to see discussions regarding this growth. The growth of biological data is frequently illustrated by showing a graph of the data being added to a database such [...]]]></description>
				<content:encoded><![CDATA[<div class="kcite-section" kcite-section-id="830">
<p>As is commonly stated, biological data is constantly increasing. However, although it is well known that the number of databases in existence is growing, it is less common to see discussions regarding this growth.</p>
<p>The growth of biological data is frequently illustrated by showing a graph of the data being added to a database such as GenBank or UniProtKB over time. I was surprised that I wasn&#8217;t able to find many similar and up to date graphs regarding the number of biological databases over time. One such <a href="http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3245051/figure/gkr1099-F1/">graph</a> (from <span id="cite_ITEM-830-0" name="citation"><a href="#ITEM-830-0">[1]</a></span>) shows the number of publications in PubMed with the keyword &#8220;database&#8221; in the title over time. However, this isn&#8217;t entirely accurate &#8212; a paper with &#8220;database&#8221; in the title doesn&#8217;t necessarily describe a new database. Similarly, the growth and importance of databases can be highlighted by the introduction of <a href="http://database.oxfordjournals.org/">&#8220;Database: The Journal of Biological Databases and Curation&#8221;</a> in 2009, however this doesn&#8217;t give us a quantitative measure of the growth.</p>
<p>I did fine suitable graphs on two pages (See <a href="http://finchtalk.geospiza.com/2011/01/databases-of-databases.html">1</a> and <a href="http://duncan.hull.name/2007/01/05/nar-database-issue-2007-not-waving-but-drowning/">2</a>), however these are a number of years out of date (2 and 6, respectively) and sadly don&#8217;t appear to provide the raw data. Therefore, I&#8217;ve gone about reproducing and updating these graph, making available the raw data.</p>
<p>The pages showing this growth are based on databases listed in the Nucleic Acids Research (NAR) online molecular Biology Database Collection, which currently lists <a href="http://www.oxfordjournals.org/nar/database/a/">1512 online databases</a>. The most recent issue (2013) <span id="cite_ITEM-830-1" name="citation"><a href="#ITEM-830-1">[2]</a></span> marks the 20th annual database issue of NAR &#8212; the first issue in 1993 consisted of 24 articles (there was two issues prior to this, although not formally labelled as a database issue, in 1992 and 1991 which contained <a href="http://nar.oxfordjournals.org/content/19/supplement.toc">18</a> and <a href="http://nar.oxfordjournals.org/content/20/supplement.toc">19</a> articles, respectively).</p>
<p>Although there are over 20 years of database issues, it would be too time consuming to calculate the growth of databases prior to 1999. Since 1999, each issue has had a corresponding paper summarising the databases and the number of databases listed, meaning the number of biological databases listed can be easily extracted. Prior to this it would become more complicated to calculate the size of the list, as the number or articles within an issue doesn&#8217;t necessarily reflect the number of databases listed. For example, the 2013 version includes 176 articles (corresponding to 88 new databases and 88 updates), yet the collection lists 1512 online databases. Whilst it would be possible to calculate this information for databases prior to 1999, it would add little benefit for my needs &#8212; 15 years worth of data is sufficient to illustrate that the number of databases is continually growing over time. </p>
<p>I&#8217;ve extracted the information from each summary paper and shown it below, supported with the corresponding paper reference and a graph to illustrate this growth:</p>
<table>
<tbody>
<tr>
<td><strong>Year</strong></td>
<td style="text-align: center;"><strong>Number of Databases</strong></td>
<td style="text-align: center;"></td>
<td style="text-align: center;"><strong>Reference</strong></td>
</tr>
<tr>
<td>1999</td>
<td style="text-align: center;">202</td>
<td style="text-align: center;"></td>
<td><span id="cite_ITEM-830-2" name="citation"><a href="#ITEM-830-2">[3]</a></span></td>
</tr>
<tr>
<td>2000</td>
<td style="text-align: center;">226</td>
<td style="text-align: center;"></td>
<td><span id="cite_ITEM-830-3" name="citation"><a href="#ITEM-830-3">[4]</a></span></td>
</tr>
<tr>
<td>2001</td>
<td style="text-align: center;">281</td>
<td style="text-align: center;"></td>
<td><span id="cite_ITEM-830-4" name="citation"><a href="#ITEM-830-4">[5]</a></span></td>
</tr>
<tr>
<td>2002</td>
<td style="text-align: center;">335</td>
<td style="text-align: center;"></td>
<td><span id="cite_ITEM-830-5" name="citation"><a href="#ITEM-830-5">[6]</a></span></td>
</tr>
<tr>
<td>2003</td>
<td style="text-align: center;">386</td>
<td style="text-align: center;"></td>
<td><span id="cite_ITEM-830-6" name="citation"><a href="#ITEM-830-6">[7]</a></span></td>
</tr>
<tr>
<td>2004</td>
<td style="text-align: center;">548</td>
<td style="text-align: center;"></td>
<td><span id="cite_ITEM-830-7" name="citation"><a href="#ITEM-830-7">[8]</a></span></td>
</tr>
<tr>
<td>2005</td>
<td style="text-align: center;">719</td>
<td style="text-align: center;"></td>
<td><span id="cite_ITEM-830-8" name="citation"><a href="#ITEM-830-8">[9]</a></span></td>
</tr>
<tr>
<td>2006</td>
<td style="text-align: center;">858</td>
<td style="text-align: center;"></td>
<td><span id="cite_ITEM-830-9" name="citation"><a href="#ITEM-830-9">[10]</a></span></td>
</tr>
<tr>
<td>2007</td>
<td style="text-align: center;">968</td>
<td style="text-align: center;"></td>
<td><span id="cite_ITEM-830-10" name="citation"><a href="#ITEM-830-10">[11]</a></span></td>
</tr>
<tr>
<td>2008</td>
<td style="text-align: center;">1078</td>
<td style="text-align: center;"></td>
<td><span id="cite_ITEM-830-11" name="citation"><a href="#ITEM-830-11">[12]</a></span></td>
</tr>
<tr>
<td>2009</td>
<td style="text-align: center;">1170</td>
<td style="text-align: center;"></td>
<td><span id="cite_ITEM-830-12" name="citation"><a href="#ITEM-830-12">[13]</a></span></td>
</tr>
<tr>
<td>2010</td>
<td style="text-align: center;">1230</td>
<td style="text-align: center;"></td>
<td><span id="cite_ITEM-830-13" name="citation"><a href="#ITEM-830-13">[14]</a></span></td>
</tr>
<tr>
<td>2011</td>
<td style="text-align: center;">1330</td>
<td style="text-align: center;"></td>
<td><span id="cite_ITEM-830-14" name="citation"><a href="#ITEM-830-14">[15]</a></span></td>
</tr>
<tr>
<td>2012</td>
<td style="text-align: center;">1380</td>
<td style="text-align: center;"></td>
<td><span id="cite_ITEM-830-15" name="citation"><a href="#ITEM-830-15">[16]</a></span></td>
</tr>
<tr>
<td>2013</td>
<td style="text-align: center;">1512</td>
<td style="text-align: center;"></td>
<td style="text-align: left;"><span id="cite_ITEM-830-1" name="citation"><a href="#ITEM-830-1">[2]</a></span></td>
</tr>
</tbody>
</table>
<p><a href="images/database_growth.png"><img src="images/database_growth.png" alt="The growth of biological databases over time."></a></p>
<p>This growth is mostly due to the creation of databases that deal with a specific specialisation. For example, the <a href="http://www.nextprot.org/">neXtProt database</a> was created to provide a resource dealing exclusively with human proteins <span id="cite_ITEM-830-16" name="citation"><a href="#ITEM-830-16">[17]</a></span>. Given this, the list of databases on NAR divides databases into 15 main groups, which are further sub-divided into 40 sub-groups.</p>
<p>The NAR list provides an ideal platform to analyse growth over time as we can extract the number of databases listed for each year. However, the NAR list doesn&#8217;t capture all available databases. The databases listed are only those that are &#8220;<a href="http://www.oxfordjournals.org/our_journals/nar/for_authors/msprep_database.html">of high value to the biological community</a>&#8220;, whilst inclusion of new databases is by <a href="http://www.oxfordjournals.org/our_journals/nar/for_authors/msprep_database.html">invitation only</a>. Therefore, it is likely many more databases exist.</p>
<p>A number of &#8220;database of databases&#8221; sites exist, such as <a href="http://en.wikipedia.org/wiki/List_of_biological_databases">a list in Wikipedia</a>, MetaBase <span id="cite_ITEM-830-0" name="citation"><a href="#ITEM-830-0">[1]</a></span> and the Bioinformatics links directory <span id="cite_ITEM-830-17" name="citation"><a href="#ITEM-830-17">[18]</a></span>, although these are significantly based on the NAR list. However, one such site, DBCat <span id="cite_ITEM-830-18" name="citation"><a href="#ITEM-830-18">[19]</a></span>, listed 513 databases in 2000 compared to the 226 listed by NAR. DBcat went without updated from 2000 until 2006, when it was subsequently shut down (<a href="http://web.archive.org/web/20060709044154/http://www.infobiogen.fr/services/dbcat/">archived version</a>).</p>
<p>Sadly a number of resources, such DBcat, will exist without updates or will be shut down due to factors such as financial funding running out. An analysis of databases that have become extinct due to such factors is arguably more interesting than one about the growth of databases &#8212; I hope to do such an analysis in a follow-up post. I suspect a significant number of databases will sadly have gone this way. For example, it is hard to imagine that a database such as Swiss-Prot could disappear, yet in 1996 it almost did due to a significant<a href="http://web.expasy.org/docs/crisis96/help-sprot.html"> funding crisis</a>. Financial support remains a common challenge amongst databases <span id="cite_ITEM-830-19" name="citation"><a href="#ITEM-830-19">[20]</a></span>, with resources such as Protégé, BioMagResBank (BMRB) and REBASE recently having funding difficulties <span id="cite_ITEM-830-20" name="citation"><a href="#ITEM-830-20">[21]</a></span>. This really does highlight the importance of long term preservation of the data held within databases; something that isn&#8217;t necessarily guaranteed given the way that the majority of data is currently stored <span id="cite_ITEM-830-19" name="citation"><a href="#ITEM-830-19">[20]</a></span>.</p>
<h2>References</h2>
    <ol>
    <li><a name='ITEM-830-0'></a>
D.M. Bolser, P. Chibon, N. Palopoli, S. Gong, D. Jacob, V.D.D. Angel, D. Swan, S. Bassi, V. Gonzalez, P. Suravajhala, S. Hwang, P. Romano, R. Edwards, B. Bishop, J. Eargle, T. Shtatland, N.J. Provart, D. Clements, D.P. Renfro, D. Bhak, and J. Bhak, "MetaBase--the wiki-database of biological databases", <i>Nucleic Acids Research</i>, vol. 40, pp. D1250-D1254, 2011. <a href="http://dx.doi.org/10.1093/nar/gkr1099">http://dx.doi.org/10.1093/nar/gkr1099</a>


</li>
<li><a name='ITEM-830-1'></a>
X.M. Fernandez-Suarez, and M.Y. Galperin, "The 2013 Nucleic Acids Research Database Issue and the online Molecular Biology Database Collection", <i>Nucleic Acids Research</i>, vol. 41, pp. D1-D7, 2012. <a href="http://dx.doi.org/10.1093/nar/gks1297">http://dx.doi.org/10.1093/nar/gks1297</a>


</li>
<li><a name='ITEM-830-2'></a>
C. Burks, "Molecular Biology Database List", <i>Nucleic Acids Research</i>, vol. 27, pp. 1-9, 1999. <a href="http://dx.doi.org/10.1093/nar/27.1.1">http://dx.doi.org/10.1093/nar/27.1.1</a>


</li>
<li><a name='ITEM-830-3'></a>
A.D. Baxevanis, "The Molecular Biology Database Collection: an online compilation of relevant database resources", <i>Nucleic Acids Research</i>, vol. 28, pp. 1-7, 2000. <a href="http://dx.doi.org/10.1093/nar/28.1.1">http://dx.doi.org/10.1093/nar/28.1.1</a>


</li>
<li><a name='ITEM-830-4'></a>
A.D. Baxevanis, "The Molecular Biology Database Collection: an updated compilation of biological database resources", <i>Nucleic Acids Research</i>, vol. 29, pp. 1-10, 2001. <a href="http://dx.doi.org/10.1093/nar/29.1.1">http://dx.doi.org/10.1093/nar/29.1.1</a>


</li>
<li><a name='ITEM-830-5'></a>
A.D. Baxevanis, "The Molecular Biology Database Collection: 2002 update", <i>Nucleic Acids Research</i>, vol. 30, pp. 1-12, 2002. <a href="http://dx.doi.org/10.1093/nar/30.1.1">http://dx.doi.org/10.1093/nar/30.1.1</a>


</li>
<li><a name='ITEM-830-6'></a>
A.D. Baxevanis, "The Molecular Biology Database Collection: 2003 update", <i>Nucleic Acids Research</i>, vol. 31, pp. 1-12, 2003. <a href="http://dx.doi.org/10.1093/nar/gkg120">http://dx.doi.org/10.1093/nar/gkg120</a>


</li>
<li><a name='ITEM-830-7'></a>
M.Y. Galperin, "The Molecular Biology Database Collection: 2004 update", <i>Nucleic Acids Research</i>, vol. 32, pp. 3D-22, 2004. <a href="http://dx.doi.org/10.1093/nar/gkh143">http://dx.doi.org/10.1093/nar/gkh143</a>


</li>
<li><a name='ITEM-830-8'></a>
M.Y. Galperin, "The Molecular Biology Database Collection: 2005 update", <i>Nucleic Acids Research</i>, vol. 33, pp. D5-D24, 2004. <a href="http://dx.doi.org/10.1093/nar/gki139">http://dx.doi.org/10.1093/nar/gki139</a>


</li>
<li><a name='ITEM-830-9'></a>
M.Y. Galperin, "The Molecular Biology Database Collection: 2006 update", <i>Nucleic Acids Research</i>, vol. 34, pp. D3-D5, 2006. <a href="http://dx.doi.org/10.1093/nar/gkj162">http://dx.doi.org/10.1093/nar/gkj162</a>


</li>
<li><a name='ITEM-830-10'></a>
M.Y. Galperin, "The Molecular Biology Database Collection: 2007 update", <i>Nucleic Acids Research</i>, vol. 35, pp. D3-D4, 2007. <a href="http://dx.doi.org/10.1093/nar/gkl1008">http://dx.doi.org/10.1093/nar/gkl1008</a>


</li>
<li><a name='ITEM-830-11'></a>
M.Y. Galperin, "The Molecular Biology Database Collection: 2008 update", <i>Nucleic Acids Research</i>, vol. 36, pp. D2-D4, 2007. <a href="http://dx.doi.org/10.1093/nar/gkm1037">http://dx.doi.org/10.1093/nar/gkm1037</a>


</li>
<li><a name='ITEM-830-12'></a>
M.Y. Galperin, and G.R. Cochrane, "Nucleic Acids Research annual Database Issue and the NAR online Molecular Biology Database Collection in 2009", <i>Nucleic Acids Research</i>, vol. 37, pp. D1-D4, 2009. <a href="http://dx.doi.org/10.1093/nar/gkn942">http://dx.doi.org/10.1093/nar/gkn942</a>


</li>
<li><a name='ITEM-830-13'></a>
G.R. Cochrane, and M.Y. Galperin, "The 2010 Nucleic Acids Research Database Issue and online Database Collection: a community of data resources", <i>Nucleic Acids Research</i>, vol. 38, pp. D1-D4, 2009. <a href="http://dx.doi.org/10.1093/nar/gkp1077">http://dx.doi.org/10.1093/nar/gkp1077</a>


</li>
<li><a name='ITEM-830-14'></a>
M.Y. Galperin, and G.R. Cochrane, "The 2011 Nucleic Acids Research Database Issue and the online Molecular Biology Database Collection", <i>Nucleic Acids Research</i>, vol. 39, pp. D1-D6, 2010. <a href="http://dx.doi.org/10.1093/nar/gkq1243">http://dx.doi.org/10.1093/nar/gkq1243</a>


</li>
<li><a name='ITEM-830-15'></a>
M.Y. Galperin, and X.M. Fernandez-Suarez, "The 2012 Nucleic Acids Research Database Issue and the online Molecular Biology Database Collection", <i>Nucleic Acids Research</i>, vol. 40, pp. D1-D8, 2011. <a href="http://dx.doi.org/10.1093/nar/gkr1196">http://dx.doi.org/10.1093/nar/gkr1196</a>


</li>
<li><a name='ITEM-830-16'></a>
L. Lane, G. Argoud-Puy, A. Britan, I. Cusin, P.D. Duek, O. Evalet, A. Gateau, P. Gaudet, A. Gleizes, A. Masselot, C. Zwahlen, and A. Bairoch, "neXtProt: a knowledge platform for human proteins", <i>Nucleic Acids Research</i>, vol. 40, pp. D76-D83, 2011. <a href="http://dx.doi.org/10.1093/nar/gkr1179">http://dx.doi.org/10.1093/nar/gkr1179</a>


</li>
<li><a name='ITEM-830-17'></a>
M.D. Brazas, D.S. Yim, J.T. Yamada, and B.F.F. Ouellette, "The 2011 bioinformatics links directory update: more resources, tools and databases and features to empower the bioinformatics community", <i>Nucleic Acids Research</i>, vol. 39, pp. W3-W7, 2011. <a href="http://dx.doi.org/10.1093/nar/gkr514">http://dx.doi.org/10.1093/nar/gkr514</a>


</li>
<li><a name='ITEM-830-18'></a>
C. Discala, X. Benigni, E. Barillot, and G. Vaysseix, "DBcat: a catalog of 500 biological databases.", <i>Nucleic acids research</i>, 2000. <a href="http://www.ncbi.nlm.nih.gov/pubmed/10592168">http://www.ncbi.nlm.nih.gov/pubmed/10592168</a>


</li>
<li><a name='ITEM-830-19'></a>
C. Chandras, T. Weaver, M. Zouberakis, D. Smedley, K. Schughart, N. Rosenthal, J.M. Hancock, G. Kollias, P.N. Schofield, and V. Aidinis, "Models for financial sustainability of biological databases and resources", <i>Database</i>, vol. 2009, pp. bap017-bap017, 2009. <a href="http://dx.doi.org/10.1093/database/bap017">http://dx.doi.org/10.1093/database/bap017</a>


</li>
<li><a name='ITEM-830-20'></a>
M. Baker, "Databases fight funding cuts", <i>Nature</i>, vol. 489, pp. 19-19, 2012. <a href="http://dx.doi.org/10.1038/489019a">http://dx.doi.org/10.1038/489019a</a>


</li>
</ol>

</div> <!-- kcite-section 830 -->]]></content:encoded>
			<wfw:commentRss>http://homepages.cs.ncl.ac.uk/m.j.bell1/blog/?feed=rss2&#038;p=830</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Citation reuse in UniProtKB</title>
		<link>http://homepages.cs.ncl.ac.uk/m.j.bell1/blog/?p=794</link>
		<comments>http://homepages.cs.ncl.ac.uk/m.j.bell1/blog/?p=794#comments</comments>
		<pubDate>Tue, 19 Mar 2013 10:50:06 +0000</pubDate>
		<dc:creator>Michael Bell</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://homepages.cs.ncl.ac.uk/m.j.bell1/blog/?p=794</guid>
		<description><![CDATA[The majority of the work discussed on this blog has looked at annotations within UniProtKB. In a nutshell, we have shown high levels of annotation reuse, which is essentially done as a matter or protocol. I&#8217;m now in the process of writing up, and have just been looking at citation information in UniProtKB. Although not [...]]]></description>
				<content:encoded><![CDATA[<div class="kcite-section" kcite-section-id="794">
<p>The majority of the work discussed on this blog has looked at annotations within UniProtKB. In a nutshell, we have shown high levels of annotation reuse, which is essentially done as a matter or protocol. I&#8217;m now in the process of writing up, and have just been looking at citation information in UniProtKB. Although not too surprising, it appears that the reuse of citations in UniProtKB is also common.</p>
<p>I started to look at this data as I was curious how many articles have been manually curated by annotators over the lifespan of Swiss-Prot (the last 27 years or so). Reading an article about manual curation in FlyBase, I was interested to read that it can take between two and four months for an article to be fully curated (curation in FlyBase is done on a article, rather than protein/entry, basis) <span id="cite_ITEM-794-0" name="citation"><a href="#ITEM-794-0">[1]</a></span>. Given this, I was wanting to get a grasp of how much of a bottleneck literature curation in UniProtKB is. </p>
<p>The UniProtKB curation process is well documented <span id="cite_ITEM-794-1" name="citation"><a href="#ITEM-794-1">[2]</a></span> and states that, for each paper identified for curation, the full text is read, with relevant information extracted. </p>
<p>The most <a href="http://web.expasy.org/docs/relnotes/relstat.html">current statistics</a> for Swiss-Prot (Version 2013_03 at time of writing) state that there are 1,037,168 reference lines (RL), which relates to one reference (citation) in one entry (the RL lines don&#8217;t wrap over two or more lines). For example, in <a href="http://www.uniprot.org/uniprot/P63015.txt">P63015</a> there are three RL lines, meaning that three journal articles are referenced. The statistics also state that an average entry has 1.92 references, with 829,697 of the references being journal references, coming from 2306 journals (the remaining references are made up from sources such as books, theses and Worm Breeder&#8217;s Gazette). The latest <a href="http://www.uniprot.org/statistics/TrEMBL">statistics</a> for TrEMBL show 23,815,989 citations over 21,182,512 entries.</p>
<p>Clearly references are reused; it is unlikely that 829, 697 journal articles have been read by UniProt curators (who currently list 43 biocurators on their <a href="http://www.uniprot.org/help/uniprot%20staff">staff list</a>), whilst TrEMBL has more citations than exist in PubMed. Sadly, these statistics don&#8217;t state how many distinct citations there are. However in a 2010 paper UniProt stated that &#8220;currently, there are ∼228 000 distinct PubMed citations associated with ∼4.2 million UniProtKB sequences and 67% of these citations are in UniProtKB/Swiss-Prot.&#8221; with &#8220;11 external sources contribute ∼350 000 unique PubMed citations not yet annotated in UniProtKB, covering ∼188 000 UniProtKB entries.&#8221; <span id="cite_ITEM-794-2" name="citation"><a href="#ITEM-794-2">[3]</a></span>. Obtaining the statistics for the nearest major release of <a href="http://www.uniprot.org/statistics/UniProtKB%2015">Swiss-Prot</a> relating to these figures shows a total of 781,540 references, 628,701 of which are journal references.</p>
<p>This suggests that, in 2010, each journal reference was reused approximately 5 times on average (based on citations in Swiss-Prot). Given the growth in references over 3 years, it is likely this reuse has increased. Presumably the number of relevant papers identified in a typical month will be increasing, which may also impact this reuse. The biomedical literature is growing at a double-exponential rate <span id="cite_ITEM-794-3" name="citation"><a href="#ITEM-794-3">[4]</a></span>. Currently, PubMed indexes over 22 million citations, up from 20 million in mid 2010 (see two interesting posts [<a href="http://duncan.hull.name/2010/07/15/fifty-million/">1</a>, <a href="http://duncan.hull.name/2010/07/27/pubmed-20-million/">2</a>] by Duncan Hull on pros/cons of this, and how many journal articles exist). </p>
<p>This citation may be due to factors such as a paper containing multiple sequences. Either way, further analysis would be interesting. For example, it would be interesting to analyse this citation reuse in relation to sentence reuse &#8212; I assume there would be some correlation, which we could potentially use to gain further confidence in identifying sentence propagation.</p>
<h2>References</h2>
    <ol>
    <li><a name='ITEM-794-0'></a>
P. McQuilton, "Opportunities for text mining in the FlyBase genetic literature curation workflow", <i>Database</i>, vol. 2012, pp. bas039-bas039, 2012. <a href="http://dx.doi.org/10.1093/database/bas039">http://dx.doi.org/10.1093/database/bas039</a>


</li>
<li><a name='ITEM-794-1'></a>
M. Magrane, and U. Consortium, "UniProt Knowledgebase: a hub of integrated protein data", <i>Database</i>, vol. 2011, pp. bar009-bar009, 2011. <a href="http://dx.doi.org/10.1093/database/bar009">http://dx.doi.org/10.1093/database/bar009</a>


</li>
<li><a name='ITEM-794-2'></a>
"The Universal Protein Resource (UniProt) in 2010", <i>Nucleic Acids Research</i>, vol. 38, pp. D142-D148, 2009. <a href="http://dx.doi.org/10.1093/nar/gkp846">http://dx.doi.org/10.1093/nar/gkp846</a>


</li>
<li><a name='ITEM-794-3'></a>
L. Hunter, and K.B. Cohen, "Biomedical Language Processing: What's Beyond PubMed?", <i>Molecular Cell</i>, vol. 21, pp. 589-594, 2006. <a href="http://dx.doi.org/10.1016/j.molcel.2006.02.012">http://dx.doi.org/10.1016/j.molcel.2006.02.012</a>


</li>
</ol>

</div> <!-- kcite-section 794 -->]]></content:encoded>
			<wfw:commentRss>http://homepages.cs.ncl.ac.uk/m.j.bell1/blog/?feed=rss2&#038;p=794</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>My experience of presenting at ECCB&#8217;12 #eccb12</title>
		<link>http://homepages.cs.ncl.ac.uk/m.j.bell1/blog/?p=770</link>
		<comments>http://homepages.cs.ncl.ac.uk/m.j.bell1/blog/?p=770#comments</comments>
		<pubDate>Tue, 11 Sep 2012 13:32:31 +0000</pubDate>
		<dc:creator>Michael Bell</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://homepages.cs.ncl.ac.uk/m.j.bell1/blog/?p=770</guid>
		<description><![CDATA[Earlier today I presented our work on annotation quality at ECCB&#8217;12. This was my first time presenting outside of an internal lab seminar and it was a great experience. I was in the 525 seater Montreal auditorium at the Congress Center Basel, which was a really nice place to talk. There was a good number [...]]]></description>
				<content:encoded><![CDATA[<div class="kcite-section" kcite-section-id="770">
<p>Earlier today I presented our work on annotation quality <span id="cite_ITEM-770-0" name="citation"><a href="#ITEM-770-0">[1]</a></span> at <a href="http://www.eccb12.org/home">ECCB&#8217;12</a>. This was my first time presenting outside of an internal lab seminar and it was a great experience. I was in the 525 seater <a href="http://www.congress.ch/en-US/Raeume/Montreal">Montreal auditorium</a> at the <a href="http://www.congress.ch/">Congress Center Basel</a>, which was a really nice place to talk. There was a good number of people in attendance; at a guess, i&#8217;d say around 200 people.</p>
<p>I think my talk went reasonably well. There were a couple of places I could have been clearer/more flowing, but overall I&#8217;m quite happy. It was also great to have some of the UniProt people in the audience asking questions (which I think I could have answered a bit better). It was also nice to catch some of them for a chat afterwards.</p>
<p>Twitter has been quite popular during the conference, and the hashtag <a href="http://twitter.com/i/#!/search/?q=eccb12">#eccb12</a> has been trending on Twitter. It was interesting to read the tweets people had sent during the talk, and the whole twitter experience during ECCB has been pretty cool!</p>
<p>It has been a great experience to be at ECCB. Some of the statistics provided in the opening talk were impressive (sadly I can&#8217;t remember them specifically/all). However, there are over 1000 attendees and 341 submissions were received, of which 48 were accepted for oral presentation. I&#8217;m really pleased that our work was accepted, and want to extend a thanks to the organisers of ECCB for the opportunity to come and talk.</p>
<p>I&#8217;d also like to thank my co-authors <a href="http://www.ncl.ac.uk/maths/staff/profile/colin.gillespie">Colin Gillespie</a>, <a href="http://eridanusdotnet.wordpress.com/">Daniel Swan</a> and  <a href="http://homepages.cs.ncl.ac.uk/phillip.lord/">Phillip Lord</a>, without whom this work would not have been possible.</p>
<h2>References</h2>
    <ol>
    <li><a name='ITEM-770-0'></a>
M.J. Bell, C.S. Gillespie, D. Swan, and P. Lord, "An approach to describing and analysing bulk biological annotation quality: a case study using UniProtKB", <i>Bioinformatics</i>, vol. 28, pp. i562-i568, 2012. <a href="http://dx.doi.org/10.1093/bioinformatics/bts372">http://dx.doi.org/10.1093/bioinformatics/bts372</a>


</li>
</ol>

</div> <!-- kcite-section 770 -->]]></content:encoded>
			<wfw:commentRss>http://homepages.cs.ncl.ac.uk/m.j.bell1/blog/?feed=rss2&#038;p=770</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>An approach to describing and analysing bulk biological annotation quality: a case study using UniProtKB</title>
		<link>http://homepages.cs.ncl.ac.uk/m.j.bell1/blog/?p=741</link>
		<comments>http://homepages.cs.ncl.ac.uk/m.j.bell1/blog/?p=741#comments</comments>
		<pubDate>Sun, 09 Sep 2012 10:41:45 +0000</pubDate>
		<dc:creator>Michael Bell</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://homepages.cs.ncl.ac.uk/m.j.bell1/blog/?p=741</guid>
		<description><![CDATA[Just a quick post to say that our paper has now been published by Bioinformatics: &#8220;An approach to describing and analysing bulk biological annotation quality: a case study using UniProtKB&#8221;]]></description>
				<content:encoded><![CDATA[<div class="kcite-section" kcite-section-id="741">
<p>Just a quick post to say that our paper has now been published by Bioinformatics:</p>
<p>&#8220;An approach to describing and analysing bulk biological annotation quality: a case study using UniProtKB&#8221;<span id="cite_ITEM-741-0" name="citation"><a href="#ITEM-741-0">[1]</a></span></p>
<h2>References</h2>
    <ol>
    <li><a name='ITEM-741-0'></a>
M.J. Bell, C.S. Gillespie, D. Swan, and P. Lord, "An approach to describing and analysing bulk biological annotation quality: a case study using UniProtKB", <i>Bioinformatics</i>, vol. 28, pp. i562-i568, 2012. <a href="http://dx.doi.org/10.1093/bioinformatics/bts372">http://dx.doi.org/10.1093/bioinformatics/bts372</a>


</li>
</ol>

</div> <!-- kcite-section 741 -->]]></content:encoded>
			<wfw:commentRss>http://homepages.cs.ncl.ac.uk/m.j.bell1/blog/?feed=rss2&#038;p=741</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Presentation for ECCB&#8217;12</title>
		<link>http://homepages.cs.ncl.ac.uk/m.j.bell1/blog/?p=725</link>
		<comments>http://homepages.cs.ncl.ac.uk/m.j.bell1/blog/?p=725#comments</comments>
		<pubDate>Wed, 05 Sep 2012 12:51:13 +0000</pubDate>
		<dc:creator>Michael Bell</dc:creator>
				<category><![CDATA[Annotation Quality]]></category>
		<category><![CDATA[Miscellaneous]]></category>
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://homepages.cs.ncl.ac.uk/m.j.bell1/blog/?p=725</guid>
		<description><![CDATA[As mentioned previously, we (Colin Gillespie, Daniel Swan, Phillip Lord and I) have recently had some of our work on annotation quality accepted at the European Conference on Computational Biology (ECCB&#8217;12). Given this, I am due to give an oral presentation at ECCB&#8217;12 in Switzerland on the 11th September (ECCB&#8217;12 Schedule). I have made this [...]]]></description>
				<content:encoded><![CDATA[<div class="kcite-section" kcite-section-id="725">
<p>As mentioned previously, we (<a href="http://www.ncl.ac.uk/maths/staff/profile/colin.gillespie">Colin Gillespie</a>, <a href="http://eridanusdotnet.wordpress.com/">Daniel Swan</a>, <a href="http://homepages.cs.ncl.ac.uk/phillip.lord/">Phillip Lord</a> and I) have recently had some of our work on annotation quality accepted at the European Conference on Computational Biology (<a href="http://www.eccb12.org/home">ECCB&#8217;12</a>). Given this, I am due to give an oral presentation at ECCB&#8217;12 in Switzerland on the 11th September (<a href="http://www.eccb12.org/schedule/tuesday">ECCB&#8217;12 Schedule</a>). I have made this <a href="http://www.slideshare.net/mj_bell/an-approach-to-describing-and-analysing-bulk-biological-annotation-quality-a-case-study-using-uniprotkb">talk available</a> for those who are interested: </p>
<p><iframe src="http://www.slideshare.net/slideshow/embed_code/14154680" width="427" height="356" frameborder="0" marginwidth="0" marginheight="0" scrolling="no" style="border:1px solid #CCC;border-width:1px 1px 0;margin-bottom:5px" allowfullscreen> </iframe> </p>
<p>The paper from this work will be published in a special open-access issue of the journal &#8220;Bioinformatics&#8221; in September. However, it is already available in arXiv for those wishing to read it <span id="cite_ITEM-725-0" name="citation"><a href="#ITEM-725-0">[1]</a></span>!</p>
<h2>References</h2>
    <ol>
    <li><a name='ITEM-725-0'></a>
M.J. Bell, C.S. Gillespie, D. Swan, and P. Lord, "An approach to describing and analysing bulk biological annotation
  quality: a case study using UniProtKB", <i>arXiv</i>, 2012. <a href="http://arxiv.org/abs/1208.2175">http://arxiv.org/abs/1208.2175</a>


</li>
</ol>

</div> <!-- kcite-section 725 -->]]></content:encoded>
			<wfw:commentRss>http://homepages.cs.ncl.ac.uk/m.j.bell1/blog/?feed=rss2&#038;p=725</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Additional animated power-law graphs</title>
		<link>http://homepages.cs.ncl.ac.uk/m.j.bell1/blog/?p=690</link>
		<comments>http://homepages.cs.ncl.ac.uk/m.j.bell1/blog/?p=690#comments</comments>
		<pubDate>Thu, 16 Aug 2012 12:44:11 +0000</pubDate>
		<dc:creator>Michael Bell</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://homepages.cs.ncl.ac.uk/m.j.bell1/blog/?p=690</guid>
		<description><![CDATA[In the previous post I introduced some animated graphs that were produced to aid the analysis of Swiss-Prot and TrEMBL annotations overtime. Visual inspection of these graphs is beneficial, however the main analytical value comes from the extracted alpha values for each graph &#8211; something which I could have included. Many thanks to Colin Gillespie [...]]]></description>
				<content:encoded><![CDATA[<div class="kcite-section" kcite-section-id="690">
<p>In the <a href="http://homepages.cs.ncl.ac.uk/m.j.bell1/blog/?p=610">previous post </a> I introduced some animated graphs that were produced to aid the analysis of Swiss-Prot and TrEMBL annotations overtime. Visual inspection of these graphs is beneficial, however the main analytical value comes from the extracted alpha values for each graph &#8211; something which I could have included. Many thanks to <a href="http://csgillespie.wordpress.com/">Colin Gillespie</a> for pointing this out. I have re-created the Swiss-Prot and TrEMBL graphs with the corresponding alpha value now included. These are shown below:</p>
<p><a href="images/swiss_alpha_1.gif"><img src="images/swiss_alpha_1.gif" alt="Animation showing swiss-prot over time, with included alpha value."></a><br /> Make it <a href="images/swiss_alpha_2.gif">slower</a> or even <a href="images/swiss_alpha_3.gif">slower still&#8230;</a></p>
<p><a href="images/tr_alpha_1.gif"><img src="images/tr_alpha_1.gif" alt="Animation showing TrEMBL over time, with included alpha value."></a><br /> Make it <a href="images/tr_alpha_2.gif">slower</a> or even <a href="images/tr_alpha_3.gif">slower still&#8230;</a></p>
<p><a href="images/uptr_alpha_1.gif"><img src="images/uptr_alpha_1.gif" alt="Animation showing both Swiss-Prot and TrEMBL over time, with included alpha values."></a><br /> Make it <a href="images/uptr_alpha_2.gif">slower</a> or even <a href="images/uptr_alpha_3.gif">slower still&#8230;</a></p>
<p>Our early analysis of Swiss-Prot found an interesting &#8220;kink&#8221; in the graph, which was identified by visual inspection. This turned out to be copyright statements and provides an interesting talking point, some of which has been discussed <a href="http://homepages.cs.ncl.ac.uk/m.j.bell1/blog/?p=403">previously</a>. It is hard to believe I forgot to include this graph&#8230; therefore the graph for Swiss-Prot versions where copyright statements are still included is shown below:</p>
<p><a href="images/swiss_copyright_1.gif"><img src="images/swiss_copyright_1.gif" alt="Animation showing Swiss-Prot over time, with copyright included."></a><br /> Make it <a href="images/swiss_copyright_2.gif">slower</a> or even <a href="images/swiss_copyright_3.gif">slower still&#8230;</a></p>
<!-- kcite active, but no citations found -->
</div> <!-- kcite-section 690 -->]]></content:encoded>
			<wfw:commentRss>http://homepages.cs.ncl.ac.uk/m.j.bell1/blog/?feed=rss2&#038;p=690</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Animation: visualising change over time in power-law graphs</title>
		<link>http://homepages.cs.ncl.ac.uk/m.j.bell1/blog/?p=610</link>
		<comments>http://homepages.cs.ncl.ac.uk/m.j.bell1/blog/?p=610#comments</comments>
		<pubDate>Wed, 15 Aug 2012 08:51:41 +0000</pubDate>
		<dc:creator>Michael Bell</dc:creator>
				<category><![CDATA[Annotation Quality]]></category>
		<category><![CDATA[Miscellaneous]]></category>

		<guid isPermaLink="false">http://homepages.cs.ncl.ac.uk/m.j.bell1/blog/?p=610</guid>
		<description><![CDATA[As documented in numerous posts, a key aspect of my research has been the application of power laws to word occurrence in bulk biological annotation. A large portion of this work has recently been accepted at the European Conference on Computational Biology (ECCB) 2012 and the resulting paper will shortly appear in a special issue [...]]]></description>
				<content:encoded><![CDATA[<div class="kcite-section" kcite-section-id="610">
<p>As documented in numerous posts, a key aspect of my research has been the application of power laws to word occurrence in bulk biological annotation. A large portion of this work has recently been accepted at the <a href="http://www.eccb12.org/home">European Conference on Computational Biology (ECCB) 2012</a> and the resulting paper will shortly appear in a special issue of <a href="http://bioinformatics.oxfordjournals.org/">Bioinformatics</a>. The acceptance into ECCB&#8217;12 requires an oral presentation. Therefore, I am currently in the process of creating a selection of images and slides for this talk. </p>
<p>As is clear from previous posts, this work produces a lot of graphs. One of the challenges of writing the paper was trying to include figures that allow the key points to be clearly made within the page limit enforced upon us. For example, if we show the resulting graph for all versions of Swiss-Prot, TrEMBL and the overlay of both we will comfortably exceed 100 images. This clearly cannot be suitably done in publication format, nor will the resulting paper clearly show the change over time (Well, I guess we could have produced something along the lines of a &#8220;flick book&#8221; as part of the publication &#8211; however, I&#8217;m unsure my supervisor would be happy with the subsequent <a href="http://www.russet.org.uk/blog/2170">publication and colour figure costs</a> that OUP would charge)</p>
<p>However, unlike the paper, the presentation provides an ideal opportunity to show these graphs and the change over time. To do this I have created animated images for Swiss-Prot, TrEMBL and overlay images. The animations run reasonably quickly &#8211; 1/10th of a second between transitions. I have provided two further links for each of the animations which are slower (50ms and 1 second between transitions) below each graph. These graphs are shown below:</p>
<p><a href="images/swiss_1.gif"><img src="images/swiss_1.gif" alt="Animation showing the change over time in Swiss-Prot."></a><br />
Make it <a href="images/swiss_2.gif">slower</a> or even <a href="images/swiss_3.gif">slower still&#8230;</a></p>
<p><a href="images/tr_1.gif"><img src="images/tr_1.gif" alt="Animation showing the change over time in TrEMBL."></a><br />
Make it <a href="images/tr_2.gif">slower</a> or even <a href="images/tr_3.gif">slower still&#8230;</a></p>
<p><a href="images/uptr_1.gif"><img src="images/uptr_1.gif" alt="Animation showing the divergence of Swiss-Prot and TrEMBL over time."></a><br /> Make it <a href="images/uptr_2.gif">slower</a> or even <a href="images/uptr_3.gif">slower still&#8230;</a></p>
<p>These views clearly show, for example, the divergence between Swiss-Prot and TrEMBL over time. These animations are a powerful way to visually analyse  data that cannot be replicated by viewing static images side by side, or the extracted alpha values. For example, viewing the change in Swiss-Prot over time we see that while the head of the power law is &#8220;structurally&#8221; similar, it actually drops significantly (from the second point, i.e. x = 2) over time. This would suggest that over time the richness of individual words appears to be increasing &#8212; i.e. more words are occurring only a single time in later versions. </p>
<p>However, I&#8217;m not convinced that the animation of word clouds is quite as meaningful:</p>
<p><a href="images/wordle_animation.gif"><img src="images/wordle_animation.gif" alt="Animation of various word clouds."></a></p>
<p>These word clouds show the various words in Swiss-Prot annotations over time. I have <a href="http://homepages.cs.ncl.ac.uk/m.j.bell1/blog/?p=55">previously blogged about these</a> as well making all the images <a href="http://homepages.cs.ncl.ac.uk/m.j.bell1/word_clouds.php">available on my website</a>.</p>
<!-- kcite active, but no citations found -->
</div> <!-- kcite-section 610 -->]]></content:encoded>
			<wfw:commentRss>http://homepages.cs.ncl.ac.uk/m.j.bell1/blog/?feed=rss2&#038;p=610</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Annotation maturity: Comparison of annotations in new and old sets of UniProtKB entries.</title>
		<link>http://homepages.cs.ncl.ac.uk/m.j.bell1/blog/?p=554</link>
		<comments>http://homepages.cs.ncl.ac.uk/m.j.bell1/blog/?p=554#comments</comments>
		<pubDate>Fri, 03 Feb 2012 14:57:50 +0000</pubDate>
		<dc:creator>Michael Bell</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://homepages.cs.ncl.ac.uk/m.j.bell1/blog/?p=554</guid>
		<description><![CDATA[Carrying on from the previous post, we now wish to look at annotation maturity in sets of UniProtKB entries. We have seen that over time the quality of annotations appear to be decreasing over time, for both Swiss-Prot and TrEMBL. A reasonable explanation for this would be that annotations are constantly being added to the [...]]]></description>
				<content:encoded><![CDATA[<div class="kcite-section" kcite-section-id="554">
<p>Carrying on from the <a href="http://homepages.cs.ncl.ac.uk/m.j.bell1/blog/?p=442">previous post</a>, we now wish to look at annotation maturity in sets of UniProtKB entries. We have seen that over time the quality of annotations appear to be decreasing over time, for both Swiss-Prot and TrEMBL. A reasonable explanation for this would be that annotations are constantly being added to the newly incorporated data, which in turn has added additional pressure on curators, meaning that over time the least effort has shifted from the reader to the annotator. Whilst we see a reduction in the overall annotation quality, we suspect that a mature set of entries would improve over age. </p>
<p>To approach this analysis, we compare annotations from entries that are in both Swiss-Prot Version 9 and UniProtKB/Swiss-Prot version 15. We also show the resulting alpha value for annotations from entries in UniProtKB/Swiss-Prot version 15 but not Swiss-Prot Version 9. The resulting graph for this is shown below:</p>
<p><a href="http://homepages.cs.ncl.ac.uk/m.j.bell1/blog/images/time_slice.png"><img class="aligncenter" src="images/time_slice.png" alt="Difference in Alpha value between SP9 and UPSP15" width=100% height=100%/></a></p>
<p>With the assumption that maturity is linked to age, we would expect that the quality of annotations within a set of old entries would improve over time. Interestingly, this doesn&#8217;t appear to be the case. Whilst the alpha value does decrease (by roughly 0.1), the alpha value for the remaining entries is significantly lower. This would suggest that annotations within the whole database are generally decreasing, although the rate of decrease depends on initial age of the entry. Given this, it is of interest to see how the quality of annotations in only new entries change over time. For this, we extract annotations for those entries that appeared for the first time in a given version. The resulting graph for this is shown below:</p>
<p><a href="http://homepages.cs.ncl.ac.uk/m.j.bell1/blog/images/new_accessions.png"><img class="aligncenter" src="images/new_accessions.png" alt="Alpha values for new annotations in new entries in various Swiss-Prot databases." width=100% height=100%/></a></p>
<p>This graph also shows a steady decrease over time &#8211; a similar pattern to most of our previous analyses. These results are interesting; it would appear that annotation of new data is getting worse over time. It also appears to have a detrimental affect on other annotations. We have discussed the increase in data in relation to annotation quality, but the impact of size alone would not explain the decrease of annotation quality in older entries. One possible explanation for this is the protocol used for annotation curation.</p>
<p>The curation of annotations is clear process, consisting of 6 key steps. This process is detailed in <span id="cite_ITEM-554-0" name="citation"><a href="#ITEM-554-0">[1]</a></span>, with an overview of the process shown in the figure below (taken from <span id="cite_ITEM-554-0" name="citation"><a href="#ITEM-554-0">[1]</a></span>). </p>
<p><a href="http://homepages.cs.ncl.ac.uk/m.j.bell1/blog/images/uniprot_protocol.jpg"><img class="aligncenter" src="images/uniprot_protocol.jpg" alt="Outline of the UniProtKB manual curation process, taken from paper: DOI:10.1093/database/bar009" width=100% height=100%/></a></p>
<p>Part of the protocol is to, for a given sequence, identify similar entries and then standardise and propagate annotation between these entries to ensure data consistency. Presumably, over time the curation process has undergone revisions (worthy of further investigation and another blog post!) due to changes in resources, increase of data, and so on. It is possible that this curation process is refined to deal with larger amounts of data and quicker release dates (both of which are true for Swiss-Prot over time &#8211; early versions of Swiss-Prot saw around 1000 new entries being added, with the later versions seeing around 30,000 new entries, whilst the release cycle is more frequent by a couple of months). Although the increase of manually curated entries and faster release dates could be due to more curators rather than change in annotation protocol (which will be investigated further), it is plausible that attempts to standardise annotation between similar entries is actually having a detrimental affect on overall annotation quality.</p>
<h2>References</h2>
    <ol>
    <li><a name='ITEM-554-0'></a>
M. Magrane, and U. Consortium, "UniProt Knowledgebase: a hub of integrated protein data", <i>Database</i>, vol. 2011, pp. bar009-bar009, 2011. <a href="http://dx.doi.org/10.1093/database/bar009">http://dx.doi.org/10.1093/database/bar009</a>


</li>
</ol>

</div> <!-- kcite-section 554 -->]]></content:encoded>
			<wfw:commentRss>http://homepages.cs.ncl.ac.uk/m.j.bell1/blog/?feed=rss2&#038;p=554</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>
