# Michael Bell

Ph.D. Students Blog

## Annotation maturity: Comparison of annotations in new and old sets of UniProtKB entries.

3 February, 2012 (14:57) | Uncategorized | By: mj_bell

Carrying on from the previous post, we now wish to look at annotation maturity in sets of UniProtKB entries. We have seen that over time the quality of annotations appear to be decreasing over time, for both Swiss-Prot and TrEMBL. A reasonable explanation for this would be that annotations are constantly being added to the newly incorporated data, which in turn has added additional pressure on curators, meaning that over time the least effort has shifted from the reader to the annotator. Whilst we see a reduction in the overall annotation quality, we suspect that a mature set of entries would improve over age.

To approach this analysis, we compare annotations from entries that are in both Swiss-Prot Version 9 and UniProtKB/Swiss-Prot version 15. We also show the resulting alpha value for annotations from entries in UniProtKB/Swiss-Prot version 15 but not Swiss-Prot Version 9. The resulting graph for this is shown below:

With the assumption that maturity is linked to age, we would expect that the quality of annotations within a set of old entries would improve over time. Interestingly, this doesn’t appear to be the case. Whilst the alpha value does decrease (by roughly 0.1), the alpha value for the remaining entries is significantly lower. This would suggest that annotations within the whole database are generally decreasing, although the rate of decrease depends on initial age of the entry. Given this, it is of interest to see how the quality of annotations in only new entries change over time. For this, we extract annotations for those entries that appeared for the first time in a given version. The resulting graph for this is shown below:

This graph also shows a steady decrease over time – a similar pattern to most of our previous analyses. These results are interesting; it would appear that annotation of new data is getting worse over time. It also appears to have a detrimental affect on other annotations. We have discussed the increase in data in relation to annotation quality, but the impact of size alone would not explain the decrease of annotation quality in older entries. One possible explanation for this is the protocol used for annotation curation.

The curation of annotations is clear process, consisting of 6 key steps. This process is detailed in , with an overview of the process shown in the figure below (taken from ).

Part of the protocol is to, for a given sequence, identify similar entries and then standardise and propagate annotation between these entries to ensure data consistency. Presumably, over time the curation process has undergone revisions (worthy of further investigation and another blog post!) due to changes in resources, increase of data, and so on. It is possible that this curation process is refined to deal with larger amounts of data and quicker release dates (both of which are true for Swiss-Prot over time – early versions of Swiss-Prot saw around 1000 new entries being added, with the later versions seeing around 30,000 new entries, whilst the release cycle is more frequent by a couple of months). Although the increase of manually curated entries and faster release dates could be due to more curators rather than change in annotation protocol (which will be investigated further), it is plausible that attempts to standardise annotation between similar entries is actually having a detrimental affect on overall annotation quality.

## Annotation maturity: Average entry age within UniProtKB

26 January, 2012 (11:04) | Uncategorized | By: mj_bell

As we know the UniProtKB database is growing exponentially. Our prior quality analysis of whole UniProtKB versions has seen that, over time, annotations are becoming more geared towards the annotator; that is, readers now require more effort to interpret and digest the annotations than in earlier versions of UniProtKB. The exponential growth of the database gives a plausible explanation to this – the initial quality of annotations cannot be maintained. This explanation also fits with other research .

Given this, we would like to abstract away from the growth of the database and see how sets of annotations change over time. For example, how do annotations for a set of entries in both Swiss-Prot Version 9 and Swiss-Prot version 45 compare? Our initial step is to investigate the average age of entries within UniProtKB.

Within each entry there is a set of date stamps, as explained in the UniProtKB manual, one of which indicates when the entry first appeared in the database. For each entry within a database version we extract this date and calculate the average. The figure below shows the average entry age for both TrEMBL and Swiss-Prot:

This graph shows that the average entry age increases over time. These results aren’t unexpected, as new data is constantly being added, which will gradually push the average age up. Although entries in Swiss-Prot come from TrEMBL, they overwrite the initial TrEMBL date and state the date integrated into Swiss-Prot, hence why we see a similar pattern for both Swiss-Prot and TrEMBL. Additionally, older entries can become deleted or merged. Very little information is given about what happens to the dates for merged entries. It isn’t uncommon to see two or more entries becoming merged; when this happens, one accession number is used as a primary accession number with the remaining accession numbers becoming secondary accession numbers. The primary accession isn’t necessarily the oldest entry and we only extract dates from primary accession numbers.

Whilst this graph shows that the average entry age is increasing, the difference between the release date and average entry date also shows an initial increase. We currently have over 20 years worth of Swiss-Prot versions, with the average record age for the latest version being around 5 years old, compared to version 9, which is around a year and a half. This difference between the average and and release date is shown below:

To illustrate this, the first graph shows that Swiss-Prot version 9 was released in November 1988 and that the average entry release date was July 1987 – the difference between these dates is 1 year and 4 months, which is reflected in the bottom left point in the above graph. This graph is slightly different than what we previously expected. Initially we see the age difference increasing, but then start to decrease (Swiss-Prot) or level off (TrEMBL); we would expect a constant increase. We assume age and maturity are linked; something old is more mature than something young. If this is indeed the case, then this would suggest Swiss-Prot is decreasing in maturity, whilst TrEMBL remains at a steady level of maturity. It could be suggested that the increase in entries hit a maximum level, before having a detrimental effect. A similar graph and suggestion could be shown by, for example, the response time for a server with a constant increase in concurrent users querying a database.

Although viewing entry dates, rather than annotation quality, this conclusion fits with that of our other analyses; the rate of data is outstripping our ability to deal with it. Following on from this, we now wish to look at maturity of annotations within entry sets. This is discussed in the next blog post (link)

## Most frequently occurring sentences in UniProtKB and their propagation through the Web

4 November, 2011 (16:16) | Sentence reuse, Uncategorized | By: mj_bell

As already established, sentence reuse is common within UniProtKB. Obviously, some sentences will have higher reuse than others, giving an indication as to their information content. Below we show the top 10 sentences for the first versions of Swiss-Prot and TrEMBL, and also for Version 15 of UniProtKB/TrEMBL and UniProtKB/Swiss-Prot.

Top 10 sentences for Swiss-Prot Version 9:

 Sentence Occurrences dimer of identical chains. 141 tetramer of identical chains. 125 to other ef-hand calcium binding proteins. 94 ferredoxin are iron-sulfur proteins that transfer electrons in a wide variety of metabolic reactions. 79 these proteins may be involved in dormant spore’s high resistance to uv light. 73 nadh + ubiquinone = nad(+) + ubiquinol. 69 4 ferrocytochrome c + o(2) = 2 h(2)o + 4 ferricytochrome c. 68 cytoplasmic. 63 avian ovomucoid consist of three homologous tandem kazal family inhibitory domains. 63 subunit i, ii, and iii form the functional core of the enzyme complex. 52

Top 10 sentences for UniProtKB/Swiss-Prot Version 15:

 Sentence Occurrences cytoplasm (by similarity). 69247 homodimer (by similarity). 28182 cytoplasm. 26648 nucleus (by similarity). 17013 cytoplasm (potential). 16029 secreted. 15020 nucleus. 14855 monomer (by similarity). 13922 cytoplasm (probable). 12340 the rnap catalytic core consists of 2 alpha, 1 beta, 1 beta’ and 1 omega subunit. 11460

Top 10 sentences for TrEMBL Version 1:

 Sentence Occurrences d-ribulose 1,5-bisphosphate + co(2) = 2 3-phospho-d-glycerate. 538 integral membrane protein. 488 belongs to the cytochrome p450 family. 460 4 ferrocytochrome c + o(2) = 2 h(2)o + 4 ferricytochrome c. 374 subunit i and ii form the functional core of the enzyme complex. 374 nuclear. 370 subunit ii binds cu(a) and cytochrome c. 370 electrons originating in cytochrome c are transferred via heme a and cu(a) to the binuclear center formed by heme a3 and cu(b). 362 copper a and heme group. 362 to other mitochondrial or bacterial cox2 subunits. 362

Top 10 sentences for UniProtKB/TrEMBL Version 15:

 Sentence Occurrences the sequence shown here is derived from an embl/genbank/ddbj whole genome shotgun (wgs) entry which is preliminary data. 1518744 endonucleolytic cleavage to 5′- phosphomonoester. 262057 mitochondrion inner membrane; multi-pass membrane protein (by similarity). 96557 contains 1 reverse transcriptase domain. 84828 cytochrome c oxidase is the component of the respiratory chain that catalyzes the reduction of oxygen to water. 83892 subunits 1- 3 form the functional core of the enzyme complex. 83892 cytoplasm (by similarity). 81184 contains 1 peptidase a2 domain. 71851 belongs to the heme-copper respiratory oxidase family. 64848 4 ferrocytochrome c + o(2) + 4 h(+) = 4 ferricytochrome c + 2 h(2)o. 63881

It is interesting to view which sentences come out top, and how these change over time. One of the most obvious things to note is the drastic reduction in sentence length for Swiss-Prot over time, whereas TrEMBL sustains a similar sentence length.

However, whilst this view is interesting, the main purpose is as an initial test to see how sentences from UniProtKB propagate through the web, if at all. We would expect annotations to flow to various other places, whether it is blogs and webpages such as this, or to more formal places, such as other databases. Checking for the flow of the most popular sentences will give an indication as to whether further investigation for lesser used sentences is worthwhile.

As an initial and simple test we can search for exact sentences which contain five or more words on Google. We choose sentences with 5 or more words as it is less likely a matching sentence would be independently curated by chance. Additionally, sentences with 5 words or more are seen to hold more information content (this will be discussed more in future blog posts).

An example result can be viewed for the sentence “cytochrome c oxidase is the component of the respiratory chain that catalyzes the reduction of oxygen to water.”, which occurs with high frequency in UniProtKB/TrEMBL version 15. It is hard to get a reasonable figure as to how many Google hits these results return (Results are, at best, a rough estimate. The way these estimations are calculated cause particular errors for some of the sentences we are searching for, given we only want exact sentences. More information about this can be found here and here.) but we can see the sentence has spread into a number of other places. These include STRING, PhosphoSite, DrugBank and a paper published in the Neurobiology of Aging journal.

Some of these pages state that the information was pulled from UniProtKB, with some others linking to UniProtKB, whilst others do not make any reference to UniProtKB. We do have to consider that UniProtKB could have pulled this sentence from another location or the sentence was curated elsewhere by chance, albeit unlikely for all cases. Given the sentence appeared in a journal article it is possible this is the source of the sentence, which coincides with the UniProtKB curation protocol.

Being unable to easily tell where the original source of this data isn’t uncommon and highlights one of the major problems we are attempting to address. We know some of the sites pull the data from UniProtKB, and assume those mentioning UniProtKB also do, even if not explicitly stated. This can cause major issues should this annotation be found to have been erroneous. Do these sites do regular checks, either automated or manual, for changes to the original annotation/entry, or would this annotation remain unchanged? This simple example highlights the difficulty with detecting and tracing error propagation and provenance, and its subsequent impact on assessing an annotations quality and correctness.

## UniProtKB and Benford’s Law

5 October, 2011 (14:36) | Annotation Quality, Miscellaneous | By: mj_bell

In the last blog post we looked at data parsing, and how Zipf’s law could possibly be used to detect parsing errors. Whilst reading a recent blog post by Ben Goldacre I was reminded of Benford’s law – which shares a number of similarities to Zipf’s – and considered how it may also be applicable to detecting parsing errors.

Like Zipf’s Law, Benford’s Law is a rather interesting empirical law. The law states that the occurrences of first digits for a set of numbers aren’t evenly distributed. By first (or leading) digit, we simply mean that we take only the first digit of a number, regardless of its size. So the leading digit of 18362 is 1, whilst the leading digit of 489 is 4. Interestingly, the chance of the first digit being 1 is around 30%, whereas the chance of the leading digit being 9 is around 5% – like the distribution of grid-line widths on the logarithmic scale. For a given set of numbers, if, for each leading digit $d (d \in {1, ..., 9})$ occurs with probability $P(d) = log_{10}(d+1) - log_{10}(d) = log_{10} (1 + \frac{1}{d})$ then the numbers are said to satisfy Benford’s law.

Also like Zipf’s Law, Benford’s Law has been seen to hold in numerous applications, assuming the data is distributed over multiple orders of magnitude. However, one of the most interesting applications of Benford’s law is the detection of fraud. When numbers are being fabricated, in an attempt to disguise fraud, people appear to distribute the numbers evenly, when naturally this isn’t the case. Given this, we are interested to see if UniProt also follows Benford’s Law and if it can be used for detecting parsing errors.

We have created a number of graphs for Swiss-Prot and TrEMBL for Benford’s law. For each graph we have plotted orange diamonds to show the distribution predicted by Benford’s law. On each graph the X-axis represents leading digits with the corresponding percentage being shown on the Y axis. Below are the graphs showing Swiss-Prot version 9 and UniProtKB/Swiss-Prot Version 15:

These graphs show that Swiss-Prot follows a Benford distribution. This was also true for intermediate versions of Swiss-Prot. Additionally, we did the same for TrEMBL, with graphs for versions 1, 28 and UniProtKB/TrEMBL version 15 shown below:

For TrEMBL the results are much more scattered, with later versions not really following Benford’s Law. We appear to be able to extract similar meaning from these graphs as we did from the Zipf ones. For example, it shows Swiss-Prot is more “natural” than TrEMBL and that early versions of TrEMBL are of better quality than later versions.

Whilst this view gives some further confidence to the underlying annotation, investigation into error detection isn’t as clear. One of the main issues is we don’t have much erroneously parsed data to check – only those that contain copyright and topic header “errors”. Using Benford’s Law we were unable to detect any “errors” in these datasets. Data that produced a major skew for Zipf’s law produced negligible impact on Benford’s law. However, it is likely that different kinds of parsing errors could be detected with Benford’s law, that would be unnoticeable by Zipf’s law. A combination of both methods could be employed when checking parsed data, and any subsequent work on this would be interesting to see.

## Have I parsed my data correctly?

22 September, 2011 (16:06) | Miscellaneous | By: mj_bell

The foundations for most of our work has been based on data parsed from text files. Extracted data has included single words and whole sentences from gigabytes of raw text. With such overwhelming amounts of data, how can we be confident that we have correctly parsed our data?

Obviously some basic checking was performed. This included counting the number of entries expected vs number parsed, manual checking of random entries that have been parsed, parsing artificially created data and so on. Doing checks such as these help detect parsing errors; in our case we identified a mismatch of expected entries parsed (which was an error with the UniProt release notes, which they have subsequently corrected). However, as with all testing, you can only show the presence of bugs/errors, not prove that parsed data is error free. In many cases you also don’t actually know everything you need to test for.

As detailed in previous blog posts, we have been applying power laws to our parsed data. In a number of cases the resulting graphs have given unexpected results – such as artifactual kinks or outlying alpha values. Upon inspection these were due to incorrectly parsed data. In the first instance, a kink in the tail of the graph was due to incorrectly parsing copyright statements. These statements are an example of an error we couldn’t foresee or test for. Similarly, we also detected the incorrect parsing of topic block headers. Another error was due to incorrect escaping of speech marks when reading data dumps (OK, not a parsing error, but still an error detected with this method).

It appears that a side-effect of our original analysis has been the detection of incorrectly parsed data. The usage of Zipf’s Law is common; numerous papers exist that claim a Zipfian distribution has been found in all kinds of natural and man-made phenomena. This is also true of similar empirical laws, such as Pareto’s law. Given this, it isn’t unreasonable to hypothesise that power-law approaches could be applied to the detection of incorrectly parsed data.

I am not aiming to investigate this further, rather I found it an interesting side-effect of our analysis. Should this approach be explored, it would be necessary for the algorithm to distinguish between parsing errors and ‘real’ outliers. It would also have to be available in a way that meant it was easy to use and quick to run (in proportion to the total parsing time). It would be interesting to see if such an approach could reliably detect errors in parsed data, and given the generic nature of these approaches, the amount of literature on the subject and my experience I would suspect so.

## Anaylsing selenocysteine sentence reuse and flow.

31 August, 2011 (11:17) | Annotation Quality, Sentence reuse, Website | By: mj_bell

The sentence “Selenocysteine is encoded by the opal codon.” was previously discussed, having been identified through a network view of sentence usage within Swiss-Prot version 9. Our interest was mainly due to it being present in two entries in different interconnected groups. In that blog entry we looked at the similarities (or lack of) between the entries that shared the sentence. Given the development of dynamic graphs (as discussed in the last blog post) we thought it worth revisiting to see what additional information we can extract from the annotation space. Firstly, we can look at the graph for sentence occurrences over time:

This sees a steady increase of the sentence reuse in Swiss-Prot, until Version 44 when it steeply drops from 81 occurrences to 9. We can also see that in TrEMBL there is only ever one occurrence. Previously we knew that the sentence was last seen in UniProtKB/Swiss-Prot version 8, when it was only in 2 occurrences, however we didn’t know about the big decline in its usage before its removal, nor could we easily tell how many entries where made up from TrEMBL and Swiss-Prot. The removal of this sentence appears to be due to the incorporation of selenocysteine information into an entries feature table. We are unsure when Selenocysteine information was added to the feature table; presumably it was after Swiss-Prot Version 44, with the remaining entries initially being missed. There doesn’t appear to be any other clear reason for this sharp decline.

We can also view which entries the sentence occurs in, and for which database versions:

In this view, each point represents a sentence that was in a given entry (as stated on the x-axis) for a given version (release date on the y-axis). A red point indicates the entry is in the TrEMBL dataset, whilst blue indicates inclusion in Swiss-Prot. This view shows a couple of interesting points:

1. Reappearance within an entry
2. In two entries (P18283 and P12079) it is interesting to see the sentence being removed and then later reappearing. In both cases, the sentence is removed after Swiss-Prot Version 23, with it reappearing in version 38 for P18283 and version 42 for P12079. Looking at the history of P18283 and P12079 we can see that the sentence was “replaced” with “THE ACTIVE-SITE SELENOCYSTEINE IS ENCODED BY THE OPAL CODON, UGA (BY SIMILARITY).”. The usage of “by similarity” is mainly used in TrEMBL, where the information is inferred computationally. This is illustrated below, with the corresponding graph for the similarity sentence:

Whilst the sentence is in the majority of TrEMBL entries, it does appear in some Swiss-Prot entries.

Reverting back to the original sentence, in the latest version of P12079 (which has merged with P11352) a comment has been added that states “sequence was originally thought to originate from human.” Looking at the history of both entries this confusion appears to have led to the uncertainty about the selenocysteine annotation. The sentence reappears in P12079 when it is merged with P11352. There is no clear indication in P18283 why the sentence was reinstated. In the latest version of both these entries, the encoding of selenocysteine is documented in the feature table.

3. Transient appearance
4. In one entry (P21765) we see the sentence appearing, but in the subsequent version it is removed. It only appears in a single version of the entry. In the latest version of the entry there is no mention of selenocysteine, so it would appear this was an erroneous annotation. The entry Q49613 also only appears a single time in second graph, however the entry was deleted after one version.

5. Multiple entry merges
6. A large majority of the entries that share the sentence in the graph are merged in later versions. For example, in the latest version accessions P22352; O43787; Q86W78; Q9NZ74 and Q9UEL1 are all merged into a single entry. In our graph, they are all independent entries containing the sentence. This is also the case for a number of other entries on the graph, including the final two entries to have the sentence.

The analysis above gave some interesting cases and patterns — but what value can we take from this? Whilst this is only a single example, we can start to see possible uses and applications for this data. However, all of these will require further analysis:

• Annotation error rate in Swiss-Prot has been estimated to be as high as 43% . We have seen that annotations that are transient or reappeared appear to have an annotation error or entry error. This could suggest that more stable annotations are more likely to be correct (paper link).
• As discussed and explored during the network analysis blog post, this approach allow us to see entries for which a sentence is (or was) included. We have seen that it is possible for entries with limited sequence similarity to share common annotations. This approach allows UniProt entries to be browsed for shared commonality through annotation, as opposed to just sequence similarity.
• The first graph showed levels of sentence reuse within Swiss-Prot and TrEMBL, culminating with a large decline, before complete removal. In this instance the sharp decline signified the inclusion of the sentence into the feature table, but major changes or fluctuations for other sentences could indicate additional information, such as a large erroneous annotation propagation.
• Levels of reuse can give an indication to the quality of the sentence. A heavily reused sentence is typically more generic than a sentence that is unique to a single entry. This suggests unique sentences are more ‘meaningful’ than generic sentences, as quantified using approaches such as Inverse Document Frequency and Term Frequency-Inverse Document Frequency .
• As stated in the curation protocol , annotations are standardised across homologous proteins. This means it is possible that the inclusion of a sentence in an entry will propagate to homologous entries for a given database version. This is shown in the graph, and is expected. Should the sentence be removed from one of these entries, it is also possible that the sentence will be removed from the homologous entries. We can use these approaches to determine if the sentence still exists amongst any of these entries.

Each sentence has to originate in an individual entry (or group of entries in the same version). In this particular example, the sentence originated in two entries, and it was still in these entries before removal. Should the sentences have been removed from the original entries, it may be worth reassessing other entries that subsequently include this sentence. If the sentence was found to be false in the root entry, then it possible that an error has propagated into numerous other entries.

• We have seen in this example that the majority of entries sharing this sentence have been merged. It is possible that an entry is merged with another entry, or entries, which do not share a particular sentence. We could hypothesise that a merged entry which shared a common sentence prior to a merge means the sentence is more likely to be correct than one which isn’t shared by the other entry (or entries).

As a final point, it is worth mentioning sentence similarity. We have purposefully focused on the reuse of exact sentences, as we know they are standardised and copied between entries as part of the curation protocol , thus allowing us to analyse annotation flow and reuse. However, we can see that there are sentences that are very similar in the database:

• the active-site selenocysteine is encoded by the opal codon, uga.
• the active-site selenocysteine is encoded by the opal codon, uga (by similarity).
• the selenocysteine is encoded by the opal codon, uga.
• the active- site selenocysteine is encoded by the opal codon, uga.
• the active-site selenocysteine is encoded by a normal cys codon.

In one case the difference is an extra space (“active- site”), which is only in a single entry and likely a copy and paste typo. For the sentence “the selenocysteine is encoded by the opal codon, uga.”, it is shown in entries P26970 and P26971, with it later being replaced with “the active-site selenocysteine is encoded by the opal codon, uga.” in Swiss-Prot version 30. These first 4 sentences are sementically similar; that is the message they convey is pretty much the same, and you could argue they should all be treated as the same sentence.

However we need to be careful with some sentences that appear highly similar. For example, the last sentence (“the active-site selenocysteine is encoded by a normal cys codon.”) states that selenocysteine is encoded by a cysteine codon, not the opal codon, in the entries it is used in. Whilst it is semantically similar in the sense selenocysteine is encoded by a particular codon, it is done so with a different codon. Therefore, investigating sentences that are semantically similar appears to be worthwhile, but care must be taken if treating similar sentences “as one”.

## Producing web-based dynamic graphs

5 August, 2011 (13:05) | Sentence reuse, Uncategorized, Website | By: mj_bell

As part of my work on sentence reuse I have been investigating ways to visualise various sets of data on my website. An obvious requirement of this is that the graphs must be developed dynamically; the resulting graph depending on a users query. We also have to account for various types of data, not just numerical data. For example, we want to produce a graph that shows, for each database version, which entry (or entries) a given sentence occurs in. Below is an early mock-up of such a graph in R, illustrating this example:

From this graph we can see that for the given sentence, it initially only occurs in two database entries, before propagating through the database. We can also see in some instances where it has been removed from an entry and then reappearing in a later version. This current view has a number of issues, with perhaps the main issue being it isn’t always clear which point corresponds to a database entry and accession number. The mock-up also doesn’t make it clear which entries come from TrEMBL and which are Swiss-Prot, whilst the labels should also be more meaningful (release dates on the Y axis and actual accession numbers on X axis).

A range of tools looking to meet these requirements were tried and tested, including:

A number of issues were identified with most approaches. These included:

• The majority of approaches made use of the canvas element, which isn’t supported by Internet Explorer (without using additional scripts)
• Reliance on a third party server (couldn’t run all code locally)
• No ability to handle non-numeric data types
• Problems with licencing

We opted to use the HighCharts project for our charts, as it appeared to meet our full requirements. Benefits of HighCharts are that it is open source, provides numerous chart types, ability to zoom sections of a graph, ability to display tooltip text and the ability to export and save graphs. As an example, we have produced a graph similar to the one above, this time dynamically created in HighCharts:

As can be seen, it overcomes some of the drawbacks of the previous approach. TrEMBL entries are clearly visible in red, whilst Swiss-Prot entries are in blue. The axis labels are also more meaningful and we can hover over any point to get a tool-tip, clearly identifying which accession number and version it relates to. Data points are also click-able, providing links to further information, whilst you can also zoom into areas of the graph.

The production of these graphs is part of an overall view we are producing to allow information about sentence reuse in UniProtKB to be analysed and browsed. We will blog about further developments in future posts.

## Levels of sentence reuse in UniProtKB

26 July, 2011 (12:35) | Sentence reuse, Uncategorized | By: mj_bell

As is frequently highlighted, data being added to biological databases is ever increasing; typically at an exponential rate. This is true for the number of entries added over time to both UniProtKB/Swiss-Prot and UniProtKB/TrEMBL, as illustrated below:

UniProtKB offer a number of detailed statistics for each release of Swiss-Prot and TrEMBL, including the total number of species represented (including which species are most represented), sequence sizes, total number of references, and so on. These statistics are used for various analyses, such as analysing database growth and total coverage. As discussed previously we have started to analyse sentence reuse within UniProtKB, and can provide additional statistics from a sentence viewpoint. For example, we can see that the total number of sentences within Swiss-Prot and TrEMBL are also increasing at an exponential rate:

This level of growth isn’t unexpected. As more entries are added to UniProtKB, then more sentences are either added or copied to these entries, as each entry contains at least a single sentence of annotation. It is interesting to see that TrEMBL and Swiss-Prot had similar numbers of sentences until the last four versions, where TrEMBL convincingly overtook Swiss-Prot. Perhaps a more interesting perspective is to look at how many unique sentences occur within each dataset:

This shows a steady linear increase of unique sentences within Swiss-Prot. For TrEMBL the level of uniqueness appears to maintain a similar level, but a clearer indication is shown below:

The levels, whilst initially show a constant increase, fluctuate erratically. Finally, we can view the percentage of unique sentences in both TrEMBl and Swiss-Prot:

This shows that the level of unique sentences decrease over time. It shows a steady decline for Swiss-Prot, whilst TrEMBL seems to jump down in steps. Given that TrEMBL is created automatically, these jumps and the erratic levels of unique sentences perhaps relate to changes in the annotation algorithm. One of our reasons for analysing sentence reuse is to determine which sentences have more information content than others.

The more frequently a sentence occurs, the more generic it is. Therefore, a sentence that is unique to an individual entry would be expected to have a better information content than one that appears in numerous entries. For example, the sentences “atp + h(2)o + h(+)(in) = adp + phosphate + h(+)(out).”, “contains 1 ring-type zinc finger” and “monomer (by similarity).” all occur in numerous entries whereas the sentences “has weak antifungal activity toward c.comatus and p.piricola”, “the ligand for sev is the boss (bride of sevenless) protein on the surface of the neighboring r8 cell.” and “brain; in the cal region of hippocampus, the medial habenula, and raphe nuclei.” only occur in one entry within a dataset. Given this, and the levels of reuse illustrated earlier, this would suggest that sentences that are unique to an entry, or show very little levels of reuse, provide better quality information to a reader.

## Selenocysteine is encoded by the opal codon.

6 May, 2011 (13:17) | Networks, Sentence reuse | By: mj_bell

As mentioned previously, sentence reuse is common in both manual and automated annotation curation. Although our analysis is in its early stages, we have already noticed that, in most cases, sentences are only shared between clusters of proteins. In Swiss-Prot version 9 we noticed one sentence, “THE ACTIVE-SITE SELENOCYSTEINE IS ENCODED BY THE OPAL CODON, UGA.”, which was only linked between two proteins (P07203 and P07658) that were in separate clusters. To illustrate this, the image below shows an example of a sentence linked between two clusters. In the image each node is a protein entry and each edge is a sentence that appears in both its connected nodes.

Due to the rarity of linked groups this sentence ‘jumped out’. Whilst this was enough to warrant further investigation, the fact that selenocysteine was apparently encoded by a stop codon (UGA) within these two proteins made it even more interesting. Upon investigation I found out that opal codons can do more than just terminate translation (8132075), and that the co-translational incorporation of selenocysteine is confirmed in a number of papers (e.g. (8744353)).

It is still interesting to see no clear link between the two entries other than this single sentence. One is a protein from a prokaryotic organism (Escherichia coli) and another from Eukaryotic organism (Homo sapien). A BLAST search showed very little sequence similarity between the two entries (even using the various versions of sequences). Additionally the stated functions of the two proteins are very different. These differences would suggest that a manual curator using the approach given by UniProt would not consider these two proteins at the same time and they would be curated independently.

A possible explanation regarding the reuse of the sentences would be that they both reference the same paper. However the supporting papers ( and for P07203 and (2941757) for P07658) are different. None of the papers actually state the sentence as shown in the annotation, share any of the same authors or reference each other. This would suggest the sentence was manually written and copied between the two entries manually.

In later versions of the database we see the sentence occur in a total of 84 entries, and was last seen in UniProtKB/Swiss-Prot Version 8 (P83564) and UniProtKB/TrEMBL Version 24 (P83564). The sentence has, presumably, been removed due to the mutation being added to the each entries feature table. Being able to browse annotations in this way is currently cumbersome and labour intenstive so we are currently developing tools to aid this and these tools will be discussed in the blog as they near (beta) completion.

Whilst our analysis is in the very early stages, we can already show a link between two proteins that would been seen as unrelated in the traditional sense via functional annotation.

## UniProtKB and Sentence Reuse

25 April, 2011 (12:26) | Annotation Quality, Sentence reuse | By: mj_bell

Previously we have looked at the occurrences of words within UniProtKB and seen varying degrees of reuse. What about whole sentences? When an annotation is created, either manually or automatically, annotations are frequently reused (i.e. sentences copied and pasted from one entry to another), as detailed in the curation protocol . By investigating this reuse, we can look at tracking annotation flow through database versions, analyse sentence distribution and provide a useful interface to this data by offering, for example, a concordance view.

Having extracted sentences for use in this analysis, it required little additional effort to extract the occurrence of each sentence for each database version. By doing this we thought it would be interesting to initially see if fitting a power law to the sentences would give similar results to those from fitting to words. This was done for both Swiss-Prot and TrEMBL. Graphs for UniProtKB/TrEMBL version 1 and 13 are shown below:

The TrEMBL graphs show the same pattern as the word graphs; over time, the amount of reuse increases heavily. Below shows graphs for Swiss-Prot version 20 and UniProtKB/Swiss-Prot version 11:

Like TrEMBL, these graphs also follow the pattern of the word graphs. Over time it is clear Swiss-Prot becomes more mature with the clear development of of two slopes.

These graphs show that sentence reuse occurs in both TrEMBL and Swiss-Prot and the body of reused sentences also increases. Given this information, it would appear highly likely that analysing the sentence distribution would be worthwhile.