Michael Bell

Ph.D. Students Blog

Skip to: Content | Sidebar | Footer

Selenocysteine is encoded by the opal codon.

6 May, 2011 (13:17) | Networks, Sentence reuse | By: mj_bell

As mentioned previously, sentence reuse is common in both manual and automated annotation curation. Although our analysis is in its early stages, we have already noticed that, in most cases, sentences are only shared between clusters of proteins. In Swiss-Prot version 9 we noticed one sentence, “THE ACTIVE-SITE SELENOCYSTEINE IS ENCODED BY THE OPAL CODON, UGA.”, which was only linked between two proteins (P07203 and P07658) that were in separate clusters. To illustrate this, the image below shows an example of a sentence linked between two clusters. In the image each node is a protein entry and each edge is a sentence that appears in both its connected nodes.

Clusters of genes and sentences

Due to the rarity of linked groups this sentence ‘jumped out’. Whilst this was enough to warrant further investigation, the fact that selenocysteine was apparently encoded by a stop codon (UGA) within these two proteins made it even more interesting. Upon investigation I found out that opal codons can do more than just terminate translation (8132075), and that the co-translational incorporation of selenocysteine is confirmed in a number of papers (e.g. (8744353)).

It is still interesting to see no clear link between the two entries other than this single sentence. One is a protein from a prokaryotic organism (Escherichia coli) and another from Eukaryotic organism (Homo sapien). A BLAST search showed very little sequence similarity between the two entries (even using the various versions of sequences). Additionally the stated functions of the two proteins are very different. These differences would suggest that a manual curator using the approach given by UniProt (10.1093/database/bar009) would not consider these two proteins at the same time and they would be curated independently.

A possible explanation regarding the reuse of the sentences would be that they both reference the same paper. However the supporting papers ((10.1093/nar/15.17.7178) and (10.1093/nar/15.13.5484) for P07203 and (2941757) for P07658) are different. None of the papers actually state the sentence as shown in the annotation, share any of the same authors or reference each other. This would suggest the sentence was manually written and copied between the two entries manually.

In later versions of the database we see the sentence occur in a total of 84 entries, and was last seen in UniProtKB/Swiss-Prot Version 8 (P83564) and UniProtKB/TrEMBL Version 24 (P83564). The sentence has, presumably, been removed due to the mutation being added to the each entries feature table. Being able to browse annotations in this way is currently cumbersome and labour intenstive so we are currently developing tools to aid this and these tools will be discussed in the blog as they near (beta) completion.

Whilst our analysis is in the very early stages, we can already show a link between two proteins that would been seen as unrelated in the traditional sense via functional annotation.



Pingback from Michael Bell » Anaylsing selenocysteine sentence reuse and flow.
Time August 31, 2011 at 11:17 am

[...] sentence “Selenocysteine is encoded by the opal codon.” was previously discussed, having been identified through a network view of sentence usage within Swiss-Prot version 9. Our [...]

Write a comment