Michael Bell

Ph.D. Students Blog

Skip to: Content | Sidebar | Footer

Most frequently occurring sentences in UniProtKB and their propagation through the Web

4 November, 2011 (16:16) | Sentence reuse, Uncategorized | By: mj_bell

As already established, sentence reuse is common within UniProtKB. Obviously, some sentences will have higher reuse than others, giving an indication as to their information content. Below we show the top 10 sentences for the first versions of Swiss-Prot and TrEMBL, and also for Version 15 of UniProtKB/TrEMBL and UniProtKB/Swiss-Prot.

Top 10 sentences for Swiss-Prot Version 9:

Sentence Occurrences
dimer of identical chains. 141
tetramer of identical chains. 125
to other ef-hand calcium binding proteins. 94
ferredoxin are iron-sulfur proteins that transfer electrons in a wide variety of metabolic reactions. 79
these proteins may be involved in dormant spore’s high resistance to uv light. 73
nadh + ubiquinone = nad(+) + ubiquinol. 69
4 ferrocytochrome c + o(2) = 2 h(2)o + 4 ferricytochrome c. 68
cytoplasmic. 63
avian ovomucoid consist of three homologous tandem kazal family inhibitory domains. 63
subunit i, ii, and iii form the functional core of the enzyme complex. 52


Top 10 sentences for UniProtKB/Swiss-Prot Version 15:

Sentence Occurrences
cytoplasm (by similarity). 69247
homodimer (by similarity). 28182
cytoplasm. 26648
nucleus (by similarity). 17013
cytoplasm (potential). 16029
secreted. 15020
nucleus. 14855
monomer (by similarity). 13922
cytoplasm (probable). 12340
the rnap catalytic core consists of 2 alpha, 1 beta, 1 beta’ and 1 omega subunit. 11460


Top 10 sentences for TrEMBL Version 1:

Sentence Occurrences
d-ribulose 1,5-bisphosphate + co(2) = 2 3-phospho-d-glycerate. 538
integral membrane protein. 488
belongs to the cytochrome p450 family. 460
4 ferrocytochrome c + o(2) = 2 h(2)o + 4 ferricytochrome c. 374
subunit i and ii form the functional core of the enzyme complex. 374
nuclear. 370
subunit ii binds cu(a) and cytochrome c. 370
electrons originating in cytochrome c are transferred via heme a and cu(a) to the binuclear center formed by heme a3 and cu(b). 362
copper a and heme group. 362
to other mitochondrial or bacterial cox2 subunits. 362


Top 10 sentences for UniProtKB/TrEMBL Version 15:

Sentence Occurrences
the sequence shown here is derived from an embl/genbank/ddbj whole genome shotgun (wgs) entry which is preliminary data. 1518744
endonucleolytic cleavage to 5′- phosphomonoester. 262057
mitochondrion inner membrane; multi-pass membrane protein (by similarity). 96557
contains 1 reverse transcriptase domain. 84828
cytochrome c oxidase is the component of the respiratory chain that catalyzes the reduction of oxygen to water. 83892
subunits 1- 3 form the functional core of the enzyme complex. 83892
cytoplasm (by similarity). 81184
contains 1 peptidase a2 domain. 71851
belongs to the heme-copper respiratory oxidase family. 64848
4 ferrocytochrome c + o(2) + 4 h(+) = 4 ferricytochrome c + 2 h(2)o. 63881


It is interesting to view which sentences come out top, and how these change over time. One of the most obvious things to note is the drastic reduction in sentence length for Swiss-Prot over time, whereas TrEMBL sustains a similar sentence length.

However, whilst this view is interesting, the main purpose is as an initial test to see how sentences from UniProtKB propagate through the web, if at all. We would expect annotations to flow to various other places, whether it is blogs and webpages such as this, or to more formal places, such as other databases. Checking for the flow of the most popular sentences will give an indication as to whether further investigation for lesser used sentences is worthwhile.

As an initial and simple test we can search for exact sentences which contain five or more words on Google. We choose sentences with 5 or more words as it is less likely a matching sentence would be independently curated by chance. Additionally, sentences with 5 words or more are seen to hold more information content (this will be discussed more in future blog posts).

An example result can be viewed for the sentence “cytochrome c oxidase is the component of the respiratory chain that catalyzes the reduction of oxygen to water.”, which occurs with high frequency in UniProtKB/TrEMBL version 15. It is hard to get a reasonable figure as to how many Google hits these results return (Results are, at best, a rough estimate. The way these estimations are calculated cause particular errors for some of the sentences we are searching for, given we only want exact sentences. More information about this can be found here and here.) but we can see the sentence has spread into a number of other places. These include STRING, PhosphoSite, DrugBank and a paper published in the Neurobiology of Aging journal.

Some of these pages state that the information was pulled from UniProtKB, with some others linking to UniProtKB, whilst others do not make any reference to UniProtKB. We do have to consider that UniProtKB could have pulled this sentence from another location or the sentence was curated elsewhere by chance, albeit unlikely for all cases. Given the sentence appeared in a journal article it is possible this is the source of the sentence, which coincides with the UniProtKB curation protocol.

Being unable to easily tell where the original source of this data isn’t uncommon and highlights one of the major problems we are attempting to address. We know some of the sites pull the data from UniProtKB, and assume those mentioning UniProtKB also do, even if not explicitly stated. This can cause major issues should this annotation be found to have been erroneous. Do these sites do regular checks, either automated or manual, for changes to the original annotation/entry, or would this annotation remain unchanged? This simple example highlights the difficulty with detecting and tracing error propagation and provenance, and its subsequent impact on assessing an annotations quality and correctness.

Write a comment