Most frequently occurring sentences in UniProtKB and their propagation through the Web
As already established, sentence reuse is common within UniProtKB. Obviously, some sentences will have higher reuse than others, giving an indication as to their information content. Below we show the top 10 sentences for the first versions of Swiss-Prot and TrEMBL, and also for Version 15 of UniProtKB/TrEMBL and UniProtKB/Swiss-Prot.
Top 10 sentences for Swiss-Prot Version 9:
|dimer of identical chains.||141|
|tetramer of identical chains.||125|
|to other ef-hand calcium binding proteins.||94|
|ferredoxin are iron-sulfur proteins that transfer electrons in a wide variety of metabolic reactions.||79|
|these proteins may be involved in dormant spore’s high resistance to uv light.||73|
|nadh + ubiquinone = nad(+) + ubiquinol.||69|
|4 ferrocytochrome c + o(2) = 2 h(2)o + 4 ferricytochrome c.||68|
|avian ovomucoid consist of three homologous tandem kazal family inhibitory domains.||63|
|subunit i, ii, and iii form the functional core of the enzyme complex.||52|
Top 10 sentences for UniProtKB/Swiss-Prot Version 15:
|cytoplasm (by similarity).||69247|
|homodimer (by similarity).||28182|
|nucleus (by similarity).||17013|
|monomer (by similarity).||13922|
|the rnap catalytic core consists of 2 alpha, 1 beta, 1 beta’ and 1 omega subunit.||11460|
Top 10 sentences for TrEMBL Version 1:
|d-ribulose 1,5-bisphosphate + co(2) = 2 3-phospho-d-glycerate.||538|
|integral membrane protein.||488|
|belongs to the cytochrome p450 family.||460|
|4 ferrocytochrome c + o(2) = 2 h(2)o + 4 ferricytochrome c.||374|
|subunit i and ii form the functional core of the enzyme complex.||374|
|subunit ii binds cu(a) and cytochrome c.||370|
|electrons originating in cytochrome c are transferred via heme a and cu(a) to the binuclear center formed by heme a3 and cu(b).||362|
|copper a and heme group.||362|
|to other mitochondrial or bacterial cox2 subunits.||362|
Top 10 sentences for UniProtKB/TrEMBL Version 15:
|the sequence shown here is derived from an embl/genbank/ddbj whole genome shotgun (wgs) entry which is preliminary data.||1518744|
|endonucleolytic cleavage to 5′- phosphomonoester.||262057|
|mitochondrion inner membrane; multi-pass membrane protein (by similarity).||96557|
|contains 1 reverse transcriptase domain.||84828|
|cytochrome c oxidase is the component of the respiratory chain that catalyzes the reduction of oxygen to water.||83892|
|subunits 1- 3 form the functional core of the enzyme complex.||83892|
|cytoplasm (by similarity).||81184|
|contains 1 peptidase a2 domain.||71851|
|belongs to the heme-copper respiratory oxidase family.||64848|
|4 ferrocytochrome c + o(2) + 4 h(+) = 4 ferricytochrome c + 2 h(2)o.||63881|
It is interesting to view which sentences come out top, and how these change over time. One of the most obvious things to note is the drastic reduction in sentence length for Swiss-Prot over time, whereas TrEMBL sustains a similar sentence length.
However, whilst this view is interesting, the main purpose is as an initial test to see how sentences from UniProtKB propagate through the web, if at all. We would expect annotations to flow to various other places, whether it is blogs and webpages such as this, or to more formal places, such as other databases. Checking for the flow of the most popular sentences will give an indication as to whether further investigation for lesser used sentences is worthwhile.
As an initial and simple test we can search for exact sentences which contain five or more words on Google. We choose sentences with 5 or more words as it is less likely a matching sentence would be independently curated by chance. Additionally, sentences with 5 words or more are seen to hold more information content (this will be discussed more in future blog posts).
An example result can be viewed for the sentence “cytochrome c oxidase is the component of the respiratory chain that catalyzes the reduction of oxygen to water.”, which occurs with high frequency in UniProtKB/TrEMBL version 15. It is hard to get a reasonable figure as to how many Google hits these results return (Results are, at best, a rough estimate. The way these estimations are calculated cause particular errors for some of the sentences we are searching for, given we only want exact sentences. More information about this can be found here and here.) but we can see the sentence has spread into a number of other places. These include STRING, PhosphoSite, DrugBank and a paper published in the Neurobiology of Aging journal.
Some of these pages state that the information was pulled from UniProtKB, with some others linking to UniProtKB, whilst others do not make any reference to UniProtKB. We do have to consider that UniProtKB could have pulled this sentence from another location or the sentence was curated elsewhere by chance, albeit unlikely for all cases. Given the sentence appeared in a journal article it is possible this is the source of the sentence, which coincides with the UniProtKB curation protocol.
Being unable to easily tell where the original source of this data isn’t uncommon and highlights one of the major problems we are attempting to address. We know some of the sites pull the data from UniProtKB, and assume those mentioning UniProtKB also do, even if not explicitly stated. This can cause major issues should this annotation be found to have been erroneous. Do these sites do regular checks, either automated or manual, for changes to the original annotation/entry, or would this annotation remain unchanged? This simple example highlights the difficulty with detecting and tracing error propagation and provenance, and its subsequent impact on assessing an annotations quality and correctness.