Michael Bell

Ph.D. Students Blog

Skip to: Content | Sidebar | Footer

UniProtKB and Sentence Reuse

25 April, 2011 (12:26) | Annotation Quality, Sentence reuse | By: mj_bell

Previously we have looked at the occurrences of words within UniProtKB and seen varying degrees of reuse. What about whole sentences? When an annotation is created, either manually or automatically, annotations are frequently reused (i.e. sentences copied and pasted from one entry to another), as detailed in the curation protocol (10.1093/database/bar009). By investigating this reuse, we can look at tracking annotation flow through database versions, analyse sentence distribution and provide a useful interface to this data by offering, for example, a concordance view.

Having extracted sentences for use in this analysis, it required little additional effort to extract the occurrence of each sentence for each database version. By doing this we thought it would be interesting to initially see if fitting a power law to the sentences would give similar results to those from fitting to words. This was done for both Swiss-Prot and TrEMBL. Graphs for UniProtKB/TrEMBL version 1 and 13 are shown below:


The TrEMBL graphs show the same pattern as the word graphs; over time, the amount of reuse increases heavily. Below shows graphs for Swiss-Prot version 20 and UniProtKB/Swiss-Prot version 11:


Like TrEMBL, these graphs also follow the pattern of the word graphs. Over time it is clear Swiss-Prot becomes more mature with the clear development of of two slopes.

These graphs show that sentence reuse occurs in both TrEMBL and Swiss-Prot and the body of reused sentences also increases. Given this information, it would appear highly likely that analysing the sentence distribution would be worthwhile.

Bibliography

Comments

Pingback from Michael Bell » Selenocystein is encoded by the opal codon.
Time May 6, 2011 at 1:18 pm

[...] mentioned previously, sentence re-use is common in both manual and automated annotation curation. Although our analysis [...]

Pingback from Michael Bell » Levels of sentence reuse in UniProtKB
Time July 26, 2011 at 12:35 pm

[...] are used for various analyses, such as analysing database growth and total coverage. As discussed previously we have started to analyse sentence reuse within UniProtKB, and can provide additional statistics [...]

Pingback from Michael Bell » Most frequently occurring sentences in UniProtKB and their propagation through the Web
Time November 4, 2011 at 4:16 pm

[...] already established, sentence reuse is common within UniProtKB. Obviously, some sentences will have higher reuse than [...]

Write a comment