UniProtKB and Sentence Reuse

25 April, 2011 (12:26) | Annotation Quality, Sentence reuse | By: mj_bell

Previously we have looked at the occurrences of words within UniProtKB and seen varying degrees of reuse. What about whole sentences? When an annotation is created, either manually or automatically, annotations are frequently reused (i.e. sentences copied and pasted from one entry to another), as detailed in the curation protocol (10.1093/database/bar009). By investigating this reuse, we can look at tracking annotation flow through database versions, analyse sentence distribution and provide a useful interface to this data by offering, for example, a concordance view.

Having extracted sentences for use in this analysis, it required little additional effort to extract the occurrence of each sentence for each database version. By doing this we thought it would be interesting to initially see if fitting a power law to the sentences would give similar results to those from fitting to words. This was done for both Swiss-Prot and TrEMBL. Graphs for UniProtKB/TrEMBL version 1 and 13 are shown below:

The TrEMBL graphs show the same pattern as the word graphs; over time, the amount of reuse increases heavily. Below shows graphs for Swiss-Prot version 20 and UniProtKB/Swiss-Prot version 11:

Like TrEMBL, these graphs also follow the pattern of the word graphs. Over time it is clear Swiss-Prot becomes more mature with the clear development of of two slopes.

These graphs show that sentence reuse occurs in both TrEMBL and Swiss-Prot and the body of reused sentences also increases. Given this information, it would appear highly likely that analysing the sentence distribution would be worthwhile.



