Levels of sentence reuse in UniProtKB
As is frequently highlighted, data being added to biological databases is ever increasing; typically at an exponential rate. This is true for the number of entries added over time to both UniProtKB/Swiss-Prot and UniProtKB/TrEMBL, as illustrated below:
UniProtKB offer a number of detailed statistics for each release of Swiss-Prot and TrEMBL, including the total number of species represented (including which species are most represented), sequence sizes, total number of references, and so on. These statistics are used for various analyses, such as analysing database growth and total coverage. As discussed previously we have started to analyse sentence reuse within UniProtKB, and can provide additional statistics from a sentence viewpoint. For example, we can see that the total number of sentences within Swiss-Prot and TrEMBL are also increasing at an exponential rate:
This level of growth isn’t unexpected. As more entries are added to UniProtKB, then more sentences are either added or copied to these entries, as each entry contains at least a single sentence of annotation. It is interesting to see that TrEMBL and Swiss-Prot had similar numbers of sentences until the last four versions, where TrEMBL convincingly overtook Swiss-Prot. Perhaps a more interesting perspective is to look at how many unique sentences occur within each dataset:
This shows a steady linear increase of unique sentences within Swiss-Prot. For TrEMBL the level of uniqueness appears to maintain a similar level, but a clearer indication is shown below:
The levels, whilst initially show a constant increase, fluctuate erratically. Finally, we can view the percentage of unique sentences in both TrEMBl and Swiss-Prot:
This shows that the level of unique sentences decrease over time. It shows a steady decline for Swiss-Prot, whilst TrEMBL seems to jump down in steps. Given that TrEMBL is created automatically, these jumps and the erratic levels of unique sentences perhaps relate to changes in the annotation algorithm. One of our reasons for analysing sentence reuse is to determine which sentences have more information content than others.
The more frequently a sentence occurs, the more generic it is. Therefore, a sentence that is unique to an individual entry would be expected to have a better information content than one that appears in numerous entries. For example, the sentences “atp + h(2)o + h(+)(in) = adp + phosphate + h(+)(out).”, “contains 1 ring-type zinc finger” and “monomer (by similarity).” all occur in numerous entries whereas the sentences “has weak antifungal activity toward c.comatus and p.piricola”, “the ligand for sev is the boss (bride of sevenless) protein on the surface of the neighboring r8 cell.” and “brain; in the cal region of hippocampus, the medial habenula, and raphe nuclei.” only occur in one entry within a dataset. Given this, and the levels of reuse illustrated earlier, this would suggest that sentences that are unique to an entry, or show very little levels of reuse, provide better quality information to a reader.