Anaylsing selenocysteine sentence reuse and flow.
The sentence “Selenocysteine is encoded by the opal codon.” was previously discussed, having been identified through a network view of sentence usage within Swiss-Prot version 9. Our interest was mainly due to it being present in two entries in different interconnected groups. In that blog entry we looked at the similarities (or lack of) between the entries that shared the sentence. Given the development of dynamic graphs (as discussed in the last blog post) we thought it worth revisiting to see what additional information we can extract from the annotation space. Firstly, we can look at the graph for sentence occurrences over time:
This sees a steady increase of the sentence reuse in Swiss-Prot, until Version 44 when it steeply drops from 81 occurrences to 9. We can also see that in TrEMBL there is only ever one occurrence. Previously we knew that the sentence was last seen in UniProtKB/Swiss-Prot version 8, when it was only in 2 occurrences, however we didn’t know about the big decline in its usage before its removal, nor could we easily tell how many entries where made up from TrEMBL and Swiss-Prot. The removal of this sentence appears to be due to the incorporation of selenocysteine information into an entries feature table. We are unsure when Selenocysteine information was added to the feature table; presumably it was after Swiss-Prot Version 44, with the remaining entries initially being missed. There doesn’t appear to be any other clear reason for this sharp decline.
We can also view which entries the sentence occurs in, and for which database versions:
In this view, each point represents a sentence that was in a given entry (as stated on the x-axis) for a given version (release date on the y-axis). A red point indicates the entry is in the TrEMBL dataset, whilst blue indicates inclusion in Swiss-Prot. This view shows a couple of interesting points:
- Reappearance within an entry
- Transient appearance
- Multiple entry merges
In two entries (P18283 and P12079) it is interesting to see the sentence being removed and then later reappearing. In both cases, the sentence is removed after Swiss-Prot Version 23, with it reappearing in version 38 for P18283 and version 42 for P12079. Looking at the history of P18283 and P12079 we can see that the sentence was “replaced” with “THE ACTIVE-SITE SELENOCYSTEINE IS ENCODED BY THE OPAL CODON, UGA (BY SIMILARITY).”. The usage of “by similarity” is mainly used in TrEMBL, where the information is inferred computationally. This is illustrated below, with the corresponding graph for the similarity sentence:
Whilst the sentence is in the majority of TrEMBL entries, it does appear in some Swiss-Prot entries.
Reverting back to the original sentence, in the latest version of P12079 (which has merged with P11352) a comment has been added that states “sequence was originally thought to originate from human.” Looking at the history of both entries this confusion appears to have led to the uncertainty about the selenocysteine annotation. The sentence reappears in P12079 when it is merged with P11352. There is no clear indication in P18283 why the sentence was reinstated. In the latest version of both these entries, the encoding of selenocysteine is documented in the feature table.
In one entry (P21765) we see the sentence appearing, but in the subsequent version it is removed. It only appears in a single version of the entry. In the latest version of the entry there is no mention of selenocysteine, so it would appear this was an erroneous annotation. The entry Q49613 also only appears a single time in second graph, however the entry was deleted after one version.
A large majority of the entries that share the sentence in the graph are merged in later versions. For example, in the latest version accessions P22352; O43787; Q86W78; Q9NZ74 and Q9UEL1 are all merged into a single entry. In our graph, they are all independent entries containing the sentence. This is also the case for a number of other entries on the graph, including the final two entries to have the sentence.
The analysis above gave some interesting cases and patterns — but what value can we take from this? Whilst this is only a single example, we can start to see possible uses and applications for this data. However, all of these will require further analysis:
- Annotation error rate in Swiss-Prot has been estimated to be as high as 43% (http://dx.doi.org/10.1093/bioinformatics/bti1206). We have seen that annotations that are transient or reappeared appear to have an annotation error or entry error. This could suggest that more stable annotations are more likely to be correct (http://dx.doi.org/10.1007/978-3-642-02879-3_7)(paper link).
- As discussed and explored during the network analysis blog post, this approach allow us to see entries for which a sentence is (or was) included. We have seen that it is possible for entries with limited sequence similarity to share common annotations. This approach allows UniProt entries to be browsed for shared commonality through annotation, as opposed to just sequence similarity.
- The first graph showed levels of sentence reuse within Swiss-Prot and TrEMBL, culminating with a large decline, before complete removal. In this instance the sharp decline signified the inclusion of the sentence into the feature table, but major changes or fluctuations for other sentences could indicate additional information, such as a large erroneous annotation propagation.
- Levels of reuse can give an indication to the quality of the sentence. A heavily reused sentence is typically more generic than a sentence that is unique to a single entry. This suggests unique sentences are more ‘meaningful’ than generic sentences, as quantified using approaches such as Inverse Document Frequency and Term Frequency-Inverse Document Frequency (http://dx.doi.org/10.1108/00220410410560582).
- As stated in the curation protocol (http://dx.doi.org/10.1093/database/bar009), annotations are standardised across homologous proteins. This means it is possible that the inclusion of a sentence in an entry will propagate to homologous entries for a given database version. This is shown in the graph, and is expected. Should the sentence be removed from one of these entries, it is also possible that the sentence will be removed from the homologous entries. We can use these approaches to determine if the sentence still exists amongst any of these entries.
Each sentence has to originate in an individual entry (or group of entries in the same version). In this particular example, the sentence originated in two entries, and it was still in these entries before removal. Should the sentences have been removed from the original entries, it may be worth reassessing other entries that subsequently include this sentence. If the sentence was found to be false in the root entry, then it possible that an error has propagated into numerous other entries.
- We have seen in this example that the majority of entries sharing this sentence have been merged. It is possible that an entry is merged with another entry, or entries, which do not share a particular sentence. We could hypothesise that a merged entry which shared a common sentence prior to a merge means the sentence is more likely to be correct than one which isn’t shared by the other entry (or entries).
As a final point, it is worth mentioning sentence similarity. We have purposefully focused on the reuse of exact sentences, as we know they are standardised and copied between entries as part of the curation protocol (http://dx.doi.org/10.1093/database/bar009), thus allowing us to analyse annotation flow and reuse. However, we can see that there are sentences that are very similar in the database:
- the active-site selenocysteine is encoded by the opal codon, uga.
- the active-site selenocysteine is encoded by the opal codon, uga (by similarity).
- the selenocysteine is encoded by the opal codon, uga.
- the active- site selenocysteine is encoded by the opal codon, uga.
- the active-site selenocysteine is encoded by a normal cys codon.
In one case the difference is an extra space (“active- site”), which is only in a single entry and likely a copy and paste typo. For the sentence “the selenocysteine is encoded by the opal codon, uga.”, it is shown in entries P26970 and P26971, with it later being replaced with “the active-site selenocysteine is encoded by the opal codon, uga.” in Swiss-Prot version 30. These first 4 sentences are sementically similar; that is the message they convey is pretty much the same, and you could argue they should all be treated as the same sentence.
However we need to be careful with some sentences that appear highly similar. For example, the last sentence (“the active-site selenocysteine is encoded by a normal cys codon.”) states that selenocysteine is encoded by a cysteine codon, not the opal codon, in the entries it is used in. Whilst it is semantically similar in the sense selenocysteine is encoded by a particular codon, it is done so with a different codon. Therefore, investigating sentences that are semantically similar appears to be worthwhile, but care must be taken if treating similar sentences “as one”.