UniProtKB and Benford’s Law

In the last blog post we looked at data parsing, and how Zipf’s law could possibly be used to detect parsing errors. Whilst reading a recent blog post by Ben Goldacre I was reminded of Benford’s law – which shares a number of similarities to Zipf’s – and considered how it may also be applicable […]

Anaylsing selenocysteine sentence reuse and flow.

The sentence “Selenocysteine is encoded by the opal codon.” was previously discussed, having been identified through a network view of sentence usage within Swiss-Prot version 9. Our interest was mainly due to it being present in two entries in different interconnected groups. In that blog entry we looked at the similarities (or lack of) between […]

UniProtKB and Sentence Reuse

Previously we have looked at the occurrences of words within UniProtKB and seen varying degrees of reuse. What about whole sentences? When an annotation is created, either manually or automatically, annotations are frequently reused (i.e. sentences copied and pasted from one entry to another), as detailed in the curation protocol . By investigating this reuse, […]

Word Clouds (Swiss-Prot and TrEMBL)

During my analysis of Swiss-Prot and TrEMBL datasets I have extracted all the words from each version of each dataset and counted their occurrences. A neat way of looking at this data is to create word clouds. I have done this for all versions of Swiss-Prot and TrEMBL. These can be seen with common words […]

Interpreting Power-Law Graphs and Quality

In previous posts we have looked at annotation quality, yet failed to clearly define our definition of quality. This oversight is fairly common, with many authors stating something is of ‘high quality’ or that something is of better quality than something else without stating what makes something high quality, or how they quantify quality. Our […]

Swiss-Prot Vs TrEMBL: Annotation Quality

As discussed in the previous post the application of Zipf’s law to annotation and noting the subsequence exponent α could hold promise as the basis for use as a quality metric. UniProtKB/Swiss-Prot is a manually curated and reviewed knowledgebase holding protein information. It is often regarded as the ‘gold standard’ for annotation. Additionally UniProt offer […]

How to determine annotation quality?

As mentioned in the last post, how do go about determining annotation quality? We have seen other approaches use, for example, the underlying data structure. These mean we cannot use the same approach on all databases. The only thing we can guarantee is that all databases that store annotations do so in text; either structured […]

Why look at annotation quality?

As I continue with adding content to my website, I have started to detail a project that took up the majority of my first year of my PhD. The majority of the work is complete, and we a paper well underway, but we are still polishing off the latter parts – details of which will […]