In previous posts we have looked at annotation quality, yet failed to clearly define our definition of quality. This oversight is fairly common, with many authors stating something is of ‘high quality’ or that something is of better quality than something else without stating what makes something high quality, or how they quantify quality.

Our definition of quality is linked to Zipf’s principle of least effort. An annotation that requires lesser amounts of effort for the reader than the annotator being deemed to be of high quality (as opposed to an annotation of lesser quality, which means it is harder for the reader to interpret). So a high quality annotation is one that is well curated, using specific terms, requiring the annotator to go against the principle of least effort and thus making the reader require the least effort in the overall process. The value of α derived from the fitting of Zipf’s law allows us to quantify quality and compare datasets.

Another oversight has been the lack of discussion about the graphs used: what are the values on the x and y axis? How do we interpret which has a greater reuse of words? We can explain and discuss these oversights by revisiting the graph for Swiss-Prot Version 9:

The data shown on the graph is represented as a cumulative distribution function (CDF) and is better known as Pareto’s Law. Pareto’s Law, like Zipf’s Law, is a type of Power-Law and the two are closely linked, providing slightly different ways of looking at the same thing. One of the main differences is that the plotted data is cumulative, which means ‘noisy’ tails are tidied up, as all the words that occur once are shown as a single point on the graph, as opposed to plotting them all (which makes the tail noisy in Zipf’s Law). The other main differences, due to being a CDF, is that we look at the probability that a word occurs X or more times, rather than stating what a words exact ranking is.

Along the x-axis of the graph we have the size of x, or the amount of times the word occurs. Along the y-axis we have probabilities. Given this, each data point within the graph (indicated by blue circles on the above graph) shows the probability of a word occurring x or more times. So if we look at the top left of the graph, we can see the probability that a word occurs 1 or more times is 1 (as we only look at words that appear in the data set). If we look to the bottom right of the graph, we can see that, roughly, the largest word in the dataset is 10^4 (10,000 times) and the chance of a word in the dataset occurring that many times is, roughly, 10^-4 (0.0001 or, as a percentage, .01%).

Additionally, as is often the case, a regression line (a line of best fit) is fitted to the graph. The value of α is extracted from this line (its slope). The fitting of the power-law to the data – i.e. how well the regression line fits the data points – has to provide a decent fit for the extracted α values to be meaningful. As mentioned in prior posts, we have used a framework to give confidence to the fitting of our data. In some cases, the fitting using the current approach is not good enough to allow us to use the extracted α values. However, whilst we are currently investigating ways to improve this, we can still get useful information by looking at the graphs.

Unsurprisingly, the distributions of the data points on the graphs relate to the underlying data. If we have a body of data that shows a high reuse of words, then the probability a word will occur will be higher than those datasets with less reuse. We can illustrate this by revisiting the graph for UniProtKB/Swiss-Prot Vs. UniProtKB/TrEMBL version 15:

You can see for TrEMBL (red triangles) that the probability of a word occurring 100 times is almost the same as the probability of it occurring 10 times – the initial slope is very flat. This means if we randomly pick a word out at random from the dataset, the probability of it occurring 10 times or more in the dataset is almost the same as it occurring 100 times or more. If you look at the underlying data you see that the most popular words occur very often, and words with large numbers of occurrences occur very frequently. Conversely, if you look at the data for frequently occurring words in Swiss-Prot you see the frequency is also high, but it doesn’t maintain this high reuse for words slightly less common. Additionally Swiss-Prot has more words that occur less times (1, 2, 3…20 times) unlike TrEMBL. You can see from the graph that the probability a word occurring X or more times always has a higher probability than the same value in Swiss-Prot.

If you look towards the end of the TrEMBL graph, around 10^6, you will see the opposite – an almost flat vertical line. This shows the probability a word occurs slightly more frequently than the previous one is much lower. This shows that the most occurring words have little difference between them. If we look at the data we see the 3rd-6th most common words occur 1574337, 1558969, 1439234, and 1428770 times. The difference between these sizes is small, but as they are so highly ranked, the chance a word occurs 1574337 times compared to 1558969 is quite a bit less likely. These two points (a number of words occurring very frequently with a smaller subset occurring 1,2,3..10 times) highlighted shows a high reuse of words within TrEMBL, which isn’t unexpected given that TrEMBL is made up of automated annotations.

Comparing Swiss-Prot Version 9 to UniProtKB/Swiss-Prot Version 15, apart from the size difference as one would expect from a growing knowledgebase, a clear two slope behaviour can be seen developing over time (much clearer when you see graphs for all versions). This two slope behaviour has been said to be evident for mature datasets. One way of explaining these slopes is to say we have have a ‘core’ body of text, containing standard English terms used in most annotations, which represents the second slope (those words that occur more frequently). Over time, this slope moves right and downwards to reflect the increase of size within the database, with the usage growing proportionally. The first slope deals with the new words that are incorporated over time, which would reflect a mature dataset.