## UniProtKB and Benford’s Law

In the last blog post we looked at data parsing, and how Zipf’s law could possibly be used to detect parsing errors. Whilst reading a recent blog post by Ben Goldacre I was reminded of Benford’s law – which shares a number of similarities to Zipf’s – and considered how it may also be applicable to detecting parsing errors.

Like Zipf’s Law, Benford’s Law is a rather interesting empirical law. The law states that the occurrences of first digits for a set of numbers aren’t evenly distributed. By first (or leading) digit, we simply mean that we take only the first digit of a number, regardless of its size. So the leading digit of 18362 is 1, whilst the leading digit of 489 is 4. Interestingly, the chance of the first digit being 1 is around 30%, whereas the chance of the leading digit being 9 is around 5% – like the distribution of grid-line widths on the logarithmic scale. For a given set of numbers, if, for each leading digit occurs with probability then the numbers are said to satisfy Benford’s law.

Also like Zipf’s Law, Benford’s Law has been seen to hold in numerous applications, assuming the data is distributed over multiple orders of magnitude. However, one of the most interesting applications of Benford’s law is the detection of fraud. When numbers are being fabricated, in an attempt to disguise fraud, people appear to distribute the numbers evenly, when naturally this isn’t the case. Given this, we are interested to see if UniProt also follows Benford’s Law and if it can be used for detecting parsing errors.

We have created a number of graphs for Swiss-Prot and TrEMBL for Benford’s law. For each graph we have plotted orange diamonds to show the distribution predicted by Benford’s law. On each graph the X-axis represents leading digits with the corresponding percentage being shown on the Y axis. Below are the graphs showing Swiss-Prot version 9 and UniProtKB/Swiss-Prot Version 15:

These graphs show that Swiss-Prot follows a Benford distribution. This was also true for intermediate versions of Swiss-Prot. Additionally, we did the same for TrEMBL, with graphs for versions 1, 28 and UniProtKB/TrEMBL version 15 shown below:

For TrEMBL the results are much more scattered, with later versions not really following Benford’s Law. We appear to be able to extract similar meaning from these graphs as we did from the Zipf ones. For example, it shows Swiss-Prot is more “natural” than TrEMBL and that early versions of TrEMBL are of better quality than later versions.

Whilst this view gives some further confidence to the underlying annotation, investigation into error detection isn’t as clear. One of the main issues is we don’t have much erroneously parsed data to check – only those that contain copyright and topic header “errors”. Using Benford’s Law we were unable to detect any “errors” in these datasets. Data that produced a major skew for Zipf’s law produced negligible impact on Benford’s law. However, it is likely that different kinds of parsing errors could be detected with Benford’s law, that would be unnoticeable by Zipf’s law. A combination of both methods could be employed when checking parsed data, and any subsequent work on this would be interesting to see.