How to determine annotation quality?
As mentioned in the last post, how do go about determining annotation quality? We have seen other approaches use, for example, the underlying data structure. These mean we cannot use the same approach on all databases. The only thing we can guarantee is that all databases that store annotations do so in text; either structured or unstructured. Given this, we aim to create a generic quality metric for annotation based purely on free text.
As we are looking at, in essence, structured language it is no surprise we came across linguist work, done by George Zipf, that appeared promising. Zipf’s Law, named unsurprisingly after George Zipf, is an empirical power law. Basically it shows that the occurrence of a word is inversely proportional to its rank (i.e. the most common word is ranked 1st, second most common ranked 2nd, and so on). When this data is plotted on a log-log plot, we typically see a straight line. Zipfian distributions have been seen in various natural and man made phenomena, such as earthquakes, city sizes and web pages on the Internet.
An obvious observation is that depending on the text used the graph will vary slightly, and thus, so will the exponent of the line (α). Indeed, one paper has shown a pattern emerging. Text that is seen to be of very poor quality or incomprehensible (e.g. writings by small children) has &alpha values below 1.6 and above 2.4. In between these values we have text that is of readable quality, but differs slightly. We see that text that favors the reader (e.g. the text is very well written, using specific terms) gets a different score to those that favor the writer (e.g. they write with generic terms). These are summarised below:
- α < 1.6 – Text is incomprehensible
- 1.6 < α < 2 – Favors the writer
- α = 2 – Equal effort levels
- 2 < α < 2.4 – Favors the reader
- α > 2.4 – Text is incomprehensible
This ties in with The Principle of Least Effort, also formulated by George Zipf, that states humans will typically take the path requiring least effort to obtain their goal. Given this, we would expect annotations that computer generated to show an α value less than 2, whilst manually curated annotations to obtain a score greater than 2. Should we obtain these results from some sample database, then this approach would appear to hold promise as the basis of a quality metric.