Michael Bell

Ph.D. Students Blog

Skip to: Content | Sidebar | Footer

Have I parsed my data correctly?

22 September, 2011 (16:06) | Miscellaneous | By: mj_bell

The foundations for most of our work has been based on data parsed from text files. Extracted data has included single words and whole sentences from gigabytes of raw text. With such overwhelming amounts of data, how can we be confident that we have correctly parsed our data?

Obviously some basic checking was performed. This included counting the number of entries expected vs number parsed, manual checking of random entries that have been parsed, parsing artificially created data and so on. Doing checks such as these help detect parsing errors; in our case we identified a mismatch of expected entries parsed (which was an error with the UniProt release notes, which they have subsequently corrected). However, as with all testing, you can only show the presence of bugs/errors, not prove that parsed data is error free. In many cases you also don’t actually know everything you need to test for.

As detailed in previous blog posts, we have been applying power laws to our parsed data. In a number of cases the resulting graphs have given unexpected results – such as artifactual kinks or outlying alpha values. Upon inspection these were due to incorrectly parsed data. In the first instance, a kink in the tail of the graph was due to incorrectly parsing copyright statements. These statements are an example of an error we couldn’t foresee or test for. Similarly, we also detected the incorrect parsing of topic block headers. Another error was due to incorrect escaping of speech marks when reading data dumps (OK, not a parsing error, but still an error detected with this method).

It appears that a side-effect of our original analysis has been the detection of incorrectly parsed data. The usage of Zipf’s Law is common; numerous papers exist that claim a Zipfian distribution has been found in all kinds of natural and man-made phenomena. This is also true of similar empirical laws, such as Pareto’s law. Given this, it isn’t unreasonable to hypothesise that power-law approaches could be applied to the detection of incorrectly parsed data.

I am not aiming to investigate this further, rather I found it an interesting side-effect of our analysis. Should this approach be explored, it would be necessary for the algorithm to distinguish between parsing errors and ‘real’ outliers. It would also have to be available in a way that meant it was easy to use and quick to run (in proportion to the total parsing time). It would be interesting to see if such an approach could reliably detect errors in parsed data, and given the generic nature of these approaches, the amount of literature on the subject and my experience I would suspect so.

Comments

Pingback from Michael Bell » UniProtKB and Benford’s Law
Time October 5, 2011 at 2:36 pm

[...] the last blog post we looked at data parsing, and how Zipf’s law could possibly be used to detect parsing errors. Whilst reading a recent [...]

Write a comment