Michael Bell

Ph.D. Students Blog

Skip to: Content | Sidebar | Footer

Word Clouds (Swiss-Prot and TrEMBL)

21 March, 2011 (16:37) | Annotation Quality, Miscellaneous, Website | By: mj_bell

Example word cloud from Swiss-Prot Version 9, with common words and numbers removed

During my analysis of Swiss-Prot and TrEMBL datasets I have extracted all the words from each version of each dataset and counted their occurrences. A neat way of looking at this data is to create word clouds. I have done this for all versions of Swiss-Prot and TrEMBL. These can be seen with common words and numbers removed, just numbers removed or nothing removed. I have created a gallery for each of these:

  1. Common words and numbers removed. Click for the Swiss-Prot or TrEMBL gallery.
  2. Numbers removed. Click for the Swiss-Prot or TrEMBL gallery.
  3. Nothing removed. Click for the Swiss-Prot or TrEMBL gallery.

More information about the word clouds can be found on my website.

Interpreting Power-Law Graphs and Quality

15 March, 2011 (14:02) | Annotation Quality, Projects | By: mj_bell

In previous posts we have looked at annotation quality, yet failed to clearly define our definition of quality. This oversight is fairly common, with many authors stating something is of ‘high quality’ or that something is of better quality than something else without stating what makes something high quality, or how they quantify quality.

Our definition of quality is linked to Zipf’s principle of least effort. An annotation that requires lesser amounts of effort for the reader than the annotator being deemed to be of high quality (as opposed to an annotation of lesser quality, which means it is harder for the reader to interpret). So a high quality annotation is one that is well curated, using specific terms, requiring the annotator to go against the principle of least effort and thus making the reader require the least effort in the overall process. The value of α derived from the fitting of Zipf’s law allows us to quantify quality and compare datasets.

Another oversight has been the lack of discussion about the graphs used: what are the values on the x and y axis? How do we interpret which has a greater reuse of words? We can explain and discuss these oversights by revisiting the graph for Swiss-Prot Version 9:

Graph for Swiss-Prot Version 9

The data shown on the graph is represented as a cumulative distribution function (CDF) and is better known as Pareto’s Law. Pareto’s Law, like Zipf’s Law, is a type of Power-Law and the two are closely linked, providing slightly different ways of looking at the same thing. One of the main differences is that the plotted data is cumulative, which means ‘noisy’ tails are tidied up, as all the words that occur once are shown as a single point on the graph, as opposed to plotting them all (which makes the tail noisy in Zipf’s Law). The other main differences, due to being a CDF, is that we look at the probability that a word occurs X or more times, rather than stating what a words exact ranking is.

Along the x-axis of the graph we have the size of x, or the amount of times the word occurs. Along the y-axis we have probabilities. Given this, each data point within the graph (indicated by blue circles on the above graph) shows the probability of a word occurring x or more times. So if we look at the top left of the graph, we can see the probability that a word occurs 1 or more times is 1 (as we only look at words that appear in the data set). If we look to the bottom right of the graph, we can see that, roughly, the largest word in the dataset is 10^4 (10,000 times) and the chance of a word in the dataset occurring that many times is, roughly, 10^-4 (0.0001 or, as a percentage, .01%).

Additionally, as is often the case, a regression line (a line of best fit) is fitted to the graph. The value of α is extracted from this line (its slope). The fitting of the power-law to the data – i.e. how well the regression line fits the data points – has to provide a decent fit for the extracted α values to be meaningful. As mentioned in prior posts, we have used a framework to give confidence to the fitting of our data. In some cases, the fitting using the current approach is not good enough to allow us to use the extracted α values. However, whilst we are currently investigating ways to improve this, we can still get useful information by looking at the graphs.

Unsurprisingly, the distributions of the data points on the graphs relate to the underlying data. If we have a body of data that shows a high reuse of words, then the probability a word will occur will be higher than those datasets with less reuse. We can illustrate this by revisiting the graph for UniProtKB/Swiss-Prot Vs. UniProtKB/TrEMBL version 15:

Graph for UniProtKB Version 15

You can see for TrEMBL (red triangles) that the probability of a word occurring 100 times is almost the same as the probability of it occurring 10 times – the initial slope is very flat. This means if we randomly pick a word out at random from the dataset, the probability of it occurring 10 times or more in the dataset is almost the same as it occurring 100 times or more. If you look at the underlying data you see that the most popular words occur very often, and words with large numbers of occurrences occur very frequently. Conversely, if you look at the data for frequently occurring words in Swiss-Prot you see the frequency is also high, but it doesn’t maintain this high reuse for words slightly less common. Additionally Swiss-Prot has more words that occur less times (1, 2, 3…20 times) unlike TrEMBL. You can see from the graph that the probability a word occurring X or more times always has a higher probability than the same value in Swiss-Prot.

If you look towards the end of the TrEMBL graph, around 10^6, you will see the opposite – an almost flat vertical line. This shows the probability a word occurs slightly more frequently than the previous one is much lower. This shows that the most occurring words have little difference between them. If we look at the data we see the 3rd-6th most common words occur 1574337, 1558969, 1439234, and 1428770 times. The difference between these sizes is small, but as they are so highly ranked, the chance a word occurs 1574337 times compared to 1558969 is quite a bit less likely. These two points (a number of words occurring very frequently with a smaller subset occurring 1,2,3..10 times) highlighted shows a high reuse of words within TrEMBL, which isn’t unexpected given that TrEMBL is made up of automated annotations.

Comparing Swiss-Prot Version 9 to UniProtKB/Swiss-Prot Version 15, apart from the size difference as one would expect from a growing knowledgebase, a clear two slope behaviour can be seen developing over time (much clearer when you see graphs for all versions). This two slope behaviour has been said to be evident for mature datasets. One way of explaining these slopes is to say we have have a ‘core’ body of text, containing standard English terms used in most annotations, which represents the second slope (those words that occur more frequently). Over time, this slope moves right and downwards to reflect the increase of size within the database, with the usage growing proportionally. The first slope deals with the new words that are incorporated over time, which would reflect a mature dataset.

Blog and Website Updates

28 February, 2011 (16:39) | Website | By: mj_bell

Having discussed the progress of my website in various blog posts, it has now got to a stage where it can go live. You can view it by clicking the link in the top right, or go to http://www.michaeljbell.co.uk. I have also set myself up a twitter page (@mjbell). In addition to this I’ve also updated to the latest version of WordPress, so it’s good to see the update has gone smoothly.

In addition to updating, I also installed some new plug-ins, one of which was KCite (See KnowledgeBlog for more info). It allows me to cite papers in my blog, simply by stating their DOI in between cite tags (or pubmed ID). So, for example, in previous posts I have mentioned Zipf’s principle of least effort (10.1073/pnas.0335980100) and a framework used to fit the power-laws (10.1137/070710111). However, now I can have in-line citations and a bibliography at the end of the post. Pretty neat!


Turing Lecture with Prof Donald Knuth

11 February, 2011 (15:24) | Lecture, Miscellaneous, String Searching | By: mj_bell

Yesterday I attended the BCS and IET Turing Lecture in Glasgow. This year the lecture was given by Donald Knuth. Saying it was a lecture was slightly mis-leading, as it was basically a question and answer session. This was part 4 of the lectures, having also given the lecture at Manchester, Cardiff and London.

I’d imagine it would be hard to find a computer scientist who hasn’t heard of Don or been impacted by his work. Having done an algorithm course I was exposed to algorithm analysis, where he is often credited as being the father, as well as the Knuth-Morris-Pratt string searching algorithm which also impacted my Dissertation. As we look to publish our first paper, I have also made use of TeX (well, LaTeX) which was written by Don. Finally, his Art of Computer Programming books, which I really must find time to read some day!

Before the question and answer session, Don was awarded an honorary degree from Glasgow. Following this, the question and answer session started. The questions asked were generally pretty good, and covering subjects such as the functional programming paradigm, art vs. science of programming and how to be a good teacher. At times it was hard to hear Don, and likewise Don struggling to clearly hear the questions. In some cases his answers were a tad rambly and went around the subject with no clear answer to the question. Nevertheless it was an interesting lecture and glad I managed to attend. Some of the main (or humorous) points/quotes, with help from the twitter hashtag:

  • On the subject of teaching: “Say everything at least twice, once formally and once informally.”
  • On the subject of functional programming: “Out of the ~250 programs I wrote last year, 2-3 would have benefited from being written in a functional style.”
  • On the subject of functional programming: “With 1 or even 2 hands tied behind your back it’s hard to do anything dangerous.”
  • On the subject of TAOCP: “I started working on the art of computer programming 49 years ago, January 1962″
  • On the subject of text books becoming less popular: “Almost anything is out of fashion compared to Facebook.”
  • On the subject of NP Problems “Knuth believes P=NP, but that the proof will be nonconstructive and won’t help us write efficient programs.”
  • On Combinatorial Algorithms: “Almost no combinatorial algorithms were known in 1962.”

Twitter Integration

9 February, 2011 (12:49) | Website | By: mj_bell

As I continue to improve my web presence, I am hoping each blog post will generate a new Twitter post…so this should appear on my twitter feed! (@mj_bell)

Swiss-Prot Vs TrEMBL: Annotation Quality

9 February, 2011 (12:04) | Annotation Quality, Projects | By: mj_bell

As discussed in the previous post the application of Zipf’s law to annotation and noting the subsequence exponent α could hold promise as the basis for use as a quality metric. UniProtKB/Swiss-Prot is a manually curated and reviewed knowledgebase holding protein information. It is often regarded as the ‘gold standard’ for annotation. Additionally UniProt offer TrEMBL (Translated EMBL) which is automatically annotated and not reviewed. UniProt also allow previous releases of both Swiss-Prot and TrEMBL to be downloaded, which will allow us to also see the change in quality of both datasets over time.

For these reasons, Swiss-Prot and TrEMBL are an ideal set of databases to apply power-laws too. Obtaining the datasets was trivial, and were downloaded from the UniProt FTP site. Following this, we parsed each entry in each version, extracting each word and counting how many times it occurred over a dataset. Given this, we now had, for each version of both databases, a list of words that occurred and their frequency.

We then wanted to apply a power law to this data and note the exponents. Although we have mentioned Zipf’s law, we actually used Pareto’s law which is slightly different. The main differences is it a cumulative distribution function which provides a better fit of the power law, as it tidies up noisy tails. We also used a framework to provide confidence to the fitting of the power laws. Applying this to the first version of Swiss-Prot available (Swiss-Prot Version 9, December 1988) and the latest major version at the time of performing the experiment (UniProtKB/Swiss-Prot Version 15) we obtained the following graphs:

Swiss-Prot Version 9
UniProtKB/Swiss-Prot Version 15

The corresponding α’s are 2.07 for Swiss-Prot version 9 and 1.58 for UniProtKB/Swiss-Prot version 15. However, whilst the line provides a good fit Swiss-Prot version 9 the line for UniProtKB/Swiss-Prot 15 is poor, as there is a clear two slope behaviour. This two slope behaviour is apparent in mature resources. Whilst the framework cannot handle this, it is clear if we visually fit two lines (Segmented regression) we get a good fit, and in doing so we get an average α value of around 2.1 for UniProtKB/Swiss-Prot version 15. Having done this, we also applied the law to TrEMBL. Below shows the first version of TrEMBL compared to version 34 of Swiss-Prot (closest release to TrEMBL) and Version 15 of UniProtKB (both Swiss-Prot and TrEMBL packaged in the same release):

Swiss-Prot Version 34 Vs. TrEMBL Version 1
Swiss-Prot Vs. TrEMBL for UniProtKB Version 15

The α values for TrEMBL are given as 1.88 for version 1 and 1.56 for version 15. Again, like later versions of Swiss-Prot these aren’t perfect fits and using visual fitting we roughly get are 1.8 for version 1 and 1.6 for version 15. Although these are done by visual inspection, they do appear to follow our original hypothesis, and it does look like an approach that could give an indication to a knowledgebase’s annotation quality. It is also interesting to see TrEMBL and Swiss-Prot diverging over time, with TrEMBL showing less maturity over time with may words occurring with a high frequency.

We are currently looking at ways to improve the fitting the power law.

How to determine annotation quality?

8 February, 2011 (17:46) | Annotation Quality, Projects | By: mj_bell

As mentioned in the last post, how do go about determining annotation quality? We have seen other approaches use, for example, the underlying data structure. These mean we cannot use the same approach on all databases. The only thing we can guarantee is that all databases that store annotations do so in text; either structured or unstructured. Given this, we aim to create a generic quality metric for annotation based purely on free text.

As we are looking at, in essence, structured language it is no surprise we came across linguist work, done by George Zipf, that appeared promising. Zipf’s Law, named unsurprisingly after George Zipf, is an empirical power law. Basically it shows that the occurrence of a word is inversely proportional to its rank (i.e. the most common word is ranked 1st, second most common ranked 2nd, and so on). When this data is plotted on a log-log plot, we typically see a straight line. Zipfian distributions have been seen in various natural and man made phenomena, such as earthquakes, city sizes and web pages on the Internet.

An obvious observation is that depending on the text used the graph will vary slightly, and thus, so will the exponent of the line (α). Indeed, one paper has shown a pattern emerging. Text that is seen to be of very poor quality or incomprehensible (e.g. writings by small children) has &alpha values below 1.6 and above 2.4. In between these values we have text that is of readable quality, but differs slightly. We see that text that favors the reader (e.g. the text is very well written, using specific terms) gets a different score to those that favor the writer (e.g. they write with generic terms). These are summarised below:

  • α < 1.6 – Text is incomprehensible
  • 1.6 < α < 2 – Favors the writer
  • α = 2 – Equal effort levels
  • 2 < α < 2.4 – Favors the reader
  • α > 2.4 – Text is incomprehensible

This ties in with The Principle of Least Effort, also formulated by George Zipf, that states humans will typically take the path requiring least effort to obtain their goal. Given this, we would expect annotations that computer generated to show an α value less than 2, whilst manually curated annotations to obtain a score greater than 2. Should we obtain these results from some sample database, then this approach would appear to hold promise as the basis of a quality metric.

Why look at annotation quality?

8 February, 2011 (17:08) | Annotation Quality, Projects | By: mj_bell

As I continue with adding content to my website, I have started to detail a project that took up the majority of my first year of my PhD. The majority of the work is complete, and we a paper well underway, but we are still polishing off the latter parts – details of which will be posted once we complete them!

As discussed on my website, annotations are important. They can vary from expertly curated (e.g. UniProtKB/Swiss-Prot) to computer generated (e.g. TrEMBL). There are numerous databases (with varying annotation rules and standards), and the annotations from one database can extracted and incorporated into another. It has been shown that users often incorrectly assume annotations are of equal quality and correctness. Annotations tend to come with little, if any, meta data. It is hard, if not impossible, to track the provenance of an annotation, detect where an error propagated from, track annotation changes to an annotation, etc.

Some work has looked at trying to determine annotation quality. All annotations contain free text, but the surveyed papers all require additional information – such as exploiting the structure of the underlying structure (the Gene Ontology in one paper) – to determine the quality. This means data resources cannot be equally compared. Thus we have taken an approach to assess quality based purely on the free text annotation.

String Searching

3 February, 2011 (14:56) | Projects, String Searching, Website | By: mj_bell

Continuing with my adding of content to my website, I added my undergraduate dissertation project earlier today. My project, titled “Animating String Searching Algorithms”, looked at building a teaching tool for string searching algorithms. The development of the tool was done in Java and as a couple of years have now passed since I created the tool, and I can see a number of enhancements I would like to make. A few of these were clear upon completion, however as my programming has improved I can see a number of ways to improve it “under the hood”.  However, I’m still pretty happy with the tool and the research I did into it, and enjoyed the project as a whole .

I started demonstrating on CSC8002 today – Advanced Programming on the MSc Computing Science course – which will cover event driven programming, so will be good to brush up on my GUI skills.

Coming along nicely…

2 February, 2011 (16:41) | Website | By: mj_bell

The webpage is coming along nicely. The design is pretty much sorted, just need to add the content now. I’ve also setup a twitter account (@mj_bell) which I have integrated into the webpage, along with updates from my CiteULike library. Once the website is fully up and running, I will start tweeting! The blog and tweets will start being work based and give updates and progress on my research and progress as I go along hopefully in a day or two, once the core content is up on the webpage.