Annotation maturity: Average entry age within UniProtKB
As we know the UniProtKB database is growing exponentially. Our prior quality analysis of whole UniProtKB versions has seen that, over time, annotations are becoming more geared towards the annotator; that is, readers now require more effort to interpret and digest the annotations than in earlier versions of UniProtKB. The exponential growth of the database gives a plausible explanation to this – the initial quality of annotations cannot be maintained. This explanation also fits with other research (http://dx.doi.org/10.1093/bioinformatics/btm229).
Given this, we would like to abstract away from the growth of the database and see how sets of annotations change over time. For example, how do annotations for a set of entries in both Swiss-Prot Version 9 and Swiss-Prot version 45 compare? Our initial step is to investigate the average age of entries within UniProtKB.
Within each entry there is a set of date stamps, as explained in the UniProtKB manual, one of which indicates when the entry first appeared in the database. For each entry within a database version we extract this date and calculate the average. The figure below shows the average entry age for both TrEMBL and Swiss-Prot:
This graph shows that the average entry age increases over time. These results aren’t unexpected, as new data is constantly being added, which will gradually push the average age up. Although entries in Swiss-Prot come from TrEMBL, they overwrite the initial TrEMBL date and state the date integrated into Swiss-Prot, hence why we see a similar pattern for both Swiss-Prot and TrEMBL. Additionally, older entries can become deleted or merged. Very little information is given about what happens to the dates for merged entries. It isn’t uncommon to see two or more entries becoming merged; when this happens, one accession number is used as a primary accession number with the remaining accession numbers becoming secondary accession numbers. The primary accession isn’t necessarily the oldest entry and we only extract dates from primary accession numbers.
Whilst this graph shows that the average entry age is increasing, the difference between the release date and average entry date also shows an initial increase. We currently have over 20 years worth of Swiss-Prot versions, with the average record age for the latest version being around 5 years old, compared to version 9, which is around a year and a half. This difference between the average and and release date is shown below:
To illustrate this, the first graph shows that Swiss-Prot version 9 was released in November 1988 and that the average entry release date was July 1987 – the difference between these dates is 1 year and 4 months, which is reflected in the bottom left point in the above graph. This graph is slightly different than what we previously expected. Initially we see the age difference increasing, but then start to decrease (Swiss-Prot) or level off (TrEMBL); we would expect a constant increase. We assume age and maturity are linked; something old is more mature than something young. If this is indeed the case, then this would suggest Swiss-Prot is decreasing in maturity, whilst TrEMBL remains at a steady level of maturity. It could be suggested that the increase in entries hit a maximum level, before having a detrimental effect. A similar graph and suggestion could be shown by, for example, the response time for a server with a constant increase in concurrent users querying a database.
Although viewing entry dates, rather than annotation quality, this conclusion fits with that of our other analyses; the rate of data is outstripping our ability to deal with it. Following on from this, we now wish to look at maturity of annotations within entry sets. This is discussed in the next blog post (link)