In my previous post I checked how many of the databases listed in DBcat and the NAR database issue from 1999 were still reachable. I did this as a quick way to try and gain an indication as to how databases have become extinct.
This approach, as I acknowledged, had a number of limitations. Indeed, a comment by Alex Bateman raised a very good point that I failed to explicitly state — many URLs from 1999 will have suffered URL decay. Basically, a database may still be alive but with a new URL. For example, DBcat has an entry for the Ad5E1A Database, with the URL “http://www.geocities.com/CapeCanaveral/Hangar/2541/“, showing the database was hosted on GeoCities. GeoCities was a popular host for webpages in the late 1990′s/mid 2000′s, but Yahoo! closed the GeoCities service in 2009. A quick search identifies that the Ad5E1A database is still reachable at a different URL, although it hasn’t been updated since October 2009. Therefore, as pointed out by Alex, this analysis served more as an indicator for URL decay than how many databases have become extinct.
Whilst reading the 2006 NAR summary/update paper (http://dx.doi.org/10.1093/nar/gkj162), I realised that database summaries are provided with a persistent URL based on their entry number. So, for example, the URL http://www.oxfordjournals.org/nar/database/summary/157 points to the database entry for ABCdb, as ABCdb has the entry number 157. If ABCdb was to be removed from the list (as it previously had been, but was later resurrected) it would retain it’s previous entry number. Therefore, if this URL was to be visited, and a 404 HTTP response code was returned, it will indicate that a database has been removed from the list. For example, the URL for entry 61 returns 404 — it used to point to the IXDB database which appears to have been shut down.
Therefore, by attempting to connect to each URL for entries 1 up to 1673 we can gain a better idea of how many databases have died. However, this isn’t strictly correct. A database may be removed from the NAR list for a number of reasons. Whilst a database being shut down is one such reason, it may just mean a database requires registration or has become commercialised. The aim of the NAR list was not to be exhaustive, but to contain databases that are publicly available and of high quality. Therefore, I guess it would be fair to say such a result will give an indication as to how many databases have failed to maintain the high standards required for inclusion in the NAR database list.
Anyway, as it was fairly quick to do, I wrote some code to check how many database summaries returned 404. In total 161 databases have been removed from the NAR database. For each of these URLs I checked against the web archive to try and determine the name of the database that was removed (with mixed success). This list can be downloaded in CSV format.
Finally, it is worth acknowledging that there is a significant amount of literature looking at this kind of phenomena. Perhaps the best known is the extensive work by Jonathan Wren, who has looked at factors such as URL and E-Mail decay for articles indexed by MEDLINE.