CSC8312 -- Bioinformatics Theory and Applications

Sequence Features and Databases

The purpose this exercise is:

To walk-through some of the most commonly used databases in bioinformatics.
To see the most common kinds of data.
To understand how bioinformatics data is structured and annotated.

Some Biology

In this practical, we're going to walk-through an investigation of the PAX6 gene; I've just picked this because it's a well-studied gene, it's function has been extensively investigated and, most importantly for a biologist, it's associated with a disease.

This practical assumes a reasonable biological background; for those from a non-bio background, please ask the demonstrator about anything that you haven't covered in the lectures yet.

Biological Databases

As well as tools, most bioinformatics databases ¹ are available on the web. You can download the entire human genome, if you wish. In this case, we are going to investigate the enormous amount of information that is available about one gene.

The core of molecular biology is DNA. It is well served by databases. The EMBL database is a repository for nucleic acid sequences. It is maintained by the EMBL organisation and produced in collaboration with the DDBJ and GenBank. New entries are exchanged between GenBank and DDBJ to ensure uniformity as part of the international nucleotide sequence database collaboration.

Entries to the databases are provided externally by many sources which includes researchers and sequencing programs. How much information is available about the genes in question in the form of annotation is highly variable, depending on the source and functional annotations should be treated with caution. TrEMBL contains protein sequences derived by translating entries from the EMBL database. TrEMBL annotation is, therefore, not to be entirely trusted. It's designed to be comprehensive which means that its contents are not rigorously checked.

The NIH is the home for the GenBank database that essentially holds the same information as EMBL. Currently, GenBank holds 11 billion bases from over 100,000 species. The NIH also has a database called nr (non-redundant) that contains sequences collated from many different databases in a non-redundant fashion (i.e. each sequence only occurs once). This database is the default used for searching using tools such as BLAST.

The highest quality protein sequence information can be found in the UNIPROT ² database. This database is jointly maintained by the EBI and the Swiss Institute of Bioinformatics. The information contained within is of high quality since it is maintained by teams of expert curators and entries are accompanied by high quality information derived from the literature. Annotating sequences take a lot of time and resources ³, so the Uniprot database only represents a fraction of the known proteins. Currently, it documents around 240,000 different proteins.

To make the best use of these databases, you will need to understand how the data is organised and stored

1. Those with a computational background may think of "database" as meaning relational. It really doesn't have this connotation in bioinformatics. An increasingly large number of the databases are backed onto RDBMS, but not all.

2. Uniprot is a combined version of the older databases Swissprot and PIR. You'll probably see the name Swissprot a lot as many people still call is this. It is the same thing.

3. There are currently about 100 annotaters, programmers and sys admins working on Uniprot. At a quick guess, this would mean funding in the order of £20,000,000 per annum.

CSC8312 -- Bioinformatics Theory and Applications

Sequence Features and Databases

Some Biology

Biological Databases

Contents