CSC8312 -- Bioinformatics Theory and Applications

Flat Files

A large amount of bioinformatics is based on flat files. Those of you from a computer science background might find this rather surprising. In the past, these flat files were actually the primary means for storing data. Nowadays, things have advanced a lot. Relational databases and XML are in much more common use. Still many tools will depend on one or more flat file formats.

What are the main advantages of flat files over other forms of database?
What are their main problems?

File Formats

Most bioinformatics databases will provide their data in multiple formats. At the moment, you should be looking at a pretty picture of an EMBL record.

What are the main sections of the EMBL record?
What information is stored in each section?

This format is good for viewing, but not for using computationally.

Click on "Text Entry" at the top. You should get a text entry which looks like this.

This file uses a two-letter abbreviation syntax. So each line indicates one specific kind of knowledge.

ID — Identifiers
FT — Features

The explanations, descriptions, classifications and other comments are in ordinary English, and the symbols and formatting employed for the base sequences themselves have been chosen for readability. Wherever possible, symbols familiar to molecular biologists have been used. At the same time, the structure is systematic enough to allow computer programs to easily read, identify, and manipulate the various types of data included. Each entry in the database is composed of lines. Different types of lines, each with its own format, are used to record the various types of data which make up the entry. EMBL entries are composed to be usable by humans as well as by programs.

Can you find a list of what all the prefixes means?

XML

Try next EMBLXML format — again select this at the top

XML is a very common language which can be used to represent many things. In this case, an EMBL file. It's harder to read and for humans to manipulate, but it's much better for computers.