A large amount of bioinformatics is based on flat files. Those of you from a computer science background might find this rather surprising. In the past, these flat files were actually the primary means for storing data. Nowadays, things have advanced a lot. Relational databases and XML are in much more common use. Still many tools will depend on one or more flat file formats.
|
Most bioinformatics databases will provide their data in multiple formats. At the moment, you should be looking at a pretty picture of an EMBL record.
|
This format is good for viewing, but not for using computationally.
|
This file uses a two-letter abbreviation syntax. So each line indicates one specific kind of knowledge.
The explanations, descriptions, classifications and other comments are in ordinary English, and the symbols and formatting employed for the base sequences themselves have been chosen for readability. Wherever possible, symbols familiar to molecular biologists have been used. At the same time, the structure is systematic enough to allow computer programs to easily read, identify, and manipulate the various types of data included. Each entry in the database is composed of lines. Different types of lines, each with its own format, are used to record the various types of data which make up the entry. EMBL entries are composed to be usable by humans as well as by programs.
Can you find a list of what all the prefixes means? |
Try next EMBLXML format — again select this at the top |
XML is a very common language which can be used to represent many things. In this case, an EMBL file. It's harder to read and for humans to manipulate, but it's much better for computers.