Top: Index Previous: Features of a file Up: Sequence Features Next: From Sequence to Genome

CSC8312 -- Bioinformatics Theory and Applications

Fasta Format

actMove back to the EmblEntry format

At the bottom is DNA sequence, which should look something like this:

>embl|AF457141|AF457141 Mus musculus Pax6 paired-less isoform mRNA, complete cds.
ttaaactctgggcaggtcctcgcgtagaacccggttgtcagatctgctacttccccccga
gaagcggctttgagaagtgtgggaaccagcgccaccagactcacctgacaccccagcctc
ggctcacagatggctgccagcaacaggaaggagggggagagaacaccaactccatcagtt
ctaacggagaagactcggatgaagctcagatgcgacttcagctgaagcggaagctgcaaa
gaaatagaacatcttttacccaagagcagattgaggctctggagaaagagtttgagagga
cccattatccagatgtgtttgcccgggaaagactagcagccaaaatagatctacctgaag
caagaatacaggtatggttttctaatcgaagggccaaatggagaagagaagagaaactga
ggaaccagagaagacaggccagcaacactcctagtcacattcctatcagcagcagcttca
gtaccagtgtctaccagccaatcccacagcccaccacacctgtctcctccttcacatcag
gttccatgttgggccgaacagacaccgccctcaccaacacgtacagtgctttgccaccca
tgcccagcttcaccatggcaaacaacctgcctatgcaacccccagtccccagtcagacct
cctcatactcgtgcatgctgcccaccagcccgtcagtgaatgggcggagttatgatacct
acacccctccgcacatgcaaacacacatgaacagtcagcccatgggcacctcggggacca
cttcaacaggactcatttcacctggagtgtcagttcccgtccaagttcccgggagtgaac
ctgacatgtctcagtactggcctcgattacagtaaagagagaaggagagagcatgtgatc
gagagaggaaattgtgttcactctgccaatgactatgtggacacagcagttgggtattca
ggaaagaaagagaaatggcggt

FASTA format is a widely used method for sequence file formatting. It begins with a single-line description of the sequence, followed by lines of sequence data. The sequence data can be nucleic acid (DNA/RNA) or Amino Acid sequence (for a protein).

The description line is indicated by a greater-than (">") symbol in the first column. All lines of text are usually 80 characters or shorter terminated by a carriage return.

Sequences are expected to be represented in the standard IUB/IUPAC amino acid and nucleic acid codes. The nucleic acid codes are:

        A --> adenosine           M --> A C (amino)
        C --> cytidine            S --> G C (strong)
        G --> guanine             W --> A T (weak)
        T --> thymidine           B --> G T C
        U --> uridine             D --> G A T
        R --> G A (purine)        H --> A C T
        Y --> T C (pyrimidine)    V --> G C A
        K --> G T (keto)          N --> A G C T (any)

The accepted amino acid codes are:

    A  alanine                         P  proline
    B  aspartate or asparagine         Q  glutamine
    C  cystine                         R  arginine
    D  aspartate                       S  serine
    E  glutamate                       T  threonine
    F  phenylalanine                   U  selenocysteine
    G  glycine                         V  valine
    H  histidine                       W  tryptophan
    I  isoleucine                      Y  tyrosine
    K  lysine                          Z  glutamate or glutamine
    L  leucine                         X  any
    M  methionine                      *  translation stop
    N  asparagine                      -  gap of indeterminate length

Top: Index Previous: Features of a file Up: Sequence Features Next: From Sequence to Genome