An introduction to sequence similarity searching


The first question that is usually asked by a biologist when looking at a new DNA or Protein sequence is what is the function of this molecule and what is its structure? 

One of the major goals of bioinformaticists is to be able to predict protein function and structure from the sequence alone.  Scientists have been working towards this goal for many years and still have a long way to go.

However, there are some prediction methods that are successful and all are based on the same knowledge based principle.

They rely on the fact that the best way to predict structure or function is to find similar sequences in existing databases and to use the information about them to make conclusions about properties of the new sequence.

Special algorithms have been developed to perform this function and have been implemented as computer programs.

There are three major algorithms that been most widely used to identify regions of local similarity (the best matching regions) between sequences: the Smith-Waterman approach, FASTA (pronounced Fast A) and BLAST.

The Smith Waterman algorithm is the most exhaustive approach and guarantees to find the best match to a query sequence. However, it is still relatively slow and hence web based tools are not commonly available.

FASTA and BLAST have been developed to perform the same kind of searches but much faster at the expense of some loss of sensitivity.  Although FASTA and BLAST differ slightly in the way that they work, they essentially produce the same results. In this exercise we will just concentrate on using the BLAST tool.

BLAST is a heuristic method to find the highest scoring locally optimal alignments (the bits that best match up!) between a query sequence and a database. In this section we will look at some of the ways to find similar sequences in sequence databases using BLAST and PSI_BLAST.


More information and detail about similarity searching is available here.


Back to section 2