Some points about similarity searching


 

Care must be taken when interpreting the results of a database search:


 

Global vs local similarity

·          Sometimes the similarity between two sequences only covers parts of the sequence. This may be because the two sequences only share a common domain but are not similar globally.

·          Such sequences are unlikely to have the same function (e.g. proteins of many different functions may have an ATP binding domain).

·          You should always look at the sequence alignment to see whether two sequences share global similarity.


 

Sequence similarity doesn’t guarantee an evolutionary relationship

·          Although an evolutionary relationship between two sequences is the most common reason for sequence similarity two sequences may be similar but totally unrelated

·          It may be that the characteristics necessary to form a common structure have arisen more than once in the course of evolution or two sequences are biased to the same composition

·          Sequences that are of biased composition (e.g. SSSGSSEGEGSSSSSS is biased towards serine) are called low complexity segments. These will tend to align with other regions of similar low complexity in proteins of unrelated function.

·          Programs like SEG and XNU are built into most BLAST servers and will remove regions of low complexity, replacing them with a string of X’s, usually by default.


 

Don’t believe everything in the databases

·         The functional annotations of genes in the databases are done by a variety of methods and are not always correct.

·         Since much annotation results from themselves similarity searches, an incorrect annotation has a tendency to multiply in the database.


 

What is a significant match? How do I interpret them?

This is a difficult question to answer – it really should be done on a case-by-case basis. But to generalise:

Use the E score to indicate the similarity between sequences. Although this will vary according to the size and contents of the databases generally the lower this value the more likely your two sequences are to be related. 

Don’t be afraid to change the input parameters and the score matrices. BLOSUM62 is the usual default matrix and is a good general purpose starting point. However, other matrices like PAM250 may give better results.  

 


Back to section 2