Top: Index Previous: BLAST searching Up: BLAST and company Next: InterPro

CSC8312 -- Bioinformatics Theory and Applications

Variants of BLAST

There are many different variants of BLAST; you should have some reasonable idea what these all do.

quest
  • How does gapped BLAST different the original BLAST?
  • What is MEGA-BLAST? What might you use it for.

Here, we are going to try PSI-BLAST.

Iterated profile searching using PSI-BLAST

Many proteins that are functionally and evolutionarily related but their sequences have diverged so much that this relationship is not easily identified by direct sequence comparison Thus, simple pair-wise sequence comparisons only detect a small proportion of distant evolutionary relationships. Many potentially interesting relationships are missed in simple FASTA, BLAST and Smith-Waterman searches.

In these cases their sequence identity falls below the level required for inferring homology by "normal" sequence comparison methods - called the twight-light zone - about 25-30% sequence similarity. Comparison of three-dimensional protein structures would normally reveal such conserved relationships between two such proteins but 3D structures aren't available for all proteins.

More sensitive methods of searching for functional homologues of a protein are now starting to be developed. Position-Specific Iterated BLAST (PSI-BLAST) is one such tool that takes advantage of a technique called profile searching as a more sensitive method of looking for protein function. PSI-BLAST is much better than normal BLAST when trying to detect sequences that are distantly related to your query sequence.

How does PSI-BLAST work.

A single database search might locate some sequences that are related to the query sequence. Information from these related sequences can then be used in further searches to locate yet more related sequences. Essentially we are using intermediate sequences to infer similarity between two sequences that are too dissimilar to link directly.

When using PSI-BLAST the results of a normal BLAST search are aligned and used to construct a pattern of conserved residues. This pattern is used for the next round of searching instead of the original query sequence. The process is repeated (iterated) until a final database search finds no more related sequences. When the process ends in this fashion, it is said to have converged.

Tutorial

NCBI already produce a perfectly good tutorial, so we are just going to use that one.

Using PSI-BLAST with a real world example.

A paper by Holm & Sander (1997) demonstrated the similarity between histidine triad proteins (HIT) and galactose-1-phosphate uridyltransferase (GalT) protein by superimposing their three-dimensional structures.

However, when we look at their sequence similarity it looks very weak and a standard BLAST search using a HIT sequence reveals no significant hits to GalT sequences.

See if you can establish a relationship between HIT protein and GalT using PSI-BLAST.

Here is the HIT sequence:

>gi|1706794|sp|P49789|FHIT_HUMAN BIS(5'-ADENOSYL)-TRIPHOSPHATASE
MSFRFGQHLIKPSVVFLKTELSFALVNRKPVVPGHVLVCPLRPVERFHDLRPDEVADLFQTTQRVGTVVE
KHFHGTSLTFSMQDGPEAGQTVKHVHVHVLPRKAGDFHRNDSIYEELQKHDKEDFPASWRSEEEMAAEAA
ALRVYFQ

Try searching with this sequence using PSI-BLAST.

Make sure that you:

  1. Change the database to SwissProt to keep search times low
  2. Switch on the low complexity filter to remove regions of low complexity
  3. Switch off the graphical overview
  4. Try an inclusion threshold of around 4
  5. Keep your eyes open for GalT hits.

Things to watch for

Care must be taken when interpreting the results of a database search:

Global vs local similarity

Sometimes the similarity between two sequences only covers parts of the sequence. This may be because the two sequences only share a common domain but are not similar globally. Such sequences are unlikely to have the same function (e.g. proteins of many different functions may have an ATP binding domain). You should always look at the sequence alignment to see whether two sequences share global similarity.

Sequence similarity doesn't guarantee an evolutionary relationship

Although an evolutionary relationship between two sequences is the most common reason for sequence similarity two sequences may be similar but totally unrelated it may be that the characteristics necessary to form a common structure have arisen more than once in the course of evolution or two sequences are biased to the same composition Sequences that are of biased composition (e.g. SSSGSSEGEGSSSSSS is biased towards serine) are called low complexity segments. These will tend to align with other regions of similar low complexity in proteins of unrelated function. Programs like SEG and XNU are built into most BLAST servers and will remove regions of low complexity, replacing them with a string of X's, usually by default.

Don't believe everything in the databases

The functional annotations of genes in the databases are done by a variety of methods and are not always correct. Since much annotation results from themselves similarity searches, an incorrect annotation has a tendency to multiply in the database.

What is a significant match? How do I interpret them?

This is a difficult question to answer - it really should be done on a case-by-case basis. But to generalise:

Use the E score to indicate the similarity between sequences. Although this will vary according to the size and contents of the databases generally the lower this value the more likely your two sequences are to be related.

Don't be afraid to change the input parameters and the score matrices. SUM62 is the usual default matrix and is a good general purpose starting point. However, other matrices like PAM250 may give better results.


Top: Index Previous: BLAST searching Up: BLAST and company Next: InterPro