Multiple Sequence Alignments, and Interpro

This section of the tutorial is split into two parts. The first part will show you how to perform multiple alignments of DNA and protein sequences using ClustalW and how to format the output for inclusion into a Word document. This process allows you to identify the regions of conserved sequence by eye.

The second section will describe how to search for patterns and motifs in a protein sequence, comparing them to motif entries in a database for functional analysis purposes.

Clustal

In this section, we will perform a multiple sequence alignment using the ClustalW program, which is probably the best known multiple sequence alignment program.

In this exercise we will align protein sequences of the enzyme alcohol dehydrogenase from 4 different organisms. Click here to bring up the sequences in a window in fasta format. Copy the entire set of sequences to the clipboard.
In this case, we are going to use the web based interface for ClustalW at European Bioinformatics Institute. There are many other interfaces for Clustal, but this one is relatively local.
Paste the protein sequences in the large text box. Notice you can get the results emailed back to you if you wish - in this case we'll just get them back on the screen. Leave all the other parameters as they appear, but notice the range of other possible options on this page. Click on the grey run button to start the alignment.
The results are fine, but hard to see. Try clicking on the "Run JalView" button, which should show your sequence in a Java application which you can interact with.

Protein Motifs — some revision

When aligning protein sequences it is often apparent that certain regions or specific amino acids, are more conserved than others. Such conserved regions are often conserved because they encode a part of the protein that is functionally important. The term motif is use to refer to a part of a protein sequence that is associated with a particular biological function.

For example a region of a protein that binds ATP is characterised by an ATP binding motif in the protein sequence. Since these regions are conserved, they may be recognisable by the presence of a particular sequence of amino acids called a pattern. A pattern is thus a qualitative description of a motif in terms of amino acid sequence.

The concept of a profile extends this concept, allowing a quantitative description of a motif, by assigning probabilities to the occurrence of a particular amino acid at each position of a motif. Thus profiles can be used to describe very divergent motifs.

The presence of a particular motif within a protein sequence can be used to suggest functions for uncharacterised proteins.

A number of databases have been constructed that attempt to describe particular protein motifs in terms of patterns and profiles. They allow you to search for patterns or profiles that are indicative of particular functional motifs within a query protein.

These databases all have different areas of optimum application - its difficult to tell which one will give the best results. They all have particular strengths and weaknesses. You really need to use them all.

However, a database called INTERPRO has been recently established that combines information from PRINTS, PROSITE, ProDom and Pfam (click here for the reference describing InterPro). Using InterPro saves a lot of work since we can essentially search many databases in one go.

The following exercise will guide you through the use of InterPro to look for motifs in some example protein sequences.

Interpro

In this exercise we will analyse some example protein sequences for motifs as an exercise in getting the hang of using InterProScan.

These sequences are mystery sequences. We will use interpro to see if you can assign a putative function to them based on the motifs that you find. Use the first sequence first. After this, pick some of the others.

Bring up the InterProScan web interface at EBI in a new window. Paste your sequence in to the text box labelled "Enter or cut and paste protein sequence here". You may want to add your email address to the relevant box, to get information sent back to you. When you are ready, click on the "Run" box. This can take quite a long time, so be prepared to wait.

You will be presented with the results in a tabular form. If you don't know what they mean then try the tutorial which explains things.

Although many biologists still use BLAST as their first port-of-call, Interpro is probably a better place; it most cases, it gives a more definitive answer about the function of a protein, with less effort.

CSC8312 -- Bioinformatics Theory and Applications

Introduction

Clustal

Protein Motifs — some revision

Interpro