A Brief look at EPCLUST
This section introduces you to a tool for the clustering and visualisation of data from DNA arrays called EPCLUST. EPCLUST is part of a larger toolset called Expression Profiler written by Dr. Jaak Vilo formely of the EBI. A list of the tools contained within Expression Profiler can be seen here.
EPCLUST takes data generated from DNA arrays, usually microarrays, and allows genes that show similar expression profiles to be grouped together (clustered). The program allows the application of a number of different clustering techniques.
EPCLUST has a large amount of data built in from publically available datasets (from the yeast Saccharomyces cerevisiae, and human). The program also allows you to upload your own datasets for analysis.
In this part of the tutorial we will use one of the built in datasets to a simple clustering exercise. The analysis of microarray data is a complex and rapidly evolving subject - this tutorial will only give you a taste. Hopefully though, it might encourage you to explore the other options yourself!
Click here to bring up the EPCLUST program.
Step 1: The first option in the form gives you the option to use an exisiting dataset or upload your own. We will use existing data, so click on the link that reads 'Select data for analysis from the public data sets'. (show me)
Step 2: You will be presented with a form from which you can choose your datasets. We are going to use the Test_and_Demo set for Saccharomyces cerevisiae
Click on the Saccharomyces cerevisiae option. Check the box on the third row, containing the test_and_demo link.
This dataset contains a number of experiments. We'll just look at how gene expression varies as yeast sporulates, using the sporulation dataset. Enter 'Sporulation_*' in to the text box on the third row to select all the sporulation experiments in the test_and_demo dataset.
Finally, click on the grey button labelled 'Select the corresponding experiments'. (show me)
Step 3: You'll then be presented with complicated looking screen that summarises the contents of the datasets and allows you to refine it. You'll be glad to hear that, for the sake of simplicity, we are not going to alter any of these. Simply click on the grey button at the bottom labelled 'Select the data'. (show me)
Step 4: EPCLUST then puts your data in a unique folder for you to come back to if you wish. However, we are just going to cluster it. There are two possible choices for clustering techniques:
i) Clustering using hierarchical clustering
ii) Clustering using K-means
For this tutorial we'll just try the clustering with the K-means. K means clustering will cluster each gene expression profiles into one of a given number of clusters.
At the beginning we need to tell the program how many clusters to use. The default value is ten. We are going to change this to twenty.
Enter the value 20 into the box labelled 'k=' next to the grey button labelled 'Cluster and visualise with K-means!', then leaving all the other options unchanged, click on the 'Cluster and visualise with K-means!' button. (show me)
Step 5: You will now be presented with a page that details the gene expression clusters and their members. The expression profiles of the genes assigned to each cluster are shown as graphs.
Scroll down the page and look at the expression profiles in each cluster. Do all the clusters contain similar profiles?
Go back and set the number of clusters to 40. Examine the results and compare the effectiveness of the clustering using 40 clusters to that using 20 clusters.
Look at the functional annotation of the genes. Do genes of similar function fall into the same cluster?
Wondering how the different clustering methods work? It's quite a complex subject but there are some explanations dotted around the web:
A Simple Introduction to Principal Component and Cluster Analysis
Some explanations in the Cluster and Treeview software manual
What difference do the clustering techniques make?. Have a look a simple dataset clustered using different methods here.