Now we need to install an analysis package to deal with this data. We are going to use affylmGUI
If you are using Ubuntu, this may require some additional installation. See http://bioinf.wehi.edu.au/affylmGUI/ (see Bwidget and TkTable sections).
Start R. At the R command line.
biocLite("affylmGUI", dependencies=TRUE) library( affylmGUI ) affylmGUI() |
You should understand what each of these steps achieve. This should pop up a
fairly ugly looking interface, which is affylmGUI
. Although this GUI is
relatively horrible, it does take you through the individual steps in a
relatively straight-forward way.
The first step is to load the RAW data into the system for processing.
From the 'File' menu, select 'New' (or press CTRL-N). If you are not
already in the directory with the CEL files in then you should navigate to it
now with the 'Browse' button and click 'OK'
You will be asked now for a 'Targets file' - this is the tab delimited text
file which describes the experiment. We called it You can check that the experimental information was loaded by selecting 'RNA Targets' from the menu, confirm that this replicates the contents of your phenodata.txt file. If you select the 'Raw Affy Data' checkbox, it should say 'Available' - affylmGUI has found your input files. If you select the 'Normalised Affy Data' checkbox, it should say 'Not Available'. The raw data has not been normalised. |
The purpose of this step is to adjust data for technical variation, as opposed to biological differences between the samples. There will always be slight discrepancy between the hybridisation processes for each array and these variations tend to lead to scaling differences between the overall fluorescence intensity levels of various arrays. For example the quantity of RNA in a sample, the amount of time for which a sample spends hybridising or the volume of a sample can all introduce significant variance. Even subtle physical differences between arrays or between the scanners used to read arrays can have an effect.
Put simply, normalization ensures that when comparing expression levels of different arrays, that we are, as much as is possible, comparing like with like. Studies have shown that the normalization method used has a significant difference on final differential expression levels, so it is vital to choose an appropriate method
To normalise the data, select the 'Normalization' menu, and click 'Normalize'. Select GCRMA for the normalisation method and click 'OK'. GCRMA is effectively a standard method for Affymetrix expression arrays of this type. |
If you have launched R from a terminal, you will see that in the console additional packages have been downloaded and installed:
hgu133plus2cdf_2.6.0.tgz
hgu133plus2probe_2.6.0.tgz
The CDF package is the 'Chip Description Format'. For technical information about this file format for Affymetrix arrays you can have a look at this website: http://www.affymetrix.com/support/developer/powertools/changelog/gcos-agcc/cdf.html
The probe package has information about the probe sequence. The probe information from PM (PerfectMatch) and MM (MisMatch) probes is used by the GC-RMA algorithm.
What is the difference between the PM and MM probes? If you do not know, please find out. |
This is tied to the positional information in the CEL file, and the CDF, and contains the probe set names, probe positions, and target strand of the probe in addition to the sequence. When the GCRMA normalisation has finished, it will show that 'Normalized Affy Data' is available in the left hand pane.
Before we can analyse the data, we need to compute the linear model fit.
Select the 'Linear Model' option from the menu and 'Compute Linear Model Fit'. When this is done, you will see that the 'Linear Model Fit' is marked as available in the left hand pane. |
The Linear Model and subsequent steps are provided by the BioConductor package Limma (and is the 'lm' part of affylmGUI). Limma is used to fit a linear model for each gene for a given set of arrays. The coefficients of the fitted models describe the differences between the RNA sources hybridised to the arrays. We have already described the coefficients - in this experiment there are four - iris, retina, huvec and choroid. In statistical terms, we have a single factor with four levels.
These levels are used to create a design matrix which links the factors to the data in the arrays. This information is then used to fit the linear model for each gene.
Limma then requires you to create 'contrasts'. These are comparisons between the factors (conditions) in the experiment in order to determine differentially expressed genes.
Under the 'Linear Model' menu, you can 'Compute Contrasts'. What we are interested in is the comparison of all our cell lines from the eye, vs the HUVEC cell lines. Set up your contrasts appropriately. Click 'OK' and enter a name for this set of contrasts. |
At this point limma will calculate differentially expressed genes for the listed set of contrasts. The experiment is effectively analysed at this point.
In order to view the lists of differentially expressed genes, you can now use
the 'TopTable' menu to get a table of genes ranked by differential expression.
It will ask you to select from a list of contrasts parameterizations (in this
case there will be only one), and then you can select the specific contrast
you are interested in.
Because we are conducting tens of thousands of statistical tests, we need to adjust for multiple testing, a process called, strangely enough, 'Multiple Testing Correction'. BH is selected by default, this is shorthand for Benjamini-Hochberg, which is used quite widely for microarray analysis. Sort the TopTable output by p-value, and then select 'All genes'. As this table is very large it will be displayed in a text window, and you can save it to a file. Name the file after the contrasts. Repeat this for the other 2 contrasts, so you have 3 files, sorted by adjusted p-value. |