CSC8309 -- Gene Expression and Proteomics

The GUI is fine, but it's pretty slow, and painful if you have to do lots of chips. This is particularly true as R tends to slow your computer down — it's a lot easier to make a cup of tea, rather than working, if you have a single machine.

Fortunately, BioConductor allows us to do the same analysis at the comment line.

## load the libraries
source ("http://bioconductor.org/biocLite.R")
biocLite("GEOquery")
library(GEOquery)
library( simpleaffy )


## Download raw files:
getGEOSuppFiles("GSE20986")

## Uncompress them:

untar("GSE20986/GSE20986_RAW.tar")
cels<-list.files(".", pattern = "[gz]")
for (celfile in cels) {
   gunzip (celfile)
}

If this doesn't work, try and figure out why.

Now we need to read in the CEL files into an 'AffyBatch' object - this is a particular data structure for storing Affymetrix data. We are going to find out some information about the data and normalise it.

celfiles<-read.affy(covdesc="phenodata.txt")

## this should give you output to read. Remember the type of object,
## in case you need to look up the documentation.
str(celfiles)

celfiles.gcrma<-gcrma(celfiles)

## To get information about the CEL filenames we read in you can do:

celfiles.gcrma$FileName

How would you extract the information about which tissue is on which chip?

In this case, we want to look at the expression values, so another step is required.

celfiles.exprs<-exprs( celfiles.gcrma )

## now lets look at the expression levels across chips.
head( celfiles.exprs )

One of the things that we failed to address using affylmGUI was that of filtering, even though we did some QC checks. This analysis is not supported directly in the afflmGUI, and is an additional useful step.

Filtering is able to remove the AFFX control probesets and other internal controls as well as removing genes with low variance, that would be unlikely to pass statistical tests for differential expression, or are expressed uniformly close to background detection levels.

As this step relies in part on annotation information, you will have to install the annotation library files for the Affymetrix chip. The modifiers to nsFilter in this case just tell nsFilter not to ditch probesets without Entrez Gene identifiers, or have duplicated Entrez Gene identifiers.

## install annotation files
biocLite("hgu133plus2.db")
library(hgu133plus2.db)
celfiles.filtered<-nsFilter(celfiles.gcrma, require.entrez=FALSE, remove.dupEntrez=FALSE)

## In order to see what probesets were removed, and why:

celfiles.filtered$filter.log

Does this operation create the new object as an ExpressionSet?
How many probesets are removed from the analysis?
What percentage is this of all the probesets on the chip?

It is common to remove 40-60% of probesets from a microarray experiment at the filtering stage. Now we have a filtered dataset, we can send the information to limma for differential gene expression analysis. First of all we need to extract information about the samples:

samples<-celfiles.gcrma$Target

## You can check the results by running:

samples


# which will return:

# [1] "iris"    "retina"  "retina"  "iris"    "retina"  "iris"    "choroid"
# [8] "choroid" "choroid" "huvec"   "huvec"   "huvec"

# We now convert these into factors for limma:
samples<-as.factor(samples)

# Check the values again
samples

# should give
# [1] iris    retina  retina  iris    retina  iris    choroid choroid choroid
# [10] huvec   huvec   huvec
# Levels: choroid huvec iris retina

Next we set up the experimental design.

design = model.matrix(~0 + samples)
colnames(design)<-c("choroid", "huvec", "iris", "retina")

If you inspect the design it shows that each one of the 12 chips has a 1 in the appropriate column for the tissue type.

design

# This should output the following:
#
#    choroid huvec iris retina
# 1        0     0    1      0
# 2        0     0    0      1
# 3        0     0    0      1
# 4        0     0    1      0
# 5        0     0    0      1
# 6        0     0    1      0
# 7        1     0    0      0
# 8        1     0    0      0
# 9        1     0    0      0
# 10       0     1    0      0
# 11       0     1    0      0
# 12       0     1    0      0
#
# attr(,"assign")
#
# attr(,"contrasts")
# attr(,"contrasts")$samples

# Now we can load up limma and start the analysis:

library(limma)

# First we fit the linear model, using the filtered celfiles Expression Set object

fit=lmFit(exprs(celfiles.filtered$eset), design)

# Now we set up a contrast matrix to compare all the tissues against the HUVEC line:

contrast.matrix = makeContrasts(huvec_choroid = huvec - choroid, huvec_retina = huvec - retina, huvec_iris = huvec - iris, levels=design)

# You can inspect this with:

contrast.matrix

# which will display:
#
#          Contrasts
# Levels    huvec_choroid huvec_retina huvec_iris
#   choroid            -1            0          0
#   huvec               1            1          1
#   iris                0            0         -1
#   retina              0           -1          0

# Now the contrast matrix is combined with the per-probeset linear model fit.

huvec_fits<-contrasts.fit(fit, contrast.matrix)
huvec_ebFit=eBayes(huvec_fits)

# And we can return the top 10 results for any given contrast (where coef=1 is
# huvec_choroid, coef=2 is huvec_retina etc.):

topTable(huvec_ebFit, number=10, coef=1)

Lets get topTable to return the full result set.

## To write the full results set to a file:
## top table also applies the multiple test correction by default (using
## "adjust.method")
results<-topTable(huvec_ebFit, coef=1, number=100000000, p.value=0.05)
write.table(results, "huvec_choroid_limma.txt",  sep="\t")

Now, we want to compare the lists we have derived from this process with the equivalent lists from afflmGUI.