Fetching and Installing

The first thing that we need is a data set. In this case, we are going to use GEO — the Gene Expression Omnibus. This is a database storing MIAME¹ compliant microarray data. It's no the only such tool; Array Express which is hosted at the EBI does a similar thing. In this case, we are using GEO because this is the what our paper does. You can read some background information in a recent paper.

Create a new directory. All of your work will be done within this directory. You should use the hard drive of your local machine, as you going to put large amount of data in it, so your h-drive is not appropriate.

Please make sure you back up your own work however

The paper provides a GEO accession number for their data set. Find the GEO website. You will need to use their query interface to retrieve the entry. The files that you want are in the supplementary files, providing the raw data.

There is also a link in the paper to the a .zip file with all the data and the code. Do NOT download this. We will be discussing the contents of this .zip file during the Data Standards section.

Unpack the .tar file in a subdirectory called "geo-data".

Installing Bioconductor

You should already have R installed on your machines, but we need additional resources to perform any microarray analsys, in this case the bioconductor base packages.

Bioconductor provides a nice script to automate this process for you, as there are lots of packages to install. Start the R gui and evalutate the following.

## install the basic bioconductor packages
#source( "http://bioconductor.org/biocLite.R")
#biocLite()
#biocLite('soybeanprobe')


## ensure installation before carrying on.

(Complete File)(Rout)

This takes a while, as it requires network access. You only need to do this once, however. If you do it more than once, no harm will be done, but the system will reinstall itself.

We are also installing the soybeanprobe library which will use later.

1. We will be covering MIAME in more detail in the Data Standards section of this module.

CSC8309 -- Gene Expression and Proteomics

Getting the data set

Installing Bioconductor