Preparing the dataset

In this practical we are going to work fromt the raw data files, if you return to the GEO page on GSE20986 you can download a compressed tar archive of the data. You can do this in a variety of different ways, although a command line approach tends to give you most control. If you don't know how to use the command line to access files, you can use a browser also.

curl -O ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE20986/GSE20986_RAW.tar

wget ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE20986/GSE20986_RAW.tar

Now you need to untar and unzip the files. If you have not learnt about tar and gzip before, then please investigate them briefly. They are very common file formats and you should know of them.

Unpack the tar ball and then unzip the files that it contains.

You should now have a directory with .CEL files in. These CEL files are binary files produced by the array scanner for Affymetrix chips. If you are given CEL files but you do not know what chip they are from orginally you can interrogate them by looking into the binary file.

On linux, this works. On windows, you might like to investigate alternatives.

strings GSM524665.CEL |less

Looking at some of the other files in the same way, can you spot the original filenames of the files as they came off the machine? GEO has renamed all the files when they were uploaded.

We are now going to do some analysis on the chips, however in order to do this we need to capture the experimental information. This is just a text file which describes the chip names, and their relevant biological samples

Start by creating a text file with the CEL file names. You need a list of all the files one per line. Call the file phenodata.txt.

Now open the file in a text editor of your choice. You need to create a tab-separated file. This should have 3 columns 'Name' 'FileName' and 'Target'. The FileName and Name columns will be identical in this case. The Target column is information taken from the GEO page. Label the samples appropriately either 'iris' 'retina' 'choroid' or 'huvec'.

To repeat. This is TAB separated. This means separated with TABS. NOT WITH WHITE SPACE. This will be read by R. R is really very picky about formats. If you use white space, IT WILL NOT WORK. It has to be TABS.

CSC8309 -- Gene Expression and Proteomics