Top: Index Previous: The Dataset Up: Gene Expression Next: AffylmGUI

CSC8309 -- Gene Expression and Proteomics

In this practical we are going to work fromt the raw data files, if you return to the GEO page on GSE20986 you can download a compressed tar archive of the data. You can do this in a variety of different ways, although a command line approach tends to give you most control. If you don't know how to use the command line to access files, you can use a browser also.

act curl -O ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE20986/GSE20986_RAW.tar

or

wget ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE20986/GSE20986_RAW.tar

Now you need to untar and unzip the files. If you have not learnt about tar and gzip before, then please investigate them briefly. They are very common file formats and you should know of them.

act Unpack the tar ball and then unzip the files that it contains.

You should now have a directory with .CEL files in. These CEL files are binary files produced by the array scanner for Affymetrix chips. If you are given CEL files but you do not know what chip they are from orginally you can interrogate them by looking into the binary file.

act On linux, this works. On windows, you might like to investigate alternatives.

strings GSM524665.CEL |less

Can you see where the chip type is represented?

Looking at some of the other files in the same way, can you spot the original filenames of the files as they came off the machine? GEO has renamed all the files when they were uploaded.

We are now going to do some analysis on the chips, however in order to do this we need to capture the experimental information. This is just a text file which describes the chip names, and their relevant biological samples

act Start by creating a text file with the CEL file names. You need a list of all the files one per line. Call the file phenodata.txt.

Now open the file in a text editor of your choice. You need to create a tab-separated file. This should have 3 columns 'Name' 'FileName' and 'Target'. The FileName and Name columns will be identical in this case. The Target column is information taken from the GEO page. Label the samples appropriately either 'iris' 'retina' 'choroid' or 'huvec'.

To repeat. This is TAB separated. This means separated with TABS. NOT WITH WHITE SPACE. This will be read by R. R is really very picky about formats. If you use white space, IT WILL NOT WORK. It has to be TABS.

The final result should be:

Name    FileName        Target
GSM524662.CEL   GSM524662.CEL   iris
GSM524663.CEL   GSM524663.CEL   retina
GSM524664.CEL   GSM524664.CEL   retina
GSM524665.CEL   GSM524665.CEL   iris
GSM524666.CEL   GSM524666.CEL   retina
GSM524667.CEL   GSM524667.CEL   iris
GSM524668.CEL   GSM524668.CEL   choroid
GSM524669.CEL   GSM524669.CEL   choroid
GSM524670.CEL   GSM524670.CEL   choroid
GSM524671.CEL   GSM524671.CEL   huvec
GSM524672.CEL   GSM524672.CEL   huvec
GSM524673.CEL   GSM524673.CEL   huvec

Top: Index Previous: The Dataset Up: Gene Expression Next: AffylmGUI