In this practical we are going to work fromt the raw data files, if you return to the GEO page on GSE20986 you can download a compressed tar archive of the data. You can do this in a variety of different ways, although a command line approach tends to give you most control. If you don't know how to use the command line to access files, you can use a browser also.
curl -O ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/series/GSE20986/GSE20986_RAW.tar
or
|
Now you need to untar and unzip the files. If you have not learnt about tar and gzip before, then please investigate them briefly. They are very common file formats and you should know of them.
Unpack the tar ball and then unzip the files that it contains. |
You should now have a directory with .CEL
files in. These CEL
files are binary
files produced by the array scanner for Affymetrix chips. If you are given CEL
files but you do not know what chip they are from orginally you can
interrogate them by looking into the binary file.
On linux, this works. On windows, you might like to investigate alternatives.
|
Can you see where the chip type is represented?
Looking at some of the other files in the same way, can you spot the original filenames of the files as they came off the machine? GEO has renamed all the files when they were uploaded.
We are now going to do some analysis on the chips, however in order to do this we need to capture the experimental information. This is just a text file which describes the chip names, and their relevant biological samples
Start by creating a text file with the CEL file names. You need a list of all
the files one per line. Call the file phenodata.txt .
|
Now open the file in a text editor of your choice. You need to create a tab-separated file. This should have 3 columns 'Name' 'FileName' and 'Target'. The FileName and Name columns will be identical in this case. The Target column is information taken from the GEO page. Label the samples appropriately either 'iris' 'retina' 'choroid' or 'huvec'.
To repeat. This is TAB separated. This means separated with TABS. NOT WITH WHITE SPACE. This will be read by R. R is really very picky about formats. If you use white space, IT WILL NOT WORK. It has to be TABS.
The final result should be:
Name FileName Target GSM524662.CEL GSM524662.CEL iris GSM524663.CEL GSM524663.CEL retina GSM524664.CEL GSM524664.CEL retina GSM524665.CEL GSM524665.CEL iris GSM524666.CEL GSM524666.CEL retina GSM524667.CEL GSM524667.CEL iris GSM524668.CEL GSM524668.CEL choroid GSM524669.CEL GSM524669.CEL choroid GSM524670.CEL GSM524670.CEL choroid GSM524671.CEL GSM524671.CEL huvec GSM524672.CEL GSM524672.CEL huvec GSM524673.CEL GSM524673.CEL huvec