CSC8309 -- Gene Expression and Proteomics

There are a number of issues with this analysis. The biggest issue we will address first; we have not checked the quality of the data in the first place. For microarray data, quality control is a pretty important issue as experimentally, it is still an error-prone technique. Some QC work can be done within afflmGUI under the 'Plot' menu.

The second major issue is the amount of manual intervention that is required to download the data, produce the results and interrogate the gene lists; we will address this next.

First of all generate a raw intensity box plot for each array using the 'Plot' menu.

A boxplot is a convenient means by which to compare the probe intensity levels between the arrays of a dataset. Either end of the box represents the upper and lower quartile. The line in the middle of the box represents the median. Horizontal lines, connected to the box by “whiskers”, indicate the largest and smallest values not considered outliers. Outliers are values that lie more than 1.5 times the interquartile range from the ﬁrst of third quartile (the edges of the box) and are represented by a small circle. You will notice that the median values, and standard distributions of the data are all variable. This is the data prior to normalisation. The data is log2 transformed intensity values.

Generate a 'Intensity Density Plot' for the PM probes.

This displays similar information to the boxplot - the spread of intensities for all the CEL files. The x-axis represents probe density level and the y-axis indicates probe intensity. This plot provides us with a slightly more detailed picture and there are a number of inferences that can be made from these plots; a bimodal distribution in the raw data, for example, is often indicative of an array containing a spatial artefact and an array which is shifted to the right often has abnormally high background interference. The data in this case looks fine.

Now generate a normalised intensity boxplot for the data.

Notice the spread of the data is now even across all the arrays. This is essential to make meaningful comparisons between the groups.

Generate an RNA degradation plot

These give a good indication of the quality of the sample that has been hybridised to the array. mRNA degradation occurs when the molecule begins to break down and is therefore ineﬀective in determining gene expression. Because this kind of degradation starts at the 5’ end of the molecule and progresses to the 3’ end it can be easily measured using oligonucleotide arrays, where each PM probe is numbered sequentially from the 5’ end of the targeted mRNA transcript.

When RNA degradation is advanced, PM probe intensity at the 3’ end of a probeset should be elevated when compared with the 5’ end. When dealing with high quality RNA a slope of between .5 and 1.7 is typical, depending on the type of array; slopes that exceed these values by a factor of 2 or higher could indicate excessive degradation, the actual value is however less important than agreement between the chips, because if all the arrays have similar slopes then comparisons within genes across the arrays may still be valid.

If you now do a 'Relative Log Expression' plot - do any of the samples look like they might be an outlier?

The Relative Log Expression (RLE) plot shows, for each array, the deviation of gene expression level from the median gene expression level for that gene across all arrays. An array with quality problems may show signiﬁcantly diﬀerent values than the majority of arrays, resulting in an RLE box with greater spread or a median which deviates from 0.

The Normalized Unscaled Standard Error (NUSE) plot portrays the chip-wise distribution of standard error estimates, obtained for each gene on each array. To account for the fact that variability diﬀers considerably between each genes, the error estimates are standardised so that the median standard error across arrays is 1 for each gene.

Does a NUSE plot identify any other potential outlier chips?

It is possible to generate chip pseudo images, to look for spatial artifacts on the chips. First generate a weights plot, and a residuals plot for the GSM524662 chip. Notice the distribution of these statistical measures across the chip - they should be relatively uniform.

Now generate the same plots for any chips that you might have identified as potential outliers. Are there any clues as to why they might be outliers?

Although we appear to have one or more outlier chips, the majority of the data appears to be ok, with some spatial artefacts. We will trust the normalisation routines to have successfully normalised this systematic error, and we will not discard this chip from the experiment.