• If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.


Microarray Normalization and Gene Expression Index

Page history last edited by PBworks 14 years, 8 months ago




Microarray recap



insert link to previous topic here


Importance of microarray normalization


Our motivation is trifold:

(1) Preserve biological variation between different samples

(2) Minimize experimental variation

(3) Allow comparison of different experiments


Normalization methods are significant as these may have a large effect on analysis between data samples that need to be corrected before future downstream steps such as group comparisons can be performed. We need to take several things into consideration when looking at a data set:


  • Median Scaling
  • Within / across slide variation
  • Dye Bias
  • Location bias
  • Probe bias


Microarray Normalization


Median Scaling


We often use linear scaling to ensure that different arrays have the same median value and dynamic range in order to allow future comparisons. The mathematical transformation is



where x’ is the new best fit line after applying a new range and median value.



However, the data set is often not linear. In these cases, linear scaling is not a good idea. One can apply linear scaling for small sets of data, but for other cases, we can use Loess.


Loess (LOcally WEighted Scatterplot Smoothing)


This is designed to fit a smooth curve and aggregates local linear fits into such. It effectively applies different scaling factors at different intensity levels to transform



Transform X to X’=f(X)


where Y and X’ are now comparable.



See Cleveland, W.S. and Devlin, S.J. Locally-weighted regression: an approach to regression analysis by local fitting. __J. Am. Stat Assoc__., 83, 596-610.


Reference for Normalization


Both Loess and Median scaling require a reference sample for normalization. There are two ways of doing this:


(1) “Middle chip” – use the median of the median of the samples

(2) Pooled reference RNA sample


The selection of the reference sample will influence the results.




cDNA Normalization


The key to this normalization is to balance the fluorescence intensities of two dyes, the green Cy3 and the red Cy5 dye as well as to allow comparisons of expression levels across experiments.


Within-Slide Normalization


Within-slide normalization is very similar to the cyclic loess method described for the Affymetrix chip below. In fact, the original idea was modified off the within-slide normalization.


A little bit more background on the normalization calculations:


It is actually preferable to work with logged intensities for the following reasons:


  1. the variation of logged intensities and ratio of intensities is less dependent on absolute magnitude
  2. normalization is usually additive for logged intensities
  3. taking logs evens out highly skewed distributions
  4. taking logs gives a more realistic sense of variation


Logarithms base 2 are used because intensities are integers between 0 and (2^16) – 1. Finally, the log ratios and mean log intensity measured by M and A are used instead to reveal more interesting features of the data. Below is another example on how M vs. A plots are more revealing than their log based counterparts.



Three Methods of Normalization at Probe Intensity Level for Affymetrix


‘’Complete data methods’’ – make use of data from all arrays to perform normalization.

Point of normalization is to eliminate ‘’obscuring variation’’ – variation that is introduced during the process of carrying out the experiment. This includes differences in


• sample preparation (such as labeling differences)

• production of arrays

• processing of arrays (different scanners)


Affymetrix’s method of normalization is by scaling the array to the same average value in intensities. This approach works best with linear relationships, (see Li and Wong, 2001).


Introduction to these three methods


  • Independent of choice of baseline array
  • Carried out at probe level for all probes on array
  • Treats PM (perfect match probes) and MM (mismatch probes) as part of intensities that need to be normalized
  • Does not account for saturation
  • First two methods based on cDNA microarray normalization methods
  • Assess performance of three methods using:
    • Scaling method
    • Non-linear method


Cyclic Loess


This method uses the comparison of the difference in log expression values (M) and the average of log expression values (A) of probe intensity values. A normalized data plot should look like the transformed one below where the data set is based around the M = 0 axis. R and G are the original probe sets representing red and green, respectively.





and .


For any two arrays, i, j with probe intensities Xki and Xkj, where k = 1,…, p (the probe), we can calculate


and . Next, a normalization curve is fitted to this new M versus A plot using loess. Below are two example graphs of five arrays for a PM probe sample set using liver (at concentration 10) dilution series data.



Normalized curve can be described as and hence, the normalization adjustment is . Adjusted probe intensities are given by and . Normalization is carried out in a pairwise manner. After looking at all pairs of arrays for any array k where 1<=k<=n, we have adjustments for chip k relative to arrays 1,…,k-1,k+1,…,n. The adjustments are equally weighed and applied to the set of arrays. Due to the pairwise manner of comparison, this is the most time consuming method.


Second Method: Contrast Based Method


This is another extension of the M versus A method. This is carried out by placing the data on a log-scale and transforming the basis. In the transformed basis, a series of ‘‘n’’-1 curves are fit in a similar manner to that of the cyclic loess method. The data is then adjusted using a smooth transformation which adjusts the normalization curve so that it lies along the horizontal. Data in the normalized state is obtained by transforming back to the original basis and exponentiating, making this method a bit faster than the cyclic method, but still overall, time consuming.


    • Quantile normalization


One of the first questions one may ask is what’s a quantile? Good question! Let’s use an example out of the Mathematical Statistics and Data Analysis, 2nd ed. by John Rice. If we use a cumulative distribution function (cdf) of a uniform random variable on [0,1], where


F(x) = {


0, x≤0
x, 0≤x≤1
1, x ≥1

Then, the pth quantile of the distribution F is defined to be that value Xp such that F(Xp) = p, or P(Y≤ Xp) = p. Special cases are p = 0.50, which corresponds to the median of F, and p = 0.25 and p = 0.75, which correspond to the lower and upper quartiles of F.


Summary: The pth quantile of the distribution hence divides a data set into p sets, with the same number of individual samples in a set.


Hence, quantile – quantile (Q-Q) plots are useful for comparing distribution functions. In a Q-Q plot, the quantiles of one distribution are plotted against those of another. This is an easy way to see if the distribution of two data sets (control versus test sample) is the same: if so, all the data sets will match up and form a straight diagonal line.


The goal of this normalization method is to make the distribution of probe intensities for each array in a set of arrays the same. The concept can be extended to ‘‘n’’ dimensions so that if all ‘‘n’’ data vectors have the same distribution, plotting the quantiles in ‘‘n’’ dimensions gives a straight line along the unit vector. Hence, we could make a set of data have the same distribution if we project the points of our ‘‘n’’ dimension quantile plot onto the diagonal.


We can use the projection of q onto d where q represents the ‘’k’’ th quantiles for all ‘‘n’’ arrays and d represents the unit diagonal:



This implies the following algorithm for normalizing a set of data vectors by giving them the same distribution:


  1. Given arrays of length ‘‘p’’, form X of dimension ‘‘p’’ x ‘‘n’’ where each array is a column
  2. Sort each column of X to give Xsort
  3. Take the means across rows of Xsort and assign the mean to each element in the row to get X’sort
  4. Get Xnormalized by rearranging each column of X’sort to have the same ordering as the original X


See below for graphical representation:




Two Sets of Experiments to obtain data


Dilution/Mixture Experiment


  1. RNA sample is diluted into five different concentrations
  2. The experiment is replicated five times
  3. The results of each replication is scanned on 5 different scanners, from higher values to lower values
  4. Below is an illustration of the before and after quantile normalization:



Spike-In Experiment


  1. cRNA fragments have been spiked in at various concentrations


Probeset Measure Comparisons


Variance ComparisonsPairwise ComparisonsBias ComparisonsSpeed
Cyclic loess Reduces variationDistance remains fairly constantPerforms comparablySlowest
Contrast based method Reduces variationPerforms comparablyDistance remains fairly constant across itensitiesSecond slowest
Quantile normalizationReduces variation slightly betterGives smallest distance between arraysSlight AdvantageFastest
Non-linear methodReduces variationPerforms poorer
Scaling MethodbaselineHad slightly higher slopes


Summary: quantile normalization is the most popular normalization method. One of its assumptions is, however, that most of the probes/genes don’t change between samples.


Microarray Expression Value (Expression Index)


cDNA Microarrays


  • Fold change: ratio Cy5 / Cy3
  • When fold change is negative, take





Affymetrix Microarray Expression Index




MAS 4 was the older version of GeneChip’s software and it uses the average over probe pairs.



A is defined as the set of suitable pairs chosen by the software with the following characteristics to maintain a robust average:


  • the removal of the highest and lowest difference
  • the mean and standard deviation were calculated from the remaining probes
  • probes more than 3 standard deviations from the mean were removed


Drawbacks to this naïve algorithm were:


  • can omit 30-40% of the probes
  • can give negative values




This is the latest version where the calculations apply a log to reduce the dependence of the variance on the mean.


A two-step procedure determines the detection p-value for a given probe set:


  1. Calculate the discrimination score R for each probe pair.
  2. Test the discrimination scores against the user-definable threshold Tau.


The discrimination score describes the ability of a probe pair to detect its intended target. It measures the target-specific intensity difference of the probe pair (PM-MM) relative to its overall hybridization intensity (PM+MM):


R = (PM -MM) / (PM + MM)


The next step toward the calculation of a Detection p-value is the comparison of each Discrimination score to the user-definable threshold Tau. Tau is a small positive number that can be adjusted to increase or decrease sensitivity.


The One-Sided Wilcoxon’s Signed Rank test is the statistical method employed the generate the detection p-value. It assigns each probe pair a rank based on how far the probe pair discrimination score is from Tau.



In this hypothetical probe set, the PM intensity is 80 and the MM intensity for each probe pair increases from 10 to 100. The probe pairs are numbered from 1 to 10. As the MM probe cell intensity (x-axis) increases and becomes equal or greater than the PM intensity, the discrimination score decreases as plotted on the y-axis. More specifically, as the intensity of the MM increases, our ability to discriminate between the PM and MM decreases. The dashed line is the user-definable parameter Tau (default = 0.015).


The intensity signal is a quantitative metric calculated for each probe set, which represents the relative level of expression of a transcript. The one-step Tukey’s Biweight is used to calculate the signal.


Tukey’s Biweight is a type of sin curve that is often used in robust estimation techniques, an estimation technique that is insensitive to small departures from the idealized assumptions and hence minimizes the weight of outliers. It is given by the equation:



and in graph form looks as this:



Each probe pair in a probe set is considered as having a potential vote in determining the signal value. The vote, in this case, is defined as an estimate of the real signal due the hybridization of the target. the MM intensity is used to estimate the stray signal. The real signal is estimated by taking the log of the PM intensity after subtracting the stray signal estimate. The probe pair vote is weighted more strongly if this probe pair signal value is closer to the median value for a probe set. Once the weight of each probe pair is determined, the mean of the weighted intensity values for a probe set is identified. This mean value is converted back to a linear scale and is outputted as the signal.


The software computes the probe set intensity signal as the following:



CT, or change threshold, is a version of MM that is never bigger than PM. The following rules avoid taking the log of negative numbers.


MM≥PMestimate typical case of MM for PM
If typical MMs>PM, set CT = PM – ε


Otherwise, if PM > MM, this is useful data as it provides an estimate of the stray data.



The white bars, Idealized Match (IM), are the intensities of the MM based on the signal rules above. In this sample offered by the Affymetrix Manual, a lot of the MM intensities are lower than the PM, and hence, the MM values can be used directly. When the PM < MM, the IM value is used.


Global scaling strategy sets the average signal intensity of the array to a target signal of 100. The key assumption of the global scaling strategy is that there are few changes in the gene expression between the arrays being analyzed.






Li and Wong published the first paper on the software dCHIP in 2001. Their motivation was that with one data set, there was still significant variation in expression level produced by different probes. The researchers used a dataset of 21 HuGeneFl arrays and the significant differences found in expression level were more due to probe effects rather than variations in array. Hence, they had an invested interest in finding a statistical model, (Model-based expression index), to normalize probe-effects in data.


Statistical Model


The individual probe response is based on the PM and MM difference and is modeled as such:



There were two reasons given in the literature by Li and Wong to measure the difference in PM and MM


  1. Computational advantage in determining the difference first, then fit the full data
  2. Earlier experimentation showed the average difference is linear to the true expression level


Normalization of arrays based on an "invariant set"


For a group of arrays, one needs to first normalize all the arrays to a common baseline array having the median overall brightness, as measured by the median CEL intensity in an array.


Model-Based Expression Index


One looks at multiple samples at one time and assign different probes a different weight. Each probe signal is then proportion to the amount of target sample (theta) and to the affinity of the specific probe sequence (gamma) to the target (see statistical model above). The error for MBEI acts as a confidence interval. Hence, we iteratively estimate theta and gamma to minimize the error. Specifically, we first fit the model to one probe set, and identify arrays with large standard error. Next, with these array outliers excluded, we work on the data table with fewer rows and fit the model again. This time, the standard errors and magnitudes of gamma's are examined in order to exclude probe outlier. If gamma results in a negative value, we regard it as a probe outlier and exclude the corresponding probe. As a result, data table shrinks in columns and we fit the model again to the new table. After probe outliers are excluded, we evaluate all arrays for outliers again and compare them to the set of array outliers in the previous round to see if there is any change. Procedure is repeated until the set of probe and array outliers do not change anymore.


(a) Two arrays plotted against each other (b) Green circles represent the invariant set (c) After normalization (d) Q-Q plot shows that these two probe sets have nearly same distribution


RMA: Robust Multi-chip analysis


RMA outperforms dCHIP which outperforms MAS 5, and so on.



Therefore, RMA is of some significance to us. Its model is that it eliminates MM probes in its statistical model. The main idea is that there's heavy probe intensity background adjustment using quantile normalization with adjusted PM. The log of PM is then taken.


RMA Background Subtraction


  • Signal + BG = PM
  • Signal is approximately exponential, while the BG is approximately normal.



Log (PM)


  • We use the log of PM to capture the fact that the higher value probes are more variable and this assumes that the probe noise is comparable on the log scale


Statistical model**


For each probe set, we set



and then take the log of it to get




The model is then:



where a is the expression index and b is the probe effect. The log of n stands for the log after quantile normalization of n samples. What follows is similar to dCHIP in that a and b are iteratively refitted to the data set. Essentially, the new formula is modified from multiplication to addictive on the log scale. The main difference is the minimization of error at the log PM.


Method Comparison Conclusion


  • No one uses MAS 4 now
  • With fold change, RMA > dCHIP > MAS 5
  • With p-value, RMA ~ MAS 5 > dCHIP
  • MAS 5 does a good job on abundant genes
  • dCHIP and RMA do a better job on less abundant genes
  • All four models are implemented in BioConductor (open source R package)




  • Affymetrix Manual.
  • Astrand, M. Normalizing oligonucleotide arrays. Unpublished Manuscript. http://www.math.chalmers.se/~magnusaa/maffy.pdf
  • Bolstad, B.M., Irizarry, R.A. et al, A Comparison of Normalization Methods for High Density Olignucleotide Array Data Based on Variance and Bias. Bioinformatics 2003. 19 (2): 185 – 193.
  • Cleveland, W.S., and Devlin, S.J. Locally-weighted regression: an approach to regression analysis by local fitting. Am. Stat. Assoc. 1998. 83: 596 – 610.
  • Dudoit, S., Yang, Y.H., et al. Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Statistica Sinica 2002. 12: 111-139
  • Hartemink, A., Gifford, D, et al. Maximum likelihood estimation of optimal scaling factors for expression array normalization. SPIE BIOS 2001).
  • Irizarry, R.; Bolstad, B.; Collin, F. et al. Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Research 2003, 31 (4).
  • Li, C.; Wong, W. Model-based analysis of oligonucleotide arrays: model validation, design issues and standard error application. Genome Biology 2001, 2(8).
  • Li, C.; Wong, W. Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. PNAS 2001, 98 (1): 31-36.
  • Yang, Y.H., Dudoit, S., et al. Normalization for cDNA Microarray Data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Research 2002. 30 (4): 15 – 27.
  • Yang, Y.H., Dudoit, S., et al. Normalization for cDNA Microarray Data. SPIE BIOS 2001 powerpoint presentation.



For more on Tukey’s Biweight, see Mathworld.wolfram.com

Comments (0)

You don't have permission to comment on this page.