| 
  • If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!

View
 

Microarray Normalization and Expression Index

Page history last edited by PBworks 15 years, 11 months ago

Outline:

 


 

Microarray recap

 

 

insert link to previous topic here

 

Importance of microarray normalization

 

The first step of microarray analysis is normalization, which adjusts and balances individual hybridization intensities so that comparisons are biologically meaningful. Data must be normalized for a number of reasons, including different starting quantities of RNA, difference in labeling efficiences of dyes, and unequal probe hybridization.

 

In essence, microarray normalization has three intended outcomes:

(1) Preserve biological variation between different samples

(2) Minimize experimental variation

(3) Allow comparison of different experiments

 

Normalization methods are significant as these may have a large effect future downstream comparisons can be performed. We need to take several things into consideration when looking at a data set:

 

  • Median Scaling
  • Within / across slide variation
  • Dye Bias
  • Location bias
  • Probe bias

 

Microarray Normalization

 

Dye Bias

 

Often, the green Cy3 and the red Cy5 dye do not label equally.

 

If dyes are swapped in a replicate experiment, and there is no dye bias, then:

 

 

However, if there is dye bias, then the average expression value can be obtained by calculating the geometric mean:

 

 

 

Median Scaling

 

Linear Scaling

 

We often use linear scaling to ensure that different arrays have the same median value and dynamic range in order to allow future comparisons. The mathematical transformation is

 

 

where x’ is the new best fit line after applying a new range and median value. c1 is the amount data is shifted so that the line of best fit goes through the origin. c2 is the ratio of the array's dynamic range.

 

 

However, the data set is often not linear. In these cases, linear scaling is not a good idea. One can apply linear scaling for small sets of data, but for other cases, we can use Loess.

 

Loess (LOcally WEighted Scatterplot Smoothing)

 

This is designed to fit a smooth curve to the data by calculating the best linear fit at each point and aggregating them together to get a nonlinear fit. It effectively applies different scaling factors at different intensity levels to transform

 

Y=f(X)

 

Then,

Transform X to X’=f(X)

 

where Y and X’ are now comparable.

 

 

See Cleveland, W.S. and Devlin, S.J. Locally-weighted regression: an approach to regression analysis by local fitting. __J. Am. Stat Assoc__., 83, 596-610.

 

Reference for Normalization

 

Both Loess and Median scaling require a reference sample for normalization. You can either pick one treatment as your reference sample (eg. pick "control" as your reference sample), or pick a subset of genes to estimate the scaling factor.

 

When using one treatment as a reference sample, you use the "middle" chip approach, which uses the median of all the samples. The selection of the reference sample will influence the results.

 

When using a subset of genes to estimate the scaling factor, what genes do you choose? The two best groups of genes to use are housekeeping genes, which are present at constant levels, and genes which have an invariant rank across all samples (this indicates the gene is not differentially expressed).

 

Quantile Normalization

 

Quantile normalization is currently considered the best normalization method.

 

One of the first questions one may ask is what’s a quantile? Good question! Let’s use an example out of the Mathematical Statistics and Data Analysis, 2nd ed. by John Rice. If we use a cumulative distribution function (cdf) of a uniform random variable on [0,1], where

 

F(x) = {

 

0, x≤0
x, 0≤x≤1
1, x ≥1

Then, the pth quantile of the distribution F is defined to be that value Xp such that F(Xp) = p, or P(Y≤ Xp) = p. Special cases are p = 0.50, which corresponds to the median of F, and p = 0.25 and p = 0.75, which correspond to the lower and upper quartiles of F.

 

Summary: The pth quantile of the distribution hence divides a data set into p sets, with the same number of individual samples in a set.

 

Hence, quantile – quantile (Q-Q) plots are useful for comparing distribution functions. In a Q-Q plot, the quantiles of one distribution are plotted against those of another. This is an easy way to see if the distribution of two data sets (control versus test sample) is the same: if so, all the data sets will match up and form a straight diagonal line.

 

The goal of quantile normalization is to make the distribution of probe intensities for each array in a set of arrays the same. The concept can be extended to ‘‘n’’ dimensions so that if all ‘‘n’’ data vectors have the same distribution, plotting the quantiles in ‘‘n’’ dimensions gives a straight line along the unit vector. Hence, we could make a set of data have the same distribution if we project the points of our ‘‘n’’ dimension quantile plot onto the diagonal.

 

We can use the projection of q onto d where q represents the ‘’k’’ th quantiles for all ‘‘n’’ arrays and d represents the unit diagonal:

 

 

This implies the following algorithm for normalizing a set of data vectors by giving them the same distribution:

 

  1. Given arrays of length ‘‘p’’, form X of dimension ‘‘p’’ x ‘‘n’’ where each array is a column
  2. Sort each column of X to give Xsort
  3. Take the means across rows of Xsort and assign the mean to each element in the row to get X’sort
  4. Get Xnormalized by rearranging each column of X’sort to have the same ordering as the original X

 

No experiment will retain its original value, but all the experiments will have the same variation. The advantage of using quantile normalization is that no reference sample has to be chosen - all samples are equally important. The disadvantage for using it is that everytime a new sample is added to the data, normalization has to be run all over again. However, if the database is very large to start off with, you can calculate the mean once, and then use this mean for future samples.

 

Here is a visual representation:

 

 

Dilution Series Illustrates Effectiveness of Quantile Normalization

 

  1. RNA sample is diluted into five different concentrations
  2. The experiment is replicated five times
  3. The results of each replication is scanned on 5 different scanners, from higher values to lower values
  4. Below is an illustration of the before and after quantile normalization. Different concentrations are in different shades, and each concentration is scanned on a different scanner:

 

Below is the data set before normalization:

 

 

...and after normalization:

 

 

 

Normalization Quality Check

 

A little bit more background on the normalization calculations:

 

It is actually preferable to work with logged intensities for the following reasons:

 

  1. the variation of logged intensities and ratio of intensities is less dependent on absolute magnitude
  2. normalization is usually additive for logged intensities
  3. taking logs evens out highly skewed distributions
  4. taking logs gives a more realistic sense of variation
  5. log captures the fact that higher value probes are more variable
  6. however, on a log scale, the probe noise is comparable

 

Logarithms base 2 are used because intensities are integers between 0 and (2^16) – 1.

 

Below are scatter plots of probe intensities with their SD before and after log transformations. Notice how in the left graph, probes with higher values have higher SD, while after log transformation, probes have a more comparable SD.

 

 

MvA Plot

If normalization has worked, a plot of log2R vs log2G (R and G are the original probe sets representing red and green) should be on the diagonal.

 

 

MvA plots are a visual way of seeing whether normalization worked. This method uses the comparison of the difference in log expression values (M) and the average of log expression values (A) of probe intensity values. Mathemtically,

M = log2R-log2G

A = (log2R + log2G)/2

 

M should be close to 0 for all probes because most probes don't change.

 

 

Below are two example graphs of five arrays for a PM probe sample set using liver (at concentration 10) dilution series data.

 

 

 

Normalization Techniques Comparison

 

Variance ComparisonsPairwise ComparisonsBias ComparisonsSpeed
Quantile normalizationReduces variation slightly betterGives smallest distance between arraysSlight AdvantageFastest
Non-linear methodReduces variationPerforms poorer
Scaling MethodbaselineHad slightly higher slopes

 

 

Microarray Expression Value (Expression Index)

 

A summary of all outputs of a microarray is called the expression index.

 

cDNA Microarrays

 

Fold change can be measured by taking a direct ratio: Cy5 / Cy3. Data analyzed this way will always have positive numbers as outputs. Fold change can also be calculated as: Log2(Cy5/Cy3). In this case, some outputs will be negative.

 

The chart listed below is an example of the output you might see when using the latter method:

 

 

However, these two simple fold change calculations algorithms are seldomly used because they are too naive and do not take into consideration the complexities of microarray data. The techniques listed below employ more complex algorithms for analyzing microarray data, and are more commonly used.

 

Affymetrix Microarray Expression Index

 

The goal now is to summarize the patterns contained in a probeset. As a general rule, perfect match probes usually carry more information, but again, this may not necessarily be the case due to cross hybridization.

 

Affymetrix created a solution to this problem by designing a "mismatch" (MM) probe for every perfect match (PM) probe. The mismatch probe contains a single mismatch located in the middle of the 25-base probe sequence, and serves as a control for its perfect match partner because it is as likely to bind to nonspecific sequences as the perfect match probe. This allows us to detect cross hybridization events and other false signals. These spurious signals can then be subtracted from the perfect match signal to obtain the true gene expression measurement.

 

Below is an illustration of a possible PM/MM profile for hypothetical gene x.

 

 

The software described below are tools which can be used to analyze and understand microarray expression indices.

 

MAS 4

 

MAS 4 (Microarray Analysis Software 4.0) is an older software created by GeneChip. It calculates the average difference between PM and MM for probes in a set of suitable pairs chosen by the software. This set is called A.

 

 

A has the following characteristics, and are chosen in the following way to maintain a robust average:

 

  • highest and lowest differences are removed
  • the mean and standard deviation are calculated from the remaining probes
  • probes more than 3 standard deviations from the mean are removed

 

Drawbacks to this naïve algorithm are:

 

  • can omit 30-40% of the probes
  • can give negative values

 

MAS 5

 

This is GeneChip's latest software for microarray analysis. Algorithm applies a log to reduce the dependence of the variance on the mean.

 

A two-step procedure determines the detection p-value for a given probe set:

 

  1. Calculate the discrimination score R for each probe pair.
  2. Test the discrimination scores against the user-definable threshold Tau.

 

The discrimination score describes the ability of a probe pair to detect its intended target. It measures the target-specific intensity difference of the probe pair (PM-MM) relative to its overall hybridization intensity (PM+MM):

 

R = (PM -MM) / (PM + MM)

 

The next step toward the calculation of a Detection p-value is the comparison of each Discrimination score to the user-definable threshold Tau. Tau is a small positive number that can be adjusted to increase or decrease sensitivity.

 

The One-Sided Wilcoxon’s Signed Rank test is the statistical method employed the generate the detection p-value. It assigns each probe pair a rank based on how far the probe pair discrimination score is from Tau.

 

 

In this hypothetical probe set, the PM intensity is 80 and the MM intensity for each probe pair increases from 10 to 100. The probe pairs are numbered from 1 to 10. As the MM probe cell intensity (x-axis) increases and becomes equal or greater than the PM intensity, the discrimination score decreases as plotted on the y-axis. More specifically, as the intensity of the MM increases, our ability to discriminate between the PM and MM decreases. The dashed line is the user-definable parameter Tau (default = 0.015).

 

The intensity signal is a quantitative metric calculated for each probe set, which represents the relative level of expression of a transcript. The one-step Tukey’s Biweight is used to calculate the signal.

 

Tukey’s Biweight is a type of sin curve that is often used in robust estimation techniques, an estimation technique that is insensitive to small departures from the idealized assumptions and hence minimizes the weight of outliers. It is given by the equation:

 

 

and in graph form looks as this:

 

 

Each probe pair in a probe set is considered as having a potential vote in determining the signal value. The vote, in this case, is defined as an estimate of the real signal due the hybridization of the target. the MM intensity is used to estimate the stray signal. The real signal is estimated by taking the log of the PM intensity after subtracting the stray signal estimate. The probe pair vote is weighted more strongly if this probe pair signal value is closer to the median value for a probe set. Once the weight of each probe pair is determined, the mean of the weighted intensity values for a probe set is identified. This mean value is converted back to a linear scale and is outputted as the signal.

 

The software computes the probe set intensity signal as the following:

 

 

CT, or change threshold, is a version of MM that is never bigger than PM. The following rules avoid taking the log of negative numbers.

 

IfMMCT=MM
MM≥PMestimate typical case of MM for PM
If typical MMs>PM, set CT = PM – ε

 

Where ε is a small value.

 

Below is an illustration from Affymetrix of PM and MM values for a sample probeset.

 

 

The white bars are the change threshold (CT): the intensities of the MM based on the signal rules above. In this sample, many of the MM intensities are lower than the PM, and hence, the MM values can be used directly. When the PM < MM, the CT value is used.

 

Global scaling strategy sets the average signal intensity of the array to a target signal of 100. The key assumption of the global scaling strategy is that there are few changes in the gene expression between the arrays being analyzed.

 

dCHIP

 

Motivation

 

Li and Wong published the first paper on the software dCHIP in 2001. Their motivation was that with one data set, there was still significant variation in expression level produced by different probes. Using a dataset of 21 HuGeneFl arrays, the researchers discovered that most of the significant differences in expression level were due to probe effects rather than variations in array. Hence, they had an invested interest in finding a statistical model, (Model-based expression index), to normalize probe-effects in data.

 

The below graphs highlight Li and Wong's important observation: that relative values of probes within a probeset are very stable across multiple samples. Every line in the graph depicts one sample.

 

 

There were two reasons given in the literature by Li and Wong to measure the difference in PM and MM

 

  1. Computational advantage in determining the difference first, then fitting the full data
  2. Earlier experimentation showed the average difference is linear to the true expression level

 

Normalization of arrays based on an "invariant set"

 

For a group of arrays, one needs to first normalize all the arrays to a common baseline array having the median overall brightness, as measured by the median CEL intensity in an array.

 

Model-Based Expression Index

 

One looks at multiple samples at one time and assign different probes a different weight. Each probe signal is then proportion to the amount of target sample and to the affinity of the specific probe sequence to the target.

 

 

The error for MBEI acts as a confidence interval. Hence, we iteratively estimate theta and gamma to minimize the error. Specifically, we first fit the model to one probe set, and identify arrays with large standard error. Next, with these array outliers excluded, we work on the data table with fewer rows and fit the model again. This time, the standard errors and magnitudes of gamma's are examined in order to exclude probe outlier. If gamma results in a negative value, we regard it as a probe outlier and exclude the corresponding probe. As a result, data table shrinks in columns and we fit the model again to the new table. After probe outliers are excluded, we evaluate all arrays for outliers again and compare them to the set of array outliers in the previous round to see if there is any change. Procedure is repeated until the set of probe and array outliers do not change anymore.

 

(a) Two arrays plotted against each other (b) Green circles represent the invariant set (c) After normalization (d) Q-Q plot shows that these two probe sets have nearly same distribution

 

Statistical Model

 

The individual probe response is based on the PM and MM difference and is modeled in the following way:

 

 

 

and are iteratively estimated to minimize ε:

 

 

RMA: Robust Multi-chip analysis

 

RMA outperforms dCHIP which outperforms MAS 5, which outperforms MAS4.

 

 

RMA was created by Irizarry and Speed in 2003. The authors realized that having MM probes for every PM probe was a waste of microarray space, since if every MM probe was instead a PM probe, there would be much more information to work with. The model eliminates MM probes in its statistical model. The main idea is that there's heavy probe intensity background adjustment using quantile normalization with adjusted PM. The log of PM is then taken.

 

RMA Background Subtraction

 

  • Signal + BG = PM
  • Signal is approximately exponential, while the BG is approximately normal.

 

 

Background noise turns out to have the predicted shape:

 

 

Log (PM)

 

  • We use the log of PM to capture the fact that the higher value probes are more variable and this assumes that the probe noise is comparable on the log scale

 

Statistical model

 

For each probe set, we set

 

 

and then take the log of it to get

 

.

 

The model is then:

 

 

where a is the expression index and b is the probe effect. The log of n stands for the log after quantile normalization of n samples. What follows is similar to dCHIP in that a and b are iteratively refitted to the data set. Essentially, the new formula is modified from multiplication to addictive on the log scale. The main difference is the minimization of error at the log PM.

 

Testing Methods

 

Spike-ins can be used to introduce known concentrations of markers into the RNA samples. These spike ins should cover a broad range of concentrations. The different algorithms we discussed here can each be run to see which one can most accurately detect the spike ins.

 

Dilutions were introduced earlier as a test of quantile normalization. Here, you use a series of dilutions to see whether the algorithm is sensitive to concentration differences.

 

Latin square spike-ins capture both spike-ins and dilutions. The columns are various spike-ins, and the rows are different dilutions of each spike-in:

 

 

Robust algorithms should be able to normalize the data, but also detect various spike-ins (shown in grey):

 

 

 

Method Comparison Conclusion

 

  • No one uses MAS 4 now
  • With fold change, RMA > dCHIP > MAS 5
  • With p-value, RMA ~ MAS 5 > dCHIP
  • MAS 5 does a good job on abundant genes
  • dCHIP and RMA do a better job on less abundant genes
  • All four models are implemented in BioConductor (open source R package)

 

In Summary:

MAS4MAS5dCHIPRMA
raw MM/PMYesYes
log PMYesYes
Expression index from single arrayYesYes
Multiple arraysYesYes
Weighs probes in different probe sets differentlyYesYesYes
Robust at low levelsYes
Can give negative expression indexYes
Uses iterative method to estimate probe effect and expression indexYesYes

 

 

References:

 

  • Affymetrix Manual.
  • Astrand, M. Normalizing oligonucleotide arrays. Unpublished Manuscript. http://www.math.chalmers.se/~magnusaa/maffy.pdf
  • Bolstad, B.M., Irizarry, R.A. et al, A Comparison of Normalization Methods for High Density Olignucleotide Array Data Based on Variance and Bias. Bioinformatics 2003. 19 (2): 185 – 193.
  • Cleveland, W.S., and Devlin, S.J. Locally-weighted regression: an approach to regression analysis by local fitting. Am. Stat. Assoc. 1998. 83: 596 – 610.
  • Dudoit, S., Yang, Y.H., et al. Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Statistica Sinica 2002. 12: 111-139
  • Hartemink, A., Gifford, D, et al. Maximum likelihood estimation of optimal scaling factors for expression array normalization. SPIE BIOS 2001).
  • Irizarry, R.; Bolstad, B.; Collin, F. et al. Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Research 2003, 31 (4).
  • Li, C.; Wong, W. Model-based analysis of oligonucleotide arrays: model validation, design issues and standard error application. Genome Biology 2001, 2(8).
  • Li, C.; Wong, W. Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. PNAS 2001, 98 (1): 31-36.
  • Yang, Y.H., Dudoit, S., et al. Normalization for cDNA Microarray Data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Research 2002. 30 (4): 15 – 27.
  • Yang, Y.H., Dudoit, S., et al. Normalization for cDNA Microarray Data. SPIE BIOS 2001 powerpoint presentation.

 

 

For more on Tukey’s Biweight, see Mathworld.wolfram.com

Comments (0)

You don't have permission to comment on this page.