If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.
You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!

Microarray Normalization and Expression Index

Page history last edited by PBworks 15 years, 11 months ago

Outline:

Microarray recap

Importance of microarray normalization

The first step of microarray analysis is normalization, which adjusts and balances individual hybridization intensities so that comparisons are biologically meaningful. Data must be normalized for a number of reasons, including different starting quantities of RNA, difference in labeling efficiences of dyes, and unequal probe hybridization.

In essence, microarray normalization has three intended outcomes:

(1) Preserve biological variation between different samples

(2) Minimize experimental variation

(3) Allow comparison of different experiments

Normalization methods are significant as these may have a large effect future downstream comparisons can be performed. We need to take several things into consideration when looking at a data set:

Median Scaling
Within / across slide variation
Dye Bias
Location bias
Probe bias

Microarray Normalization

Dye Bias

Often, the green Cy3 and the red Cy5 dye do not label equally.

If dyes are swapped in a replicate experiment, and there is no dye bias, then:

However, if there is dye bias, then the average expression value can be obtained by calculating the geometric mean:

Median Scaling

Linear Scaling

We often use linear scaling to ensure that different arrays have the same median value and dynamic range in order to allow future comparisons. The mathematical transformation is

where x’ is the new best fit line after applying a new range and median value. c1 is the amount data is shifted so that the line of best fit goes through the origin. c2 is the ratio of the array's dynamic range.

However, the data set is often not linear. In these cases, linear scaling is not a good idea. One can apply linear scaling for small sets of data, but for other cases, we can use Loess.

Loess (LOcally WEighted Scatterplot Smoothing)

This is designed to fit a smooth curve to the data by calculating the best linear fit at each point and aggregating them together to get a nonlinear fit. It effectively applies different scaling factors at different intensity levels to transform

Y=f(X)

Then,

Transform X to X’=f(X)

where Y and X’ are now comparable.

See Cleveland, W.S. and Devlin, S.J. Locally-weighted regression: an approach to regression analysis by local fitting. __J. Am. Stat Assoc__., 83, 596-610.

Reference for Normalization

Both Loess and Median scaling require a reference sample for normalization. You can either pick one treatment as your reference sample (eg. pick "control" as your reference sample), or pick a subset of genes to estimate the scaling factor.

When using one treatment as a reference sample, you use the "middle" chip approach, which uses the median of all the samples. The selection of the reference sample will influence the results.

When using a subset of genes to estimate the scaling factor, what genes do you choose? The two best groups of genes to use are housekeeping genes, which are present at constant levels, and genes which have an invariant rank across all samples (this indicates the gene is not differentially expressed).

Quantile Normalization

Quantile normalization is currently considered the best normalization method.

One of the first questions one may ask is what’s a quantile? Good question! Let’s use an example out of the Mathematical Statistics and Data Analysis, 2nd ed. by John Rice. If we use a cumulative distribution function (cdf) of a uniform random variable on [0,1], where

F(x) = {

0,	x≤0
x,	0≤x≤1
1,	x ≥1

Then, the pth quantile of the distribution F is defined to be that value Xp such that F(Xp) = p, or P(Y≤ Xp) = p. Special cases are p = 0.50, which corresponds to the median of F, and p = 0.25 and p = 0.75, which correspond to the lower and upper quartiles of F.

Summary: The pth quantile of the distribution hence divides a data set into p sets, with the same number of individual samples in a set.

Hence, quantile – quantile (Q-Q) plots are useful for comparing distribution functions. In a Q-Q plot, the quantiles of one distribution are plotted against those of another. This is an easy way to see if the distribution of two data sets (control versus test sample) is the same: if so, all the data sets will match up and form a straight diagonal line.

The goal of quantile normalization is to make the distribution of probe intensities for each array in a set of arrays the same. The concept can be extended to ‘‘n’’ dimensions so that if all ‘‘n’’ data vectors have the same distribution, plotting the quantiles in ‘‘n’’ dimensions gives a straight line along the unit vector. Hence, we could make a set of data have the same distribution if we project the points of our ‘‘n’’ dimension quantile plot onto the diagonal.

We can use the projection of q onto d where q represents the ‘’k’’ th quantiles for all ‘‘n’’ arrays and d represents the unit diagonal:

This implies the following algorithm for normalizing a set of data vectors by giving them the same distribution:

Given arrays of length ‘‘p’’, form X of dimension ‘‘p’’ x ‘‘n’’ where each array is a column
Sort each column of X to give Xsort
Take the means across rows of Xsort and assign the mean to each element in the row to get X’sort
Get Xnormalized by rearranging each column of X’sort to have the same ordering as the original X

No experiment will retain its original value, but all the experiments will have the same variation. The advantage of using quantile normalization is that no reference sample has to be chosen - all samples are equally important. The disadvantage for using it is that everytime a new sample is added to the data, normalization has to be run all over again. However, if the database is very large to start off with, you can calculate the mean once, and then use this mean for future samples.

Here is a visual representation:

Dilution Series Illustrates Effectiveness of Quantile Normalization

RNA sample is diluted into five different concentrations
The experiment is replicated five times
The results of each replication is scanned on 5 different scanners, from higher values to lower values
Below is an illustration of the before and after quantile normalization. Different concentrations are in different shades, and each concentration is scanned on a different scanner:

Below is the data set before normalization:

...and after normalization:

Normalization Quality Check

A little bit more background on the normalization calculations:

It is actually preferable to work with logged intensities for the following reasons:

the variation of logged intensities and ratio of intensities is less dependent on absolute magnitude
normalization is usually additive for logged intensities
taking logs evens out highly skewed distributions
taking logs gives a more realistic sense of variation
log captures the fact that higher value probes are more variable
however, on a log scale, the probe noise is comparable

Logarithms base 2 are used because intensities are integers between 0 and (2^16) – 1.

Below are scatter plots of probe intensities with their SD before and after log transformations. Notice how in the left graph, probes with higher values have higher SD, while after log transformation, probes have a more comparable SD.

MvA Plot

If normalization has worked, a plot of log2R vs log2G (R and G are the original probe sets representing red and green) should be on the diagonal.

MvA plots are a visual way of seeing whether normalization worked. This method uses the comparison of the difference in log expression values (M) and the average of log expression values (A) of probe intensity values. Mathemtically,

M = log2R-log2G

A = (log2R + log2G)/2

M should be close to 0 for all probes because most probes don't change.

Below are two example graphs of five arrays for a PM probe sample set using liver (at concentration 10) dilution series data.

Normalization Techniques Comparison

	Variance Comparisons	Pairwise Comparisons	Bias Comparisons	Speed
Quantile normalization	Reduces variation slightly better	Gives smallest distance between arrays	Slight Advantage	Fastest
Non-linear method	Reduces variation	Performs poorer
Scaling Method	baseline	Had slightly higher slopes

Microarray Expression Value (Expression Index)

A summary of all outputs of a microarray is called the expression index.

cDNA Microarrays

Fold change can be measured by taking a direct ratio: Cy5 / Cy3. Data analyzed this way will always have positive numbers as outputs. Fold change can also be calculated as: Log2(Cy5/Cy3). In this case, some outputs will be negative.

The chart listed below is an example of the output you might see when using the latter method:

However, these two simple fold change calculations algorithms are seldomly used because they are too naive and do not take into consideration the complexities of microarray data. The techniques listed below employ more complex algorithms for analyzing microarray data, and are more commonly used.

Affymetrix Microarray Expression Index

The goal now is to summarize the patterns contained in a probeset. As a general rule, perfect match probes usually carry more information, but again, this may not necessarily be the case due to cross hybridization.

Affymetrix created a solution to this problem by designing a "mismatch" (MM) probe for every perfect match (PM) probe. The mismatch probe contains a single mismatch located in the middle of the 25-base probe sequence, and serves as a control for its perfect match partner because it is as likely to bind to nonspecific sequences as the perfect match probe. This allows us to detect cross hybridization events and other false signals. These spurious signals can then be subtracted from the perfect match signal to obtain the true gene expression measurement.

Below is an illustration of a possible PM/MM profile for hypothetical gene x.

The software described below are tools which can be used to analyze and understand microarray expression indices.

MAS 4

MAS 4 (Microarray Analysis Software 4.0) is an older software created by GeneChip. It calculates the average difference between PM and MM for probes in a set of suitable pairs chosen by the software. This set is called A.

A has the following characteristics, and are chosen in the following way to maintain a robust average:

highest and lowest differences are removed
the mean and standard deviation are calculated from the remaining probes
probes more than 3 standard deviations from the mean are removed

Drawbacks to this naïve algorithm are:

can omit 30-40% of the probes
can give negative values

MAS 5

This is GeneChip's latest software for microarray analysis. Algorithm applies a log to reduce the dependence of the variance on the mean.

A two-step procedure determines the detection p-value for a given probe set:

Calculate the discrimination score R for each probe pair.
Test the discrimination scores against the user-definable threshold Tau.

The discrimination score describes the ability of a probe pair to detect its intended target. It measures the target-specific intensity difference of the probe pair (PM-MM) relative to its overall hybridization intensity (PM+MM):

R = (PM -MM) / (PM + MM)

The next step toward the calculation of a Detection p-value is the comparison of each Discrimination score to the user-definable threshold Tau. Tau is a small positive number that can be adjusted to increase or decrease sensitivity.

The One-Sided Wilcoxon’s Signed Rank test is the statistical method employed the generate the detection p-value. It assigns each probe pair a rank based on how far the probe pair discrimination score is from Tau.

In this hypothetical probe set, the PM intensity is 80 and the MM intensity for each probe pair increases from 10 to 100. The probe pairs are numbered from 1 to 10. As the MM probe cell intensity (x-axis) increases and becomes equal or greater than the PM intensity, the discrimination score decreases as plotted on the y-axis. More specifically, as the intensity of the MM increases, our ability to discriminate between the PM and MM decreases. The dashed line is the user-definable parameter Tau (default = 0.015).

The intensity signal is a quantitative metric calculated for each probe set, which represents the relative level of expression of a transcript. The one-step Tukey’s Biweight is used to calculate the signal.

Tukey’s Biweight is a type of sin curve that is often used in robust estimation techniques, an estimation technique that is insensitive to small departures from the idealized assumptions and hence minimizes the weight of outliers. It is given by the equation:

and in graph form looks as this:

Each probe pair in a probe set is considered as having a potential vote in determining the signal value. The vote, in this case, is defined as an estimate of the real signal due the hybridization of the target. the MM intensity is used to estimate the stray signal. The real signal is estimated by taking the log of the PM intensity after subtracting the stray signal estimate. The probe pair vote is weighted more strongly if this probe pair signal value is closer to the median value for a probe set. Once the weight of each probe pair is determined, the mean of the weighted intensity values for a probe set is identified. This mean value is converted back to a linear scale and is outputted as the signal.

The software computes the probe set intensity signal as the following:

CT, or change threshold, is a version of MM that is never bigger than PM. The following rules avoid taking the log of negative numbers.

If	MM	CT=MM
	MM≥PM	estimate typical case of MM for PM
		If typical MMs>PM, set CT = PM – ε

Where ε is a small value.

Below is an illustration from Affymetrix of PM and MM values for a sample probeset.

The white bars are the change threshold (CT): the intensities of the MM based on the signal rules above. In this sample, many of the MM intensities are lower than the PM, and hence, the MM values can be used directly. When the PM < MM, the CT value is used.

Global scaling strategy sets the average signal intensity of the array to a target signal of 100. The key assumption of the global scaling strategy is that there are few changes in the gene expression between the arrays being analyzed.

dCHIP

Motivation

Li and Wong published the first paper on the software dCHIP in 2001. Their motivation was that with one data set, there was still significant variation in expression level produced by different probes. Using a dataset of 21 HuGeneFl arrays, the researchers discovered that most of the significant differences in expression level were due to probe effects rather than variations in array. Hence, they had an invested interest in finding a statistical model, (Model-based expression index), to normalize probe-effects in data.

The below graphs highlight Li and Wong's important observation: that relative values of probes within a probeset are very stable across multiple samples. Every line in the graph depicts one sample.

There were two reasons given in the literature by Li and Wong to measure the difference in PM and MM

Computational advantage in determining the difference first, then fitting the full data
Earlier experimentation showed the average difference is linear to the true expression level

Normalization of arrays based on an "invariant set"

For a group of arrays, one needs to first normalize all the arrays to a common baseline array having the median overall brightness, as measured by the median CEL intensity in an array.

Model-Based Expression Index

One looks at multiple samples at one time and assign different probes a different weight. Each probe signal is then proportion to the amount of target sample and to the affinity of the specific probe sequence to the target.

The error for MBEI acts as a confidence interval. Hence, we iteratively estimate theta and gamma to minimize the error. Specifically, we first fit the model to one probe set, and identify arrays with large standard error. Next, with these array outliers excluded, we work on the data table with fewer rows and fit the model again. This time, the standard errors and magnitudes of gamma's are examined in order to exclude probe outlier. If gamma results in a negative value, we regard it as a probe outlier and exclude the corresponding probe. As a result, data table shrinks in columns and we fit the model again to the new table. After probe outliers are excluded, we evaluate all arrays for outliers again and compare them to the set of array outliers in the previous round to see if there is any change. Procedure is repeated until the set of probe and array outliers do not change anymore.

(a) Two arrays plotted against each other (b) Green circles represent the invariant set (c) After normalization (d) Q-Q plot shows that these two probe sets have nearly same distribution

Statistical Model

The individual probe response is based on the PM and MM difference and is modeled in the following way:

and are iteratively estimated to minimize ε:

RMA: Robust Multi-chip analysis

RMA outperforms dCHIP which outperforms MAS 5, which outperforms MAS4.

RMA was created by Irizarry and Speed in 2003. The authors realized that having MM probes for every PM probe was a waste of microarray space, since if every MM probe was instead a PM probe, there would be much more information to work with. The model eliminates MM probes in its statistical model. The main idea is that there's heavy probe intensity background adjustment using quantile normalization with adjusted PM. The log of PM is then taken.

RMA Background Subtraction

Signal + BG = PM
Signal is approximately exponential, while the BG is approximately normal.

Background noise turns out to have the predicted shape:

Log (PM)

We use the log of PM to capture the fact that the higher value probes are more variable and this assumes that the probe noise is comparable on the log scale

Statistical model

For each probe set, we set

and then take the log of it to get

The model is then:

where a is the expression index and b is the probe effect. The log of n stands for the log after quantile normalization of n samples. What follows is similar to dCHIP in that a and b are iteratively refitted to the data set. Essentially, the new formula is modified from multiplication to addictive on the log scale. The main difference is the minimization of error at the log PM.

Testing Methods

Spike-ins can be used to introduce known concentrations of markers into the RNA samples. These spike ins should cover a broad range of concentrations. The different algorithms we discussed here can each be run to see which one can most accurately detect the spike ins.

Dilutions were introduced earlier as a test of quantile normalization. Here, you use a series of dilutions to see whether the algorithm is sensitive to concentration differences.

Latin square spike-ins capture both spike-ins and dilutions. The columns are various spike-ins, and the rows are different dilutions of each spike-in:

Robust algorithms should be able to normalize the data, but also detect various spike-ins (shown in grey):

Method Comparison Conclusion

No one uses MAS 4 now
With fold change, RMA > dCHIP > MAS 5
With p-value, RMA ~ MAS 5 > dCHIP
MAS 5 does a good job on abundant genes
dCHIP and RMA do a better job on less abundant genes
All four models are implemented in BioConductor (open source R package)

In Summary:

	MAS4	MAS5	dCHIP	RMA
raw MM/PM	Yes	Yes
log PM		Yes		Yes
Expression index from single array	Yes	Yes
Multiple arrays			Yes	Yes
Weighs probes in different probe sets differently		Yes	Yes	Yes
Robust at low levels				Yes
Can give negative expression index	Yes
Uses iterative method to estimate probe effect and expression index			Yes	Yes

References:

Affymetrix Manual.
Astrand, M. Normalizing oligonucleotide arrays. Unpublished Manuscript. http://www.math.chalmers.se/~magnusaa/maffy.pdf
Bolstad, B.M., Irizarry, R.A. et al, A Comparison of Normalization Methods for High Density Olignucleotide Array Data Based on Variance and Bias. Bioinformatics 2003. 19 (2): 185 – 193.
Cleveland, W.S., and Devlin, S.J. Locally-weighted regression: an approach to regression analysis by local fitting. Am. Stat. Assoc. 1998. 83: 596 – 610.
Dudoit, S., Yang, Y.H., et al. Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Statistica Sinica 2002. 12: 111-139
Hartemink, A., Gifford, D, et al. Maximum likelihood estimation of optimal scaling factors for expression array normalization. SPIE BIOS 2001).
Irizarry, R.; Bolstad, B.; Collin, F. et al. Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Research 2003, 31 (4).
Li, C.; Wong, W. Model-based analysis of oligonucleotide arrays: model validation, design issues and standard error application. Genome Biology 2001, 2(8).
Li, C.; Wong, W. Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. PNAS 2001, 98 (1): 31-36.
Yang, Y.H., Dudoit, S., et al. Normalization for cDNA Microarray Data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Research 2002. 30 (4): 15 – 27.
Yang, Y.H., Dudoit, S., et al. Normalization for cDNA Microarray Data. SPIE BIOS 2001 powerpoint presentation.

For more on Tukey’s Biweight, see Mathworld.wolfram.com

Comments (0)

You don't have permission to comment on this page.

To join this workspace, request access.

Already have an account? Log in!

Loading…

To use the template to start a new chapter, go to the template page; click edit, then copy everything in the edit box; go to your own page, click edit, and paste. You are ready to create your own chapter!

Loading…

Microarray Normalization and Expression Index

Microarray recap

Importance of microarray normalization

Microarray Normalization

Dye Bias

Median Scaling

Linear Scaling

Loess (LOcally WEighted Scatterplot Smoothing)

Reference for Normalization

Quantile Normalization

Normalization Quality Check

Microarray Expression Value (Expression Index)

cDNA Microarrays

Affymetrix Microarray Expression Index

MAS 4

MAS 5

dCHIP

RMA: Robust Multi-chip analysis

Testing Methods

Method Comparison Conclusion

References:

Microarray Normalization and Expression Index

Page Tools

Insert links

Comments (0)

Join this workspace

Navigator

SideBar

Recent Activity