# Introduction

If several objects are measured on a single variable, it is a trivial matter to place the measurements on a single line and then to compare them. Likewise, if the objects are measured on two variables, we can plot each object's position in the data space on a simple scatterplot. But now imagine a microarray experiment in which the expression levels of many thousands of genes are measured in dozens or hundreds of samples. It is impossible to visualize data of such high dimension, and thus our understanding of the data in terms of substantive biology may be seriously hindered. High dimension also poses a problem for detecting errors in experimental procedure or data entry, as points that are severe outliers in the high-dimensional space may seem not at all unusual in univariate and bivariate plots. However, the structure of the data is often such that a "summary" in lower dimension is adequate for purposes of visualization and interpretation. Three important techniques employed to bring about this kind of dimension reduction will be summarized in this article.

# Principal Components Analysis

## Motivation and Intuition

**Principal components analysis** is a multivariate technique for summarizing many variables in terms of fewer components with minimal loss of information. The first principal component is the linear combination of the variables with maximum variance; the subsequent principal components each have maximum variance, subject to the condition that they are uncorrelated with all preceding components. Although the usefulness of principal components analysis may not be immediately apparent from this terse definition, it is a powerful method for summarizing the structure of a large dataset.

Suppose that we have variables 1 and 2 as shown in the figure to the right (Stat 115 Microarray Analysis Lecture Slide 7). We can imagine that 1 is the expression level of one gene and 2 the expression level of another. In the figure we observe that 1 and 2 are highly correlated; that is, the expression levels of these genes move together in close synchrony across samples. We know this from inspection of the figure because the sample points cluster around an upward sloping line rather than being dispersed across the data space in a random cloud. If we insert a red coordinate axis that coincides with this line, then the position of each point's position in the *two-*space is well summarized by a *single* number representing its projection onto the inserted axis. That is, in describing our two-dimensional data, we do almost as well when we reduce the dimension to one. This simple example captures one of the essential merits of principal components analysis: a great deal of the information in the data is well summarized by the projections of the sample points on **principal axes** (or **principal components**) that are fewer in number than the variables.

What we desire is a mathematical criterion that will optimize the placement of the new coordinate axis with respect to summarization of the data. It turns out that the construct of **variance** is central to this criterion. We motivate this assertion with an appeal to visual intuition. Examining the figure to the right, note the variance of the projections of the sample points on the red axis sloping upward from left to right. As the projections are spread out all along the axis, this variance is fairly large. Now imagine rotating the red axis about the origin so that it departs from its "ideal" orientation. It is clear that the variance of the projections becomes smaller; that is, the projections of the sample points on the red coordinate axis become ever closer together. It thus seems reasonable to seek a vector on which the variance of the projections is maximized.

Notice that a second red axis has been inserted into the figure above. We have now selected a new coordinate system for this two-dimensional data space by rotating the original one that employed 1 and 2 as coordinate axes. Two features of the data with respect to these new axes are worthy of comment. First, the second principal axis is orthogonal to the first one. This is obviously a deirable feature. Imagine teaching children Cartesian coordinates with *x* and *y* axes that are not orthogonal! To be somewhat more formal, a non-orthogonal second axis would mean that the projections on this axis would convey information that is redundant to some extent with that conveyd by the first axis. Second, the second principal axis is placed in such a way as to maximize the variance of the projections, subject to the condition that it is orthogonal to the first principal axis. In the two-dimensional case this is a rather trivial coupling of constraints, as there is only one orientation that is orthogonal to the first principal axis. But imagine a space of higher dimension such that the location of a second coordinate axis (orthogonal to the first one) is no longer trivial. Now it is obvious that the placement of the second axis must be *chosen* so as to maximize the variance of the projections.

An example of a higher-dimensional instance may clarify the importance of the second point. If we tossed a number of darts at a dartboard and measured the *x*, *y*, and *z* positions of the tops of the darts, then we would have a point cloud in three-space. Suppose that measurements on *x* and *y* are somewhat correlated; for some reason we have a tendency to land our darts in a narrow strip of the dartboard running from southwest to northeast. If we have to summarize the postitions of dart tops with a single number, obviously a good choice is the projection of each dart top onto a new coordinate axis running from southwest to northeast. But what if we are allowed two numbers? There are then two degrees of freedom for the placement of the second axis. Because the variation in the *z* direction (i.e., the variation in the millimeters of penetration into the board) is negligible relative to the centimer-scale variation in distance along the dartboard from the first axis, an orientation that pierces the board would be a very poor choice. An orientation in the *xy* plane, running from northwest to southeast, would be much more informative about the structure of the data.

## Mathematical Foundations

Let us begin by summarizing what our examination of the figure above has led us. In order to capture the structure of the data with fewer dimensions than the number of variables, we would like to rotate the coordinate axes to a new orientation. The projections of the sample points on one of the new axes created in this way should have maximum variance. The projections of the sample points on the successive axes should also have maximum variance, subject to the contraint that the projections of the sample points on all of the axes are mutually uncorrelated. The following derivations will provide us with a mathematical formalism that satisfies these requirements.

It is also possible to carry out a principal components analysis of standardized variables. This procedure amount to replacing the *covariance* matrix in the derivations above with the *correlation* matrix. However, the reader is warned that principal components derived from the correlation matrix are generally different from those derived from the covariance matrix. One set is not a simple function of the other. The choice of whether to standardize the variables can be a subtle one. Variables should probably be standardized if they are measured on scales with widely differing ranges or if the units of measurement are not commensurate. In most microarray experiments there is probably no urgent rationale for standardization, but it may be worthwhile to perform principal components analyses of both the raw and standardized data in order to determine whether the results are broadly congruent.

Principal components can also be obtained directly from the data matrix rather than from the covariance matrix by the use of the **singular value decomposition**. Since the singular value decomposition is typically not taught in undergraduate courses (or even in graduate courses), the reader is advised to consult an advanced text on either linear algebra or multivariate statistics for a more complete explication of this decomposition and its application to principal components analysis.

## An Application to Microarray Data

The data used in the following example may be found at the Statistics 115 website under Homework B2. The data themselves consist of expression levels of 120 genes in 14 samples (7 cancer patients, 4 normal controls, 3 patients of uncertain diagnosis). What we would like to know is whether the profiles of gene expression in the unknowns more closely resembles those of the cancer patients or the normal controls. Because the gene space has a dimension of 120, it is impossible for us to plot the sample points in the gene space. Fortunately, the first two eigenvalues of the covariance matrix of the genes (as calculated by the statistical computing package R) are quite large relative to the rest, indicating that the first two principal components capture much of the variation in the data. The plot of the samples on the first and second principal components is given to the left.

As can be seen, the cancer and normal samples differ sharply in their profiles of scores on the first two principal components. On these data it seems quite reasonable to induce that Unknown2 and Unknown3 are likely to be cancerous.

## A Note Regarding Interpretation

It is often tempting to interpret a principal component as a latent source of variance with causal influences on the variables that are highly correlated with it. For example, suppose that a distinct group of genes show high correlation with a particular principal component of the gene space. We might interpet this observation as evidence that the regulation of these genes shares some common cause represented by the principal component--perhaps something tangible like a common transcription factor, perhaps something more abstract such as a higher-order biological process. Strictly speaking, this kind of causal modeling belongs in the domain of **factor analysis**, a multivariate technique that in some versions is very similar to principal components analysis in manner of calculation. In many circumstances principal components analysis, because of its mathematical elegance, is used as a proxy for factor analysis and yields quite acceptable results as such. However, the two techniques are conceptually distinct. Principal components analysis is best thought of as a method for deriving optimal indices for the summarization of data. Factor analysis, on the other hand, is employed to discover or confirm explicit models accounting for the covariances between variables as the result of mutual dependencies on unobserved causal factors. It is important not to confuse the two. Whether factor analysis should be used explicitly more often in bioinformatics and computational biology involves subtle issues in the philosophy of science that will not be discussed here. For more on the distinction between principal components and factor analysis, the reader is advised to consult sources that discuss both techniques.

## Extensions

One application of principal components analysis is known as **gene shaving**. The goal of gene shaving is displayed schematically in the figure to the right (Stat 115 Microarray Analysis Lecture Slide 17). Essentially, gene shaving is a method for identifying block of genes, possibly overlapping, with similar expression profiles. This is done by finding blocks of genes that show large between-sample variance in expression levels but small between-gene variance. The absence of variation in expression levels within a block implicates the genes in a common biological process, whereas the presence of variation in expression levels across samples indicates meaningful variation along some dimension of the putative biological process.

We take the data matrix, representing the expressions levels of genes as deviation scores. We then find scores of the genes on the first principal component of the sample space. A low absolute value for this component scores indicates that the expression of the gene is not particularly variable across subjects. Thus, some specified percentage of genes with the lowest absolute values of scores on the first principal component are "shaved." This process is repeated with the shaved data matrix, which now has fewer rows (or columns) representing genes. The iterations continue until only one gene is left. The result of this process is a nested sequence of gene blocks, each a smaller subset of the previous one. One of these is picked out according to a gap statistic summarizing the between-gene and between-sample variances of a block. The entire process is then repeated, except this time the rows (or columns) of the data matrix representing the genes are centered on the mean expression level of the genes in the block chosen previously. Iterations of the entire algorithm continue until the desired number of blocks have been extracted.

# Multidimensional Scaling

Principal components analysis incidentally permits the display of multivariate data in a low-dimensional space. We now consider a technique known as **multidimensional scaling**, where the primary objective is to project the data onto a low-dimensional space in such a way that the structure of the data is distorted as little as possible.

## Motivation and Intuition

Consider the locations of various points in the country of Chile. Chile is about 2,650 miles long from north to south, about 110 miles wide from east to west, and roughly 4 miles from its lowest point of elevation to its highest. We thus require three dimensions for an accurate determination of location within Chile: longitude, latitude, and altitude. Another way to put it is that the distance between any two locations in Chile has three components (*x*, *y*, and *z* if we like) that must be individually squared, then summed, and finally taken to the power one-half in order to produce a single number capturing the magnitude of the distance.

Clearly, most models of Chile are not three-dimensional replicas. Most are flat maps where the dimension of altitude is discarded entirely. The effect of this reduction is to distort the distances between the various locations by treating variation in *z* as if it did not exist. The distance between points A and B is declared to be (*x*^2 + *y*^2)^(1/2) rather than the true (*x*^2 + *y*^2 + *z*^2)^(1/2). How serious are the consequences of this distortion for our understanding of Chile's layout? Hopefully, the reader has enough intuition for maps to understand immediately that for most purposes the distortion is utterly trivial. Take locations A, B, and C. There are 3 pairwise distances among these locations: the distances between A and B, A and C, and B and C. There is so little variation in *z* that ignoring it in our calculation of these distances has a very weak effect on their magnitudes. It is very probable that the relative ordering of these magnitudes will be preserved. If we had to solve the Traveling Salesman's Problem for various locations scattered across Chile, a solution computed on a flat map would probably be just as good as that computed on a realistic three-dimensional topographical model. (Even if the latter solution were superior, it would be difficult to implement because of practical constraints on the precision of our movement in *z*!)

In fact, the geography of Chile is such that we would not do all that badly if we discarded *both* altitude and longitude and relied solely on latitude. In other words, the variation in *y* is so great relative to that in the other two dimensions that modeling the various locations as points arrayed along a single line captures nearly all of the distance relationships among the locations. If we specified the location of a particular city in Chile with a single number, say the number of miles from the country's southernmost point, uncertainty in where the city lies along the country's 2,650-mile north-south axis would be completely reduced, whereas a bare guess as to where the city lies along the east-west axis would rarely be off by more than a hundred miles.

Now suppose that we want to construct a model not only of locations on the ground but also of the satellites that happen to be passing over Chile. In this case discarding altitude has very damaging consequences for our understanding of location. Suppose that two cities, A and B, are both slightly displaced from our north-south axis but that their orthogonal projections on the axis are only 100 miles apart. Suppose also that a satellite happens to be in geosynchronous orbit 22,240 miles above the midpoint between the projection of A and B on the north-south axis. (The slight displacements are necessary to ensure that the three objects are not located in the same plane. If they are located in the same plane, this example becomes somewhat trivial. An exercise for the reader: Why does the example become trivial?) Taking only the lattitudes and longitudes of these three points gives the seriously misleading impression that the satellite is closer to both towns than the towns are to each other and that all three locations are clustered in a relatively small area. Obviously, this represention will not suffice. But notice that it is not *impossible* to project the three locations onto a flat map so as to preserve the rank order of their relative distances from each other. We might put the satellite at the tip of a very long isosceles triangle with A and B at the congruent angles. In this representation A and B are closest, the satellite is equally distant from A and B, and the three locations are spread out across a large space. In all of these aspects the flat map more or less faithfully conveys the corresponding reality of the higher-dimensional space.

## Implementation

The examples above capture the essence of multidimensional scaling. What we seek is the arrangement of data points in a lower-dimensional space that minimizes the **stress** resulting from distortion or disordering of the distances between the points in the higher-dimensional space. The nature of the algorithm that performs this minimization is sketched out below.

## An Application to Microarray Data

We examine the same data on gene expression levels across patients as those subjected to principal components analysis in the previous section. The objective is to project the 14 samples from their original 120-dimensional gene space onto a mere two dimensions while preserving the original distances between samples as much as possible. The representation found by the metric multidimensional scaling algorithm in R is given to the right.

Note the striking visual similarity between the respective plots given by principal components analysis and multidimensional scaling. Both cleanly separate cancer samples from normals and group Unknown2 and Unknown3 with the cancer samples. It is always reassuring to find that methods with distinct conceptual bases converge on the same results, but here the similarity of the plots is attributable to more than just the robustness of the findings. It turns out that metric multidimensional scaling to *q* dimensions is equivalent to plotting scores on the first *q* principal components. Here R happened to choose opposite signs for the corresponding eigenvectors, so initially the two solutions appear to be somewhat different. However, a reflection of both axes in the multidimensional scaling plot recovers the plot of the first two principal components exactly. Given the distinct conceptual approaches through which we arrived at principal components and multidimensional scaling, their equivalence in the case of metric scaling probably strikes most readers by surprise. This result illustrates the mathematical profundity of principal components.

# Fisher's Linear Discriminant

In the application that we have been using as an example, dimension reduction has allowed us to attempt the classification of unknown samples as either normal or cancerous. This is because the samples that are already known to be either normal or cancerous preferentially inhabit certain regions of the low-dimensional representation of the data, allowing us to draw a line that clearly defines the regions and to classify the unknown samples that fall on either side. We now examine a technique, called **Fisher's linear discriminant**, specifically designed for the purposes of separating groups and classifying new observations whose group membership is unknown. Just as techniques for dimension reduction proved useful for separation and classification, Fisher's linear discriminant turns out to be a useful tool for dimension reduction. In this article we focus on the the properties of Fisher's linear discriminant relevant to dimension reduction. Readers interested in its use as a classificatory technique should consult Kerry Geiler's article on microarray classification methods.

## Motivation and Intuition

The observations in a study are often known to belong to two or more classes. This situation is depicted in the figure below (Stat 115 Microarray Analysis Lecture Slide 26). The observations belong to three groups that are labeled by color.

If we wish to reduce the dimension of the data from two to one while preserving the separation of the groups as much as possible, we see that this can be easily done. That is, we observe a vector in the span of the two variables (i.e., a linear combination of the two variables) on which the projections of the data retain much of the original separation among groups. This is in constrast to the vector nearly orthogonal to this optimal vector, labeled "worst 1D subspace," on which the projection of a data point tells us very little about its class membership. This optimal vector for the projection of high-dimensional data, at least with respect to group separation, is Fisher's linear discriminant. Incidentally, the retention of information on group membership in the projection to the lower-dimensional space is what allows the discriminant to be used for the classification of new observations. In the figure above, imagine setting two cut scores on the optimal projection, one at the point separating the red points from the blue, and another at the point separating the blue from the green. In the simplest case, where a new observation falls on this line with respect to the cut points determines its classification.

## Mathematical Foundations

## An Application to Intelligence Test Data

The reduced space onto which the data is projected by Fisher's linear discriminant will often appear similar to what is produced by the two methods reviewed earlier. For example, when used as a classifier, Fisher's linear discriminant groups Unknown1 with the normal samples and Unknown2 and Unknown2 with the cancers, a result qualitatively similar to those found by principal components analysis and multidimensional scaling.

For a change of pace we look at an application of Fisher's linear discriminant in a different discipline. The data in the following example are taken from Capron and Duyme (1996). These researchers selected as subjects thirty-eight adoptees, all relinquished at birth and adopted soon thereafter, on the basis of the socioeconomic status (SES) of their biological parents (B) and that of their adoptive parents (A). Only adoptees with biological and adoptive parents of the highest and lowest SES were selected; thus, the study consisted of a simple 2-by-2 design, with two levels (+/-) of two factors (SES of biological parents, SES of adoptive parents). At the age of fourteen all subjects were administered an IQ test consisting of ten subtests. An analysis of variance found that both factors exerted significant effects on Full Scale IQ with no interaction.

The figure to the left shows a plot of the subjects' scores on the first two linear discriminants derived from the ten subtests. Group 1 is B+/A+, Group 2 is B+/A-, Group 3 is B-/A-, and Group 4 is B-/A+. The separation of the four groups is moderately good. If not for the B-/A+ group, the separation would be very good. The first discriminant appears roughly to track overall level of performance, as it most sharply separates the highest-scoring group (B+/A+) from the lowest (B-/A-). The second discriminant is more difficult to interpret; perhaps it distinguishes children who grow up in environments congruent with their heredity from those who find themselves out of place. These results are broadly in agreement with those derived from ANOVA; a child's genetic and environmental background have substantial effects on the level and profile of performance on mental ability tests.

# Conclusion

The complexities of the phenomena investigated by molecular and computational biologists necessitate the collection of observations on many different variables. The need to understand the relationships among this welter of variables can present a daunting challenge to human beings, who are typically limited to spatial comprehension of three dimensions. The techniques discussed in this article can be thought of as means for simplifying a complex multivariate dataset through dimension reduction while simultaneously preserving one or more of its important features. Principal components analysis aims for the preservation of the prominent *dimensions of variation*; multidimensional scaling, the *relative distances between objects in the data space*; and, finally, Fisher's linear discriminant, *the separation of distinct groups*. All three techniques are valuable tools for assisting our efforts to transform raw data into a deeper understanding of biological phenomena.

# Additional Links and Reading

Johnson RA & Wichern DW (2002). *Applied Multivariate Statistical Analysis* (5th ed). Upper Saddle River, NJ: Prentice Hall.

**A comprehensive textbook on multivariate methods, from which most of the mathematical material in this article is adapted. This book eschews derivations based on matrix calculus, but nevertheless does require a strong motivation to tackle dense quantitative material. On the whole an excellent balance of depth, thoroughness, and accessibility.**

Cox TF (2005). *An Introduction to Multivariate Data Analysis*. London, UK: Hodder Education.

**Covers most of the same material as Johnson and Wichern, but is much lighter and cheaper.**

Friedberg SH, Insel AJ & Spence, LE. *Linear Algebra* (4th ed). Upper Saddle River, NJ: Prentice Hall.

**A standard textbook on theoretical linear algebra. Contains some interesting applications.**

Wikipedia articles on principal components analysis, multidimensional scaling, linear discriminant analysis, eigenvalue, eigenvector, and eigenspace, the spectral theorem, and the singular value decomposition.

**The quality of Wikipedia articles varies. Generally speaking, articles about mathematical topics, mathematicians, and mathematical history tend to be good.**

The R Project for Statistical Computing.

**Free software and support for the statistical computing package R, which can carry out all of the statistical techniques discussed in this article.**

The BIO280 online lecture video.

Wall ME, Rechsteiner A & Rocha LM (2002). Singular value decomposition and principal component analysis. In DP Berrar, W Dubitzky & M Granzow (Eds.), *A Practical Approach to Microarray Analysis* (pp. 91-109). Norwell, MA: Kluwer.

Hastie T, Tibshirani R, Eisen MB, Alizadeh A, Levy R, Staudt L, Chan WC, Bostein D & Brown P (2001). 'Gene shaving' as a method for identifying distinct sets of genes with similar expression patterns. *Genome Biology*, 1(2):RESEARCH0003.

PMID 11178228

Oldham MC, Horvath S & Geschwind DH (2006). Conservation and evolution of gene coexpression networks in human and chimpanzee brains. *Proceedings of the National Academy of Sciences*, 103(47), 17973-17978

PMID 17101986

**An application of multidimensional scaling.**

Korbel JO, Doerks T, Jensen, LJ, Perez-Iratzeta C, Kaczanowski S, Hooper SD, Andrade MA & Bork P (2005). Sysematic association of genes to phenotypes by genome and literature mining. *Public Library of Science Biology*, 3(5): e134.

PMID 15799710

**An application of principal components analysis.**

MA, Hodar C, Vulpe C, Gonzalez M & Cambiazo V (2002). Discriminant analysis to evaluate clustering of gene expression data. *Federation of European Biochemical Societies Letters*, 522(1), 24-28.

PMID 12095613

**An application of Fisher's linear discriminant.**

## Comments (0)

You don't have permission to comment on this page.