Microarray Classification
Introduction
Microarray classification involves assigning groups of genes with common expression patterns into biologically or clinically relevant classes. For example, a researcher may ask whether knowledge about expression patterns for a set of genes can be used to tell whether an individual has a particular cancer. The overall goal is to be able to establish a set of criteria that, when given an unknown sample or microarray, can be used to classify the sample in a biologically meaningful way. Clustering – the sorting of genes studied in a microarray into groups with similar expression patterns – is often a useful step prior to classification.
When is microarray classification used?
The case when classification comes in is when you want to make Predictions. For example, this could be incredibly important in medicine. If you were able to diagnose a person with cancer or another disease based on expression data, perhaps we could detect this condition earlier. In order to do this we need a very clear understanding of which genes are expressed under the disease vs. the healthy condition. Gene expression is rarely on or off, instead it vaires by small degrees, so determining the differences in expression between two condition is not obvious. We need statistical methods to help us out. In addition, expression pattern differences between two conditions are not concrete, and there is no set pattern that indicates with certainty whether a person has the disease. Therefore, we need to use statistical comparisons of expression patterns to determine the probability that if you have a given expression pattern you have a disease. For many diseases our ability to place a particular expression pattern into one disease group vs. another is too low.
Supervised Learning vs. Unsupervised Learning
There are two types of classification: supervised and unsupervised. In unsupervised learning, the class labels of the known samples are ignored. The clustering is done blindly first, and then the researcher looks at the labels to see how well the clustering worked. Oftentimes, this results in clusters that do not correctly separate the known samples by class. There is no guarantee that the known samples will cluster together. The researcher can try every parameter and cluster technique to get the perfect clustering of the knowns. Unsupervised learning can also be considered semi-supervised because the researcher can tweak the clustering algorithm so that the known samples cluster together into pretty figures in the end. Then, they choose the clustering method that produces the best results, and uses that one to classify unknown samples.
Supervised learning involves finding which features of the known samples (with known class labels) that are most useful in separating the known samples. These features are then used to classify the unknown samples.
Performance Assessment for Supervised Learning
A constant concern with supervised learning is the possible over-fitting of the data. Since the researcher can see every difference between the two data sets, he may be tempted to distinguish the samples by factors that might not inherently be unique to either data set, but rather by chance is differentially expressed in the learning data. This result in classification criteria that very precisely distinguishes the data sets now, but poorly in future observations with new data.
The most common kind of performance assessment is to divide the "known" data into a training set, and a verification set. The classification criteria are built using the training data. The classification criteria are then applied to the verification set and the error rate obtained. If the error rate for the verification set is low, then the classification criteria looks promising. The training and the verification data sets have to be independent and identically-distributed, which is often achieved by randomly selecting the two data sets.
A variation on this procedure is the N-fold cross validation performance assessment method. Instead of two data sets, the known data is divided into N equal sized data sets. The classifier is built on (N-1) subsets and then the verification set is the remaining subset. The error rate is computed for the verification set, and this procedure is repeated so that each of the N data sets is the verification set once. After this procedure is repeated N times and the error rate is favorable for all N repetitions, the final classification criteria to fit the unknown data is built using all the known data (all N subsets).
Dimension Reduction Microarray Classification Methods
What is dimension reduction and why is it relevant to microarray classification?
Techniques in genomics are improving such that large quantities of data are produced. Microarrays are a particular example of this phenomenon. The reason that microarrays necessitate dimension reduction does not have to do only with the quantity of data available, but with the repetitiveness of data as well. For example, it is now common for a researcher to conduct multi-array experiments through different time points, or comparing many different tissues. Plus, most experiments are done in tandem, i.e. repeated multiple times to account for the chance of contamination or a bad chip read in at least one trial. In addition, most microarray chips will not be printed with each probe sequence appearing only once; instead individual probes are printed multiple times onto each chip. Furthermore, many probes are for genes that share similar expression patterns. This repetitive data set with lots variables is very confusing. We need to classify or sort some of these datapoints into groups.
Any single spot on a microarray chip has a probe, which we know is a short sequence of oligonucleotides that may match a sequence in the cDNA or DNA sample applied to the chip. This same probe is present multiple times. It is repeated on the same chip, on multiple chips that are hybridized to the same pool of cDNA, and even on other chips to which a pool of cDNA from a different tissue or time point is applied. This one probe sequence can tell us a lot of information, but we need to sort out the data properly. Correctly interpreting the information we receive from each probe will allow us to compare probes in a meaningful way and cluster similar probes together based on their expression pattern. The formation of meaningful clusters give us a better chance of classifying those clusters into biologically relevant groups, such as "genes that are expressed only in the liver", or "genes that are expressed only during the 8 cell stage of development", or "genes that are expressed only when an individual has cancer."
Imagine you measure the height of 100 people in inches and in centimeters, and then you want to correlate a measurement of height with nutrition. You have three variables, inches, centimeters, and nutrition. But inches and centimeters correlate very well (perfectly), so you probably want to condense these two variables into one so that you can more easily interpret your data. Now that these 3 variables have become 2, you can make a 2D plot. Note: it is always easier to plot data in 2D because it is easier to detect or confirm the relationship among data points visually.
Now you understand the meaning of dimension reduction. It is exactly as it sounds, reducing the number of variables (and therefore, the number of dimensions on a graph) such that a better understanding of the data can be reached. Think of each probe as a data point with many dimensions. By reducing these dimensions and combining similar probes we can better understand the data. High dimensional data points are difficult to visualize:
Three Dimensions are harder to understand than Two! Four Dimensions are impossible to graph at all!
Professor Jun Liu points out that there are 2 ways to reduce the dimensions of our data. We can combine genes into groups when the genes seem to share similar properties (i.e. the same expression profiles), or combine experiments when they produce the same results (i.e. perhaps in some human tissues there is no difference in expression between morning and night, so the time variable can be eliminated and probes from time point experiments can be combined.)
The three forms of dimension reduction microarray classification covered in class are Multidimensional Scaling, Principle Component Analysis, and Linear Discriminant Analysis. More information on dimension reduction techniques can be found at http://compbio.pbwiki.com/Microarray+Dimension+Reduction
Non-dimension Reduction Microarray Classification Methods
K Nearest neighbor (KNN)
KNN for missing value estimation can also be used to classify unknown microarrays. KNN involves calculating the correlation coefficients between the unknown sample and all of the known samples. Then, the K observations in the training data with the highest correlation (closest) to the unknown sample is obtained. The unknown sample is classified by the majority vote in the K nearest neighbors. KNN is technically unsupervised since the correlation coefficients are objectively determined. However, the method could be considered semi-supervised since the choice of the value for K can be determined by the predictability of the known samples, which is dependent on the total sample size. KNN is a straight-forward classification method, but it offers little insight into mechanism, i.e. it cannot tell which genes are the distinguishing factors between cancer and non-cancer. More information on KNN can be found at http://compbio.pbwiki.com/Microarray+Clustering+Methods+and+Gene+Ontology
Classification and Regression Tree (CART)
Under the CART method of classification, the root node is the starting point with all the data. From this starting point, the data is split based on a series of conditions with 2 (or more) possible values for each condition. For example, the first split can be whether GeneX has an expression index over 2, and the data set of microarrays will be split into “yes” and “no” categories. The goal is to reduce the impurities in the data set, which means making each smaller group of data more similar within itself than to the other groups. The results can be imaged as a tree or a graph:
CART Results
Among a set of microarray data, there are many differentially expressed genes, but some will provide more insight into classification and better separate the microarrays into representative groups than others. To choose which of the differentially expressed genes to use as criteria for splitting, we must first consider two measurements of impurity: entropy and Gini index impurity. The mathematical definitions of these two measurements of impurity are:
Entropy and Gini Index Impurity
Entropy increases when the data set at a node is a mixture; entropy decreases when the data set at a node is pure. The same is true for Gini index impurity. Thus, the overall goal is to reduce entropy or Gini index impurity. The basic procedure for choosing which gene to split by is to test all the differentially expressed genes and see how splitting according to that gene will affect entropy or Gini index impurity. Then, split according to the gene which gave the maximum reduction in impurity. Repeat for the new data sets at the new nodes.
For example, consider a root node with 8 normal and 14 cancer microarrays. We want to split the root node to minimize Gini index impurity. The initial impurity of 10.18 is obtained from the following calculation:
Note that, to weigh the impurity of a node accordingly, we multiply the impurity by the number of samples in the node. When we split by the differential expression of GeneX, we get a much better separation between the normal and the cancer. Suppose 13 cancer samples show x>=0, while 1 cancer and 8 normal show x<0. The Gini index impurity for the result of this split would be 1.78 from the following calculation:
This is a great reduction in impurity from 10.18 to 1.78. If this is the greatest reduction among all the differentially expressed genes, then we would first split at GeneX.
When this procedure is repeated for the second split and so forth, we assume independence among all the partitions. This leads to ambiguities when, at some level, multiple genes give the same reduction in impurity and there is a choice between which gene to use to split that level. Other problems with CART include when to stop splitting. Two factors in this decision include whether the impurity at that node is already small enough and whether the number of samples at that node is already so small that further splitting would be unhelpful. It is possible to over-fit the data by splitting too much. Like performance assessment methods for supervised learning, one way to combat over-fitting involves separating the data into a training set to split and a test set for pruning. Pruning involves finding the right balance between the cost of further splits and the benefits of further splits, such as between tree size and goodness of fit.
Support Vector Machines (SVM)
All of these hyperplanes successfully separate the gray from the black samples, but which one would serve as the best criteria to classify new unknowns as either gray or black?
The intuitive answer (D) can be achieved by using Support Vector Machines (SVM). Firstly, support vectors are formed by the samples that lie on the class edge; they are the points closes to the other sample class. Only the support vectors determine the margin, or the difference between the two support vectors; the points on the insides of the classes are irrelevant. The following figure illustrates the concept of support vectors and the margin:
Support Vectors and Margin
The arrows point to the support vectors of Class 1 and Class 2, while the orthogonal line marked “m” indicates the margin. The SVM method of microarray classification finds the hyperplane that maximizes the margin. The hyperplane cuts in the middle of the margin area (also shown in the above figure). Although the 2D example presented is easy to intuit and visualize, SVM is especially helpful when there are too many dimensions for the best hyperplane to be obvious.
One variation to the SVM method is to relax the definition of the support vectors. Instead of a strict edge such that the support vectors represent the absolute boundary of all samples in a class, a soft edge can be used where some outliers can fall beyond the boundary created by the support vectors. For example, the bolded square appears to be a mistake and including it in the support vector would skew the hyperplane:
In this case, a tradeoff has to be established between omitting a sample and increasing the margin between the two support vectors or sample classes. One way to account for this tradeoff is to penalize by a slack variable for every mistake. Then the optimization would be Max(margin – slack variable * #mistakes).
Lastly, contrary to the dimension reduction microarray classification techniques, SVM can sometimes better separate and classify the data by increasing dimensionality. The data may be projected into a higher dimensional space so that the classes can be better separated by a hyperplane. For example:
In the first graph, there is no way to linearly separate the circles from the crosses. SVM cutting is always on a plane (linear), but the projection does not have to be linear. By increasing the dimensionality (including the radius as an additional dimension), the new projection separates the inside of the circle (circle markings) from the outside (cross markings) and a hyperplane that separates the two classes becomes evident. Matlab and BioConductor both have functions that can help find non-linear projections, but the choice of projections is usually obtained through trial and error and personal experience.
Microarray Classification Resources
When classifying microarrays, a variety of different sequence IDs could be used to identify the genes or nucleotide sequences in the arrays. Some of the most common include:
GenBank – most comprehensive database of all submitted sequences
Expressed Sequence Tags (EST) – mRNA of (coding) genes that are expressed a lot
UniGene – computationally derived gene-based transcribed sequence clusters
Entrez Gene – comprehensive catalog of genes with associated information
RefSeq – reference sequences mRNAs (NM…) and proteins (NP…)
Entrez Gene and RefSeq are the most useful of these most widely used sequence IDs.
There are also several public microarray databases:
Stanford Microarray Database (SMD) – contains mostly cDNA arrays from Stanford and collaborators
Gene Expression Omnibus (GEO) – a NCBI repository for gene expression and hybridization data
Oncomine – published cancer related microarrays database; many different types of microarrays including cDNA, Affy, and longer oligo arrays; raw data is all processed
Conclusion
Sorting out microarray data is complicated. Thankfully, there are many programs and algorithms that can help utilize this data to study biologically meaningful differences between sample sets. Microarray classification can be achieved through dimension reduction techniques (ex. MDS, PCA, and LDA) and non-dimension reduction techniques (KNN, CART, SVM). Microarray clustering can be unsupervised (KNN, MDS, PCA) or supervised (LDA, CART, SVM).
Links:
Image References and other online lecture notes:
Microarray Resources at London's Global university (Image 4): http://www.ucl.ac.uk/oncology/MicroCore/HTML_resource/PCA_1.htm
Example of PCA data (Image 3): http://images.google.com/imgres?imgurl=http://www.biologie.ens.fr/lgmgml/publication/comp3d/Images_Article/PCA_Pombe.gif&imgrefurl=http://www.biologie.ens.fr/lgmgml/publication/comp3d/supfigure1.php&h=344&w=600&sz=16&hl=en&start=8&tbnid=Y_xxc5uVqY0jlM:&tbnh=77&tbnw=135&prev=/images%3Fq%3Dpca%2Bmicroarray%26svnum%3D10%26hl%3Den%26lr%3D%26client%3Dsafari%26rls%3Den%26sa%3DN
Simple LDA explanation(Figure 5): http://www.vias.org/tmdatanaleng/cc_lda_intro.html
Classification problem (Figure 6): http://compbio.pbwiki.com/f/g903.png
Related Articles that use Microarray Classification Techniques:
1: Chung TP, Laramie JM, Meyer DJ, Downey T, Tam LH, Ding H, Buchman TG, Karl I, Stormo GD, Hotchkiss RS, Cobb JP.
Molecular diagnostics in sepsis: from bedside to bench. J Am Coll Surg. 2006 Nov;203(5):585-598.
2: Kalbas M, Lueking A, Kowald A, Muellner S.
New analytical tools for studying autoimmune diseases. Curr Pharm Des. 2006;12(29):3735-42.
3: Dietmann S, Aguilar D, Mader M, Oesterheld M, Ruepp A, Stuempflen V, Mewes HW.
Resources and tools for investigating biomolecular networks in mammals. Curr Pharm Des. 2006;12(29):3723-34.
4: Heber S, Sick B.
Quality assessment of Affymetrix GeneChip data. OMICS. 2006 Fall;10(3):358-68.
5: Modlich O, Prisack HB, Bojar H.
Breast cancer expression profiling: the impact of microarray testing on clinical decision making. Expert Opin Pharmacother. 2006 Oct;7(15):2069-78. Review.
6: Coppola G, Geschwind DH.
Technology Insight: querying the genome with microarrays--progress and hope for neurological disease. Nat Clin Pract Neurol. 2006 Mar;2(3):147-58. Review.
7: Janga SC, Lamboy WF, Huerta AM, Moreno-Hagelsieb G.
The distinctive signatures of promoter regions and operon junctions across prokaryotes. Nucleic Acids Res. 2006;34(14):3980-7. Epub 2006 Aug 12.
8: Mehlmann M, Dawson ED, Townsend MB, Smagala JA, Moore CL, Smith CB, Cox NJ, Kuchta RD, Rowlen KL.
Robust sequence selection method used to develop the FluChip diagnostic microarray for influenza virus. J Clin Microbiol. 2006 Aug;44(8):2857-62.
9: Rainer J, Sanchez-Cabo F, Stocker G, Sturn A, Trajanoski Z.
CARMAweb: comprehensive R- and bioconductor-based web service for microarray data analysis. Nucleic Acids Res. 2006 Jul 1;34(Web Server issue):W498-503.
10: Loots GG, Chain PS, Mabery S, Rasley A, Garcia E, Ovcharenko I.
Array2BIO: from microarray expression data to functional annotation of co-regulated genes. BMC Bioinformatics. 2006 Jun 16;7:307.
11: Derome N, Duchesne P, Bernatchez L.
Parallelism in gene transcription among sympatric lake whitefish (Coregonus clupeaformis Mitchill) ecotypes. Mol Ecol. 2006 Apr;15(5):1239-49.
12: Fang H, Xie Q, Boneva R, Fostel J, Perkins R, Tong W.
Gene expression profile exploration of a large dataset on chronic fatigue syndrome. Pharmacogenomics. 2006 Apr;7(3):429-40.
13: Hsu KL, Pilobello KT, Mahal LK.
Analyzing the dynamic bacterial glycome with a lectin microarray approach. Nat Chem Biol. 2006 Mar;2(3):153-7. Epub 2006 Feb 5.
14: Alexe G, Bhanot G, Venkataraghavan B, Ramaswamy R, Lepre J, Levine AJ, Stolovitzky G.
A Robust Meta-classification Strategy for Cancer Diagnosis from Gene Expression Data. Proc IEEE Comput Syst Bioinform Conf. 2005;:322-5.
15: Chagoyen M, Carmona-Saez P, Shatkay H, Carazo JM, Pascual-Montano A.
Discovering semantic features in the literature: a foundation for building functional associations. BMC Bioinformatics. 2006 Jan 26;7:41.
16: Wu W, Liu X, Xu M, Peng JR, Setiono R.
A hybrid SOM-SVM approach for the zebrafish gene expression analysis. Genomics Proteomics Bioinformatics. 2005 May;3(2):84-93.
17: Garrido P, Blanco M, Moreno-Paz M, Briones C, Dahbi G, Blanco J, Blanco J, Parro V.
STEC-EPEC oligonucleotide microarray: a new tool for typing genetic variants of the LEE pathogenicity island of human and animal Shiga toxin-producing Escherichia coli(STEC) and enteropathogenic E. coli (EPEC) strains. Clin Chem. 2006 Feb;52(2):192-201. Epub 2005 Dec 29.
18: Schneider KL, Pollard KS, Baertsch R, Pohl A, Lowe TM.
The UCSC Archaeal Genome Browser. Nucleic Acids Res. 2006 Jan 1;34(Database issue):D407-10.
19: Carroll WL, Bhojwani D, Min DJ, Moskowitz N, Raetz EA. Childhood acute lymphoblastic leukemia in the age of genomics. Pediatr Blood Cancer. 2006 May 1;46(5):570-8. Review.
20: Bhanot G, Alexe G, Levine AJ, Stolovitzky G.
Robust diagnosis of non-Hodgkin lymphoma phenotypes validated on gene expression data from different laboratories. Genome Inform. 2005;16(1):233-44.
Comments (0)
You don't have permission to comment on this page.