Introduction
The Bioconductor project was started in the fall of 2001 and originally included 25 core developers in the US, Europe and Australia (Dudoit and Jewell, 2004). Bioconductor is open-source and open development software used to analyze genomic data using R packages. There are at least five purposes of the Bioconductor project:
1. Provide access to statistical and graphical methods to analyze biomedical and genomic data
2. Assist in integrating biological data spread across the Internet in a variety of databases including GenBank, GO, LocusLink, PubMed new iron
3. Allow the rapid development of software that can be modified and scaled up
4. Promote high-quality documentation and encourage reproducible research
5. Provide training in methods used to analyze biomedical and genomic data (Dudoit and Jewell, 2004)
To unify these purposes, the Bioconductor project established the principal web site, www.bioconductor.org where you can find the software, data and documentation as well as training materials from short courses. Working papers related to the Bioconductor project are available at www.bepress.com/bioconductor.
Bioconductor Packages
Bioconductor software is comprised of packages usually written for R. One of the enormous advantages of the bioconductor project is the diversity of software written for it in due to the open-source and open-documentation nature. There are packages for cluster analysis, linear and nonlinear regression, resampling, visualizations, analyzing DNA microarrays and analyzing biological meta-data from the Internet (e.g. GenBank, GO, KEGG, PubMed, etc.). A rather comprehensive list of the functions of these packages include:
• General infrastructure: Biobase, Biostrings, DynDoc, reposTools, rhdf5 ,ruuid, tkWidgets, widgetTools.
• Annotation: annotate, AnnBuilder + metadata packages.
• Graphics: geneplotter, hexbin.
• Pre-processing Affymetrix oligonucleotide chip data: affy, affycomp, affydata, affylmGUI , affyPLM, annaffy, gcrma, makecdfenv, vsn.
• Pre-processing two-color spotted DNA microarray data: arrayMagic, arrayQuality, limma, limmaGUI , marray, vsn.
• Other assays: aCGH, DNAcopy, prada, PROcess, RSNPer, SAGElyzer.
• Differential gene expression: EBarrays, edd, factDesign, genefilter, limma, limmaGUI , multtest, ROC.
• Graphs and networks: graph, RBGL, Rgraphviz .
• Gene Ontology: GOstats, goTools.
• MAGE: RMAGEML.
cclust: convex clustering methods.
• class: self-organizing maps (SOM).
• cluster: AGglomerative NESting (agnes), Clustering LARge Applications
(clara), DIvisive ANAlysis (diana), Fuzzy Analysis (fanny), MONothetic
Analysis (mona), Partitioning Around Medoids (pam).
• e1071: bagged clustering (bclust), fuzzy C-means clustering (cmeans).
• flexmix: flexible mixture modeling.
• fpc: fixed point clusters, clusterwise regression and discriminant plots.
• hopach: Hierarchical Ordered Partitioning and Collapsing Hybrid
(HOPACH) - to appear on CRAN and Bioconductor.
• mclust: model-based cluster analysis.
• stats: hierarchical clustering (hclust, heatmap, cophenetic), k-means
• class: k-nearest neighbor (knn).
• classPP: projection pursuit.
• e1071: support vector machines (svm).
• ipred: bagging, resampling based estimation of prediction error.
• knnTree: k-nn classification with variable selection inside leaves of a tree.
• MASS: linear and quadratic discriminant analysis (lda, qda).
• mlbench: machine learning benchmark problems.
• nnet: feed-forward neural networks and multinomial log-linear models.
• pamr: prediction analysis for microarrays.
• randomForest: random forests.
• rpart: classification and regression trees.
There are over 80 packages, so this tutorial reviews only the ones deemed critical by extant tutorials for classes focusing on the analysis of microarrays and other genomic data. The packages reviewed in this tutorial include:
• biobase and genefilter
• multtest package
• LIMMA
• clustering methods-- Manhattan method, partitioning methods, PCA, MDS, K-means, SOM; supervised classification--LDA, KNN, SVM, Neural Network
Before these packages are reviewed, however, both R and Bioconductor must be installed. First, one must install the latest version of R available at cran.r-project.org. Assuming Windows 95 or later is being used, download R-2.2.1-win32.exe and run the program. Once R has been installed, use the getBioC installation script:
> source("http://www.bioconductor.org/getBioC.R")
> getBioC()
All packages, including the ones reviewed in this tutorial can be installed using Windows pull-down menus or the R functions install.packages and installDataPackage. Alternatively, one can install a subset of the most frequently is packages using the script:
> source("http://www.bioconductor.org/biocLite.R")
> biocLite()
Biobase, Genefilter (Pre-processing packages)
Perhaps before any formal genomic analysis can be performed, the data must be processed into appropriate formats and Biobase and Genefilter are powerful tools to do so (Dudoit and Gentleman, 2002). To operate them, one will need to load the following R and bioconductor packages:
> library(XML)
> library(Biobase)
> library(annotate)
> library(genefilter)
> library(golubEsets)
> library(ctest)
> library(MASS)
> library(cluster)
These two packages will be illustrated using a dataset from Golub et al. (1999) available at http://www-genome.wi.mit.edu/mpr/data_set_ALL_AML.html and the dataset is also included in an R data package, golubEsets. The data also come from a study of gene expression used in the forthcoming section on clustering methodologies. In this study, gene expression was examined in two types of leukemias: ALL and AML. In this study,
Gene expression levels were measured using Affymetrix high–density oligonucleotide arrays (HU6800 chip) containing probes for approximately 6, 800 human genes and ESTs. The chip actually contains 7,129 different probe sets; some of these map to the same genes and others are there for quality control purposes. The data comprise 47 cases of ALL (38 B–cell ALL and 9 T–cell ALL) and 25 cases of AML.Samples are divided into a learning set with 38 observations and a test set of 34 observations. (Dudoit and Gentleman, 2002, page 2).
In this first package, Biobase, one can obtain the gene expression measures in the form of exprSets by using:
> data(golubTrain)
> data(golubTest)
> data(golubMerge)
To find information on an object, in this case golubTrain,
> class(golubTrain)
[1] "exprSet"
> slotNames(golubTrain)
[1] "exprs" "se.exprs" "phenoData" "description" "annotation"
[6] "notes"
> golubTrain
Expression Set (exprSet) with
7129 genes
38 samples
phenoData object with 11 variables and 38 cases
varLabels
Samples: Samples
ALL.AML: ALL.AML
<RAW>BM.PB: BM.PB
T.B.cell: T.B.cell
FAB: FAB
Date: Date
Gender: Gender
pctBlasts: pctBlasts
Treatment: Treatment
PS: PS
Source: Source
Then, to extract information on the mRNA samples:
> phenoTrain <- phenoData(golubTrain)
> class(phenoTrain)
[1] "phenoData"
> slotNames(phenoTrain)
[1] "pData" "varLabels"
> phenoTrain
phenoData object with 11 variables and 38 cases
varLabels
Samples: Samples
ALL.AML: ALL.AML
BM.PB: BM.PB
T.B.cell: T.B.cell
FAB: FAB
Date: Date
Gender: Gender
pctBlasts: pctBlasts
Treatment: Treatment
PS: PS
Source: Source
If one is interested in accessing the matrix of expression measures one can use the slot function:
> X <- golubTrain@exprs
> X <- slot(golubTrain, "exprs")
> X <- exprs(golubTrain)
> dim(X)
Once one has accessed the gene expression measures of interest, one should next choose filters one wants to use, then assemble them using filterfun, and then apply them using genefilter (Dudoit and Gentleman, 2002). Using the same dataset, there appear to be at least three steps in pre-processing of data:
1. Establish a threshold of the floor of 100 and ceiling of 16,000
> X <- exprs(golubTrain)
> X[X < 100] <- 100
> X[X > 16000] <- 16000
2. Exclude genes with max / min _ 5 or (max−min) _ 500, where maximum and minimum of four to the intensities for particular gene across mRNA samples.
> mmfilt <- function(r = 5, d = 500, na.rm = TRUE) {
+ function(x) {
+ minval <- min(x, na.rm = na.rm)
+ maxval <- max(x, na.rm = na.rm)
+ (maxval/minval > r) && (maxval - minval > d)
+ }
+ }
> mmfun <- mmfilt()
> ffun <- filterfun(mmfun)
> good <- genefilter(X, ffun)
> sum(good)
[1] 3051
> X <- X[good, ]
3. Perform a logarithmic transformation
> X <- log10(X)
4. Standardize the data
> X <- scale(X)
> golubTrainSub <- golubTrain[good, ]
> slot(golubTrainSub, "exprs") <- X
> golubTrainSub
Expression Set (exprSet) with
3051 genes
38 samples
phenoData object with 11 variables and 38 cases
varLabels
Samples: Samples
ALL.AML: ALL.AML
BM.PB: BM.PB
T.B.cell: T.B.cell
FAB: FAB
Date: Date
Gender: Gender
pctBlasts: pctBlasts
Treatment: Treatment
PS: PS
Source: Source
Multtest package
After the data has been processed, a useful package for a huge variety of multiple hypothesis testing is available in “multtest” (Dudoit and Ge, 2004). This package is useful for multiple hypothesis testing in microarray settings—for example, the package contains functions that can be used to identify differentially expressed genes in microarray experiments. The mt.teststat and mt.teststat.num.denum functions provide a way to compute test statistics for each row of the data frame (e.g. paired t-statistics). The syntax to compare, for each team, the expression in dataset 1 to dataset 2 is:
> teststat <- mt.teststat(dataset 1, dataset 2)
To produce a QQ plot has a visual aid to identify genes with unusual test statistics, one can use the following code:
> postscript("mtQQ.eps")
> qqnorm(teststat)
> qqline(teststat)
> dev.off()
If one is concerned about multiple testing and the effect on p-values (i.e. the need for adjustment), one may use the mt.rawp2adjp function to adjust the p-values for simple multiple testing procedures from a vector of raw p-values. Before one can use this function, however, one must first confuse the vector of raw p-values for the test statistics using the standard Gaussian distribution:
> rawp0 <- 2 * (1 - pnorm(abs(teststat)))
Adjusted p-values for these seven multiple testing procedures can be computed as follows and stored in the original gene order in adjp using:
> procs <- c("Bonferroni", "Holm", "Hochberg", "SidakSS", "SidakSD", + "BH", "BY")
> res <- mt.rawp2adjp(rawp0, procs)
> adjp <- res$adjp[order(res$index), ]
> round(adjp[1:10, ], 2)
Notice that there are seven multiple testing procedures (for more information, consult Dudoit and Ge, 2004).
Once one uses the mt.rawp2adjp function to sufficiently adjust the p-values, one can use the mt.reject function to return the identity and number of rejected hypotheses for several multiple testing procedures. These can be obtained by using:
> mt.reject(cbind(rawp, maxT), seq(0, 1, 0.1))$r
To discern which genes have, for example, p - values less than or equal to 0.01, use:
> which <- mt.reject(cbind(rawp, maxT), 0.01)$which[, 2]
> matrix 1 [which, 2]
Where "matrix 1" is the matrix of gene identifiers.
Lastly, to produce a number of graphical summaries for the results of the multiple testing procedures just described, one can use the mt.plot function. Plots of the sorted permutation raw p-values and adjusted p - values for three of the aforementioned seven adjusting procedures use:
> res <- mt.rawp2adjp(rawp, c("Bonferroni", "BH", "BY"))
> adjp <- res$adjp[order(res$index), ]
> allp <- cbind(adjp, maxT)
> dimnames(allp)[<raw>[2]] <- c(dimnames(adjp)[<raw>[2]], "maxT")
> procs <- dimnames(allp)[<raw>[2]]
> procs <- procs[c(1, 2, 5, 3, 4)]
> cols <- c(1, 2, 3, 5, 6)
> ltypes <- c(1, 2, 2, 3, 3)
LIMMA
The LIMMA package enables the users to execute linear models for microarrays, unlike multtest which is for multiple hypothesis testing (Dudoit and Jewell, 2004). In addition, users can normalize and track the backgrounds to pre-process the data and analyze complex experimental designs (e.g. multi-factorial). To identify differentially expressed genes, LIMMA uses empirical Bayes methods (t-statistics, F-statistics). In addition, there are methods for identifying duplicate spots and replicates. Lastly, the package includes a rich set of graphical output:
(from Dudoit and Jewell, 2004)
To illustrate how powerful LIMMA can be in its ability to combine high level powerful commands with low-level simple commands, an extensive example is presented. In this example, the LIMMA package is used to analyze microarray data.
After LIMMA has been installed, the microarray data used can be found at http://bioinf.wehi.edu.au/marray/genstat2002/swirl.zip. There are three files in this microarray zip file: a .gal file that contains information about the spots printed on the arrays, the .spot file is the output from the image analysis program, and the .txt contains a description of the experiment. The purpose of the experiment was to identify genes with altered expression in the "world mutants" compared to the wild zebrafish (the swirl mutant had genes that affected the dorsal/ventral body axis) (Dudoit and Jewell, 2004).
To analyze these microarray data:
1. Activate the LIMMA command:
> library (limma)
2. Read the filenames of the files containing the intensity data into the object files:
> files <- dir(pattern="*.spot")
3. Now read the data to an object RG:
> RG <- read.maimages(files, source="spot")
4. Now that the data are in object form, one can choose to name the objects in whatever fashion and do basic descriptive statistics. For example, to discover the red signal values from the four microarray's and place them into a new matrix A:
A <- RG$R
And to find the mean of the values in this matrix:
mean(A)
5. To add information about the genes to the RG object:
> RG$genes <- readGAL()
In which the readGAL function reads its information about the spots printed on the arrays.
6. Now, to add information about the layout of the microarrays--which genes are printed where and in what order--to the RG object:
> RG$printer <- getLayout(RG$genes)
If one is interested in plotting images of the background values for the microarray is, here's an example to plot them for the red intensity is from the first microarray:
> imageplot(log2(RG$Rb[,1]), RG$printer)
7. To normalize the data, one may be interested in using LOESS normalization. One of the advantages of using this type of normalization is that this method does not require the analyst of the microarray data to identifying a function to describe all the data--instead, LOESS fits a model to each segment of the data. LOESS is used by the following command:
> MA <- normalizeWithinArrays(RG)
Then, to normalize between arrays:
> MA <- normalizeBetweenArrays(MA)
8. Now, to estimate the number of times the expression level of the gene has increased or decreased by fitting a linear model:
> fit <- lmFit(MA, design=c(-1,1,-1,1))
The lmFit command a straight forum, but the "assigned" function may need some explanation. Basically, it designates that, in this case, "the colors have been switched between the types of samples for every second array." (Dudoit and Jewell, 2004).
9. Lastly, to produce a list of the 20 machines most likely to be differentially expressed:
> options(digits=3); topTable(fit, n=20, adjust="fdr")
The "fdr" the notes the false discovery rate using the "adjust" parameter. At the rate at which one could expect false positives--for example, if we were to pick a fdr-adjusted rate of 10%, then we can expect that approximately 10% of genes in this table are false positives and that their differential expression is the result of chance rather than an actual expression.
Cluster analysis packages
When analyzing gene expression data, as in the example above, it is important to consider how one will calculate similarities across a set of samples. Fortunately, there are many methods of "clustering". It has been acknowledged that the decision of which clustering method to use as an important one (Dudoit and Gentleman, 2002).
The general goal is to use the dist R function to compute distances, and it does this by calculating the distances between the rows of an expression data matrix. In the example below, hierarchical clustering using the Manhattan distance methodology is used where golubTrainSub is the expression data matrix, which was obtained from the exprSet object golubTrainSub using the accessor function exprs (Dudoit and Gentleman, 2002). ALL and ALM refer to the data that come from a study of gene expression in two types of leukemia.
> dgTr <- dist(t(exprs(golubTrainSub)), method = "manhattan")
> hcgTr <- hclust(dgTr, method = "average")
> plot(hcgTr, labels = golubTrainSub$ALL.AML, main = "Hierarchical clustering dendrogram + sub = "Average linkage, Manhattan distance for scaled arrays, G=3,051 genes")
In the second method involving partitioning, the user must specify how many clusters are wanted in the algorithms used and different strategies to distribute the samples to the clusters (Dudoit and Gentleman, 2002). Using data from the previous example, to generate a partitioning around "medoids", the user uses the following syntax:
> set.seed(12345)
> pmgTr <- pam(dgTr, k = 2, diss = TRUE)
> kmgTr <- kmeans(t(exprs(golubTrainSub)), centers = 2)
> table(pmgTr$clust, golubTrainSub$ALL.AML)
> table(kmgTr$clust, golubTrainSub$ALL.AML)
And this gives us the following tables:
ALL AML
1 23 0
2 4 11
ALL AML
1 2 11
2 25 0
"The two tables allow us to compare the clustering results to the truth," but for more interpretation consult (Dudoit and Gentleman, 2002). Lastly, to create a silhouette plot:
> sils <- pmgTr$silinfo[<raw>[1]][, 3]
> ord <- as.numeric(row.names(pmgTr$silinfo[<raw>[1]]))
> barplot(sils, col = c(2, 4)[as.numeric(cl<raw>[ord])], main = "Title + sub = "Title")
These are two of the largest clustering methods--Manhattan distances and partitioning--but for more detailed syntax about using PCA, MDS, K-means, SOM as well as supervised classification--LDA, KNN, SVM, Neural Network, please consult the Appendix.
Conclusion
With over 80 packages, Bioconductor can present in almost intimidating array of tools for the analysis of MicroArray and other genomic data. This tutorial has reviewed the basics of preprocessing packages to order the data in an appropriate format, and packages to perform multiple hypothesis testing, analyze complex designs, execute linear models as well as perform a variety of clustering algorithms.
Appendix
(entirely from STAT 115: Hands-on Training on Machine Learning Methods in High-Throughput Biological Data Analysis)
To work through these examples, go to
http://my.harvard.edu/icb/icb.do?keyword=k8324&pageid=icb.page31390
and download SampleMicroarray.zip to get 5 CEL files.
PCA
PCA.all <- prcomp( t(alldata) )
names(PCA.all)
plot(PCA.all$sdev, type="h", main="PCA std")
SamplePCA <- t(alldata) %*% PCA.all$rotation[,1:2]
plot( SamplePCA, main="PCA" )
text( SamplePCA, samplename, col=c(rep("black",4), rep("red",7), rep("blue",3) ) )
MDS
dist <- dist( t(alldata ), diag=T, upper=T )
SampleMDS <- cmdscale(dist, k=2)
plot( SampleMDS, main="MDS")
text( SampleMDS, samplename, col=c(rep("black",4), rep("red",7), rep("blue",3) ) )
K-means, with cluster number equal to 3
KmeanRes <- kmeans(t(alldata), centers=3, iter.max = 100)
KmeanRes
table(KmeanRes$cluster)
plot(SamplePCA, main="PCA" )
text(SamplePCA, samplename, col=c(“black”, “red”, “green”,)[ ~KmeanRes$cluster] )
SOM
SOM clustering of samples:
library(som)
library(help=som)
NormData <- normalize(t(alldata))
SOMres <- som(NormData, xdim=3, ydim=1, init=”random” )
plot( SOMres )
SOM.Prj <- som.project( SOMres, NormData )
plot(SamplePCA, main="PCA" )
text(SamplePCA, samplename, col=c(“black”, “red”, “green”)[SOM.Prj$x+1])
SOM clustering of genes:
NormData <- normalize(alldata)
SOMres <- som(NormData, xdim=4, ydim=5, init=”random” )
plot( SOMres )
Supervised classification
LDA:
library(MASS)
SampleLDA <- lda( t(traindata), grouping=TrainingClass )
predict( SampleLDA, t(testdata))$class
Use a subset of genes to make a prediction:
SampleLDA <- lda( t(traindata[1:5,]), grouping=TrainingClass )
predict( SampleLDA, t(testdata[1:5,]))$class
KNN
library(class)
knn( t(traindata), t(testdata), k=1, TrainingClass )
Use different number of nearest neighbor to make prediction:
knn( t(traindata), t(testdata), k=3, TrainingClass )
Check the distances of the first Unknown test sample to other training samples:
dist(t(alldata))
SVM
library(e1071)
SampleSVM <- svm( t(traindata), TrainingClass, scale=F, kernel="linear" )
predict( SampleSVM, t(testdata) )
Neural Network
library(neural)
Output <- matrix( c(rep(0,4), rep(1,7)), ncol=1)
NeuralTrain <- mlptrain( t(traindata), neurons=c(5), Output)
Click the “START” button in the GUI interface to start training the ANN, and click “EXIT” when finished.
Test using just the training data:
mlp( t(traindata), NeuralTrain$weigth, NeuralTrain$dist, neurons=NeuralTrain$neurons, actfns=NeuralTrain$actfns )
Test with Unknown samples:
mlp( t(testdata), NeuralTrain$weigth, NeuralTrain$dist, neurons=NeuralTrain$neurons, actfns=NeuralTrain$actfns )
Works Cited
Dudoit, S. and Gentleman, R., Bioconductor basics tutorial, June 24, 2002.
Dudoit, S and Jewell, N., Lecture 1: Introduction to Bioconductor and R, 2004.
Dudoit, S. and Ge, Y., Bioconductor’s multtest package, February 24, 2004.
http://bio.lundberg.gu.se/courses/vt04/ma.html
Liu, J. and Liu, X. Hand-on Training on Machine Learning Methods in High-Throughput Biological Data Analysis, Statistics 115, Harvard University, Fall 2006.
Comments (0)
You don't have permission to comment on this page.