Expectation Maximization and Gibbs Sampling
Contributed by Ying Lei
Introduction
Expectation maximization (EM) and Gibbs sampling are two different methods for predicting the hidden variables of a statistical model. When given a set of DNA sequences, expectation maximization and Gibbs sampling can determine the possible locations of a transcription factor motif plus a position weight matrix that describes the probable motif sequence. In other words, EM and Gibbs sampling use the observable DNA sequences to find the hidden motif locations and the hidden matrix.
Transcription Motif Finding
To understand expectation maximization and Gibbs sampling, we will consider a specific example of how they are used to discover the sequence motifs of transcription factor binding sites.
Stating the Objective
Let us take a simple input of two short DNA sequence: 5'-GATTACA-3' and 5'-ACATTAG-3'. We are told that a 4-bp motif appears frequently in these sequences, but we do not know where the motif appears or what the motif looks like. Our goal is to determine the motif locations as well as the motif sequence.
Defining 'Location' and 'Sequence'
We begin by formulating concrete definitions of "motif location" and "motif sequence":
- What is the matrix? The position weight matrix is a mathematical representation of the motif sequence. Our matrix will have 4 rows—one for each position in the 4-bp motif—and 4 columns—one for each DNA base (A, T, C, G). Within the matrix, an element in the ith row and the jth column represents the probability that the ith position in the motif is the jth DNA base. For example, consider the following matrix:
.
Given this position weight matrix, the 3rd base pair of the motif has a 70% probability of being T and a 10% probability of being each of A, G, and C.
Calculating Location Likelihoods
To demonstrate the relationship between location likelihoods and the position weight matrix, take subseq2 = ATTA as an example. We can calculate the location likelihood p2 if we are given: [1] the position weight matrix shown above and [2] the assumption that all DNA bases are equally probable in non-motif locations.
To calculate the numerator of the location likelihood formula, we need P(A is in pos. 1), P(T is in pos. 2), P(T is in pos. 3), and P(A is in pos. 4):
.
The product of the four individual probabilities gives us the likelihood that subseq2 is a motif:
.
To calculate the denominator of the location likelihood formula, we calculate the probability that a random 4-bp DNA sequence is ATTA:
.
Note that we are using the assumption that each DNA base (A, T, G, C) has a one-fourth probability of appearing at a given base pair.
Now we find p2 by simple division:
.
Applying the EM Algorithm
When we apply the EM algorithm we will encounter two mutually dependent steps. The expectation step generates new location likelihoods based on the previous motif matrix. In turn, the maximization step generates a new motif matrix based on the previous location likelihoods.
We initialize our first motif matrix by taking an average of all subsequences. 3 of 8 subsequences have A in the first position, 3 of 8 have T in the first position, 1 of 8 has G in the first position, and 1 of 8 has C in the first position. Thus, the first row of our motif matrix is given by (0.375, 0.375, 0.125, 0.125). The other rows are calculated in a similar fashion to yield the initial motif matrix:
.
Once we have initialized a motif matrix, we enter our first expectation step. The calculations for the expectation step are completely identical to the one shown in "Calculating Location Likelihoods". To calculate new location likelihoods for the 8 subsequences, we simply use the motif matrix from the previous maximization step. We find that (p1, ..., p8) = (2.25, 9.00, 2.25, 1.69, 1.69, 2.25, 9.00, 2.25) after the first expectation step.
In the subsequent maximization step, we generate a new motif matrix to replace our initial motif matrix. The calculation for the new matrix is similar to the old, but instead of taking a simple average of the subsequences, we take a weighted average with p1 ... p8 as the weights. Thus, subseq1 will have the weight of 2.25 sequences, subseq2 will have the weight of 9 subsequences, and so on. Carrying through with the calculations, we find that the new motif matrix is:
.
Now that we have a new motif matrix, we use it in the expectation step to find new location likelihoods. Once we have new location likelihoods, we return to the maximization step to find another motif matrix. And then a third expectation step, and a third maximization step... And a fourth, a fifth, a sixth... etc. etc.
By repeatedly applying alternating steps of expectation and maximization, we will converge upon an equilibrium set of location likelihoods and an equilibrium motif matrix. For our particular example, the location likelihoods rapidly approach (p1, ..., p8) = (0, 256, 0, 0, 0, 0, 256, 0) and the motif matrix converges to:
.
Our example is so simple that the convergence is virtually complete by the fourth iteration.
Applying the Gibbs Sampling Algorithm
The Gibbs sampling algorithm involves one step per input sequence. In our case, there are only given two input sequences. For the sake of exposition, let us make a copy of each and consider four sequences: [1] GATTACA, [2] GATTACA, [3] ACATTAG, and [4] ACATTAG. Under these conditions, there will be four corresponding steps in our algorithm.
We begin by randomly selecting a 4-bp subsequence from every sequence in the input. This random sample will be our initial guesses of the motif location in each sequence. Let us assume that GATT, TACA, ATTA, and CATT are selected as our initial motif locations. With these we enter our first step, in which we isolate sequence 1 and consider it alone.
Our first step involves three sub-steps:
- Generating a Motif Matrix. We generate a motif matrix by considering all other sequences and our guesses of their motif locations. Currently, our guesses for sequences 2, 3, and 4 are the subsequences TACA, ATTA, and CATT. To calculate the motif matrix, we take a simple average of these subsequences to yield:
.
- Calculating Location Likelihoods. We calculate location likelihoods for all subsequences of sequence 1 based on the motif matrix from the first sub-step. This follows the same procedure that was described above. Since sequence 1 has four subsequences, we end up with four location likelihoods: (p1, ..., p4) = (0.000, 12.642, 0.000, 12.642).
- Sampling a Motif Location. We select a 4-bp subsequence from sequence 1 by sampling according to the location likelihoods p1 ... p4. Thus, the probability of selecting ATTA and TACA will each be 0.500, and the probability of selecting GATT or TTAC will each be 0.000. The selected subsequence will become our new guess of the sequence 1 motif location, thereby displacing the initial guess of GATT.
All in all, the first step updates sequence 1 with our new guess of the motif location. We record this new guess and proceed to the second step, in which we isolate sequence 2 and consider it alone.
The second step is completely analogous to to the first step. There are again three sub-steps: First, we generate a motif matrix with the motif location guesses from the other sequences. Second, we use this motif matrix to calculate location likelihoods within sequence 2. Third, we use these location likelihoods to sample a new guess of a sequence 2 motif location.
We continue to the third step and the fourth step and perform the same three sub-steps on sequence 3 and sequence 4, respectively. We then return to sequence 1 and proceed again through the four steps. Then, we cycle through the four steps once again... And then again... etc. etc.
Like the EM algorithm, Gibbs sampling will allow our guesses of the motif locations to converge towards an equilibrium. Once we reach this equilibrium, we can extract an overall position weight matrix to describe our predicted motif.
Notes
In general, EM and Gibbs sampling can handle inputs that are far more complex than the example shown. Many programs that are available by Web can manage hundreds of sequences with thousands of base pairs each. Furthermore, there are several variations to these two algorithms that attempt to improve predictive value based on prior knowledge of the motif's characteristics. For example, when calculating background probabilities in the EM algorithm, our assumption that all DNA bases are equally likely is seldom used. In most cases, our simplistic premise would be replaced with a stochastic model that more accurately reflects background frequencies.
The use of EM for motif discovery has been described and implemented by C. E. Lawrence and T. L. Bailey, among others. The use of the Gibbs sampling algorithm has been developed by C. E. Lawrence, J. S. Liu, and A. F. Neuwald, among others. For further reading, please see the listing provided below under "References". For motif-finding applications that are available on the Web, please follow the links provided below under "Online Resources".
Theoretical Background
Expectation Maximization
Expectation maximization can find the local maxima of likelihood functions that depend on latent variables. Consider a probability distribution p(γ, φ | θ), where γ are latent variables, φ are observable variables, and θ are model parameters. Expectation maximization finds the values of θ that maximize the marginal distribution p(φ raw>| θ).
In a General Setting
To find the θ that maximizes p(φ | θ), begin with an initial estimate θ0. Then, for t = 1, 2, 3, ... , calculate the conditional expectation E(log p(γ, φ | θ) | φ, θt-1), and set θt to be the value of θ that maximizes this expectation. Note that the expectation is calculated as a weighted average over p(γ | φ, θt-1), i.e., E(log p(γ, φ | θ) | θt-1, y) = ∫ log p(γ, φ | θ) p(γ | φ, θt-1) dγ. Also note that log p(γ, φ | θ) is used instead of p(γ, φ | θ) for ease of calculations. The sequence θ0, θ1, θ2, ... , will converge to a local maximum of p(φ | θ).
Applied to Motif Finding
When expectation maximization is applied to motif finding, the observable φ refers to the sequences, the latent γ refers to the positions of the motif within each sequence, and the parameter θ refers to the position weight matrix of the motif. Additionally, we include a constant θb that represents the background frequencies of DNA bases. The algorithm works as follows...
- Initialize θ0 by sampling from either a random or an informed prior.
- For t = 1, 2, 3, ..., until convergence:
- For each subsequence i, calculate the odds pi that subsequence i is an instance of the motif. That is, let pi = P(subseqi | θt-1) / P(subseqi | θb).
- Set θt to be an average of the subsequences weighted by their motif odds, i.e., θt = ∑ pi subseqi.
- Repeat Step 2 as needed until convergence.
Note that in the maximization step (Step 2b), θt is maximized using a closed form expression. This removes the needed for a direct search and greatly speeds up the algorithm.
Gibbs Sampling
Gibbs sampling generates samples from a joint probability distribution of random variables. The algorithm only involves sampling from the conditional distributions of the random variables and does not require direct sampling from the joint distribution.
In Two Dimensions
Let p(x, y) be a joint probability density function. To generate a sequence of samples (x0, y0), (x1, y1), (x2, y2), ... from p(x,y), begin by drawing (x0,y0) from a starting distribution p0(x, y). Then for t = 1, 2, 3, ... , draw (xt, yt) from p(x | y = yt-1) if t is odd, or from p(y | x = xt-1) if t is even.
In n Dimensions
Let p(θ) be a joint probability density function, where θ = (θ1, ..., θn). To generate a sequence of samples θ0, θ1, θ2, ... , begin by drawing θ0 from a starting distribution p0(θ). Then for t = 1, 2, 3, ... , draw θt from p(θi | θ1, ... , θi-1, θi+1, ... , θn), where i cycles from 1 through n. (Note that θ1, ... , θi-1, θi+1, ... , θn represents all components of θ except for θi.)
Convergence Issues
The sequence of samples will converge only after an initial burn-in period. The samples generated during burn-in should be discarded because their distribution depends on the starting distribution instead of the final joint distribution. To assess convergence, generate multiple sequences from over-dispersed starting points. Initially, the variance among the sequences will be relatively high because the starting distribution is over-dispersed. Furthermore, the variance within each sequence will initially be low because each sequence has not had time to roam across the entire distribution. As the samples converge, the among-sequence variance will decrease and the within-sequence variance will increase until both approach the same final variance.
Applied to Motif Finding
When used to find transcription motifs, Gibbs sampling works as follows...
- Initialize one motif location for each sequence. This can be done by sampling from either a random or an informed prior.
- For sequence i:
- Temporarily remove sequence i from the set and calculate a motif matrix θ-i with the motif locations from all the remaining sequences.
- For each location π in the sequence, use θ-i to calculate the probability that location π is a motif. Normalize these probabilities to create the conditional distribution p(π | θ-i).
- Update the motif location for sequence i by sampling from p(π | θ-i).
- Cycle to the next sequence in the set.
- Repeat Step 2 as needed until the motif matrix converges.
Online Resources
References
- Bailey, T. L., and Elkan, C. (1994). Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, 28-36. [pdf]
- Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. (2004). Bayesian Data Analysis. Boca Raton: CRC Press LLC.
- Lawrence, C. E., Altschul, S. F., Bogouski, M. S., Liu, J. S., Neuwald, A. F., and Wooten, J.C. (1993). Detecting subtle sequence signals: A Gibbs sampling strategy for multiple alignment. Science 262, 208-214.
- Lawrence, C. E., and Reilly, A. A. (1990). An expectation maximization (EM) algorithm for the identification and characterization of common sights in unaligned biopolymer sequences. PROTEINS: Structure and Function Genetics 7, 41-51.
Comments (0)
You don't have permission to comment on this page.