Pairwise Sequence Alignment
By Hala Iqbal
Introduction
Molecular Biology Background
Nucleotides are the building blocks of DNA. They consist of a sugar, a phosphate group and one of 4 nitrogenous bases attached: Adenine (A), Guanine (G), Cytosine (C) or Thymine (T).
Amino acids are the building blocks of proteins. There are 20 different amino acids used in protein synthesis in cells. Each amino acid is represented by a single letter, as shown in Figure 1.
Figure 1: The 20 Amino Acid's and their 3 and 1 letter codes
A gene is defined as transcribed DNA (usually coding for a protein) and any flanking sequences required for the regulation of its transcription. Genes are comprised of introns, which are non-coding regions, and exons that code for the proteins produced. Introns must be spliced out before proteins are produced.
Pairwise Sequence Alignment
What is Pairwise Sequence Alignment?
The goal of pairwise sequence alignment is to come up with the best possible alignment of two sequences. Given a scoring system for matches, mismatches and gaps (See Figure 2), the best possible alignment is defined as the alignment that optimizes the total score.
Figure2: Match, Mismatch, Gap in sequence alignment
Proteins sequences match the 20 amino acids and a gap placeholder.
DNA sequences match the 4 bases and a gap placeholder. DNA sequences can also be matched using additional strings of 2 or 3 bases. See Figure 3 for some strings defined by IUPAC.
Figure 3: Nucleotide sequences defined by IUPAC
Different scoring systems can be used depending on the sequence and the type of application in terms of the weight assigned to matches, mismatches and gaps. See Figure 4 for examples of the alignment that results from using different scoring systems.
Figure 4: Scoring schemes resulting in different matches
Applications of Pairwise Sequence Alignment
Pairwise sequence alignment is a useful tool in many fields of biology. For example, the similarity between sequences can used be in evolutionary analysis to find out what organisms share a common ancestor. Alternatively, the similarity between amino acid sequences can be used to predict the structure and functions of protein domains. Another use of pairwise sequence analysis is in genome sequencing assembly, where matches are used to find overlaps in the shorter pieces of DNA sequenced. Genome Assembly Page
Global vs. Local Alignment
Global alignment matches two whole sequences, whereas local alignment finds high scoring subsequence alignments between two sequences. An example of where local alignment would be more useful than global alignment is in finding similar function domains in proteins.
Approaches to Pairwise Sequence Alignment
Dot Matrix Approach
In this naïve and simplistic approach, the two sequences to be matched are lined as axes on a grid, and a dot is placed wherever there is an exact match. Diagonal lines can then be visually traced, which represent stretches of matched sequence (Figure 5).
Figure 5: Dot Matrix Approach
The downside to this approach is that it relies on visual analysis, and therefore it is hard to find optimal alignments. Also, it doesn’t make a distinction between gaps and mismatches, as necessitated by some scoring schemes.
Dynamic Programming Approach
Rundown of Strategy:
The essence of dynamic programming is the storing of sub-problem solutions for further use in the problem. Usually, this stored sub-problem solution is compared with the next sub-problem solution; the better solution is then compared with the next solution and so forth until the end of the sequence is reached. Mathematically, this means that the best alignment at a particular position (i,j) is calculated as the best alignment previous to (i,j) plus aligning these two (Figure 6).
Figure 6: Dynamic Programming approach to sequence alignment
Steps in Dynamic Programming
- Must be given the 2 sequences to be matched, the scores corresponding to matches and mismatches, and the gap penalty (gamma). Note that gap penalty may be omitted in Needleman-Wunch method.
E.g. Given: sequences AATGC and AGGC; gamma = 1; match score = 1; mismatch score = 0
- Initialize an N x M matrix, where N and M are the lengths of the two sequences, respectively. Use sigma (S) at the beginning of sequences as a placeholder for gaps. See Figure 7
Figure 7: N x M Matrix
- Fill the matrix sequentially, keeping track of the previous best alignment by drawing arrows (Figure 8). The formula used is: BestScorei,j = max{BestScorei-1,j – gamma, BestScorei,j-1 – gamma, BestScorei-1,j-1 + match score}. Gap penalty can be constant, or the gap opening and gap extension penalties can be different. Another option to be considered is whether or not end gaps are to be penalized. This penalty is significant when the two sequences are supposed to be of similar lengths, but if not, end gap penalties can be ignored.
Figure 8: Filling in the N x M Matrix. The green and purple arrows represent optimal alignments as per the next step
- Trace back from the maximum score to obtain optimal alignment. See Figure 9
Figure 9: Alignments with maximum score. The colors refer back to the arrows in Figure 8
Sequence Alignment algorithms employing dynamic programming:
Needleman-Wunch Algorithm:
This utilizes dynamic programming to globally align two whole sequences. It was developed in 1969, but is still the most sensitive as well as optimal algorithm for pair-wise alignment.
When filling the matrix in the Needleman-Wunch algorithm, only the previous column and row are searched for the best subscore (Figure 10). The gap penalty and mismatch scores in the Needleman-Wunch algorithm are usually 0.
Figure 10: Filling the alignment matrix using the Needleman-Wunch algorithm. Note that this assumes that the gap penalty and mismatch scores are 0.
Because the Needleman-Wunch algorithm is used for whole sequence alignments, the final trace back step must be from the lower right corner, which will by definition have the maximum score, to the upper left corner (Figure 11).
Figure 11: Final trace back step in the Needleman-Wunch algorithm
Smith-Waterman Algorithm:
This is useful for finding local high scoring subsequences. It involves slight modifications to the Needleman-Wunch method:
- Use negative mismatch score and gap penalties
- When filling the matrix, look through the cells in the column and row of the cell being filled, as well as the cell diagonal to it for the best possible subscore. (Figure 12)
Figure 12: Filling the alignment matrix using the Smith-Waterman algorithm.
- The minimum score for i,j is 0. If Si,j < 0, rewrite it to 0 – pointer is useless. If previous column or row is all 0, point Si,j to self
- The best score is sought anywhere in the matrix, not just at the lower right corner. Keep a global pointer to the best score, and trace back until a cell pointing to itself is obtained. (Figure 13).
Figure 13: Trace back step in Smith-Waterman Algorithm
Convex/Affine Gap Penalty:
This algorithm assumes a different penalty for opening gaps and extending gaps. It involves a slight modification= to the Smith-Waterman algorithm: the convex/affine gap penalty function.
Convex/Affine Gap Penalty Function
gamma(g) = a + (g-1)b
gamma = penalty for length of gap
a = penalty of gap opening
b = penalty of gap extending
g = length of gap
Usually the penalty for gap opening is much higher than b, such that the extension of gaps that have already been introduced is much favored. See this page for more information and a step-by-step example.
Figure 14: Sequence alignment employing convex penalty function.
Conclusions
Pairwise sequence alignment is an extremely useful tool for DNA and protein sequence analysis. The best approach to this employs dynamic programming. Some algorithms developed include the Needleman-Wunch algorithm for global alignment, and the Smith-Waterman algorithm (with or with convex penalty function) for local alignment.
References
Sources
- STATS 115 Lecture
- BIO 280 Lecture
- BIO 280 Lecture Video
Useful Software/Web-based applications
- ALION
- Kestrel
Papers
- Needleman and Wunsch, JMB 1970
- Smith and Waterman, JMB 1981
Comments (0)
You don't have permission to comment on this page.