| 
  • If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!

View
 

Pairwise Sequence Alignment

Page history last edited by PBworks 15 years, 11 months ago

Pairwise Sequence Alignment

 

By Hala Iqbal

 


 

Introduction

 

Molecular Biology Background

 

Nucleotides are the building blocks of DNA. They consist of a sugar, a phosphate group and one of 4 nitrogenous bases attached: Adenine (A), Guanine (G), Cytosine (C) or Thymine (T).

 

Amino acids are the building blocks of proteins. There are 20 different amino acids used in protein synthesis in cells. Each amino acid is represented by a single letter, as shown in Figure 1.

 

Figure 1: The 20 Amino Acid's and their 3 and 1 letter codes

 

A gene is defined as transcribed DNA (usually coding for a protein) and any flanking sequences required for the regulation of its transcription. Genes are comprised of introns, which are non-coding regions, and exons that code for the proteins produced. Introns must be spliced out before proteins are produced.

 

Pairwise Sequence Alignment

 

What is Pairwise Sequence Alignment?

The goal of pairwise sequence alignment is to come up with the best possible alignment of two sequences. Given a scoring system for matches, mismatches and gaps (See Figure 2), the best possible alignment is defined as the alignment that optimizes the total score.

 

Figure2: Match, Mismatch, Gap in sequence alignment

 

Proteins sequences match the 20 amino acids and a gap placeholder.

DNA sequences match the 4 bases and a gap placeholder. DNA sequences can also be matched using additional strings of 2 or 3 bases. See Figure 3 for some strings defined by IUPAC.

 

Figure 3: Nucleotide sequences defined by IUPAC

 

Different scoring systems can be used depending on the sequence and the type of application in terms of the weight assigned to matches, mismatches and gaps. See Figure 4 for examples of the alignment that results from using different scoring systems.

 

Figure 4: Scoring schemes resulting in different matches

 

Applications of Pairwise Sequence Alignment

Pairwise sequence alignment is a useful tool in many fields of biology. For example, the similarity between sequences can used be in evolutionary analysis to find out what organisms share a common ancestor. Alternatively, the similarity between amino acid sequences can be used to predict the structure and functions of protein domains. Another use of pairwise sequence analysis is in genome sequencing assembly, where matches are used to find overlaps in the shorter pieces of DNA sequenced. Genome Assembly Page

 

Global vs. Local Alignment

Global alignment matches two whole sequences, whereas local alignment finds high scoring subsequence alignments between two sequences. An example of where local alignment would be more useful than global alignment is in finding similar function domains in proteins.

 

Approaches to Pairwise Sequence Alignment

 

Dot Matrix Approach

 

In this naïve and simplistic approach, the two sequences to be matched are lined as axes on a grid, and a dot is placed wherever there is an exact match. Diagonal lines can then be visually traced, which represent stretches of matched sequence (Figure 5).

 

Figure 5: Dot Matrix Approach

 

The downside to this approach is that it relies on visual analysis, and therefore it is hard to find optimal alignments. Also, it doesn’t make a distinction between gaps and mismatches, as necessitated by some scoring schemes.

 

Dynamic Programming Approach

 

Rundown of Strategy:

The essence of dynamic programming is the storing of sub-problem solutions for further use in the problem. Usually, this stored sub-problem solution is compared with the next sub-problem solution; the better solution is then compared with the next solution and so forth until the end of the sequence is reached. Mathematically, this means that the best alignment at a particular position (i,j) is calculated as the best alignment previous to (i,j) plus aligning these two (Figure 6).

 

Figure 6: Dynamic Programming approach to sequence alignment

 

Steps in Dynamic Programming

 

  • Must be given the 2 sequences to be matched, the scores corresponding to matches and mismatches, and the gap penalty (gamma). Note that gap penalty may be omitted in Needleman-Wunch method.

 

E.g. Given: sequences AATGC and AGGC; gamma = 1; match score = 1; mismatch score = 0

 

  • Initialize an N x M matrix, where N and M are the lengths of the two sequences, respectively. Use sigma (S) at the beginning of sequences as a placeholder for gaps. See Figure 7

 

Figure 7: N x M Matrix

 

  • Fill the matrix sequentially, keeping track of the previous best alignment by drawing arrows (Figure 8). The formula used is: BestScorei,j = max{BestScorei-1,j – gamma, BestScorei,j-1 – gamma, BestScorei-1,j-1 + match score}. Gap penalty can be constant, or the gap opening and gap extension penalties can be different. Another option to be considered is whether or not end gaps are to be penalized. This penalty is significant when the two sequences are supposed to be of similar lengths, but if not, end gap penalties can be ignored.

 

Figure 8: Filling in the N x M Matrix. The green and purple arrows represent optimal alignments as per the next step

 

  • Trace back from the maximum score to obtain optimal alignment. See Figure 9

 

Figure 9: Alignments with maximum score. The colors refer back to the arrows in Figure 8

 

Sequence Alignment algorithms employing dynamic programming:

 

Needleman-Wunch Algorithm:

This utilizes dynamic programming to globally align two whole sequences. It was developed in 1969, but is still the most sensitive as well as optimal algorithm for pair-wise alignment.

 

When filling the matrix in the Needleman-Wunch algorithm, only the previous column and row are searched for the best subscore (Figure 10). The gap penalty and mismatch scores in the Needleman-Wunch algorithm are usually 0.

 

Figure 10: Filling the alignment matrix using the Needleman-Wunch algorithm. Note that this assumes that the gap penalty and mismatch scores are 0.

 

Because the Needleman-Wunch algorithm is used for whole sequence alignments, the final trace back step must be from the lower right corner, which will by definition have the maximum score, to the upper left corner (Figure 11).

 

Figure 11: Final trace back step in the Needleman-Wunch algorithm

 

Smith-Waterman Algorithm:

This is useful for finding local high scoring subsequences. It involves slight modifications to the Needleman-Wunch method:

 

  • Use negative mismatch score and gap penalties
  • When filling the matrix, look through the cells in the column and row of the cell being filled, as well as the cell diagonal to it for the best possible subscore. (Figure 12)

 

Figure 12: Filling the alignment matrix using the Smith-Waterman algorithm.

 

  • The minimum score for i,j is 0. If Si,j < 0, rewrite it to 0 – pointer is useless. If previous column or row is all 0, point Si,j to self
  • The best score is sought anywhere in the matrix, not just at the lower right corner. Keep a global pointer to the best score, and trace back until a cell pointing to itself is obtained. (Figure 13).

 

Figure 13: Trace back step in Smith-Waterman Algorithm

 

Convex/Affine Gap Penalty:

This algorithm assumes a different penalty for opening gaps and extending gaps. It involves a slight modification= to the Smith-Waterman algorithm: the convex/affine gap penalty function.

 

Convex/Affine Gap Penalty Function

gamma(g) = a + (g-1)b

 

gamma = penalty for length of gap

a = penalty of gap opening

b = penalty of gap extending

g = length of gap

 

Usually the penalty for gap opening is much higher than b, such that the extension of gaps that have already been introduced is much favored. See this page for more information and a step-by-step example.

 

Figure 14: Sequence alignment employing convex penalty function.

 

Conclusions

Pairwise sequence alignment is an extremely useful tool for DNA and protein sequence analysis. The best approach to this employs dynamic programming. Some algorithms developed include the Needleman-Wunch algorithm for global alignment, and the Smith-Waterman algorithm (with or with convex penalty function) for local alignment.

 

References

 

Sources

- STATS 115 Lecture

- BIO 280 Lecture

- BIO 280 Lecture Video

 

Useful Software/Web-based applications

- ALION

- Kestrel

 

Papers

- Needleman and Wunsch, JMB 1970

- Smith and Waterman, JMB 1981

Comments (0)

You don't have permission to comment on this page.