Substitutions Matrices and Fast Sequence Similarity Search Algorithms
Contributed by Kevin Bartz
Scoring Matrices
The methods of pairwise sequence alignment we have seen so far are agnostic to the constituents of the sequence. Swapping an A for a T has cost the same as swapping a G for a C. This assumption is reasonable for DNA because there are only four base pairs. But in proteins there are 20 amino acids, some more and some less compatible than others. It seems appropriate to penalize candidate substitutions based on known chemical likelihoods when aligning protein sequences. We'd like to stiffen the penalty on unlikely matches, such as aspartic acid (Asp) and tryptophan (Trp), and lighten the penalty on chemically interchangeable pairs such as phenylalanine (Phe) and tyrosine (Tyr).
Substitution matrices provide a scoring mechanism for protein amino acid pairs based on the acids' chemical properties. The element in the j^th roth and k^th column of a substitution matrix represents the score associated with substituting the j^th amino acid for the k^th. The scores can be negative or positive, and we can interpret them as penalties (or, conversely, bonuses) associated with particular substitutions. Substitution matrices are always square and most often symmetric, meaning that a forward substitution typically gets the same score as the reverse substitution, although this assumption can be loosened.
There are several chemical factors determining the magnitude of the score for a given substitution pair. Among the most common are
- Size difference
- Shape difference
- Electric charge difference
- Van der Waals interaction
- Ability to ionize
- Hydrophobia or hydrophilia
- Acidity or basicity
- Crystal lattice structure
- Evolutionary history
- Hydrogen bonding
Once constructed, substitution matrices are quite straightforward to use. When aligning a sequence, one extracts the score for an aligned pair from the substitution matrix rather than from a global parameter. Adding these scores over all aligned pairs gives an overall score, which is minimized to find the optimal alignment.
PAM Matrices
Credited to Margaret Dayhoff (1978), the PAM (Point Accepted Mutation) matrix was the first such substitution matrix. It is highly empirical in nature, its scores based on observed point mutation rates of amino acids in closely related proteins. Dayhoff tabulated 1,572 changes in 71 groups of proteins that started at least 85% similar, grouping them by the amino acid pairs changed.
Dayhoff's approach has an intuitive evolutionary basis. Over time, small mutations that do not disturb function are more likely than radical mutations that cripple the organism. A polar, basic amino acid like arginine is most likely to be substituted for a compound like lysine with similar properties. Consequently, when aligning sequences, matching arginine with lysine should be considered less expensive than matching it with valine.
Dayhoff measured the empirical probability of such substitutions and constructed PAM1, each cell of which represents the likely rate of substitutions if 1% of the amino acids in a protein were to change. Now imagine that another 1% change, possibly including some of the same amino acids that have already changed once. The new rates represent the PAM2 matrix. Repeating this process 250 times results in the PAM250 matrix, the likely rates of change after 250% amino acid turnover. In fact, only PAM1 need be measured, and all higher-order matrices are simple powers of PAM1. Mathematically, multiplying a transition matrix by itself n times gives the n-step transition matrix.
PAM250, shown below, is the most popular PAM matrix. The values appearing in the PAM matrix are in fact the logs of the rate, averaged over the forward and backward substitution to make the matrix symmetric.
| | Ala | Arg | Asn | Asp | Cys | Gln | Glu | Gly | His | Ile | Leu | Lys | Met | Phe | Pro | Ser | Thr | Trp | Tyr | Val |
| | A | R | N | D | C | Q | E | G | H | I | L | K | M | F | P | S | T | W | Y | V |
| Ala | A | 13 | 6 | 9 | 9 | 5 | 8 | 9 | 12 | 6 | 8 | 6 | 7 | 7 | 4 | 11 | 11 | 11 | 2 | 4 | 9 |
| Arg | R | 3 | 17 | 4 | 3 | 2 | 5 | 3 | 2 | 6 | 3 | 2 | 9 | 4 | 1 | 4 | 4 | 3 | 7 | 2 | 2 |
| Asn | N | 4 | 4 | 6 | 7 | 2 | 5 | 6 | 4 | 6 | 3 | 2 | 5 | 3 | 2 | 4 | 5 | 4 | 2 | 3 | 3 |
| Asp | D | 5 | 4 | 8 | 11 | 1 | 7 | 10 | 5 | 6 | 3 | 2 | 5 | 3 | 1 | 4 | 5 | 5 | 1 | 2 | 3 |
| Cys | C | 2 | 1 | 1 | 1 | 52 | 1 | 1 | 2 | 2 | 2 | 1 | 1 | 1 | 1 | 2 | 3 | 2 | 1 | 4 | 2 |
| Gln | Q | 3 | 5 | 5 | 6 | 1 | 10 | 7 | 3 | 7 | 2 | 3 | 5 | 3 | 1 | 4 | 3 | 3 | 1 | 2 | 3 |
| Glu | E | 5 | 4 | 7 | 11 | 1 | 9 | 12 | 5 | 6 | 3 | 2 | 5 | 3 | 1 | 4 | 5 | 5 | 1 | 2 | 3 |
| Gly | G | 12 | 5 | 10 | 10 | 4 | 7 | 9 | 27 | 5 | 5 | 4 | 6 | 5 | 3 | 8 | 11 | 9 | 2 | 3 | 7 |
| His | H | 2 | 5 | 5 | 4 | 2 | 7 | 4 | 2 | 15 | 2 | 2 | 3 | 2 | 2 | 3 | 3 | 2 | 2 | 3 | 2 |
| Ile | I | 3 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 10 | 6 | 2 | 6 | 5 | 2 | 3 | 4 | 1 | 3 | 9 |
| Leu | L | 6 | 4 | 4 | 3 | 2 | 6 | 4 | 3 | 5 | 15 | 34 | 4 | 20 | 13 | 5 | 4 | 6 | 6 | 7 | 13 |
| Lys | K | 6 | 18 | 10 | 8 | 2 | 10 | 8 | 5 | 8 | 5 | 4 | 24 | 9 | 2 | 6 | 8 | 8 | 4 | 3 | 5 |
| Met | M | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 2 | 3 | 2 | 6 | 2 | 1 | 1 | 1 | 1 | 1 | 2 |
| Phe | F | 2 | 1 | 2 | 1 | 1 | 1 | 1 | 1 | 3 | 5 | 6 | 1 | 4 | 32 | 1 | 2 | 2 | 4 | 20 | 3 |
| Pro | P | 7 | 5 | 5 | 4 | 3 | 5 | 4 | 5 | 5 | 3 | 3 | 4 | 3 | 2 | 20 | 6 | 5 | 1 | 2 | 4 |
| Ser | S | 9 | 6 | 8 | 7 | 7 | 6 | 7 | 9 | 6 | 5 | 4 | 7 | 5 | 3 | 9 | 10 | 9 | 4 | 4 | 6 |
| Thr | T | 8 | 5 | 6 | 6 | 4 | 5 | 5 | 6 | 4 | 6 | 4 | 6 | 5 | 3 | 6 | 8 | 11 | 2 | 3 | 6 |
| Trp | W | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 55 | 1 | 0 |
| Tyr | Y | 1 | 1 | 2 | 1 | 3 | 1 | 1 | 1 | 3 | 2 | 2 | 1 | 2 | 15 | 1 | 2 | 2 | 3 | 31 | 2 |
| Val | V | 7 | 4 | 4 | 4 | 4 | 4 | 4 | 4 | 5 | 4 | 15 | 10 | 4 | 10 | 5 | 5 | 5 | 72 | 4 | 17 |
PAM scoring works best when aligning sequences from organisms that are the same or evolutionarily similar. Radically different organisms have mutations that emerged over evolutionary time scales and are poorly modelled by the short-term amino acid changes in the PAM matrix.
BLOSUM Matrices
PAM's weaknesses inspired Henikoff and Henikoff's 1992 BLOSUM (Blocks Amino Acid Substitution Matrices). They aligned 2,000 blocks of aligned segments from over 500 protein families in the Prosite database representing a wide variety of organisms. Their alignments were robust, allowing for up to 60-aa gaps. From these alignments they tabulated the substitution frequencies for all 20 x 20 amino acid pairs. The basic BLOSUM matrix consists of the logarithms of these relative substitution frequencies over all of the 2,000 aligned blocks.
But in the construction above, one might be suspicious of the choice of sequences. After all, if most of the protein sequences come from human-like organisms, wouldn't amino acid substitutions common to humans be favored in the BLOSUM matrix? In this case, BLOSUM would have the same problems as PAM. Concerns like these motivated the BLOSUMN matrix, which downweights substitution counts from sequences very similar to many other sequences. The BLOSUMN process:
- Assign weights to aligned sequences such that each collection of sequences more than N% identical to each other have the collective weight 1.
- Tally substitutions from each pair of aligned sequences, weighting by the sequence weights.
In this framework, increasing N has the effect of collapsing overrepresented substitution probabilities, a process illustrated graphically below.
The most commonly used BLOSUM matrix is BLOSUM62, shown below.
| A | R | N | D | C | Q | E | G | H | I | L | K | M | F | P | S | T | W | Y | V |
| A | 4 | -1 | -2 | -2 | 0 | -1 | -1 | 0 | -2 | -1 | -1 | -1 | -1 | -2 | -1 | 1 | 0 | -3 | -2 | 0 |
| R | -1 | 5 | 0 | -2 | -3 | 1 | 0 | -2 | 0 | -3 | -2 | 2 | -1 | -3 | -2 | -1 | -1 | -3 | -2 | -3 |
| N | -2 | 0 | 6 | 1 | -3 | 0 | 0 | 0 | 1 | -3 | -3 | 0 | -2 | -3 | -2 | 1 | 0 | -4 | -2 | -3 |
| D | -2 | -2 | 1 | 6 | -3 | 0 | 2 | -1 | -1 | -3 | -4 | -1 | -3 | -3 | -1 | 0 | -1 | -4 | -3 | -3 |
| C | 0 | -3 | -3 | -3 | 9 | -3 | -4 | -3 | -3 | -1 | -1 | -3 | -1 | -2 | -3 | -1 | -1 | -2 | -2 | -1 |
| Q | -1 | 1 | 0 | 0 | -3 | 5 | 2 | -2 | 0 | -3 | -2 | 1 | 0 | -3 | -1 | 0 | -1 | -2 | -1 | -2 |
| E | -1 | 0 | 0 | 2 | -4 | 2 | 5 | -2 | 0 | -3 | -3 | 1 | -2 | -3 | -1 | 0 | -1 | -3 | -2 | -2 |
| G | 0 | -2 | 0 | -1 | -3 | -2 | -2 | 6 | -2 | -4 | -4 | -2 | -3 | -3 | -2 | 0 | -2 | -2 | -3 | -3 |
| H | -2 | 0 | 1 | -1 | -3 | 0 | 0 | -2 | 8 | -3 | -3 | -1 | -2 | -1 | -2 | -1 | -2 | -2 | 2 | -3 |
| I | -1 | -3 | -3 | -3 | -1 | -3 | -3 | -4 | -3 | 4 | 2 | -3 | 1 | 0 | -3 | -2 | -1 | -3 | -1 | 3 |
| L | -1 | -2 | -3 | -4 | -1 | -2 | -3 | -4 | -3 | 2 | 4 | -2 | 2 | 0 | -3 | -2 | -1 | -2 | -1 | 1 |
| K | -1 | 2 | 0 | -1 | -3 | 1 | 1 | -2 | -1 | -3 | -2 | 5 | -1 | -3 | -1 | 0 | -1 | -3 | -2 | -2 |
| M | -1 | -1 | -2 | -3 | -1 | 0 | -2 | -3 | -2 | 1 | 2 | -1 | 5 | 0 | -2 | -1 | -1 | -1 | -1 | 1 |
| F | -2 | -3 | -3 | -3 | -2 | -3 | -3 | -3 | -1 | 0 | 0 | -3 | 0 | 6 | -4 | -2 | -2 | 1 | 3 | -1 |
| P | -1 | -2 | -2 | -1 | -3 | -1 | -1 | -2 | -2 | -3 | -3 | -1 | -2 | -4 | 7 | -1 | -1 | -4 | -3 | -2 |
| S | 1 | -1 | 1 | 0 | -1 | 0 | 0 | 0 | -1 | -2 | -2 | 0 | -1 | -2 | -1 | 4 | 1 | -3 | -2 | -2 |
| T | 0 | -1 | 0 | -1 | -1 | -1 | -1 | -2 | -2 | -1 | -1 | -1 | -1 | -2 | -1 | 1 | 5 | -2 | -2 | 0 |
| W | -3 | -3 | -4 | -4 | -2 | -2 | -3 | -2 | -2 | -3 | -2 | -3 | -1 | 1 | -4 | -3 | -2 | 11 | 2 | -3 |
| Y | -2 | -2 | -2 | -3 | -2 | -1 | -2 | -3 | 2 | -1 | -1 | -2 | -1 | 3 | -3 | -2 | -2 | 2 | 7 | -1 |
| V | 0 | -3 | -3 | -3 | -1 | -2 | -2 | -3 | -3 | 3 | 1 | -2 | 1 | -1 | -2 | -2 | 0 | -3 | -1 | 4 |
Like the PAM matrices, the BLOSUM matrix is built on empirical substitution frequencies. BLOSUM alleviates PAM's shortcomings, however, by incorporating a diversity of protein families in its construction. Henikoff and Henikoff hoped that BLOSUM would be more useful than PAM in comparing sequences from evolutionarily divergent organisms.
Fast Sequence Similarity Search Algorithms
Given an observed protein sequence, there are many reasons to find similar sequences in the same organism.
- Perhaps our sequence contains a few errors, and we can retrieve the true version and information about its function.
- We can identify a family of proteins within an organism.
- We can identify the input sequence's point mutations, variations or SNPs.
There are also many reasons to find similar sequences in other organisms.
- We can infer a sequence's function by finding similar sequences in other organisms whose functions are known.
- We can identify homologs and orthologs in other organisms.
On first thought, we may think to run a sequence comparison method like Needleman-Wunsch repetitively to compare the given sequence with every other one in the database. Unfortunately, this is computationally prohibitive since it is O(N^2), where N is the number of candidate sequences. A fast sequence similarity search algorithm is necessary to parse through the possibilities more quickly and identify the most similar. There are several such techniques, each with its own strengths and weaknesses, but they all use similar tricks to reduce the complexity of the problem.
- Split the given sequence into a small number T of subsequences.
- Identify the top K candidate sequences sharing one of these T tokens. The great advantage of this is that it can be done in T scans of the candidate sequences, which is O(N) instead of O(N^2). Filter out those sequences not sharing the tokens.
- Extend the matched tokens for the top K sequences. Use Smith-Waterman or Needleman-Wunsch to compare the input sequence to each of the remaining candidate sequences.
FASTA
The FASTA algorithm, developed by Pearson and Lipman (1988), proceeds along almost exactly these lines. It first finds runs of identities from the input sequence among the set of candidate sequences. It scores all the runs with the PAM matrix, and limits its universe to those sequences sharing top-scoring segments; this is analogous to the second step, which takes only the top K candidate sequences. Finally, these matching segments are joined when possible for each sequence, and optimally aligned through dynamic programming. These alignments are sorted by similarity to the input sequence and top candidate sequences returned. The diagram below illustrates the process.
BLAST
First published by Altschul (1990), the BLAST algorithm is the fastest of the popular sequence similarity search mechanisms. It first removes low-complexity regions from the input and candidate sequences, such as repetitive strings like TTTTTT or ACACAC. At this point BLAST proceeds much along the general lines above, first breaking the candidate sequences into tokens (k-mers, for some k) hashed to identify candidates sharing tokens present in the input token. k for BLAST is typicallly chosen to be 11.
Carrying out Step 2, the candidates' matched tokens are scored using a substitution matrix and the top K are chosen. Then each matched token is extended in the manner of FASTA, with the extended alignments of the top candidate sequences aligned to the input sequence using Smith-Waterman. Finally, in Step 3, BLAST selects the candidate sequences with the highest similarity scores. At this point These similarities can be shown to follow an extreme-values distribution, which can be used to judge the significance of their similarity to the input sequence.
BLAST for Coding DNA
If the input sequence is known to be coding DNA, it should be translated to a protein before seeking similar sequences. Here, a zero gap-extension penalty should be used.
Psi-BLAST
Psi-BLAST (position-specific iterative BLAST) is an iterative version of BLAST. Sequences deemed similar to the input sequence are re-entered into BLAST to find additional similar sequences. Psi-BLAST is useful for finding entire protein families and related homologs.
Reciprocal BLAST
Reciprocal BLAST is an extension of BLAST aimed at finding orthologous sequences between two species. The method is very simple:
- Use BLAST to find sequences in the second species similar to a sequence in the first
- Place the top results back into BLAST to search for similar sequences in the first species
- If the second BLAST reproduces the original input sequence, reciprocal BLAST verifies that the two sequences are orthologous.
BLAT
BLAT (BLAST-like alignment tool) is very similar to BLAST, but even more efficient for longer sequences. It uses a similar approach, breaking the input and candidate sequences into k-mer tokens, but aggressively strains the candidate space by requiring very high similarity: at least 95% for DNA and 80% for proteins.
Conclusion
Finding similar sequences quickly is a major end in computational biology. Substitution matrices provide good metrics for comparing sequences, while search algorithms like FASTA, BLAST and BLAT provide means of evaluating comparison metrics over large numbers of sequences.
References:
Comments (0)
You don't have permission to comment on this page.