Fast Sequence Similarity Search Algorithms
Contributed by Alana Mendelsohn
Motivation
Fast Sequence Similarity Search Algorithms use biological sequence information to find similar sequences in the same organism. These algorithms have many uses, such as:
- Mapping a given sequence to a sequenced genome
- Inferring unknown sequence function
- Finding families of proteins in an organism
- Finding homologs and orthologs in another organism
- Finding sequence mutations or variations (SNPs)
These algorithms do not give as perfect of alignments as the Smith-Waterman algorithm, but they are much faster. The general approach employed by all fast sequence similarity search algorithms is to
- Break the query sequence and database sequences into shorter sequences to match subsequences
- Extend the matched subsequences and filter out hopeless sequences
- Use dynamic programming to get optical final alignment
FASTA
Pronounced "FAST-A," The FASTA algorithm, developed by Pearson and Lipman (1988), proceeds along almost exactly these lines. It breaks up the query and database into smaller words, observes the pattern of word-to-word matches of a given length, and marks potential matches before performing a more sensitive search using local sequence alignment.
First, FASTA first breaks the query and database sequence into k-mer words, then hashes the locations of the database words to speed up later searches. For each k-mer in the query, FASTA finds an exact match in the hashed database and and extends it in both directions, allowing for some mismatches. It then scores these matches using BLOSUM or PAM and uses dynamic programming to co-linearly stitch the high-scoring matches together. The diagram below illustrates the process.
FASTA sequence format starts with <, ID and an optional description. Each sequence line should be shorter than 80 characters.
>Example1 envelope protein
ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT
QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC
HFPSNWKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPETANLWFNCHGEFFYCK
MDWFLNYLNNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSVIIWLETISKK
TYAPPREGHLECTSTVTGMTVELNYIPKNRTNVTLSPQIESIWAAELDRYKLVEITPIGF
APTEVRRYTGGHERQKRVPFVXXXXXXXXXXXXXXXXXXXXXXVQSQHLLAGILQQQKNL
LAAVEAQQQMLKLTIWGVK
>Example2 synthetic DNA
ACCGTGTGRGCCTCGANGACKCTAWC
BLAST
Basic Local Alignment Search Tool
First published by Altschul et. al in 1990, the BLAST algorithm is one of the most widely used bioinformatics applications and is supported by NCBI.
BLAST provides an initial optional step to remove low-complexity regions from the query and database sequences, such as repetitive strings like TTTTTT or ACACAC, after which it proceeds along the same basic lines outlined above. It breaks the database sequence into k-mer words and hashes their location to speed up future searches. It then matches possible k-mers in the query sequence and evaluates them using substitution matrices. Only words that meet a T cutoff score are kept, and T is usually 11-13. this provides for approximately 50 words at each query position. BLAST then extends each database sequence with a high scoring words in both directions and uses BLOSUM to score the extended alignment. It then uses Smith-Waterman to join the high scoring extended segment pairs to get optimal alignment.
Statistical significance
BLAST local similarity scores follow the extreme value distribution (Altschul et al., 1994). Because it only considers the best values, the value distribution is skewed to the right.
The probability that a random alignment gets a given score or better is the pvalue.
The actual alignment score S can be normalized according to the equations = λS − ln Kmn
- m and n are the query and database lengths, respectively
- K and λ are constants
- Depend on the substitution matrix and sequence composition
- For a typical amino acid and PAM250 matrix
- K = 0.009, λ = 0.229
The normalized score than then be used to get the pvalue according to the pvalue equation above.
When x > 2 the probability can be approximated according to the equation
Another quick check is by looking at the raw score S divided by 3
The reported database sequences must also be above a threshold. This threshold is measured by the E value, the number (rather than probability) of matches expected merely by chance.
There is usually a [.05,10] threshold, where pvalue is less than .05 and E is less than 10.
The smaller the E value, the more stringent the test is.
BLAST Programs
There are a number of different BLAST programs that can be used depending on the type of query and database being used.
There are five main BLAST programs to choose from.
- Nucleotide blast: Search nucleotide database with nucleotide query
- Protein blast: Search protein database with protein query
- blastx: Search protein database with translated nucleotide query
- tblastn: Search translated nucleotide database with protein query
- tblastx: Search translated nucleotide database with translated nucleotide query
There are also a number of databases types available, including:
- swissprot and pdb (protein database): protein databases
- est (expressed sequence tag): short sequence tags used to identify gene transcripts
- refseq: a set of sequences, including DNA, RNA, and protein products
- nr (nonredundant): all non-redundant GenBank CDS translations, RefSeq Proteins, PDB, SwissProt, PIR and PRF
Psi-BLAST
Psi-BLAST (position-specific iterative BLAST) is an iterative version of BLAST. It aligns the high scoring hits in the initial round of BLAST to construct a profile for the hits, then uses this profile for the next iteration of BLAST. Psi-BLAST is especially useful for finding remote homologs and entire protein sequences. The drawback of psi-blast is that bad sequences can degrade the search quickly.
Reciprocal BLAST
Reciprocal BLAST is an extension of BLAST aimed at finding orthologous sequences between two species and is also called bi-directional best hit. Orthologs are genes related by vertical descent from a common ancestor and encode proteins with the same function in different species. Paralogs are homologous genes in evolved by duplication and code for proteins with similar but not identical functions in the same organism.
The method is very simple:
- Use BLAST to find sequences in the second species similar to a sequence in the first
- Place the top results back into BLAST to search for similar sequences in the first species
- If the second BLAST reproduces the original input sequence, reciprocal BLAST verifies that the two sequences are orthologous.
BLAT
BLAT (BLAST-like alignment tool) is very similar to BLAST, but is more efficient for longer sequences. It uses a similar approach, breaking the input and candidate sequences into k-mer words, searching for matches and stitching the matches together for the final alignment. The difference is that BLAT requires very high similarity for aligned species: at least 95% for DNA and 80% for proteins.
Conclusion
Finding similar sequences quickly is a major end in computational biology. Search algorithms like FASTA, BLAST and BLAT provide means of evaluating comparison metrics over large numbers of sequences.
References
Course lecture notes
Pearson WR, Lipman DJ. (1988). Improved tools for biological sequence comparison. PNAS, 85 (8): 2444-8.
Altschul SF, Lipman DJ. (1990). Protein database searches for multiple alignments. PNAS, 87(14):5509-13.
Altschul SF, Boguski MS, Gish W, Wootton JC. (1994). Issues in searching molecular sequence databases. Nat Genet., 6(2):119-29.
BLAST: http://www.ncbi.nlm.nih.gov/blast/Blast.cgi
BLAT: http://genome.ucsc.edu/cgi-bin/hgBlat?command=start
Comments (0)
You don't have permission to comment on this page.