• If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.


Genome Sequencing and Annotation

Page history last edited by PBworks 13 years, 4 months ago

Genome Sequencing and Annotation


Contributed by David Robinson





Biological Background


Some key terms


Gene: The Human Gene Nomenclature Committee defines a gene as “a DNA segment that contributes to phenotype/function. In the absence of demonstrated function a gene may be characterized by sequence, transcription or homology”. In essence, a gene is comprised of transcribed DNA (usually coding for a protein) and any flanking sequences required for the regulation of its transcription.


Genome: The genome of an organism refers to all its genetic information, present in the form of DNA. This includes both genes, as well as non-coding sequences.


Genome Sequencing: This refers to the identification of the bases present at each position of the genome of a species. Usually, the DNA to be sequenced is taken from various samples because most of the sequences, especially those encoding critical proteins such as enzymes, ligands, etc., are exactly the same across species. In humans, for example, more than 99% of the genome is identical in different individuals Source


Applications of Genome Sequencing: why sequence the genome anyway?


Elucidation of the genome sequence is important for various reasons, the most important of which is gene mapping and prediction. This has application in numerous fields, for example:


  • Medicine, including the diagnosis and early detection of genetically inherited diseases; development of gene therapy and custom-fit drugs (pharmacogenomics)
  • Alternative Energy sources, including the development of biofuels utilizing photosynthetic and microbial systems, gleaned from the study of bacterial genomics
  • Evolutionary studies, including lineage studies and phylogenetics
  • DNA Forensics, involving the use of DNA to identify individuals for crime tracking.




Sequenced Organisms


The genome of many species has already been sequenced (See comprehensive list). The most famous case however, is that of the sequencing of the human genome, called the Human Genome Project. This first began in 1990 as a government funded project, and used the clone-by-clone technique detailed below. In 1998, the firm Celera Genomics joined the race to sequence the genome; using the technique of whole-genome shotgun sequencing, they aimed to proceed at a much faster, and more cost effective pace. A rough draft of the sequence was announced in 2000, and the final draft was published in April 2003.


Genome Sequencing Strategies


Clone-by-clone shotgun sequencing


The general principle behind this strategy is to first construct a library of the DNA, and then sequence parts of the library by first mapping their location and then randomly cutting and sequencing multiple copies of the sequences. Because of the large number of overlaps this generates, the sequences can be reassembled into the genome. See Figure 1 for a pictorial representation.


Figure 1: The clone-by-clone shotgun sequencing method for sequencing genomes. The genome is depicted as an encyclopedia, with each volume representing an individual chromosome.


Steps in Clone-by-Clone Shotgun Sequencing


1. Map Construction

This involves the cloning of parts of the genome into Yeast Artificial Chromosomes (YACs) for sequences up to ~1 Mb, and Bacterial Artificial Chromosomes (BACs) for smaller sequences ~200 kb, for easy amplification and manipulation. To map the location of these smaller sequences to the genome, Sequence-tagged sites (STS) are used, consisting of DNA unique to a particular region, which identified via PCR or probe hybridization. Multiple STS can be used to identify exactly where in the genome the sequence is located – a hit matrix of the STS present in each clone is constructed, and these are ordered such that all clones containing the same STS are arranged sequentially (See figure 2).


Figure 2. Mapping the relative location of clones to genomic DNA


Another way to map the location of the clones is to use restriction enzymes, which cut DNA at specific sequences. This can be used to generate a restriction fingerprint of each clone, which refers to the pattern obtained from running them on a gel. Clones with shared aspects of their restriction fingerprint overlap, and this can be used to map them onto the genome.


The map construction phase is usually the most time consuming part of genome sequencing. In the Human Genome Project for example, it took 8 years!


2. Clone selection

The clone map generated is used to select the clones that will generate a minimum tiling pathway such that the fewest possible clones are needed to be sequenced to cover the entire genome (Figure 3). The criterion used to select clones is that they must be authentic: they must map to the right location with absolute certainty.


Figure 3: An example of a minimum tiling pathway, highlighted in red, resulting from clone selection.


3. Subclone library construction

The YACs or BACs are copied multiple times, and then physically fragmented into 2-5 kb sequences using either sonication, which refers to the shearing of DNA using sound waves, or restriction enzymes (Figure 4). These random fragments are then subcloned into vectors, which can be amplified using the same primers.


Figure 4: Subclone library construction


4. Random shotgun phase

This consists of reading the sequence of the subclone fragments via the dideoxy termination reaction, and then reassembling the sequences using bioinformatics programs using overlapping fragments (Figure 5).


Figure 5: Random shotgun phase


Dideoxy Termination Reaction:

The dideoxy termination reaction was invented by Fred Sanger, and is called so because it involves PCR amplification in the presence of a small percentage of fluorescently labeled dideoxynucleotides, which lack the 3’ OH site required for further elongation of the nucleotide chain. Thus, chain elongatrion ceases when they are incorporated, and because each different type of nucleotide is labeled with a distinct fluorophore, the base present at that location can be identified (Figure 6).


The protocol for the dideoxy termination reaction involves melting double stranded DNA to form single stranded DNA, to which the primer common to all the subclones can then hybridize. DNA Polymerase I, the various dNTPs used for regular replication and fluorescently labeled ddNTPs (dideoxy NTPs) are added, and the DNA fragments of different lengths obtained are run on a gel. The read out process can be automated by the addition of a laser to detect the fluorophores running off the gel, a process developed by Leroy Hood and Michael Hunkapiller.


Figure 6: Dideoxy termination reaction


Bioinformatics Programs Involved:

Phred: base calling – assigns a base according to the signal received, along with a confidence score

Phrap: assembly – aligns sequences

Consed: editing


Coverage and Contigs

Coverage = sequence bp / fragment size.

Contigs = number of sequences that can be assembled without gaps


The relationship between coverage and contigs is called the Langer-Waterman curve (Figure 7), after the two statisticians that elucidated it.


Figure 7: Lander-Waterman Curve displaying the relationship between coverage and contig


5. Directed finishing phase and sequence authentication

The finishing phase involves the closing of any gaps in the sequence by the generation of primers at the end and beginning of fragments that may be contiguous, and running subsequent sequencing reactions until the sequences overlap. This is followed by sequence authentication, via the verification of STS and restriction enzyme fingerprints. A finished genome is defined by the presence of < 1 error or ambiguity in 10,000 bp with sequences assembled in the right order and orientation along a chromosome with no gaps.


Advantages of the Clone-by-Clone Shotgun Sequencing: thorough, accurate

Disadvantages: time consuming, costly


Whole-genome shotgun sequencing


The general principle behind this strategy is to directly cut multiple sequences at random and sequence them. The fact that numerous overlapping sequences will result is then used to reassemble the genome. See Figure 8 for pictorial representation.


Figure 8: The whole-genome shotgun sequencing method for sequencing genomes. The genome is depicted as an encyclopedia, with each volume representing an individual chromosome.


Unlike Clone-by-Clone Shotgun Sequencing, this strategy skips the time consuming step of physical mapping, and uses jigsaw puzzle assembly (Figure 9). This is a bioinformatics-heavy approach, because of the lack of a clone map.


Figure 9: Assembly phase of whole-genome shotgun sequencing method


The specific components of shotgun assembly are:



The identification and elimination of low quality reads, contaminations and repeats.



This searches the high quality reads for overlaps with the criteria used is >= 40 bp overlap with <= 6% mismatches.



The assembly of the easy subset (unique assembly from the high quality reads) first.


Scaffold and Repeat Resolution

Generation of YAC and BAC clones to aid in creating a scaffold for the genome. The two ends of the fragments, called read pairs, are sequenced, which indicates how far apart the fragments are, as well as locating and orienting them on chromosomes, using available physical map information.




Advantages: faster, less expensive

Disadvantages: requires physical map from clone-by-clone approach to make scaffold


Hybrid method


This approach incorporates both the Clone-by-Clone Shotgun sequencing and Whole-Genome Shotgun strategies. The optimal ratio of the use of the two is still being determined, and meanwhile 8-10 x coverage is needed. Higher eukaryotes for example, require more clone-by-clone (although this is on the decline with the advent of comparative genomics), but the genomes of simpler organisms such as bacteria can be determined by the whole-genome shotgun method alone.


Figure 10: Hybrid method combining Clone-by-Clone Shotgun sequencing and Whole-Genome Shotgun strategies


Automatic Annotation


The sequencing of many new model organisms in the last decade has made automatic annotation more important than ever. Annotated genomes are highly to scientists when studying the biology or evolution of a species, as they allow pieces of the genome to be mapped to their function.


Automatic annotation software helps predict genes, regulatory elements, and coding sequences by recognizing structural or functional features in the nucleotide sequence of the genome.


Since prokaryotes and eukaryotes differ in their post-transcription modifications, gene prediction strategies for each are also different.


Key Biological Terms


Introns: Sequences spliced out of mRNA to form mature mRNA


Exons: Sequences that are not spliced out of mRNA. Only exons are present in mature mRNA


Codon: A three-nucleotide sequence of bases that codes for a particular amino acid. Codons are degenerative, that is, multiple codons may code for a certain amino acid. In addition to these, stop codons are also present, which do not code for any amino acid, causing termination of the nascent peptide.


Open Reading Frames ORFs: This is defined as a sequence of codons that starts with the start codon AUG, ends with a stop codon, which are UAA, UAG, or UGA, and has no stop codons in between.


Intrinsic: Without using reference data outside of the genome being studied


Extrinsic: Using data from other sources when studying a genome


There are two categories of automatic annotation: ab initio methods, and comparative genomics methods.


Ab initio (intrinsic)


Ab initio methods recognize gene annotations based on profiles of genes and other functional elements.




  • Does not require reference genomes or other extrinsic data
  • Can recognize nearly universal rules governing functional elements, such as coding sequences starting with start codons.




  • Generally requires tailoring to different types of genomes- particularly eukaryotic vs. prokaryotic
  • Is limited to functional elements that are known and can be programmed into the algorithm


Ab initio techniques differ greatly between prokaryotes and eukaryotes, because of differences between post-transcription modification that causes genes to have greatly different structures and appearances


Prokaryote Methods

This involves a combination of the following, which are used to create a coherent gene model complete with intergenic DNA. GLIMMER software, in particular, can be used to predict genes with 98-99% accuracy.


Identification of ORFs


Possible ORFs are identified. Note that each DNA sequence has 6 ORFs- 3 forward, and 3 backward. The software can eliminate ORFs that are too short- they are unlikely to be coding sequences.


Scoring windows with coding statistics


A coding statistic is defined as a function that calculates the probability that a given sequence is coding for a protein. Most coding statistics measure one or more of the following:


  • Codon bias: This is defined as the probability that a certain codon will be used to represent an amino acid in a gene, over its other degenerate codons coding for the same amino acid. This is seen to exist among organisms within the same species, in that the relative abundance of the specific codons used to represent an amino acid is more or less similar. Codon bias is also seen in the sequence of genes: because bias is correlated with tRNA abundance through selection, highly transcribed genes display a different codon bias to genes displaying a lower amount of transcription.
  • Codon pair (hexamer) enrichment: This bases the coding statistic on the occurrence of specific pairs of codons in the sequence.
  • Amino acid usage: Depending on the presence of certain amino acids over others in the ORF of the sequence
  • Amino acid pair preference: This takes into account the pattern of occurrence of pairs of amino acids next to each other
  • GC content: This measures percentage of the sequence that is composed of GC base pairs.
  • Third Position in codon: These are usually variable, with different bases at this position coding for the same amino acid. Many genomes display a sequence-wide preference for which particular base is at this third position.


Most of the measured features are conserved across the genome. A training set of known genes from the same genome is needed to be able to compare the different measurements to find the coding statistics, such that the ORF that is most likely to code for a protein can be identified.


Identification of individual gene architecture elements


All genes contain certain structural elements aiding in the various stages of transcription, termination, translation, etc. These can be used to pinpoint the location of genes. Some examples include:


  • Promoter Sequence: This is found near the start site of transcription, and represents binding sites for various transcription factors. The sigma dependant promoter at the –10 position, called the TATA box, is particularly useful. The consensus sequence of prokaryotic promoter motifs is shown in figure 11.


Figure 11: Typical Bacterial Promoter. The consensus sequence is shown in black, with the relative abundance of each base in red.


  • Transcription Termination Signals: These are found near the end of transcribed sequences, and can be rho-dependant, which are 70-80 nucleotide long regions rich in CG, or rho-independent, which form hairpin loops.
  • Shine-Dalgarno Sequences: these are found near translation start sites, and represent ribosome binding sites.



Eukaryote Specific


Eukaryotic gene prediction differs from prokaryotic gene prediction in that it must take intron splicing and post-transcriptional modifications into account too (Figure 12).


Figure 12: Various steps leading to translation in Eukaryotes.


  • Coding Statistics: Similar to that in prokaryotes.
  • Model Gene Structure: Structural elements to look for include:

- Transcription factor binding motifs

- CpG islands: CG-rich sequences exist all over the genome. C is usually methylated, leading to its depletion over time as methyl-C is easily mutated to T. However, near promoter regions, C is undermethylated under selective pressure, leading to CpG islands. These are ~100 bps long, and have a high abundance of C’s and G’s.

- Transcription Initiation Signature: This consists of TATA boxes, similar to prokaryotic promoters, but these are less common.

- RNA Cap signature: The cap is usually added after a CAG sequence.

- Translation Initiation: The Kozak consensus is found 20 nts from the cap. See Figure 13.


Figure 13: Kozak consensus sequence at translation initiation site


- Splicing Signature: This includes conserved sites defining exons and directing splicing

- Poly-A Signal Sequence: This is a conserved sequence of AAT(U)AAA that occurs near the end of the transcript, which contributes to transcription termination.



    • GenScan is a program used for Eukaryotic gene prediction, which uses a Hidden Markov Model with hidden states including exons, introns, intergenic regions, promoters, etc. It also takes hexamer coding sequences and matrix profiles for gene structure into account. Genscan requires training sequences with known coding/noncoding regions to determine probabilities, as well as to train for coding statistics, before it can predict with accuracy.


    • GeneMark also performs gene prediction using a hidden markov model. Its hidden states include states for initial, internal, and terminal exons, as well as introns, intergenic regions, and donor and acceptor splice sites. GeneMark provides a hidden markov model for both prokaryotic and eukaryotic genomes, as well as a self-training version, GeneMarkS.


Comparative Genomics (extrinsic)


A popular strategy recently has been using comparative genomics to annotate function regions. While implementations differ, the basic strategy of such programs is to use either alignment or blast algorithms to map annotations from a reference genome to a newly sequenced genome.




  • Since coding sequence evolve more slowly than non-coding sequences, there are usually significant homologies in genes between related genomes, which can be exploited to recognize them and discard non-coding sequences
  • Is less species specific- the algorithm can be defined and then applied to different species using appropriate reference libraries
  • Can incorporate features that appear in the reference sequences but that are hard to recognize or to build into an ab initio algorithm




  • Requires extrinsic data (annotated reference genomes), which may not be available or reliable
  • Genomic alignments are highly computationally intensive
  • Difficult to find the right reference genome
    • Using a genome that has too large an evolutionary distance can mean there are few useful homologies, but choosing a reference sequence that is too close can hurt the algorithm's ability to distinguish between coding and non-coding sequences (since coding sequences evolve more slowly).




  • To Expressed Sequence Tags (ESTs): Mature mRNAs are converted to cDNAs and used as EST sequences, to which sequences to be tested are compared to look for similarities. Downsides to this approach include a 3’ bias because the primer used is against the poly-A tail, presence of contamination sequences. The lack of uniformity is also a problem, as highly expressed genes are enriched, and not all genes are seen due to different expression patterns.


  • To known gene or protein: Programs to do this include BLASTN, BLASTP, Psi-BLAST (yields protein families). Downsides include the fact that novel genes can’t be found, and pseudogenes can’t be distinguished from active genes.


  • Conservation between genomes: TwinScan is a program that blasts between sequenced genomes. It is built upon GenScan, with the addition of the probability of generating conservations from each state. Highly conserved regions especially, can be used to make gene predictions. A lot of exon sequences especially, are usually conserved.




There is a wide variety of software that performs automatic annotation using these comparative genomics methods. A table of such software can be found below, or here.




ProgramOrganismDatabank or required inputAlignmentGene reconstruction

AAT (66) (http://genome.cs.mtu.edu/aat.html)Primates, rodents, othercDNA, proteinDDS (improved BLASTX), DPS (improved BLASTN)NAP, GAP2
ALN (62)ProteinTron code, PAM 250
CEM (81)Two genomic sequenceBLASTX output, WMM for sitesDP
EbEST (73) (http://ares.ifrc.mcw.edu/EBEST/ebest.html)Human, otherdbESTBLASTN, EST clustering, Smith–Waterman-based gapped alignment3'-UTR detection, assembly of EST-tagged exons
Est2genome (74)EST or cDNA, preferably BLASTN outputModified Smith–Waterman Needleman–Wunsh algorithmNo
GeneSeqer (67,68) (http://bioinformatics.iastate.edu/cgi-bin/gs.cgi)Arabidopsis, maize, generic plantdbEST or EST database or proteinsSpliced alignment, splice recognition with SplicePredictor if missing EST matchYes
GeneWise (60) (http://www.sanger.ac.uk/Software/Wise2/genewiseform.shtml)HumanOne protein or a HMM profileGlobal alignment translated ORF/proteinDP (dynamite)
GENQUEST (http://compbio.ornl.gov/Grail-bin/EmptyGenquestForm)dbEST, SwissProt, Prosite, BLOCKS, GSDBSmith–Waterman, Blast, Fasta
ICE (64) (http://theory.lcs.mit.edu/ice)dbEST, OWLLook-upDP
INFO (63)Nr25mer look-up table, protein/protein alignments scored with PAM 40, PAM 120, PAM 250, BLO62No
ORFgene2 (61) (http://l25.itba.mi.cnr.it/~webgene/wwworfgene2.html)Human, mouse, Drosophila, Aspergillus, Arabidopsis, CaenorhabditisSwissProtBlastP, WAM for splice sites, identity score on frequencies of dipeptidesCompatibility graph, DP
PredictGenes (http://cbrg.inf.ethz.ch/Server/subsection3_1_8.html)Invertebrates, vertebrates, prokaryotes, plantsSwissProtPAM 250DP
PROCRUSTES (59) (http://www-hto.usc.edu/software/procrustes/wwwserv.html)VertebratesOne homologous proteinProtein/protein alignments scored with PAM 120DP
Pro-Gen (83) (http://www.anchorgen.com/pro_gen/pro_gen.html)Two genomic sequencesAlignment of translated sequences scored with PAM 120DP
ROSETTA (80) (http://crossspecies.lcs.mit.edu/)Human, mouseTwo genomic sequencesGLASS (global alignment system), PAM 20, Genscan method for splice sitesDP
SGP-1 (82) (http://soft.ice.mpg.de/sgp-1)Vertebrates, angiospermsTwo genomic sequences or a pairwise local alignment outputLocal alignmentDP
SIM4 (69)All eukayotescDNA/genomicHSP from BlastNo
SLAM (85) (http://baboon.math.berkeley.edu/~syntenic/slam.html)Human, mouseTwo genomic sequencesGeneralized pair HMMDP
Spidey (70) (http://www.ncbi.nlm.nih.gov/IEB/Research/Ostell/Spidey/index.html)Vertebrates, Drosophila, C.elegans, plantOne genomic sequence/set of mRNAsTwo Blasts: high stringency and low stringency
SYNCOD (72) (http://l25.itba.mi.cnr.it/~webgene/wwwsyncod.html)Human, mouse, Drosophila, Arabidopsis, Aspergillus, CaenorhabditisBLASTN outputSilent/replacement ratio, Monte Carlo simulationsNo
TAP (75) (http://sapiens.wustl.edu/~zkan/TAP/)Human, mouse, DrosophiladbESTWU-BLASTN, SIM4Yes
Utopia (84)All eukaryotesTwo genomic sequencesLocal alignmentYes


Combination of different approaches


Programs such as GeneMachine perform both predictive and comparative gene predictions, using:

  • GRAIL/MZEF: coding exon prediction
  • GENSCAN/HMMGenes/FGenes
  • RepeatMasker/Sputnik: complexity and interspersed repeat prediction
  • BLAST: searches for sequence homology


Annotation Complications


Each method, and even to each application and piece of software, has its own advantages and disadvantages, but there are complications and exceptions that distort the predictions of all the methods. These include:


  • Sequencing errors
  • Pseudogenes
    • Pseudogenes lack protein-coding ability, usually because of a relatively recent mutation or insertion of a transposable element
    • Despite not coding or functioning, still have many features of typical genes
  • Short exons coding for only a few amino acids
  • RNA coding genes
    • Structured like protein coding genes but function differently
  • Genes within genes
  • Unusual start codons
    • Codons besides AUG have been found to function as start codons
  • Ribosomal frameshifting
    • Translational recoding means that the coding sequence will not necessarily resemble the protein produced
  • Introns in the 5’ UTR
  • Alternative Splicing
    • Makes it difficult for ab initio algorithms to recognize




Genome sequencing can be done by the clone-by-clone shotgun method, the whole-genome shotgun method or more recently, by a combination of the two. The finished sequence has a wide variety of applications, including gene prediction.


Gene annotation differs in prokaryots and eukaryotes because of the differences in structure of the genomes. It consists of both ab initio, or intrinsic, methods, and comparative genomics, or extrinsic, methods. Both have their advantages, and it is clear that there are genes that can be found by each method that cannot be found by the other. For example, recently evolved genes are difficult to find using extrinsic, homology-based methods, and genes that are missing common features or that do not fit common profiles of genes are difficult to find using ab initio methods. Therefore, the most likely future of the field consists of combinations of both methods.


Gene sequencing and annotation, by offering an extraordinarily "low-level" view of biological processes and a map of known biological functions to the genome, promises to be one of the foundations for future work in biology, medicine, and genetic engineering.






Useful Software/Web-based Applications:


Selected Papers:

Comments (0)

You don't have permission to comment on this page.