| 
  • If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!

View
 

Genome Sequencing and Annotation

This version was saved 15 years, 10 months ago View current version     Page history
Saved by PBworks
on May 10, 2008 at 5:49:10 am
 

Genome Sequencing and Annotation

 

Contributed by David Robinson

 


 

Introduction

 

Biological Background

 

Some key terms

 

Gene: The Human Gene Nomenclature Committee defines a gene as “a DNA segment that contributes to phenotype/function. In the absence of demonstrated function a gene may be characterized by sequence, transcription or homology”. In essence, a gene is comprised of transcribed DNA (usually coding for a protein) and any flanking sequences required for the regulation of its transcription.

 

Genome: The genome of an organism refers to all its genetic information, present in the form of DNA. This includes both genes, as well as non-coding sequences.

 

Genome Sequencing: This refers to the identification of the bases present at each position of the genome of a species. Usually, the DNA to be sequenced is taken from various samples because most of the sequences, especially those encoding critical proteins such as enzymes, ligands, etc., are exactly the same across species. In humans, for example, more than 99% of the genome is identical in different individuals Source

 

Applications of Genome Sequencing: why sequence the genome anyway?

 

Elucidation of the genome sequence is important for various reasons, the most important of which is gene mapping and prediction. This has application in numerous fields, for example:

 

  • Medicine, including the diagnosis and early detection of genetically inherited diseases; development of gene therapy and custom-fit drugs (pharmacogenomics)
  • Alternative Energy sources, including the development of biofuels utilizing photosynthetic and microbial systems, gleaned from the study of bacterial genomics
  • Evolutionary studies, including lineage studies and phylogenetics
  • DNA Forensics, involving the use of DNA to identify individuals for crime tracking.

 

Source

 

Sequenced Organisms

 

The genome of many species has already been sequenced (See comprehensive list). The most famous case however, is that of the sequencing of the human genome, called the Human Genome Project. This first began in 1990 as a government funded project, and used the clone-by-clone technique detailed below. In 1998, the firm Celera Genomics joined the race to sequence the genome; using the technique of whole-genome shotgun sequencing, they aimed to proceed at a much faster, and more cost effective pace. A rough draft of the sequence was announced in 2000, and the final draft was published in April 2003.

 

Genome Sequencing Strategies

 

Clone-by-clone shotgun sequencing

 

The general principle behind this strategy is to first construct a library of the DNA, and then sequence parts of the library by first mapping their location and then randomly cutting and sequencing multiple copies of the sequences. Because of the large number of overlaps this generates, the sequences can be reassembled into the genome. See Figure 1 for a pictorial representation.

 

Figure 1: The clone-by-clone shotgun sequencing method for sequencing genomes. The genome is depicted as an encyclopedia, with each volume representing an individual chromosome.

 

Steps in Clone-by-Clone Shotgun Sequencing

 

1. Map Construction

This involves the cloning of parts of the genome into Yeast Artificial Chromosomes (YACs) for sequences up to ~1 Mb, and Bacterial Artificial Chromosomes (BACs) for smaller sequences ~200 kb, for easy amplification and manipulation. To map the location of these smaller sequences to the genome, Sequence-tagged sites (STS) are used, consisting of DNA unique to a particular region, which identified via PCR or probe hybridization. Multiple STS can be used to identify exactly where in the genome the sequence is located – a hit matrix of the STS present in each clone is constructed, and these are ordered such that all clones containing the same STS are arranged sequentially (See figure 2).

 

Figure 2. Mapping the relative location of clones to genomic DNA

 

Another way to map the location of the clones is to use restriction enzymes, which cut DNA at specific sequences. This can be used to generate a restriction fingerprint of each clone, which refers to the pattern obtained from running them on a gel. Clones with shared aspects of their restriction fingerprint overlap, and this can be used to map them onto the genome.

 

The map construction phase is usually the most time consuming part of genome sequencing. In the Human Genome Project for example, it took 8 years!

 

2. Clone selection

The clone map generated is used to select the clones that will generate a minimum tiling pathway such that the fewest possible clones are needed to be sequenced to cover the entire genome (Figure 3). The criterion used to select clones is that they must be authentic: they must map to the right location with absolute certainty.

 

Figure 3: An example of a minimum tiling pathway, highlighted in red, resulting from clone selection.

 

3. Subclone library construction

The YACs or BACs are copied multiple times, and then physically fragmented into 2-5 kb sequences using either sonication, which refers to the shearing of DNA using sound waves, or restriction enzymes (Figure 4). These random fragments are then subcloned into vectors, which can be amplified using the same primers.

 

Figure 4: Subclone library construction

 

4. Random shotgun phase

This consists of reading the sequence of the subclone fragments via the dideoxy termination reaction, and then reassembling the sequences using bioinformatics programs using overlapping fragments (Figure 5).

 

Figure 5: Random shotgun phase

 

Dideoxy Termination Reaction:

The dideoxy termination reaction was invented by Fred Sanger, and is called so because it involves PCR amplification in the presence of a small percentage of fluorescently labeled dideoxynucleotides, which lack the 3’ OH site required for further elongation of the nucleotide chain. Thus, chain elongatrion ceases when they are incorporated, and because each different type of nucleotide is labeled with a distinct fluorophore, the base present at that location can be identified (Figure 6).

 

The protocol for the dideoxy termination reaction involves melting double stranded DNA to form single stranded DNA, to which the primer common to all the subclones can then hybridize. DNA Polymerase I, the various dNTPs used for regular replication and fluorescently labeled ddNTPs (dideoxy NTPs) are added, and the DNA fragments of different lengths obtained are run on a gel. The read out process can be automated by the addition of a laser to detect the fluorophores running off the gel, a process developed by Leroy Hood and Michael Hunkapiller.

 

Figure 6: Dideoxy termination reaction

 

Bioinformatics Programs Involved:

Phred: base calling – assigns a base according to the signal received, along with a confidence score

Phrap: assembly – aligns sequences

Consed: editing

 

Coverage and Contigs

Coverage = sequence bp / fragment size.

Contigs = number of sequences that can be assembled without gaps

 

The relationship between coverage and contigs is called the Langer-Waterman curve (Figure 7), after the two statisticians that elucidated it.

 

Figure 7: Lander-Waterman Curve displaying the relationship between coverage and contig

 

5. Directed finishing phase and sequence authentication

The finishing phase involves the closing of any gaps in the sequence by the generation of primers at the end and beginning of fragments that may be contiguous, and running subsequent sequencing reactions until the sequences overlap. This is followed by sequence authentication, via the verification of STS and restriction enzyme fingerprints. A finished genome is defined by the presence of < 1 error or ambiguity in 10,000 bp with sequences assembled in the right order and orientation along a chromosome with no gaps.

 

Advantages of the Clone-by-Clone Shotgun Sequencing: thorough, accurate

Disadvantages: time consuming, costly

 

Whole-genome shotgun sequencing

 

The general principle behind this strategy is to directly cut multiple sequences at random and sequence them. The fact that numerous overlapping sequences will result is then used to reassemble the genome. See Figure 8 for pictorial representation.

 

Figure 8: The whole-genome shotgun sequencing method for sequencing genomes. The genome is depicted as an encyclopedia, with each volume representing an individual chromosome.

 

Unlike Clone-by-Clone Shotgun Sequencing, this strategy skips the time consuming step of physical mapping, and uses jigsaw puzzle assembly (Figure 9). This is a bioinformatics-heavy approach, because of the lack of a clone map.

 

Figure 9: Assembly phase of whole-genome shotgun sequencing method

 

The specific components of shotgun assembly are:

 

Screener

The identification and elimination of low quality reads, contaminations and repeats.

 

Overlapper

This searches the high quality reads for overlaps with the criteria used is >= 40 bp overlap with <= 6% mismatches.

 

Unitrigger

The assembly of the easy subset (unique assembly from the high quality reads) first.

 

Scaffold and Repeat Resolution

Generation of YAC and BAC clones to aid in creating a scaffold for the genome. The two ends of the fragments, called read pairs, are sequenced, which indicates how far apart the fragments are, as well as locating and orienting them on chromosomes, using available physical map information.

 

Consensus

 

Advantages: faster, less expensive

Disadvantages: requires physical map from clone-by-clone approach to make scaffold

 

Hybrid method

 

This approach incorporates both the Clone-by-Clone Shotgun sequencing and Whole-Genome Shotgun strategies. The optimal ratio of the use of the two is still being determined, and meanwhile 8-10 x coverage is needed. Higher eukaryotes for example, require more clone-by-clone (although this is on the decline with the advent of comparative genomics), but the genomes of simpler organisms such as bacteria can be determined by the whole-genome shotgun method alone.

 

Figure 10: Hybrid method combining Clone-by-Clone Shotgun sequencing and Whole-Genome Shotgun strategies

 

Automatic Annotation

 

Automatic annotation software helps predict genes, regulatory elements, and coding sequences by recognizing structural or functional features in the nucleotide sequence of the genome.

 

Since prokaryotes and eukaryotes differ in their post-transcription modifications, gene prediction strategies for each are also different.

 

Key Biological Terms

 

Introns: Sequences spliced out of mRNA to form mature mRNA

 

Exons: Sequences that are not spliced out of mRNA. Only exons are present in mature mRNA

 

Codon: A three-nucleotide sequence of bases that codes for a particular amino acid. Codons are degenerative, that is, multiple codons may code for a certain amino acid. In addition to these, stop codons are also present, which do not code for any amino acid, causing termination of the nascent peptide.

 

Open Reading Frames ORFs: This is defined as a sequence of codons that starts with the start codon AUG, ends with a stop codon, which are UAA, UAG, or UGA, and has no stop codons in between.

 

There are two categories of automatic annotation: ab initio methods, and comparative genomics methods.

 

Ab initio (intrinsic)

 

Ab initio methods recognize gene annotations based on profiles of genes and other functional elements.

 

Advantages

 

  • Does not require reference genomes or other extrinsic data
  • Can recognize nearly universal rules governing functional elements, such as coding sequences starting with start codons.

 

Disadvantages

 

  • Generally requires tailoring to different types of genomes- particularly eukaryotic vs. prokaryotic
  • Is limited to functional elements that are known and can be programmed into the algorithm

 

Methods

 

Ab initio techniques differ greatly between prokaryotes and eukaryotes, because of differences between post-transcription modification that causes genes to have greatly different structures and appearances

 

Prokaryote Methods

This involves a combination of the following, which are used to create a coherent gene model complete with intergenic DNA. GLIMMER software, in particular, can be used to predict genes with 98-99% accuracy.

 

Identification of ORFs

 

Possible ORFs are identified. Note that each DNA sequence has 6 ORFs- 3 forward, and 3 backward. The software can eliminate ORFs that are too short- they are unlikely to be coding sequences.

 

Scoring windows with coding statistics

 

A coding statistic is defined as a function that calculates the probability that a given sequence is coding for a protein. Most coding statistics measure one or more of the following:

 

  • Codon bias: This is defined as the probability that a certain codon will be used to represent an amino acid in a gene, over its other degenerate codons coding for the same amino acid. This is seen to exist among organisms within the same species, in that the relative abundance of the specific codons used to represent an amino acid is more or less similar. Codon bias is also seen in the sequence of genes: because bias is correlated with tRNA abundance through selection, highly transcribed genes display a different codon bias to genes displaying a lower amount of transcription.
  • Codon pair (hexamer) enrichment: This bases the coding statistic on the occurrence of specific pairs of codons in the sequence.
  • Amino acid usage: Depending on the presence of certain amino acids over others in the ORF of the sequence
  • Amino acid pair preference: This takes into account the pattern of occurrence of pairs of amino acids next to each other
  • GC content: This measures percentage of the sequence that is composed of GC base pairs.
  • Third Position in codon: These are usually variable, with different bases at this position coding for the same amino acid. Many genomes display a sequence-wide preference for which particular base is at this third position.

 

Most of the measured features are conserved across the genome. A training set of known genes from the same genome is needed to be able to compare the different measurements to find the coding statistics, such that the ORF that is most likely to code for a protein can be identified.

 

Identification of individual gene architecture elements

 

All genes contain certain structural elements aiding in the various stages of transcription, termination, translation, etc. These can be used to pinpoint the location of genes. Some examples include:

 

  • Promoter Sequence: This is found near the start site of transcription, and represents binding sites for various transcription factors. The sigma dependant promoter at the –10 position, called the TATA box, is particularly useful. The consensus sequence of prokaryotic promoter motifs is shown in figure 11.

 

Figure 11: Typical Bacterial Promoter. The consensus sequence is shown in black, with the relative abundance of each base in red.

 

  • Transcription Termination Signals: These are found near the end of transcribed sequences, and can be rho-dependant, which are 70-80 nucleotide long regions rich in CG, or rho-independent, which form hairpin loops.
  • Shine-Dalgarno Sequences: these are found near translation start sites, and represent ribosome binding sites.

 

 

Eukaryote Specific

 

Eukaryotic gene prediction differs from prokaryotic gene prediction in that it must take intron splicing and post-transcriptional modifications into account too (Figure 12).

 

Figure 12: Various steps leading to translation in Eukaryotes.

 

  • Coding Statistics: Similar to that in prokaryotes.
  • Model Gene Structure: Structural elements to look for include:

- Transcription factor binding motifs

- CpG islands: CG-rich sequences exist all over the genome. C is usually methylated, leading to its depletion over time as methyl-C is easily mutated to T. However, near promoter regions, C is undermethylated under selective pressure, leading to CpG islands. These are ~100 bps long, and have a high abundance of C’s and G’s.

- Transcription Initiation Signature: This consists of TATA boxes, similar to prokaryotic promoters, but these are less common.

- RNA Cap signature: The cap is usually added after a CAG sequence.

- Translation Initiation: The Kozak consensus is found 20 nts from the cap. See Figure 13.

 

Figure 13: Kozak consensus sequence at translation initiation site

 

- Splicing Signature: This includes conserved sites defining exons and directing splicing

- Poly-A Signal Sequence: This is a conserved sequence of AAT(U)AAA that occurs near the end of the transcript, which contributes to transcription termination.

 

 

GenScan is a program used for Eukaryotic gene prediction, which uses a Hidden Markov Model with hidden states including exons, introns, intergenic regions, promoters, etc. It also takes hexamer coding sequences and matrix profiles for gene structure into account. Genscan requires training sequences with known coding/noncoding regions to determine probabilities, as well as to train for coding statistics, before it can predict with accuracy.

 

GeneMark also performs gene prediction using a hidden markov model. Its hidden states include states for initial, internal, and terminal exons, as well as introns, intergenic regions, and donor and acceptor splice sites. GeneMark provides a hidden markov model for both prokaryotic and eukaryotic genomes, as well as a self-training version, GeneMarkS.

 

Comparative Genomics (extrinsic)

 

A common strategy recently has been using comparative genomics to annotate function regions. While implementations differ, the basic strategy of such programs is to use either alignment or blast algorithms to map annotations from a reference genome to a newly sequenced genome.

 

Advantages

 

  • Since coding sequence evolve more slowly than non-coding sequences, there are usually significant homologies in genes between related genomes, which can be exploited to recognize them
  • Isn't species specific- the algorithm can be defined and then applied to different species using appropriate reference libraries

 

Disadvantages

 

  • The algorithm is not de novo, so it requires extrinsic data (annotated reference genomes)
  • Genomic alignments are highly computationally intensive

 

Methods

 

  • To Expressed Sequence Tags (ESTs): Mature mRNAs are converted to cDNAs and used as EST sequences, to which sequences to be tested are compared to look for similarities. Downsides to this approach include a 3’ bias because the primer used is against the poly-A tail, presence of contamination sequences. The lack of uniformity is also a problem, as highly expressed genes are enriched, and not all genes are seen due to different expression patterns.

 

  • To known gene or protein: Programs to do this include BLASTN, BLASTP, Psi-BLAST (yields protein families). Downsides include the fact that novel genes can’t be found, and pseudogenes can’t be distinguished from active genes.

 

  • Conservation between genomes: TwinScan is a program that blasts between sequenced genomes. It is built upon GenScan, with the addition of the probability of generating conservations from each state. Highly conserved regions especially, can be used to make gene predictions. A lot of exon sequences especially, are usually conserved.

 

Software

 

A wide variety of software that implements these methods exists. A table of such software can be found below, or here.

 

 

 

ProgramOrganismDatabank or required inputAlignmentGene reconstruction

AAT (66) (http://genome.cs.mtu.edu/aat.html)Primates, rodents, othercDNA, proteinDDS (improved BLASTX), DPS (improved BLASTN)NAP, GAP2
ALN (62)ProteinTron code, PAM 250
CEM (81)Two genomic sequenceBLASTX output, WMM for sitesDP
EbEST (73) (http://ares.ifrc.mcw.edu/EBEST/ebest.html)Human, otherdbESTBLASTN, EST clustering, Smith–Waterman-based gapped alignment3'-UTR detection, assembly of EST-tagged exons
Est2genome (74)EST or cDNA, preferably BLASTN outputModified Smith–Waterman Needleman–Wunsh algorithmNo
GeneSeqer (67,68) (http://bioinformatics.iastate.edu/cgi-bin/gs.cgi)Arabidopsis, maize, generic plantdbEST or EST database or proteinsSpliced alignment, splice recognition with SplicePredictor if missing EST matchYes
GeneWise (60) (http://www.sanger.ac.uk/Software/Wise2/genewiseform.shtml)HumanOne protein or a HMM profileGlobal alignment translated ORF/proteinDP (dynamite)
GENQUEST (http://compbio.ornl.gov/Grail-bin/EmptyGenquestForm)dbEST, SwissProt, Prosite, BLOCKS, GSDBSmith–Waterman, Blast, Fasta
ICE (64) (http://theory.lcs.mit.edu/ice)dbEST, OWLLook-upDP
INFO (63)Nr25mer look-up table, protein/protein alignments scored with PAM 40, PAM 120, PAM 250, BLO62No
ORFgene2 (61) (http://l25.itba.mi.cnr.it/~webgene/wwworfgene2.html)Human, mouse, Drosophila, Aspergillus, Arabidopsis, CaenorhabditisSwissProtBlastP, WAM for splice sites, identity score on frequencies of dipeptidesCompatibility graph, DP
PredictGenes (http://cbrg.inf.ethz.ch/Server/subsection3_1_8.html)Invertebrates, vertebrates, prokaryotes, plantsSwissProtPAM 250DP
PROCRUSTES (59) (http://www-hto.usc.edu/software/procrustes/wwwserv.html)VertebratesOne homologous proteinProtein/protein alignments scored with PAM 120DP
Pro-Gen (83) (http://www.anchorgen.com/pro_gen/pro_gen.html)Two genomic sequencesAlignment of translated sequences scored with PAM 120DP
ROSETTA (80) (http://crossspecies.lcs.mit.edu/)Human, mouseTwo genomic sequencesGLASS (global alignment system), PAM 20, Genscan method for splice sitesDP
SGP-1 (82) (http://soft.ice.mpg.de/sgp-1)Vertebrates, angiospermsTwo genomic sequences or a pairwise local alignment outputLocal alignmentDP
SIM4 (69)All eukayotescDNA/genomicHSP from BlastNo
SLAM (85) (http://baboon.math.berkeley.edu/~syntenic/slam.html)Human, mouseTwo genomic sequencesGeneralized pair HMMDP
Spidey (70) (http://www.ncbi.nlm.nih.gov/IEB/Research/Ostell/Spidey/index.html)Vertebrates, Drosophila, C.elegans, plantOne genomic sequence/set of mRNAsTwo Blasts: high stringency and low stringency
SYNCOD (72) (http://l25.itba.mi.cnr.it/~webgene/wwwsyncod.html)Human, mouse, Drosophila, Arabidopsis, Aspergillus, CaenorhabditisBLASTN outputSilent/replacement ratio, Monte Carlo simulationsNo
TAP (75) (http://sapiens.wustl.edu/~zkan/TAP/)Human, mouse, DrosophiladbESTWU-BLASTN, SIM4Yes
Utopia (84)All eukaryotesTwo genomic sequencesLocal alignmentYes

 

Combination of different approaches

 

Programs such as GeneMachine perform both predictive and comparative gene predictions, using:

  • GRAIL/MZEF: coding exon prediction
  • GENSCAN/HMMGenes/FGenes
  • RepeatMasker/Sputnik: complexity and interspersed repeat prediction
  • BLAST: searches for sequence homology

 

Gene Prediction Complications

 

Complicating factors include:

  • Sequencing errors
  • Pseudogenes: genes without protein-coding ability, but with typical structural elements of genes
  • Short exons coding for only a few amino acids
  • RNA coding genes
  • Genes within genes
  • Unusual start codons
  • Ribosomal frameshifting
  • Introns in the 5’ UTR
  • Alternative Splicing

 

Conclusions

 

Genome sequencing can be done by the clone-by-clone shotgun method, the whole-genome shotgun method or more recently, by a combination of the two. The finished sequence has a wide variety of applications, including gene prediction. Gene prediction differs in prokaryots and eukaryotes because of the differences in structure of the genomes. It is aided by looking at codon statistics, and looking for particular structural motifs associated with genes.

 

Although the genome of many organisms, including humans, has been sequenced, there still remain several whose will be done in the future. In addition., gene prediction is and will continue to be an extremely useful tool in research and medicine.

 

References

 

Sources:

 

Useful Software/Web-based Applications:

 

Selected Papers:

*

Current methods of gene prediction, their strengths and weaknesses

Comments (0)

You don't have permission to comment on this page.