Genome Sequencing and Assembly
by Hala Iqbal
Introduction
Biological Background
Some key terms
Gene: The Human Gene Nomenclature Committee defines a gene as “a DNA segment that contributes to phenotype/function. In the absence of demonstrated function a gene may be characterized by sequence, transcription or homology”. In essence, a gene is comprised of transcribed DNA (usually coding for a protein) and any flanking sequences required for the regulation of its transcription.
Genome: The genome of an organism refers to all its genetic information, present in the form of DNA. This includes both genes, as well as non-coding sequences.
Genome Sequencing: This refers to the identification of the bases present at each position of the genome of a species. Usually, the DNA to be sequenced is taken from various samples because most of the sequences, especially those encoding critical proteins such as enzymes, ligands, etc., are exactly the same across species. In humans, for example, more than 99% of the genome is identical in different individuals Source
Applications of Genome Sequencing: why sequence the genome anyway?
Elucidation of the genome sequence is important for various reasons, the most important of which is gene mapping and prediction. This has application in numerous fields, for example:
- Medicine, including the diagnosis and early detection of genetically inherited diseases; development of gene therapy and custom-fit drugs (pharmacogenomics)
- Alternative Energy sources, including the development of biofuels utilizing photosynthetic and microbial systems, gleaned from the study of bacterial genomics
- Evolutionary studies, including lineage studies and phylogenetics
- DNA Forensics, involving the use of DNA to identify individuals for crime tracking.
Source
Sequenced Organisms
The genome of many species has already been sequenced (See comprehensive list). The most famous case however, is that of the sequencing of the human genome, called the Human Genome Project. This first began in 1990 as a government funded project, and used the clone-by-clone technique detailed below. In 1998, the firm Celera Genomics joined the race to sequence the genome; using the technique of whole-genome shotgun sequencing, they aimed to proceed at a much faster, and more cost effective pace. A rough draft of the sequence was announced in 2000, and the final draft was published in April 2003.
Genome Sequencing Strategies
Clone-by-clone shotgun sequencing
The general principle behind this strategy is to first construct a library of the DNA, and then sequence parts of the library by first mapping their location and then randomly cutting and sequencing multiple copies of the sequences. Because of the large number of overlaps this generates, the sequences can be reassembled into the genome. See Figure 1 for a pictorial representation.
Figure 1: The clone-by-clone shotgun sequencing method for sequencing genomes. The genome is depicted as an encyclopedia, with each volume representing an individual chromosome.
Steps in Clone-by-Clone Shotgun Sequencing
1. Map Construction
This involves the cloning of parts of the genome into Yeast Artificial Chromosomes (YACs) for sequences up to ~1 Mb, and Bacterial Artificial Chromosomes (BACs) for smaller sequences ~200 kb, for easy amplification and manipulation. To map the location of these smaller sequences to the genome, Sequence-tagged sites (STS) are used, consisting of DNA unique to a particular region, which identified via PCR or probe hybridization. Multiple STS can be used to identify exactly where in the genome the sequence is located – a hit matrix of the STS present in each clone is constructed, and these are ordered such that all clones containing the same STS are arranged sequentially (See figure 2).
Figure 2. Mapping the relative location of clones to genomic DNA
Another way to map the location of the clones is to use restriction enzymes, which cut DNA at specific sequences. This can be used to generate a restriction fingerprint of each clone, which refers to the pattern obtained from running them on a gel. Clones with shared aspects of their restriction fingerprint overlap, and this can be used to map them onto the genome.
The map construction phase is usually the most time consuming part of genome sequencing. In the Human Genome Project for example, it took 8 years!
2. Clone selection
The clone map generated is used to select the clones that will generate a minimum tiling pathway such that the fewest possible clones are needed to be sequenced to cover the entire genome (Figure 3). The criterion used to select clones is that they must be authentic: they must map to the right location with absolute certainty.
Figure 3: An example of a minimum tiling pathway, highlighted in red, resulting from clone selection.
3. Subclone library construction
The YACs or BACs are copied multiple times, and then physically fragmented into 2-5 kb sequences using either sonication, which refers to the shearing of DNA using sound waves, or restriction enzymes (Figure 4). These random fragments are then subcloned into vectors, which can be amplified using the same primers.
Figure 4: Subclone library construction
4. Random shotgun phase
This consists of reading the sequence of the subclone fragments via the dideoxy termination reaction, and then reassembling the sequences using bioinformatics programs using overlapping fragments (Figure 5).
Figure 5: Random shotgun phase
Dideoxy Termination Reaction:
The dideoxy termination reaction was invented by Fred Sanger, and is called so because it involves PCR amplification in the presence of a small percentage of fluorescently labeled dideoxynucleotides, which lack the 3’ OH site required for further elongation of the nucleotide chain. Thus, chain elongatrion ceases when they are incorporated, and because each different type of nucleotide is labeled with a distinct fluorophore, the base present at that location can be identified (Figure 6).
The protocol for the dideoxy termination reaction involves melting double stranded DNA to form single stranded DNA, to which the primer common to all the subclones can then hybridize. DNA Polymerase I, the various dNTPs used for regular replication and fluorescently labeled ddNTPs (dideoxy NTPs) are added, and the DNA fragments of different lengths obtained are run on a gel. The read out process can be automated by the addition of a laser to detect the fluorophores running off the gel, a process developed by Leroy Hood and Michael Hunkapiller.
Figure 6: Dideoxy termination reaction
Bioinformatics Programs Involved:
Phred: base calling – assigns a base according to the signal received, along with a confidence score
Phrap: assembly – aligns sequences
Consed: editing
Coverage and Contigs
Coverage = sequence bp / fragment size.
Contigs = number of sequences that can be assembled without gaps
The relationship between coverage and contigs is called the Langer-Waterman curve (Figure 7), after the two statisticians that elucidated it.
Figure 7: Lander-Waterman Curve displaying the relationship between coverage and contig
5. Directed finishing phase and sequence authentication
The finishing phase involves the closing of any gaps in the sequence by the generation of primers at the end and beginning of fragments that may be contiguous, and running subsequent sequencing reactions until the sequences overlap. This is followed by sequence authentication, via the verification of STS and restriction enzyme fingerprints. A finished genome is defined by the presence of < 1 error or ambiguity in 10,000 bp with sequences assembled in the right order and orientation along a chromosome with no gaps.
Advantages of the Clone-by-Clone Shotgun Sequencing: thorough, accurate
Disadvantages: time consuming, costly
Whole-genome shotgun sequencing
The general principle behind this strategy is to directly cut multiple sequences at random and sequence them. The fact that numerous overlapping sequences will result is then used to reassemble the genome. See Figure 8 for pictorial representation.
Figure 8: The whole-genome shotgun sequencing method for sequencing genomes. The genome is depicted as an encyclopedia, with each volume representing an individual chromosome.
Unlike Clone-by-Clone Shotgun Sequencing, this strategy skips the time consuming step of physical mapping, and uses jigsaw puzzle assembly (Figure 9). This is a bioinformatics-heavy approach, because of the lack of a clone map.
Figure 9: Assembly phase of whole-genome shotgun sequencing method
The specific components of shotgun assembly are:
Screener
The identification and elimination of low quality reads, contaminations and repeats.
Overlapper
This searches the high quality reads for overlaps with the criteria used is >= 40 bp overlap with <= 6% mismatches.
Unitrigger
The assembly of the easy subset (unique assembly from the high quality reads) first.
Scaffold and Repeat Resolution
Generation of YAC and BAC clones to aid in creating a scaffold for the genome. The two ends of the fragments, called read pairs, are sequenced, which indicates how far apart the fragments are, as well as locating and orienting them on chromosomes, using available physical map information.
Consensus
Advantages: faster, less expensive
Disadvantages: requires physical map from clone-by-clone approach to make scaffold
Hybrid method
This approach incorporates both the Clone-by-Clone Shotgun sequencing and Whole-Genome Shotgun strategies. The optimal ratio of the use of the two is still being determined, and meanwhile 8-10 x coverage is needed. Higher eukaryotes for example, require more clone-by-clone (although this is on the decline with the advent of comparative genomics), but the genomes of simpler organisms such as bacteria can be determined by the whole-genome shotgun method alone.
Figure 10: Hybrid method combining Clone-by-Clone Shotgun sequencing and Whole-Genome Shotgun strategies
Gene Prediction Algorithms
Differences in gene topology, including numbers of genes and distribution of intergenic sequences necessitate the use of gene prediction software. In addition, prokaryotic and eukaryotic gene prediction strategies differ in post-transcriptional modifications
Key Biological Terms
Introns: Sequences spliced out of mRNA to form mature mRNA
Exons: Sequences that are not spliced out of mRNA. Only exons are present in mature mRNA
Codon: A three-nucleotide sequence of bases that codes for a particular amino acid. Codons are degenerative, that is, multiple codons may code for a certain amino acid. In addition to these, stop codons are also present, which do not code for any amino acid, causing termination of the nascent peptide.
Open Reading Frames ORFs: This is defined as a sequence of codons that starts with a start codon, ends with a stop codon, and has no stop codons in-between.
Prokaryotic gene prediction
This involves a combination of the following, which are used to create a coherent gene model complete with intergenic DNA. GLIMMER software, in particular, can be used to predict genes with 98-99% accuracy.
Identification of ORFs
Possible ORFs are identified. Note that each DNA sequence has 3 ORFs, although short ORFs are unlikely to code for proteins.
Scoring windows with coding statistics
A coding statistic is defined as a function that calculates the probability that a given sequence is coding for a protein. Most coding statistics measure one or more of the following:
- Codon bias: This is defined as the probability that a certain codon will be used to represent an amino acid in a gene, over its other degenerate codons coding for the same amino acid. This is seen to exist among organisms within the same species, in that the relative abundance of the specific codons used to represent an amino acid is more or less similar. Codon bias is also seen in the sequence of genes: because bias is correlated with tRNA abundance through selection, highly transcribed genes display a different codon bias to genes displaying a lower amount of transcription.
- Codon pair (hexamer) enrichment: This bases the coding statistic on the occurrence of specific pairs of codons in the sequence.
- Amino acid usage: Depending on the presence of certain amino acids over others in the ORF of the sequence
- Amino acid pair preference: This takes into account the pattern of occurrence of pairs of amino acids next to each other
- GC content: This measures percentage of the sequence that is composed of GC base pairs.
- Third Position in codon: These are usually variable, with different bases at this position coding for the same amino acid. Many genomes display a sequence-wide preference for which particular base is at this third position.
Most of the measured features are conserved across the genome. A training set of known genes from the same genome is needed to be able to compare the different measurements to find the coding statistics, such that the ORF that is most likely to code for a protein can be identified.
Identification of individual gene architecture elements
All genes contain certain structural elements aiding in the various stages of transcription, termination, translation, etc. These can be used to pinpoint the location of genes. Some examples include:
- Promoter Sequence: This is found near the start site of transcription, and represents binding sites for various transcription factors. The sigma dependant promoter at the –10 position, called the TATA box, is particularly useful. The consensus sequence of prokaryotic promoter motifs is shown in figure 11.
Figure 11: Typical Bacterial Promoter. The consensus sequence is shown in black, with the relative abundance of each base in red.
- Transcription Termination Signals: These are found near the end of transcribed sequences, and can be rho-dependant, which are 70-80 nucleotide long regions rich in CG, or rho-independent, which form hairpin loops.
- Shine-Dalgarno Sequences: these are found near translation start sites, and represent ribosome binding sites.
Eukaryotic gene prediction:
Eukaryotic gene prediction differs from prokaryotic gene prediction in that it must take intron splicing and post-transcriptional modifications into account too (Figure 12).
Figure 12: Various steps leading to translation in Eukaryotes.
There are three different approaches to eukaryotic gene prediction:
Ab initio (predictive)
Methods include the use of:
- Coding Statistics: Similar to that in prokaryotes.
- Model Gene Structure: Structural elements to look for include:
- Transcription factor binding motifs
- CpG islands: CG-rich sequences exist all over the genome. C is usually methylated, leading to its depletion over time as methyl-C is easily mutated to T. However, near promoter regions, C is undermethylated under selective pressure, leading to CpG islands. These are ~100 bps long, and have a high abundance of C’s and G’s.
- Transcription Initiation Signature: This consists of TATA boxes, similar to prokaryotic promoters, but these are less common.
- RNA Cap signature: The cap is usually added after a CAG sequence.
- Translation Initiation: The Kozak consensus is found 20 nts from the cap. See Figure 13.
Figure 13: Kozak consensus sequence at translation initiation site
- Splicing Signature: This includes conserved sites defining exons and directing splicing
- Poly-A Signal Sequence: This is a conserved sequence of AAT(U)AAA that occurs near the end of the transcript, which contributes to transcription termination.
GenScan is a program used for Eukaryotic gene prediction, which uses a Hidden Markov Model with hidden states including exons, introns, intergenic regions, promoters, etc. It also takes hexamer coding sequences and matrix profiles for gene structure into account. Genscan requires training sequences with known coding/noncoding regions to determine probabilities, as well as to train for coding statistics, before it can predict with accuracy.
Similarity (comparative)
This involves the comparison of sequences to known coding sequences. Strategies involve comparisons to:
- To Expressed Sequence Tags (ESTs): Mature mRNAs are converted to cDNAs and used as EST sequences, to which sequences to be tested are compared to look for similarities. Downsides to this approach include a 3’ bias because the primer used is against the poly-A tail, presence of contamination sequences. The lack of uniformity is also a problem, as highly expressed genes are enriched, and not all genes are seen due to different expression patterns.
- To known gene or protein: Programs to do this include BLASTN, BLASTP, Psi-BLAST (yields protein families). Downsides include the fact that novel genes can’t be found, and pseudogenes can’t be distinguished from active genes.
- Conservation between genomes: TwinScan is a program that blasts between sequenced genomes. It is built upon GenScan, with the addition of the probability of generating conservations from each state. Highly conserved regions especially, can be used to make gene predictions. A lot of exon sequences especially, are usually conserved.
Combination of different approaches
Programs such as GeneMachine perform both predictive and comparative gene predictions, using:
- GRAIL/MZEF: coding exon prediction
- GENSCAN/HMMGenes/FGenes
- RepeatMasker/Sputnik: complexity and interspersed repeat prediction
- BLAST: searches for sequence homology
Gene Prediction Complications
Complicating factors include:
- Sequencing errors
- Pseudogenes: genes without protein-coding ability, but with typical structural elements of genes
- Short exons coding for only a few amino acids
- RNA coding genes
- Genes within genes
- Unusual start codons
- Ribosomal frameshifting
- Introns in the 5’ UTR
- Alternative Splicing
Conclusions
Genome sequencing can be done by the clone-by-clone shotgun method, the whole-genome shotgun method or more recently, by a combination of the two. The finished sequence has a wide variety of applications, including gene prediction. Gene prediction differs in prokaryots and eukaryotes because of the differences in structure of the genomes. It is aided by looking at codon statistics, and looking for particular structural motifs associated with genes.
Although the genome of many organisms, including humans, has been sequenced, there still remain several whose will be done in the future. In addition., gene prediction is and will continue to be an extremely useful tool in research and medicine.
References
Sources:
Useful Software/Web-based Applications:
Selected Papers:
Comments (0)
You don't have permission to comment on this page.