Contributed by Josh Paulk, Jacob Carlson, and Linjiao Luo
Epigenetics can be defined as “heritable changes in gene function that occur without a change in the DNA sequence.” These changes include covalent modification of histone proteins, methylation of DNA, and structural alterations in the higher-order chromatin structure. Gene expression can be greatly influenced by these epigenetic changes, thus deciphering and understanding the ‘epigenome’ is key to understanding how certain cell states are maintained. Bioinformatics has played an extremely important role in the advancement in this field. Below is an overview of the current state of epigenetics and some of the technology developed to advance progress in the field. Epigenetic-control operates on three major levels: nucleosome dynamics, histone modifications, and DNA methylation
DNA in eukaryotic organisms is packaged in a nucleic acid-protein complex called chromatin. All genetic processes depend on DNA in the chromatin context and are regulated through the controlled access of cellular machinery to free DNA. The nucleosome (shown below) represents the fundamental repeating unit of chromatin(occurring every 157–240 bp depending on the organism 10). The nucleosome core is composed of an octamer of histone proteins; two histone H3–H4 dimers and two histone H2A–H2B dimers (complexed to 147 bp of DNA). However, these canonical histones can be replaced by histone variants (e.g. H2A.X, H3.3, etc.) for the purposes of DNA repair, replication, etc. 9. In the chromatin fiber, this core is accompanied by an additional histone, H1, that ‘locks’ the complexed 147 bp of DNA on to the nucleosome core (and complexes with ~50 bp of linker-DNA).
Nucleosomes can block RNA polymerase access to promoter regions of genes during transcription initiation; thus, allowing regulation of expression at this level. Although the promoters of many genes (mostly housekeeping genes) have low nucleosome occupancy, genes that are regulated on the level of chromatin have their promoters effectively blocked. Fortunately, chromatin remodeling proteins can remedy this situation by actively changing the position of the nucleosomes to allow the assembly of the initiation complex. In order to identify which genes are regulated in this way, many studies have aimed to understand how and where these nucleosomes are positioned.
Nucleosome positioning at a specific locus can be determined by digestion of total genomic DNA with microccocal nuclease (MNase) and then analysis of protected fragments by microarray (see figure below). DNA microarrays with tiling oligonucleotide probes allow the nucleosome positions to be determined at a genome scale. Unfortunately, the resolution of this method is dependent on the spacing between adjacent oligonucleotide probes; thus, it is very difficult to generate a genome-wide map of nucleosome positioning at a high resolution for organisms with very large genomes (e.g. human).
In a 2008 paper published in Cell, Shones and colleagues demonstrated that high-resolution genome-wide maps of nucleosome positions could be constructed by suing Solexa high-throughput sequencing technology in place of microarrays. The authors of this study, first, isolated total human genomic DNA (from T cells) and used MNase-digesting to obtain protected fragments of approximately 150 bp. Finally, they sequenced the ends of the mononucleosome-sized DNA using the Solexa sequencing technique.
to map genome-wide nucleosome positions (see figure below). Figure 1 (from Shones et al., 2008) has been included to illustrate some of the data obtained.
Using the data from nucleosome position mapping (including sequences with both low and high affinity for nucleosomes), Segal and colleagues (in 2006) aimed to discover the genome code for nucleosome positioning. Combining these experimental and computational approaches to determine DNA sequence preferences for nucleosome positioning and the nucleosome organization of the genome that occur from these preferences, the authors discovered that eukaryotic genomes use a nucleosome positioning code to allow for specific chromatin-level regulation at certain genes (see Figure 1 from Segal et al., 2006 below).
Post-translational modification of the core histone subunits of nucleosomes is a fundamental mechanism by which the transcriptional activity of an associated gene locus can be regulated. Typically, specific residues (lysine, arginine, serine, threonine) at the positively charged N-terminal histone ‘tail’ are subject to modifications that include acetylation, methylation, phosphorylation, and ubiquitination. Within a histone tail, more than 100 single modifications are possible, which allows an enormous diversity of unique multiple-modification states.
Though the regulatory consequences of a given modification state can be subtle, some clear trends regarding specific modifications have begun to emerge. For example, acetylation of histone tail lysines often correlates with an increase in accessibility of an associated chromatin region to transcriptional machinery, presumably because the neutralization of positive charge diminishes charge-based DNA-histone interactions, thus relaxing the condensed nucleosome structure. The impact of other modifications, such as lysine methylation, depends on the residue being modified, as illustrated below.
Studies in which a particularly histone modification is associated with a chromatin state have become increasingly common, and are enabled by the ChIP-chip and ChIP-seq technologies. The workflow for these methods is described below.
For ChIP-chip, cells are treated with formaldehyde to cross-link DNA with associated proteins, after which the DNA is harvested and fragmented by enzymatic (e.g. nuclease digestion) or physical (e.g. sonication) means and incubated with an antibody that is specific for the histone modification of interest. The antibody/histone/DNA complexes are recovered (ChIP), and the associated DNA is liberated, amplified using fluorescently tagged primers, and hybridized to a DNA microarray (chip). Tiling arrays are particularly useful for this analysis because they provide fairly high-resolution whole genome coverage that includes non-coding regions (intergenic regions and introns) which often contain regulatory sequence elements.
The ChIP-seq method uses a similar antibody immunoprecipitation technique, but after this step the liberated DNA is ligated to small adapter sequences, amplified, and identified by high-throughput parallel sequencing (e.g. Solexa)
The result of a ChIP-chip or ChIP-seq experiment is a profile of the enrichment of the histone modification of interest across the entire genome. In ChIP-chip this enrichment is observed as an increased ratio of a given DNA fragment derived from ChIP isolation relative to untreated nuclear DNA. In ChIP-seq the absolute number of fragments of a given sequence is counted and used to determine abundance. In either setup the results reveal DNA loci that are hot-spots for a particular histone modification. Shown below is a comparison of the results obtained from ChIP-chip and ChIP-seq for a specific histone modification in a particular region of the genome. Though both methods can introduce subtle biases, in general the agreement between methods is quite strong.
This information can be used at a rudimentary level to annotate the occupied sequence loci for various modified histones, and in particular the localization of these nucleosomes relative to the transcription start sites and other regulatory sequence elements of genes. A higher level analysis can also correlate the frequency of these modifications with expression levels. Shown below are data that correlate the abundance of particular histone modifications with their position relative to transcription start sites, and the expression levels of downstream genes obtained from classical microarray experiments of total cellular transcripts. The results indicate that some modifications (H3K4me) are correlated with increased gene expression, while others (H3K27me3) correlate with decreases gene expression. The peaks observed in the H3K4me3 for genes at high expression levels occur at +50, +210, and +360 based which correlates well with the known spacing interval for nucleosome positioning. Furthermore, the dip in abundance at the transcriptional start site is consistent with local nucleosome depletion of actively expressed genes.
The enzymes that install these modifications are themselves regulated by specific upstream signaling processes and therefore serve as intermediates between a cellular input and a desired transcriptional effect. The resulting histone modification state reflects the integration of input from multiple cellular processes, thereby enabling the sophisticated coordination of complex epigenetic states. There are various mechanisms by which a histone modification state can be ‘translated’. One direct mechanism involves the simple electrostatic interactions of DNA with histones; positively charged lysine residues will interact more tightly with negatively charged DNA than will neutral (acetylated lysine) or negatively charged (phospho-serine, phospho-threonine) residues. In this way histone modifications can impact the extent of condensation of chromatin and thus the accessibility of an associated gene to transcriptional machinery.
Another mechanism involves the ‘reading’ of histone tails by specialized protein domains that bind to specific modified residues. A listing of these domains and their recognition elements is shown below. Proteins that contain these domains can be specifically recruited to nucleosomes with a particular histone modification which results in the localization of the protein’s activity to specific chromosome regions. These activities include chromatin remodeling, further modification of histones, and recruitment or blocking of transcriptional machinery.
Another well characterized epigenetic mechanism is DNA methylation. It involves addition of a methyl group (CH3) to the selected sites on DNA. This process is usually carried out by a group of enzymes called DNA methyltransferases. The figure below shows the most common eukaryotic DNA modification phenomena, which happens to the number 5 carbon of the cytosine pyrimidine ring.
DNA methylation mainly happens at CpG sites, where cytosine and guanine separated by a phosphate. In human genome, the frequency of CpG is usually pretty low, except for some very small “islands”. We call those CG islands. Even though ~80% CpG sites are methylated in the low CpG frequency area, most of the CpG sites on CG islands are not methylated. People found that the genes at low CpG frequency area are often inactive, while genes located on CG islands are often very important for the cell development or cell function. DNA methylation is believed to play an important role to this phenomenon. Imagine that all the important genes are methylated, while all the inactive genes are unmethylated. Over evolutionary time scales, the methylated CG will be converted to TG by accidental deamination, thus be progressively eliminated from the genome. But the unmethylated CG will be conserved.
DNA methylation is one of the most important epigenetic modifications involved in disease. Recently, lots of study report that abnormal methylation may cause cancer.
As mentioned before, important genes on GC islands are normally unmethylazed. This is essential for transcription of the genes. Once the CG islands become methylated, the gene associated becomes permanently silenced. Cancer cells are characterized by abnormal DNA methylation patterns. There are two categories of DNA methylation aberrations: transcriptional silencing of tumor suppressor genes by CpG island promoter hypermethylation and a massive global genomic hypomethylation. The figure below shows the example of the first categories
There are different types of strategies for high-throughtput detection of DNA methylation. They are either microarray based or sequencing based.
MeDIP stands for methylated DNA immunoprecipitation. Chip means microarray. MeDIP-chip means using microarray after methylated DNA immunoprecipitation to map DNA methylation sites in the genome. This is a microarray based method. It’s very similar to CHIP-chip. What’s special about MeDIP-Chip is that an antibody against 5-methyl-cytosine is used to specifically recognize methylated DNA. The figure below shows the basic principle of MeDIP-chip. After sonication, DNA is randomly sheared to 300~1000 bp fragments. Then the sonicated DNA is immunoprecipitated with an antibody against 5-methyl-cytosine. Since CpG sites are not evenly distributed in the genome (remember GC island?), both the methylation status of the target sequence and the density of the CpG sites can influence the enrichment of the MeDIP. We should use unmethylated DNA to do the control. Input DNA and methylated DNA are differentially labeled with Cy3 and Cy5. We can use the same method to do the pick finding as what we did in CHIP-chip.
Methy-Seq is a sequencing based method. The below figure shows an unpublished Methyl-Seq map from D. Johnson and R.Myers. Mathylation specific enzymes were used to digest the DNA. The CpG island fragments were sequenced. Here “U” means unmethylated sequence and “M” methylated sequence.
An understanding of the role of epigenetic modifications in cell regulation and cell-fate determination will continue to be enhanced by the use of genome-wide analysis technologies and biostatistics. These methods reveal global trends in nucleosome positioning, histone modifications, and DNA methylation that correlate with fundamental biological phenomenon (e.g. gene expression regulation). Such analysis is critical for understanding the significant influence of biological determinants that are not strictly encoded in the primary DNA sequence.
5) Bernstein et al. The mammalian epigenome. Cell (2007) 128, 669-81
6) Segal, E., et al. A genomic code for nucleosome positioning. Nature 44 (2006) 2, 772-778
8) Luger, K., et al. Nucleosome and chromatin fiber dynamics. Curr Opin Struct 2005, 15, 188-196
11) Singalet et al. DNA methylation.Blood (1999) Jun 15;93(12):4059-70.
12) Liu XS. Getting started in tiling microarray analysis. PLoS Comput Biol. 2007 Oct;3(10):1842-4.