# Introduction

The fact that some traits tend to be coinherited across generations has long made it evident to geneticists that **Mendel's law of independent assortment** is not a true law. The discovery in 1954 that DNA is the hereditary substrate provided a mechanistic basis for dependent assortment: because different gametes segregating within a species are in fact unitary molecules that are broken apart and reassembled by relatively rare recombination events, adjacent bases are likely to remain together on the same gamete for a number of generations. Thus, knowledge of the particular nucleotide at a segregating site allows for prediction of the nucleotides at nearby segregating sites. This property of dependence among adjacent segregating bases is known as **linkage disequilibrium** (LD). Recent advances in sequencing technology have revealed that LD can extend over greater distances than previously thought, creating long blocks of sequence where **single nucleotide polymorphisms** (sites in the genome where different individuals in a population may have difference bases) are in strong LD with each other. These blocks are now referred to as **haplotypes**.

This article will review

(1) the utility of typing individuals by their haplotypes rather than just their genotypes at particular bases;

(2) the methods used to accomplish this typing;

(3) applications of the haplotype concept and haplotyping; and

(4) the methods used to identify the SNPs within a haplotype and important applications.

# The Utility of Inferring Haplotypes

Given the millions of SNPs in the human genome, detecting replicable associations between genotypes and phenotypes is a daunting challenge. Even though the cost of sequencing is falling dramatically, truly comprehensive genome-wide coverage in any given association study continues to impose a burden on researchers. Some of this burden can be alleviated by the fact that not all SNPs assort independently. Because any functional SNP of interest to a researcher is very likely to be in LD with other SNPs, the haplotypes that are inferred from a handful of SNPs in a particular region can be used as a proxy for whatever functional variants may lie within the block. An association between a particular haplotype and some disease state (or an enhanced value of a quantitative trait) then provides guidance to researchers who hope to pin down to the association to a particular polymorphism within the block. Moreover, in the early phases of association research, haplotypes may provide greater reliability than the genotypes at particular base positions. This is because any association between a phenotype and a SNP is always ambiguous, especially if the SNP is noncoding; the association may be with the typed SNP itself or with another one in LD with the typed SNP. For this reason it is often desirable to use haplotypes in exploratory research.

# Methods for Inferring Haplotypes

Suppose that we have three loci with two segregating alleles each (*A*, *a*, and so on). The problem in inferring haplotypes is that current genotyping technology typically provides only an individual's genotypes at the loci and not his gamete structure. In other words, we know that the individual is *Aa BB Cc*, say, but not whether he inherited the *ABC* and *aBc* haplotypes or the *ABc* and *aBC* haplotypes. Given that 2^*k* haplotypes are possible for a block of *k* loci, what might we do to reduce our uncertainty as to which of these possible haplotypes is actually segregating at nontrivial frequency in the population? This is the essential problem of inference that must be tackled if haplotypes are to be used.

If the parental genotypes are known, then the haplotypes of the offspring can often be inferred. This procedure is analogous to that used in blood typing. For example, if the father of our hypothetical individual were homozygous for the "capital" allele at all loci, we would infer that the individual had inherited the *ABC* haplotype from his father and the *aBc* haplotype from his mother. The haplotypes carried by an individual can also often be inferred if the common haplotypes in the population are known. Methods have been developed, however, for situations where neither source of information is adequate.

## Clark's Algorithm

**Clark's algorithm**, a method developed by the population geneticist Andrew Clark, is actually quite elegant and intuitive. First, we construct haplotypes from unambiguous individuals. For example, it is certain that an *AA bb CC* individual carries two copies of a haplotype characterized by *AbC*. Second, we remove individuals from the sample that can be explained as combinations of haplotypes that have already been discovered. Third, for each remaining individual, list all possible combinations of haplotypes that might explain their multilocus genotypes and select the haplotype that shows up most often in these lists. Finally, iterate the second and third steps; that is, remove individuals that can be explained as combinations of the known and proposed haplotypes, and look again for possible haplotypes that can account for most of the remaining multilocus genotypes. The algorithm terminates when all individuals have been accounted for.

## Expectation Maximization

The following two methods, **expectation maximization** and **Gibbs sampler**, rely on a statistical model of the haplotypes in a population. As both of these methods are very general and have applications in other areas of bioinformatics, readers interested in these methods should consult the articles specifically devoted to them.

What we want are estimates of the frequencies of each possible haplotype in the population. Obtaining these estimates by maximum likelihood is trivial *provided* that we know what haplotypes each individual is carrying. Thus, the actual haplotypes that are carried by each individual are treated as missing data to be imputed from the haplotype frequencies. In the initialization step, we pick an arbitrary vector of haplotype frequencies, subject to the conditions that they are compatible with the observed genotypes and sum to one. Using this current "estimate" of the unknown frequencies, the expected value of the missing data is computed. This is the **E step**. The result is a set of likelihood equations that are relatively easy to solve. This is the **M step**. The new estimates obtained from the M step are used to update the expected values of the missing data, and this whole process is repeated until convergence.

*Example of Expectation Maximization*

The figure above (Stat 115 SNP and Haplotype Analysis Lecture Slide 6) provides an example of expectation maximization. The initial estimates of the thetas are chosen more or less at random, subject to the condition that they are compatible with the observed genotypes and sum to one. The first individuals obviously is compatible with one assignment of haplotypes (double homozygote for *ABC*), so it is estimated that he is *ABC/ABC*. The remaining two individuals are each compatible with two assignments of haplotypes. From these data we obtain the updated frequency of *ABC* as follows. It is certain that the first individuals is *ABC/ABC*, so the expected number of *ABC* haplotypes carried by the individual is 2. The second individuals carries a single *ABC* haplotype in one of his two possible assignments. The expected number of *ABC* haplotypes carried by this individual is the probability that any individual carries this particular assignment (that is, the assignment compatible with *ABC* being one of the haplotypes) divided by the sum of the respective probabilities that any individual carries an assignment of haplotypes compatible with this individual's observed genotypes. A similar process is executed to determine the expected number of *ABC* haplotypes carried by the third individual. The expected number of *ABC* haplotypes is then summed over individuals, divided by the total number of individuals, and set equal to the frequency of *ABC* in the next iteration. This procedure is carried out for each theta to produce a set of equations that can be solved by maximum likelihood to produce new estimates for theta in the next iteration.

## Gibbs Sampler

An important advantage of **Bayesian** data analysis is that it provides a complete description of the uncertainty of an estimate that itself relies on other parameters being estimated. This features makes Bayesian approaches useful in haplotype inference, where an entire vector of unknown parameters is the quantity sought. We do not describe the theory or mechanics of Bayesian analysis in any detail here; for this the reader should consult more advanced references.

What is desired in Bayesian analysis is a *posterior distribution* of an unknown paramter that incorporates all features of the data, including the uncertainty in other parameters on which the parameter of interest depends. Posterior distributions are marginal probability densities over potentially many variables, and hence obtaining them often requires the integration of high-dimensional functions that lie beyond the grasp of analytical and numerical methods. Mathematical physicists faced with the problem of integrating functions of this kind developed a variety of techniques that revolve around the clever idea of *random sampling* from these functions. Perhaps the most powerful of these techniques is the **Gibbs sampler**, one of a family of **Markov Chain Monte Carlo** (MCMC) approaches. The Gibbs sampler is difficult to explain concisely, but at a low level relatively easy to understand. The key idea of Monte Carlo methods is that we can draw samples from the multidimensional distribution associated with the integral and use these samples to estimate the desired quantity with great accuracy.

Simulating random vectors drawn from some complex **target distribution** can be a extremely difficult. MCMC approaches circumvent this problem by drawing successive samples from far simpler distributions in such a way that the distribution of the samples converges to the target distribution. We use the previous sample values to randomly generate the next sample value in such a way that the sequence satisfies a **Markov chain**. The Gibbs sampler uses only the conditional distributions that result when all of the random variables but one are assigned fixed values. Such conditional distributions are far easier to simulate than complex joint

distributions and often have simple forms.

Let us now go through a simple example. Consider a bivariate random variable *(x, y)*. Suppose that we wish to compute the marginals *p(x)* and *p(y)*. Suppose also that integrating the joint density over one variable is beyond our meager calculus skills. With the Gibbs sampler in hand, we can still obtain the marginal densities by considering a sequence of conditional distribution *p(x given y)* and *p(y given x)*. The sampler starts with some initial value *y_0* for *y* and obtains *x_0* by taking a random draw from the conditional distribution *p(x given y=y_0)*. The sampler then uses *x_0* to generate a new value *y_1*, drawing from the conditional distribution *p(y given x=x_0)*. This process is repeated $k$ times to generate a **Gibbs sequence** of length *k*, that is, a sequence of *(x_j, y_j)* where *j* = 1, ..., *k*. To obtain the desired total of *m* sample points, we sample the chain after a sufficient **burn-in** (to remove the effects of the initial values) and at set time points (say every *n* samples) following the burn-in (to rid the retained sample points of local effects). After millions of iterations, the Gibbs sequence converges to the desired target distribution regardless of the initial values.

The sampler is readily extended to more than two variables. The power of the appraoch is evident when we consider that the dimension of the joint density and the number of variables integrated over can both be very large.

There are two ways to compute the Monte Carlo estimate of the shape of the marginal density. First, we can use the Gibbs sequence of *x_i* values to approximate the marginal distribution of *x*. However, this approach can be somewhat inaccurate for the tails of the distribution. Second, we can average the conditional densities *p(x given y=y_i)*. This approach does estimate the tails accurately. Once we have the marginal density of the unknown parameter, we may of course compute other features of the parameter such as its mean and variance.

Inspection of the figure above suggests how the Gibbs sampler can be used to determine the posterior distribution of the vector of Z's (that is, the haplotypes carried by the individuals). To be clear, the distribution that we seek assigns to each individual a probability for every pair of haplotypes. We start with a pseudo-count of haplotypes very much as in the initialization step of expectation maximization. Conditioning on these "estimated" haplotype frequencies and the observed genotypes of the individuals, we draw samples of haplotype pairs and assign them to the individuals. Once each individual has a haplotype pair, we condition on the pairs and draw samples of haplotype frequencies; that is, we draw from the conditional distribution of our estimate of the true theta given the haplotypes carried by the individuals during this iteration. This process is repeated many, many times until the sequence converges to give the marginal distribution of the haplotype pairs.

## Partition/Ligation

As the number of SNPs in a block grows, a point is reached where haplotype inference becomes computationally unwieldy. This problem can be overcome by the following "divide and conquer" method. Because LD decays with distance, "sub-haplotypes" can arise within a block. That is, a cluster of adjacent SNPs may be in tight LD with each other, but sites farther away will inevitably be in weaker LD with the cluster; it will thus be possible to **partition** the sequence into groups where the LD among groups is relatively small compared to the LD within groups. If the recombination rate varies sharply within the region, then recombination hotspots may make for natural partition points.

Once the partitions are made by the analyst, haplotypes are inferred *within* each group using one of the methods discussed previously. The groups are constructed to contain sufficiently few SNPs that this task is computationally tractable. The key idea of the **ligation** step is that the sub-haplotypes can be treated as if they were segregating bases in order to infer haplotypes for the entire block. The end result is a hierarchical arrangement. At the top level are the long haplotypes that characterize the variation of the entire block. These haplotypes are distinguished from each other not by any single nucleotide but by groups of adjacent nucleotides that may be called haplotypes in their own right. For example, suppose that an entire locus consisting of 16 bases is segregating haplotypes I and II. Haplotype I is distinguished from II not by a single base but by being composed of the two 8-base haplotypes A and B, whereas II is composed of haplotypes A and C. It is possible that at the next level the sub-haplotypes are still not distinguished from each other at the level of single nucleotides but rather by even smaller haplotypes. For example, the haplotypes B and C segregating at the second half of the locus may differ in that B is composed of the 4-base haplotypes 1 and 2, whereas C is composed of the 4-base haplotypes 3 and 2. Finally, at the lowest level the sub-haplotypes are distinguished from each other by the SNPs of which they are composed.

*Example of Partition/Ligation*

## Evolutionary Cladograms

Not only is the inference of haplotypes difficult if *k* is even moderately large, but the small resulting sample sizes for each haplotype can substantially reduce the power of tests for haplotype-phenotype associations. Partition-ligation is a computational method that addresses the first aspect of this problem. Here we briefly sketch a more explicitly substantive method developed by the population geneticist Alan Templeton and his colleagues that addresses how the haplotypes should be heirarchically organized to avoid loss of power.

The sequence data in haplotypes can be used to build a **cladogram** estimating how the different haplotypes are evolutionarily related to each other. A cladogram of haplotypes is constructed very much along the same lines as a cladogram of organismal taxa. For example, if two haplotypes are the only ones in a sample that share a particular derived base, then they are inferred to share a common ancestor allele with each other more recently than they do with any other haplotype in the sample. Armed with a cladogram displaying the inferred evolutionary relationships among all of the haplotypes in the sample, one can then consider how phenotypic state is a function of nested sets of cladogram members. The motivation behind this approach is that closely related haplotypes are more likely to share an allelic state with respect to the phenotype of interest. A suitable technique to exploit this logic is nested analysis of variance if the trait is quantitative; other approaches may be more useful if the trait is dichotomous, although some authorities have held that the robustness of ANOVA is sufficient to support dichotomous response variables. The results of ANOVA determine whether membership in a haplotype clade has any effect on the phenotype of interest, considering lower clades as factors that are nested within higher clades.

# The International HapMap Project

The observation that haplotypes are often longer than expected because of variation in recombination rates across the genome motivated the **International HapMap Project**, an effort by a consortium of researchers to catalogue the haplotype variation in different world populations. The figure below gives an indication of the vast scope of this effort. Because by definition alleles are correlated with each other within haplotypes, it is expected that the HapMap will assist researchers in association studies of disease and drug response by reducing the number of SNPs that need to be typed for comprehensive coverage of the genome. All data from the HapMap, including SNP frequencies, genotypes, and haplotypes, are freely available in the public domain.

*Groups in The International HapMap Project*

[**Note: The following material on detecting natural selection will not be covered in the final exam.**] One perhaps unexpected payoff from the HapMap has been its profound usefulness in studies of recent natural selection in the human species. Haplotypes can be a critical source of evidence in inferring whether selection has been acting on some phenotype influenced by a particular genetic locus. The reason for this is as follows. When an advantageous mutant first arises, it will of course be associated with the particular bases at the segregating sites around it. If the subsequent increase in frequency of this mutant occurs on a much shorter time scale than the breakup of this haplotype by recombination, then the end result is an unusually long high-frequency haplotype. Thus, genomic regions harboring high-frequency haplotypes of unusual length (or, equivalently, unusually little genetic variation over a long range) are quite likely to contain functional variants whose effects have been targeted by recent natural selection.

The variation in the HapMap has been studied by two research teams for likely selection signals of this kind. The first team used a rather clever method that gets around the necessity of haplotype inference, although the method depends crucially on the haplotype concept itself. Since the computational demands of haplotype inference can be incompatible with the constraints of a genome-wide survey, this method may be of general value in future studies of selection. For every SNP with a sufficiently high minor allele frequency, all individuals sampled for the HapMap and Perlegen datasets were binned according to genotype. Heterozygotes were then discarded. Within the two classes of homozygotes, a list of value pairs for nearby SNPs was created; the first element of each pair was the distance between the target SNP and the particular nearby SNP, and second element was the fraction of the chromosomes in the datasets that had experienced a recombination event between the two SNPs since the base at the target SNP originally arose. This procedure is schematically illustrated in the figure below (Wang et al., 2006; open access). Notice that the fraction of recombinant chromosomes can be calculated without any knowledge of the haplotypes. Take the individuals who are homozygous for the red allele at the target SNP. It is easily seen that the haplotypes of these individuals do not need to be inferred in order to conclude that three of the eight chromosomes bearing the red allele have experienced a recombination event between the target SNP and d1. A similar number of recombination events have occured between the target SNP and the other two SNPs. On the other hand, the yellow allele is in almost perfect LD with particular variants as the three nearby SNPs. What this means is that the target locus is highly unlikely to be evolving neutrally; the yellow allele has reached its current frequency so rapidly that the original haplotype on which it originally arose has hitchhiked along with it.

A likelihood statistic was calculated for each list based on its deviation from a null model of neutral evolution. A surprisingly large number of SNPs produced values strongly in accord with positive selection--that is, a large number of SNPs were embedded in very long haplotypes showing decay with distance. (Other kinds of genomic events, such as suppression of recombination by inversion, can produce long-range LD. But these kinds of events typically do not produce LD that decays with distance.) Pathogen response, sexual reproduction, and neuronal function were among the Gene Ontology categories showing significant enrichment in the nearly 1,800 genes coincident with selected sites.

The second team, using a method that relies on inferred haplotypes, also documented an unexpectedly large number of selective events on the basis of the HapMap data. The method begins with a statistic known as **extended haplotype homozygosity** (EHH), a measure of decay with distance of haplotypes that carry a target allele. For each allele, haplotype homozygosity starts at 1, and decays to 0 with increasing distance from the core site. As shown in the figure below (Voigt et al., 2006; open access), when an allele rises rapidly in frequency as a result strong selection, it tends to have high levels of haplotype homozygosity extending much farther than expected under a neutral model. In other words, because the frequency of the target allele increases much faster than the rate at which recombination breaks up the haplotype on which it arose, the haplotypes in the sample carrying the selected variant show much longer stretches of uniformity. Using a non-parametric statistic derived from EHH called the **integrated haplotype score**, the authors discovered many sites likely to be under selection.

The methods employed by the two research teams vary in power depending on the region of the parameter space, the dimension of which is determined by current allele frequencies, intensity of selection, age and duration of the selective event, and so on. However, there is a striking overlap in the genes identified as likely targets of selection in the two studies. For example, both studies identified *SNTG1*, a brain-expressed gene of unknown function, as evolving in all of the HapMap populations. Both studies also concurred in the finding that Gene Ontology categories relating to reproduction and immune response are enriched for selection signals. The conclusion to be drawn from these studies is this: since the out-of-Africa event 50,000 years ago that began the expansion of *Homo sapiens* around the globe, our species has been experiencing very strong natural selection at hundreds (or even thousands) of loci across the genome. The study of this recent and intense adaptive evolution in our species will surely prove to be one of the most intriguing research areas in all of science.

# SNP Chip

Before any of the procedures discussed above can be used, the bases carried by an individual at the various relevant SNPs must be determined. DNA microarrays can be adapted for this purpose. The basic working principles of a SNP microarray are the same as those governing microarrays more generally. Up to 40 labeled probes are created per locus per individual through RE digestion, adapter ligation, and PCR. These probes are then placed on the array, which contains immobilized nucleotide sequences corresponding to the particular genotypes for each locus. A detection system then records the hybridization signal indicating which of the three genotypes (for a diallelic locus) is carried by the individual. The Affymetrix GeneChip, which has become the "industry standard," is capable of genotyping 500,000 SNPs; virtually the entire human genome is thus within 100,000 Kb of a SNP typed by Affymetrix.

One application of SNP chips outside of genotype-phenotype association is the study of cancer. Cells responsible for cancer are often characterized by gross abnormalities in their DNA. Thus, cancer research focuses on the nature of these changes and their effects on the regulation of cell growth. It is often the case that these changes activate cancer-promoting genes (oncogenes) or inactive tumor suppressors. Inactivation can be caused by either a null mutation or loss of an entire chromosomal region; such events lead to a condition known as **loss of heterozygosity** (LOH). SNP chips are useful in the detection of LOH because a cell that has become cancerous as a result of LOH will fail to register a full diploid genotype in comparison to a normal cell from the same individual. Unfortunately, errors of measurement and biological noise can lead to uninformative reads or even instances where it is the normal cell that has suffered LOH. Despite these sources of ambiguity, it is usually the case that LOH is a reliable diagnostic tool, steadingly increasing in regions progressively afflicted by cancerous growth. Standardized sliding-window summary scores of LOH have been proposed as means of capturing the extent of LOH in a genomic region.

Another genomic phenomenon of interest is change in copy number of a particular gene. This phenomenon is not limited to cancerous cells, as many studies (including an analysis of the HapMap) have shown that variation in copy number of genes is common in normal individuals and possibly contributes to the phenotypic variation within populations. In order to detect copy-number variation, many samples from presumably normal cells are collected and used as a benchmark for the diploid copy number (two); reads from cancerous cells are then compared to this benchmark for deviations. The raw copy number from a sample is the sample SNP signal divided by the average of the normals times two; for example, if the sample SNP signal is twice the average of the normals, then the sample is expected to contain twice the typical number of copies, that is, four. This procedure can be refined by employing a **hidden Markov model**. The true copy number is taken to be a hidden state that influences the probability of observing a given read. The standard deviation of raw read from the true copy number is modeled as a *t*-distributed random variable with 40 degrees of freedom (given 40 probes). If copy-number variation is induced by a pathological event, its occurrence in the genome should have spatial structure; the transition probabilities reflect this. The hidden Markov model is then run to infer the best path.

# Conclusion

The genomicist George Church is predicting that whole-genome sequencing for only $1,000 per subject will be available very soon, perhaps within the next five years. Given the astonishing speed at which sequencing technology has developed, Church may very well be correct. If he is in fact correct, then the use of haplotypes as a proxy for the functional variants within them may soon lose some of its utility. However, haplotypes are likely to remain useful in certain aspects of data analysis, and will almost certainly remain so for the study of recent adaptive evolution in humans and other species. For these reasons it is hoped that developments in this field will continue.

# Additional Links and Reading

Gillespie, JH (2002). *Population Genetics: A Concise Guide* (2nd ed). Baltimore, MD: John Hopkins University Press.

**Just about every genetics textbook or primer gives adequate coverage of linkage disequilibrium, recombination, haplotype structure, and so on. This book has been praised for its conciseness and clarity.**

Gelman A, Carlin JB, Stern HS & Rubin DB (2003). *Bayesian Data Analysis* (2nd ed). Baton Rouge, FL: Chapman & Hall/CRC.

**A standard textbook on Bayesian methods.**

Wikipedia articles on single nucleotide polymorphism, haplotype, International HapMap Project, expectation-maximization algorithm, Bayesian inference, Gibbs sampling, and SNP array.

Templeton AR, Weiss KM, Nickerson DA, Boerwinkle E & Sing CF (2000). Cladistic structure within the human Lipoprotein lipase gene and its implications for phenotypic association studies. *Genetics*, 156, 1259-1275.

PMID 11063700

Clark AG (1990). Inference of haplotypes from PCR-amplified samples of diploid populations. *Molecular Biology and Evolution*, 7, 111â€“122.

PMID 2108305

Qin ZS, Niu T & Liu JS (2002). Partition-ligationâ€“expectation-maximization algorithm for haplotype inference with single-nucleotide polymorphisms. *American Journal of Human Genetics*, 71, 1242-1247.

PMID 12452179

Niu T, Qin ZS, Xu X & Liu JS (2002). Bayesian haplotype inference for multiple linked single-nucleotide polymorphisms. *American Journal of Human Genetics*, 70, 157-169.

PMID 11741196

The International HapMap Consortium (2005). A haplotype map of the human genome. *Nature*, 437, 1299-1320.

PMID 16255080

Wang ET, Kodama G, Baldi P & Moyzis RK (2006). Global landscape of recent inferred Darwinian selection for *Homo sapiens*. *Proceedings of the National Academy of Sciences U.S.A.*, 103, 135-140.

PMID 16371466

Voight BF, Kudaravalli S, Wen X & Pritchard JK (2006). A map of recent positive selection in the human genome. *Public Library of Science Biology*, 4(3):e72.

PMID 16494531

Zhao X, Li C, Paez JG, Chin K, Janne PA, Chen T, Girard L, Minna J, Christiani D, Leo C, Gray JW, Sellers WR & Meyerson M (2004). An integrated view of copy number and allelic alterations in the cancer genome using single nucleotide polymorphism arrays. *Cancer Research*, 64, 3060-3071.

PMID 15126342

Redon et al. (2006). Global variation in copy number in the human genome. *Nature*, 444, 444-454.

PMID 17122850

**A study of normal copy-number variation using the HapMap.**

## Comments (0)

You don't have permission to comment on this page.