| 
View
 

Association Studies and Haplotype Inference

Page history last edited by PBworks 16 years, 9 months ago

Association Studies and Haplotype Inference

 

Edited by Stephen Yee and Kristi Fenstermacher (sfyee@fas. and kfenster@fas.)

 

 


 

 

SNP Association Studies

Association studies refer to the linking of genetic markers with phenotypic expression. The goal of SNP association studies is to find the SNP or haplotype markers for “disease genes” that increase susceptibility to disease. These markers aid in disease prediction and diagnosis. There are two types of SNP association studies: Population-based case-control association studies and family-based association studies.

 

Population-Based Case-Control Association Studies

Population-based case-control SNP association studies involve two large matching (in age and gender) sample populations. One group, the control, consists of the healthy population. The other group consists of the diseased population. The individuals' alleles can then be determined by genotyping. If one allele is significantly more prevalent in the disease group than the healthy one, it is possible that the allele is related to the disease, probably because the SNP is in or near a gene that influences the disease susceptibility. Since linkage disequilibrium places SNPs into haplotype blocks, only a number of SNPs representative of the genome need to be typed for genome-wide coverage in an association study.

 

The lecture notes present an example of this type of association study:

Note that the expected values are calculated as: ((sum of the row) * (sum of the column))/(total sum). Here is an example:

A chi-square test can be performed to determine whether the results are significant. The p-value from this example is of statistical significance:

 

Family-Based Association Studies

Family-based association studies can be conducted with unrelated families or within the same family. In studies with unrelated families, the genotype of both parents and one affected child must be known. An allele is listed as transmitted if the parent from which it comes is heterozygous. For example, in the first row of the following table, Parent 2 is heterozygous, and of its two alleles, A is selected. Therefore, A is the transmitted allele.

By looking at allele transmission, it can be determined whether there is a significant association between the allele and the disease. For example, in the results given above, there are 9 A alleles and 2 a alleles, giving a transmission frequency of:

which has a chi-squared distribution with 1 degree of freedom.

 

In studies within the same family, the significance of the association can be predicted by comparing the allele frequency between affected children and unaffected children . Furthermore, genotyping affected and unaffected individuals within the family can reveal the loci of the associated disease gene. For example, the pedigree chart below shows affected individuals in the family as filled circles or squares, and the white line across the chromosomes in the third generation indicates shared loci for all three members inflicted with the disease. This implies that there is a link between that part of the chromosome and the disease.

 

Figure: A pedigree chart of a family where filled shapes represent members affected by the disease. The affected members all seem to share a particular portion of the chromosome (green). Source: Stat 115 Lecture Notes (http://my.harvard.edu:80/icb/icb.do?keyword=k30161&pageid=icb.page131178).

 

Impact and Complications

SNP association studies have successfully identified a number of polymorphisms or haplotypes that are consistently associated with complex diseases, such as those listed in the table below. In April 2006, a collaborating group of scientists in Boston and Germany announced that they had identified a common genetic variant believed to be associated with adult and childhood obesity (Herbert et al. A common genetic variant is associated with adult and childhood obesity. Science. 312: 279-283. 2006). This finding was replicated in four separate samples composed of individuals of Western European ancestry, African Americans, and children. They determined this genotype is found in 10% of individuals and that genetic polymorphisms are an important factor in obesity. This study was an important lesson in false positives. Despite strong evidence for a connection between this SNP and obesity, future studies were unable to reproduce these results.

 

Figure: Diseases found to be consistently associated with SNPs or haplotypes. (Andrew, T., Hattersley, D.M., and McCarthy, M.I. What makes a good genetic association study? Lancet. 366: 1315-23. 2005.)

 

Still, making a true association between a genetic variation and a disease is difficult. For example, soon after the paper about the obesity-linked common genetic variant was published, four “Technical Comments” appeared in the January 2007 issue of Science, all with results indicating that there is actually no strong association between the genetic variant and the obesity condition, i.e. the original results from April 2006 were false positives. The pitfalls of association studies result from the facts that:

 

1. Association doesn’t necessarily imply causation.

2. Association studies are difficult when several genes affect a quantitative trait.

3. Penetrance (the fraction of people with the marker who express the trait) and expressivity (the severity of the effect) are variable.

4. Population stratification may cause mismatching between sample groups because certain SNPs are unique to certain ethnic groups (in an association study, it is necessary to ensure sample groups match, so a hidden environmental structure does not cause the association).

5. Weak association effects may not be captured by the studies.

6. Association studies are not very reproducible.

 

 

These complications could lead to false positives (multiple hypothesis testing and ethnic admixture/stratification should be performed) and/or false negatives (from lack of power for weak effects), resulting in inconsistencies and irreproducibility of the association study outcome. The inconsistensies could also be the result of variable linkage disequilibrium with the causal SNP, or population-specific modifiers. For studies to be reliable, careful study design using appropriate controls is necessary. Genetic effects are generally modest, but due to the winner's curse (which says that the winning bid overestimates the value), the first positive report overestimates the effect of the SNP on the disease. Thus, large study sample sizes – preferably upward of thousands of people – are also crucial to ensure that both rare and “common” variants with different relative levels of effects can be captured in the studies. Furthermore, in order for association studies to be believable, replication with low p-values is necessary. For Genome Wide Association Studyies, the use of the two-step test and haplotypes instead of SNPs will boost the study's power.

 

A Case Study using the Two-Step Test

 

As mentioned in the previous section, Herbert et al. (2006) performed a genome-wide association study, using a family-based design, to find SNPs associated with obesity. They used a two-step test. The first step (A) was the screening step, where the researchers used parental genotypes to select SNPs and genetic models (used to predict offspring's phenotype). This first step estimates the power to predict an association in the second step. Thus, the researchers ranked the SNPs from lowest to highest based on power, and performed the second step on the 10 SNPs with the highest power. In the second step (B), the researchers tested the selected SNPs from the first step, using a family-based association test. After this test, the p-values were adjusted, taking multiple hypothesis testing into account. If, after this two-step test, a SNP had a significant p-value, this SNP was reported to have a significant association with obesity at the genome-wide level.

 

Figure: The two-step test. (Herbert et al. A common genetic variant is associated with adult and childhood obesity. Science. 312: 279-283. 2006.)

 

 

 

An Introduction to Haplotypes

The fact that some traits tend to be coinherited across generations has long made it evident to geneticists that Mendel's law of independent assortment is not a true law. The discovery in 1954 that DNA is the hereditary substrate provided a mechanistic basis for dependent assortment: because different gametes segregating within a species are in fact unitary molecules that are broken apart and reassembled by relatively rare recombination events, adjacent bases are likely to remain together on the same gamete for a number of generations. Thus, knowledge of the particular nucleotide at a segregating site allows for prediction of the nucleotides at nearby segregating sites. This property of dependence among adjacent segregating bases is known as linkage disequilibrium (LD). Recent advances in sequencing technology have revealed that LD can extend over greater distances than previously thought, creating long blocks of sequence where single nucleotide polymorphisms (sites in the genome where different individuals in a population may have difference bases) are in strong LD with each other. These blocks are now referred to as haplotypes.

 

The Utility of Inferring Haplotypes

 

Given the millions of SNPs in the human genome, detecting replicable associations between genotypes and phenotypes is a daunting challenge. Even though the cost of sequencing is falling dramatically, truly comprehensive genome-wide coverage in any given association study continues to impose a burden on researchers. Some of this burden can be alleviated by the fact that not all SNPs assort independently. Because any functional SNP of interest to a researcher is very likely to be in LD with other SNPs, the haplotypes (a cluster of SNPs in linkage disequilibrium) that are inferred from a handful of SNPs in a particular region can be used as a proxy for whatever functional variants may lie within the block. An association between a particular haplotype and some disease state (or an enhanced value of a quantitative trait) then provides guidance to researchers who hope to pin down to the association to a particular polymorphism within the block. Moreover, in the early phases of association research, haplotypes may provide greater reliability than the genotypes at particular base positions. This is because any association between a phenotype and a SNP is always ambiguous, especially if the SNP is noncoding; the association may be with the typed SNP itself or with another one in linkage disequilibrium with the typed SNP. For this reason, it is often desirable to use haplotypes in exploratory research.

 

Methods for Inferring Haplotypes

 

Suppose that we have three loci with two segregating alleles each (A, a, and so on). The problem in inferring haplotypes is that current genotyping technology typically provides only an individual's genotypes at the loci and not his gamete structure. In other words, we know that the individual is Aa BB Cc, say, but not whether he inherited the ABC and aBc haplotypes or the ABc and aBC haplotypes. Given that 2^k haplotypes are possible for a block of k loci, what might we do to reduce our uncertainty as to which of these possible haplotypes is actually segregating at a nontrivial frequency in the population? This is the essential problem of inference that must be tackled if haplotypes are to be used.

 

If the parental genotypes are known, then the haplotypes of the offspring can often be inferred. This procedure is analogous to that used in blood typing. For example, if the father of our hypothetical individual were homozygous for the "capital" allele at all loci, we would infer that the individual had inherited the ABC haplotype from his father and the aBc haplotype from his mother. The haplotypes carried by an individual can also often be inferred if the common haplotypes in the population are known. Methods have been developed, however, for situations where neither source of information is adequate.

 

 

Clark's Algorithm

 

Clark's algorithm, a method developed by the population geneticist Andrew Clark, is actually quite elegant and intuitive. First, we construct haplotypes from unambiguous individuals. For example, it is certain that an AA bb CC individual (let's call him Individual 1) carries two copies of a haplotype characterized by AbC. It is also certain that an AA BB Cc individual (Individual 2) carries one copy ABC and another copy ABc. Second, we remove individuals from the sample that can be explained as combinations of haplotypes that have already been discovered. So, if Individual 3 has one haplotype that is AbC and the other is ABc, this pair has been explained (since both haplotypes have already been discovered in the set of unambiguous haplotypes). Individual 3 is then considered to be explained, and is removed. Third, for each remaining individual, list all the possible combinations of haplotypes that might explain their multilocus genotypes and select the haplotype that shows up most often in these lists. So, if the haplotypes remaining to be explained are AbC + abc and ABc + aBc in Individual 4 and AbC + abc in Individual 5, the new haplotype you would propose is abc (since abc is found in the most remaining haplotypes that has one of the two copies explained by the unambiguous haplotypes). Finally, iterate the second and third steps; that is, remove individuals that can be explained as combinations of the known and proposed haplotypes, and look again for possible haplotypes that can account for most of the remaining multilocus genotypes. The algorithm terminates when all individuals have been accounted for.

 

The disadvantages of Clark’s algorithm are that it cannot get started when the number of individuals, n, is small, and the result depends on the number of ambiguous subjects.

 

 

Expectation Maximization and Gibbs Sampler in Haplotype Inference

The following two methods, expectation maximization and Gibbs sampler, rely on a statistical model of the haplotypes in a population. Each individual's haplotypes are treated as a random draw from a pool of haplotypes which have certain frequencies, theta, that satisfy the genotyping. In both Expectation Maximization and Gibbs Sampler, we observe the genotype (Y), and estimate the haplotype pair (Z) and haplotype frequency (theta). Thus, initial haplotype frequencies are randomly assigned (and should sum to 1), and then Z and theta are iteratively estimated. The individual's haplotype pair (Z) is estimated given the observed genotype and the genotype frequencies, and the haplotype frequencies (theta) are estimated given the observed genotype and the individual's haplotype pair.

 

As both Expectation Maximization and Gibbs sampler are very general and have applications in other areas of bioinformatics, readers interested in these methods should consult the articles specifically devoted to them.

 

Expectation Maximization

What we want are estimates of the frequencies of each possible haplotype in the population. Obtaining these estimates by maximum likelihood is trivial provided that we know what haplotypes each individual is carrying. Thus, the actual haplotypes that are carried by each individual are treated as missing data to be input from the haplotype frequencies. In the initialization step, we pick an arbitrary vector of haplotype frequencies, theta, subject to the conditions that they are compatible with the observed genotypes and sum to one. Using this current "estimate" of the unknown frequencies, the expected value of the missing data is computed. So, the haplotype pair for each individual is estimated, given the observed genotype and the haplotype frequency. This is the E step. The result is a set of likelihood equations that are relatively easy to solve. In other words, the haplotypes frequencies are estimated given each individual's haplotype pair and the haplotype frequency. This is the M step. The new estimates obtained from the M step are used to update the expected values of the missing data, and this whole process is repeated until convergence.

 

Example of Expectation Maximization

 

The figure above (Stat 115 SNP and Haplotype Analysis Lecture Slide 6) provides an example of expectation maximization. The initial estimates of the thetas are chosen more or less at random, subject to the condition that they are compatible with the observed genotypes and sum to one. The first individual is obviously compatible with one assignment of haplotypes (double homozygote for ABC), so it is estimated that he is ABC/ABC. The remaining two individuals are each compatible with two assignments of haplotypes. From these data we obtain the updated frequency of ABC as follows. It is certain that the first individual is ABC/ABC, so the expected number of ABC haplotypes carried by the individual is 2. The second individuals carries a single ABC haplotype in one of his two possible assignments. The expected number of ABC haplotypes carried by this individual is the probability that any individual carries this particular assignment (that is, the assignment compatible with ABC being one of the haplotypes) divided by the sum of the respective probabilities that any individual carries an assignment of haplotypes compatible with this individual's observed genotypes. A similar process is executed to determine the expected number of ABC haplotypes carried by the third individual. The expected number of ABC haplotypes is then summed over individuals, divided by the total number of individuals, and set equal to the frequency of ABC in the next iteration. This procedure is carried out for each theta to produce a set of equations that can be solved by maximum likelihood to produce new estimates for theta in the next iteration.

 

Gibbs Sampler

 

An important advantage of Bayesian data analysis is that it provides a complete description of the uncertainty of an estimate that itself relies on other parameters being estimated. This feature makes Bayesian approaches useful in haplotype inference, where an entire vector of unknown parameters is the quantity sought. We do not describe the theory or mechanics of Bayesian analysis in any detail here; for this the reader should consult more advanced references.

 

What is desired in Bayesian analysis is a posterior distribution of an unknown parameter that incorporates all features of the data, including the uncertainty in other parameters on which the parameter of interest depends. Posterior distributions are marginal probability densities over potentially many variables, and hence obtaining them often requires the integration of high-dimensional functions that lie beyond the grasp of analytical and numerical methods. Mathematical physicists faced with the problem of integrating functions of this kind developed a variety of techniques that revolve around the clever idea of random sampling from these functions. Perhaps the most powerful of these techniques is the Gibbs sampler, one of a family of Markov Chain Monte Carlo (MCMC) approaches. The Gibbs sampler is difficult to explain concisely, but at a low level relatively easy to understand. The key idea of Monte Carlo methods is that we can draw samples from the multidimensional distribution associated with the integral and use these samples to estimate the desired quantity with great accuracy.

 

Simulating random vectors drawn from some complex target distribution can be an extremely difficult. MCMC approaches circumvent this problem by drawing successive samples from far simpler distributions in such a way that the distribution of the samples converges to the target distribution. We use the previous sample values to randomly generate the next sample value in such a way that the sequence satisfies a Markov chain. The Gibbs sampler uses only the conditional distributions that result when all of the random variables but one are assigned fixed values. Such conditional distributions are far easier to simulate than complex joint distributions and often have simple forms.

 

Let us now go through a simple example. Consider a bivariate random variable (x, y). Suppose that we wish to compute the marginals p(x) and p(y). Suppose also that integrating the joint density over one variable is beyond our meager calculus skills. With the Gibbs sampler in hand, we can still obtain the marginal densities by considering a sequence of conditional distribution p(x given y) and p(y given x). The sampler starts with some initial value y_0 for y and obtains x_0 by taking a random draw from the conditional distribution p(x given y=y_0). The sampler then uses x_0 to generate a new value y_1, drawing from the conditional distribution p(y given x=x_0). This process is repeated $k$ times to generate a Gibbs sequence of length k, that is, a sequence of (x_j, y_j) where j = 1, ..., k. To obtain the desired total of m sample points, we sample the chain after a sufficient burn-in (to remove the effects of the initial values) and at set time points (say every n samples) following the burn-in (to rid the retained sample points of local effects). After millions of iterations, the Gibbs sequence converges to the desired target distribution regardless of the initial values.

 

The sampler is readily extended to more than two variables. The power of the appraoch is evident when we consider that the dimension of the joint density and the number of variables integrated over can both be very large.

 

There are two ways to compute the Monte Carlo estimate of the shape of the marginal density. First, we can use the Gibbs sequence of x_i values to approximate the marginal distribution of x. However, this approach can be somewhat inaccurate for the tails of the distribution. Second, we can average the conditional densities p(x given y=y_i). This approach does estimate the tails accurately. Once we have the marginal density of the unknown parameter, we may of course compute other features of the parameter such as its mean and variance.

 

Inspection of the figure above suggests how the Gibbs sampler can be used to determine the posterior distribution of the vector of Z's (that is, the haplotypes carried by the individuals). To be clear, the distribution that we seek assigns to each individual a probability for every pair of haplotypes. We start with a pseudo-count of haplotypes very much as in the initialization step of expectation maximization. Conditioning on these "estimated" haplotype frequencies and the observed genotypes of the individuals, we draw samples of haplotype pairs and assign them to the individuals. Once each individual has a haplotype pair, we condition on the pairs and draw samples of haplotype frequencies; that is, we draw from the conditional distribution of our estimate of the true theta given the haplotypes carried by the individuals during this iteration. This process is repeated many, many times until the sequence converges to give the marginal distribution of the haplotype pairs.

 

Partition/Ligation

 

As the number of SNPs in a block grows, a point is reached where haplotype inference becomes computationally unwieldy. This problem can be overcome by the following "divide and conquer" method. Since recombination decays LD, LD decays with distance. Thus, "sub-haplotypes" can arise within a block--a cluster of adjacent SNPs may be in tight LD with each other, but sites farther away will inevitably be in weaker LD with the cluster; it will thus be possible to partition the sequence into groups where the LD among groups is relatively small compared to the LD within groups. If the recombination rate varies sharply within the region, then recombination hotspots may make for natural partition points.

 

Once the partitions are made by the analyst, haplotypes are inferred within each group using one of the previously discussed methods. The groups are constructed to contain sufficiently few SNPs that this task is computationally tractable. The key idea of the ligation step is that the sub-haplotypes can be treated as if they were segregating bases in order to infer haplotypes for the entire block. The end result is a hierarchical arrangement. At the top level are the long haplotypes that characterize the variation of the entire block. These haplotypes are distinguished from each other not by any single nucleotide but by groups of adjacent nucleotides that may be called haplotypes in their own right. For example, suppose that an entire locus consisting of 16 bases is segregating haplotypes I and II. Haplotype I is distinguished from II not by a single base but by being composed of the two 8-base haplotypes A and B, whereas II is composed of haplotypes A and C. It is possible that at the next level the sub-haplotypes are still not distinguished from each other at the level of single nucleotides but rather by even smaller haplotypes. For example, the haplotypes B and C segregating at the second half of the locus may differ in that B is composed of the 4-base haplotypes 1 and 2, whereas C is composed of the 4-base haplotypes 3 and 2. Finally, at the lowest level, the sub-haplotypes are distinguished from each other by the SNPs of which they are composed.

 

Example of Partition/Ligation

 

The International HapMap Project

The observation that haplotypes are often longer than expected because of variation in recombination rates across the genome motivated the International HapMap Project, an effort by a consortium of researchers to catalogue the haplotype variation in different world populations. The researchers wanted to discover what the common genetic variants were, locate them on human DNA, and understand how they are distributed within populations and between populations around the world. The goals of the HapMap Project are as follows: 1) To define haplotype "bolcks" across the genome. 2) To identify a reference set of SNPs with which to tag each haplotype. 3) To enable, unbiased, genome-wide association studies. The figure below gives an indication of the vast scope of this effort. Because by definition alleles are correlated with each other within haplotypes, it is expected that the HapMap will assist researchers in association studies of disease and drug response by reducing the number of SNPs that need to be typed for comprehensive coverage of the genome. All data from the HapMap, including SNP frequencies, genotypes, and haplotypes, are freely available in the public domain.

Groups in The International HapMap Project

 

Additional Links and Reading

 

Gillespie, JH (2002). Population Genetics: A Concise Guide (2nd ed). Baltimore, MD: John Hopkins University Press.

Just about every genetics textbook or primer gives adequate coverage of linkage disequilibrium, recombination, haplotype structure, and so on. This book has been praised for its conciseness and clarity.

 

Gelman A, Carlin JB, Stern HS & Rubin DB (2003). Bayesian Data Analysis (2nd ed). Baton Rouge, FL: Chapman & Hall/CRC.

A standard textbook on Bayesian methods.

 

Wikipedia articles on single nucleotide polymorphism, haplotype, International HapMap Project, expectation-maximization algorithm, Bayesian inference, Gibbs sampling, and SNP array.

 

Templeton AR, Weiss KM, Nickerson DA, Boerwinkle E & Sing CF (2000). Cladistic structure within the human Lipoprotein lipase gene and its implications for phenotypic association studies. Genetics, 156, 1259-1275.

PMID 11063700

 

 

Clark AG (1990). Inference of haplotypes from PCR-amplified samples of diploid populations. Molecular Biology and Evolution, 7, 111–122.

PMID 2108305

 

Qin ZS, Niu T & Liu JS (2002). Partition-ligation–expectation-maximization algorithm for haplotype inference with single-nucleotide polymorphisms. American Journal of Human Genetics, 71, 1242-1247.

PMID 12452179

 

Niu T, Qin ZS, Xu X & Liu JS (2002). Bayesian haplotype inference for multiple linked single-nucleotide polymorphisms. American Journal of Human Genetics, 70, 157-169.

PMID 11741196

 

The International HapMap Consortium (2005). A haplotype map of the human genome. Nature, 437, 1299-1320.

PMID 16255080

Comments (0)

You don't have permission to comment on this page.