Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
CALIBRATING THE CLOCK: USING STOCHASTIC PROCESSES TO MEASURE THE RATE OF EVOLUTION 119 records the allelic partition of the sample, leads to the sampling theory of the infinitely-many-alleles model initiated by Ewens (1972). The Ewens sampling formula is then described, followed by a brief digression into the simulation structure of mutations in the coalescent, both in top-down and bottom-up form. Next, the infinitely- many-sites model is introduced as a simple description of the detailed structure of the segregating sites in the sample. Finally, we return to classical population genetics theory, albeit from a coalescent point of view, to discuss the structure of K-allele models. This in turn develops into the study of the finitely-many-sites models, which play a crucial role in the study of sequence variability when back substitutions are prevalent. In the next section we digress to present a mathematical vignette in the area of random combinatorial structures. The Ewens sampling formula was derived as a means to analyze allozyme frequency data that became prevalent in the late 1960s. Current population genetic data is more sequence oriented and requires more detailed models for its analysis. Nonetheless, the combinatorial structure of the Ewens sampling formula has recently emerged as a useful approximation to the component counting process of a wide range of combinatorial objects, among them random permutations, random mapping functions, and factorization of polynomials over a finite field. We show how a result of central importance in the development of statistical inference for molecular data has a new lease on life in an area of discrete mathematics. The final section briefly discusses some of the outstanding problems in the area, with particular emphasis on likelihood methods for coalescent processes. Some aspects of the mathematical theory, for example, measure- valued diffusions, are also mentioned, together with applications to other, more complicated, genetic mechanisms. THE COALESCENT AND MUTATION The genealogy of a sample of n genes (that is, stretches of DNA sequence) drawn at random from a large population of approximately constant size may be described in terms of independent exponential random variables Tn,Tnâ1,. . .,T2 as follows. The time Tn during which the sample has n distinct ancestors has an exponential distribution with parameter n(n â 1)/2, at which time two of the lines are chosen at random to coalesce,
CALIBRATING THE CLOCK: USING STOCHASTIC PROCESSES TO MEASURE THE RATE OF EVOLUTION 120 giving the sample n â 1 distinct ancestors. The time Tnâ1 during which the sample has n â 1 such ancestors is exponentially distributed with parameter (n â l)(n â 2) / 2, at which point two more ancestors are chosen at random to coalesce. This process of coalescing continues until the sample has two distinct ancestors. From that point, it takes an exponential amount of time T2 with parameter 1, to trace back to the sample's common ancestor. For our purposes, the time scale is measured in units of N generations, where N is the (effective) size of the population from which the sample was drawn. This structure, made explicit by Kingman (1982a,b), arises as an approximation for large N to many models of reproduction, among them the Wright-Fisher and Moran models. A sample path of a coalescent with n = 5 is shown in Figure 5.1. Figure 5.1 Sample path of the coalescent for n = 5. Tj denotes the time during which the sample has j distinct ancestors. Tj has an exponential distribution with mean 2/j(j â 1). From the description of the genealogy, it is clear that the time Ïn back to the common ancestor has mean
CALIBRATING THE CLOCK: USING STOCHASTIC PROCESSES TO MEASURE THE RATE OF EVOLUTION 121 or approximately 2N generations for large sample sizes. Further aspects of the structure of the ancestral process may be found in Tavaré (1984). Rather than focus further on such issues, we describe how the genealogy may be used to study the genetic composition of the sample. To this end, assume that in the population from which the sample was drawn there is a probability u that any gene mutates in a given generation, mutation acting independently for different individuals. In looking back r generations through the ancestry of a randomly chosen gene, the number of mutations along that line is a binomial random variable with parameters r and u. If we measure time in units of N generations, so that r = [Nt] (that is, r is Nt rounded down to the next lower integer), and assume that 2Nuâ Î¸Î±Ï N â â, then the Poisson approximation to the binomial distribution shows that the number of mutations in time t has in the limit a Poisson distribution with mean θ t / 2. This argument can be extended to show that the mutations that arise on different branches of the coalescent tree follow independent Poisson processes, each of rate θ / 2. For example, the total number of mutations µn that occur in the history of our sample back to its common ancestor has a mixed Poisson distribution âgiven Tn, Tnâ1,. . .,T2, µn has a Poisson distribution with mean . The mean and variance of the number of mutations are given by Watterson (1975): (5.1) and (5.2) We are now in a position to describe the effect that mutation has on the individuals in the sample.