Markov Chain Monte Carlo Approaches to Analysis of Genetic and Environmental Components of Human Developmental Change and G E Interaction

Size: px

Start display at page:

Download "Markov Chain Monte Carlo Approaches to Analysis of Genetic and Environmental Components of Human Developmental Change and G E Interaction"

Vincent Holmes
5 years ago
Views:

1 Behavior Genetics, Vol. 33, No. 3, May 2003 ( 2003) Markov Chain Monte Carlo Approaches to Analysis of Genetic and Environmental Components of Human Developmental Change and G E Interaction Lindon Eaves 1,3 and Alaattin Erkanli 2 Received 12 Sept Final 10 Oct The linear structural model has provided the statistical backbone of the analysis of twin and family data for 25 years. A new generation of questions cannot easily be forced into the framework of current approaches to modeling and data analysis because they involve nonlinear processes. Maximizing the likelihood with respect to parameters of such nonlinear models is often cumbersome and does not yield easily to current numerical methods. The application of Markov Chain Monte Carlo (MCMC) methods to modeling the nonlinear effects of genes and environment in MZ and DZ twins is outlined. Nonlinear developmental change and genotype environment interaction in the presence of genotype-environment correlation are explored in simulated twin data. The MCMC method recovers the simulated parameters and provides estimates of error and latent (missing) trait values. Possible limitations of MCMC methods are discussed. Further studies are necessary explore the value of an approach that could extend the horizons of research in developmental genetic epidemiology. KEY WORDS: Growth curves; Bayesian inference; Gibbs sampling; Markov Chain Monte Carlo methods; twins; longitudinal studies; G E interaction, hierarchical mixed models. INTRODUCTION From relatively modest and controversial beginnings (Jinks and Fulker, 1970), the last quarter-century has seen the emergence of linear structural modeling as little short of an industry in behavioral genetics and genetic epidemiology to the point where it has supplanted almost all other approaches to the analysis of family resemblance. The reasons for its success are fairly clear. The approach is flexible, allowing for a very wide range of models to be specified for the effects of genes and environment. Models have been developed and tested for biological and cultural inheritance, various patterns 1 Virginia Institute for Psychiatric and Behavioral Genetics, Department of Human Genetics, Virginia Commonwealth University. 2 Department of Biostatistics and Bioinformatics, Duke University Medical Center. 3 To whom correspondence should be addressed at Virginia Institute for Psychiatric and Behavioral Genetics, PO Box , Virginia Commonwealth University, Richmond, Virginia of assortative mating, and developmental change in longitudinal data. One major factor accounting for its appeal has been the ready extension of models for family resemblance to incorporate multivariate measures (Martin and Eaves, 1977). On the assumption of underlying multivariate normality (i.e., probit regression), the approach can be further extended to encompass dichotomous (Fulker, 1973) and other categorical data (Eaves et al., 1978) in a rich variety of complex patterns (Neale and Kendler, 1995). The widespread adoption of structural modeling techniques has been further facilitated by the development and dissemination of remarkably flexible, wellsupported, robust, efficient and user-friendly software such as the Mx package (Neale et al., 1999). Over the last decade, structural modeling has provided the unifying platform for teaching the analysis of family resemblance at a widely-subscribed series of international workshops that are now credited with more than 500 alumni /03/ / Plenum Publishing Corporation

2 280 Eaves and Erkanli The structural modeling approach is typically based in maximizing the likelihood of data with respect to parameters of one or more theoretical models. Thus, the approach yields tests of goodness of fit and standard errors of parameter estimates, minimizing, though not removing, one element of subjectivity in deciding between alternative hypotheses. The transformation of many areas of statistical genetic research accomplished by these approaches over the last three decades has been remarkable and due in no small measure to the parallel development of computer hardware and software for numerical analysis that has rendered as matters of routine analyses that were scarcely conceivable 30 years ago. In few areas has the impact of these developments been greater than that of behavioral and psychiatric genetics. When a method works well and is being productive, there is a danger that limitations may be ignored and significant scientific questions be deferred because there are many others that can be answered with the existing methods. Nonlinear models incorporating random genetic and environmental effects comprise one significant arena in which there are serious scientific questions vying for the attention of behavioral geneticists. Examples of issues requiring such models are the analysis of non-linear developmental change (e.g., genetic differences in growth patterns) and non-additive effects (e.g., epistasis and G E interaction). Such processes cannot readily be forced into the Procrustean bed of the linear structural models that have, historically, proved so productive. Although it is a relatively easy matter to write the likelihood functions associated with such nonlinear models (see, e.g., Eaves et al., 1986), obtaining parameter estimates has been a laborious process because the likelihood involves integration over the unknown random effects in the model. For example, in a (multivariate) nonlinear model, such as a survival model or growth curve model, the likelihood of a twin pair depends on the values of the random genetic and environmental effects of the individual twins for all the parameters of the growth curve model (e.g., initial value, asymptote, slope, inflection point). The likelihood of the pair requires that the value of the likelihood of the phenotype, given particular (unknown) genetic and environmental effects be integrated over all values of the latent genetic and environmental variables. This integration has as many dimensions as there are latent genetic and environmental affects in the model. Thus, a four-parameter growth curve model in which the growth parameters of twins are each influenced by separate genetic and environmental factors requires integration over dimensions. Even though this is numerically tractable in simple cases through methods such as Gaussian quadrature, it rapidly becomes tedious in practically important contexts, such as the analysis nonlinear growth curves in twins, because of the relatively large number of dimensions and the lack of robust, accurate and relatively general adaptive procedures for numerical integration. The problem of multidimensional integration is encountered in the context of hierarchial mixed models in which there are random individual differences in parameters of a nonlinear model for a response variable (e.g., Lindstrom and Bates, 1990). Even in those cases where estimates can be obtained numerically, the additional computation required to obtain estimates of confidence intervals is prohibitive. In this paper, we introduce a fairly general Markov Chain Monte Carlo (MCMC) framework of the Bayesian methodology (see, e.g., Gilks et al., 1996) to the analysis of twin data that is free of many of the limitations inherent in some of the current widely used methods. In addition to providing numerical estimates of the parameters of complex models for twin resemblance, the MCMC approach provides, at little extra computational cost, information that is more difficult to obtain through the classic likelihood-based approaches, such as the joint posterior probability distribution of a latent trait for a twin pair. The use of noninformative prior distributions provide inferences that are comparable with the classic MLE approaches, and additionally, if there is prior information available, it too can be incorporated into the existing models through the use of informative prior distributions; such information cannot be utilized with MLE estimation. At one level, we may think of MCMC as an approach to parameter estimation that uses a carefully constructed sequence of Monte Carlo simulations to construct the integrals that are so elusive in ML approaches. However, the way the sequence of simulations is constructed provides a wealth of information and possible insight that is not available through the usual approaches to numerical integration. Although MCMC has been used quite widely in the analysis of linkage and pedigree data (see, e.g., Thomas and Gauderman, 1996), it has not been applied much outside that arena. Part of the reason for this is almost certainly the convenience of linear structural modeling and the conceptual impact of its applications in a wide range of contexts over the last quarter century. Do et al. s (2000) application of

3 Markov Chain Monte Carlo 281 MCMC to survival analysis of twin data is a notable exception that also reflects the relative intractability of nonlinear genetic models to the usual likelihood-based approaches. MARKOV CHAIN MONTE CARLO METHODS Since their inception in physics in early 1950s (Metropolis et al., 1953; Hastings, 1970), MCMC methods have been used in a variety of areas requiring complex statistical modeling such as image analysis (Besag and Green, 1993; Gelfand and Smith, 1990; Geman and Geman, 1984) and generalized linear mixed models (Clayton, 1996; Zeger and Karim, 1991). Although the approach has been implemented in some genetic contexts (see, e.g., Shoemaker et al., 1999, and citations in Thomas and Gauderman, 1996), there have been relatively few published examples of its application to twin data (e.g., Burton et al., 1999; Do et al., 2000). The latter application was developed appropriately in the context of survival analysis that has hitherto proved numerically cumbersome (Meyer and Eaves, 1988; Meyer et al., 1991) for the same reasons that compelled us to explore the application of MCMC methods to the problems described here. Apart from apparently solving some practical problems, MCMC methods commend themselves intellectually because they provide a unifying framework within which many complex problems can be analyzed using generic software (Gilks et al., 1996, p. 1). Indeed, the feasibility of applying MCMC methods to behavior-genetic applications is enhanced enormously by the free dissemination of a windows version (WinBUGS 1.3) of the program BUGS ( Bayesian Inference Using Gibbs Sampling ; Spiegelhalter et al., 2000) that we used in all the applications described here. Superficially, we may regard MCMC methods as one further way of obtaining estimates of parameters of a genetic model when the usual approaches of maximum-likelihood are too tedious. Thus, MCMC methods use Monte Carlo methods to approximate the integrals that have proved tiresome in likelihood-based approaches, such as obtaining the normalizing constants in Bayesian analyses, and marginal likelihood functions in the generalized linear mixed-effects models. However, from another perspective, MCMC s foundation in a Bayesian approach to modeling raises more profound questions about the theoretical framework within which the task of modeling is conceived (see, e.g., Gelman et al., 1995, Sivia, 1996). In the classic ML framework, we seek the values of model parameters that maximize the likelihood of the data, given certain assumptions about the distribution of the data and the underlying parameters. Within the Bayesian framework, we seek the joint posterior distribution of the unknown parameters given the data under, it is hoped, appropriate assumptions about the prior distribution of the parameters and data. Whether or not we choose explicitly to adopt a Bayesian paradigm for genetic modeling, it turns out that some of the algorithms developed within the Bayesian context have advantages for exploring a range of models that appear less tractable within the conventional context of maximum likelihood. Briefly, the MCMC approach constructs a Markov Chain on the (parameter) space of unknown quantities such that, starting with a series of trial values (e.g. means, regressions, genetic variances, genetic and environmental effects etc.), after an initial series of iterations (the burn-in ) successive iterations represent samples from the unknown joint distribution. This is the so-called stationary distribution of the Markov Chain, and in the Bayesian context, it is the joint posterior distribution of all the parameters. We note that in this context, the parameters do not just include the usual parameters of the structural model (means, genetic and environmental variances, etc.) but also the latent genetic and environmental deviations of the individual twins and any missing values. The MCMC iterations are furnished by simulating values for the unknown parameters, conditional upon the given data, using specially catered transition probability kernels (also called proposal distributions) that are not only easy to simulate from, but also guarantee the convergence (in distribution) of the simulated Markov Chain to the true joint posterior distribution. After a burn-in period the probability distribution, and its moments like the expected value of any function, of the unknown quantity is obtained to any desired degree of precision by taking an (ergodic) average of the successive values over a sufficiently large number of iterations. The Gibbs sampler (Creutz, 1979; Gelfand and Smith, 1990; Geman and Geman, 1984; Ripley, 1979) is perhaps the most popular MCMC approach to construct a Markov chain with the desired properties. In the Gibbs sampling approach, the Markov kernels consist of the conditional distributions of each variables of interest given all the other variables. As an example consider random variables X and Y having an unknown joint distribution [X, Y], and assume further that each of the conditional distributions [XƒY] and [YƒX] are available in analytically closed form. Here the Markov kernels are the conditionals [XƒY] and [YƒX]. So, if X 0

4 282 Eaves and Erkanli and Y 0 are initial values, then the Markov Chain is constructed on the XY space by simulating successively a sequence of {X r, Y r from the known conditionals [YƒX r-1 ] and [XƒY r-1 ] for r 1,2,..., R. It can be shown that the joint distribution of the sequence {X r, Y r converges to the joint distribution [X, Y] as R tends toward infinity, as long as these conditionals are bonafide probabilitydistributions. Thus, for a sufficiently large R, the {X r, Y r resemble draws from the true joint distribution [X, Y]. Within the Bayesian context, X and Y are usually unknown parameters (e.g., the mean and variance), and [XƒY] and [YƒX] (suppressing the conditioning on the data) are the conditional posterior distributions, and [X, Y] is the joint posterior distribution of X and Y, respectively. For example, the marginal posterior distribution [X] of X can be approximated by the Monte Carlo integration, for each X x, [x] 1/R r [x ƒ Y r ], where the summation is over r 1,2,..., R. Similarly, the expectation of a function g (X) is approximated by the Monte Carlo average Eg(X) 1/R r g(x r ). Note that a desired byproduct of MCMC approach is that not only the expectations, but the entire posterior distribution of g(x) is approximated by using the sequence {g(x r ). There are also several other ways, such as general Metropolis-Hasting algorithms, to construct an MCMC sampler; in fact the Gibbs sampling is a special case of these general algorithms. A more thorough account of the approach, which is beyond the scope of this paper, may be found in Tierney (1994), Gilks et al. (1996), and Brooks (1998). A recent paper by Besag (2000) reviews several MCMC approaches and provides a comprehensive list of references. Example 1: The Classic Multivariate Genetic Model for MZ and DZ Twins It is convenient to introduce the approach with an example familiar to those with practice in linear structural models, that is the estimation of the additive genetic and within-family environmental covariance matrices, from data on a pair of variables measured on samples of 100 MZ and 100 DZ twins. The data were simulated using SAS on the assumption of multivariate normality. Table I summarizes the population parameter values used to generate the simulated data and the observed mean vectors and 4 4 covariance matrices for the MZ and DZ twins. The table also shows the maximum- Table I. Population Parameter Values Used in Simulation of Bivariate Twin Data and Values Realized Using Mx for ML Estimation (n 100 MZ and 100 DZ Pairs) Parameter ML estimate Population value mu[1] mu[2] sigma2.g[1,1] sigma2.g[1,2] sigma2.g[2,2] sigma2.e[1,1] sigma2.e[1,2] sigma2.e[2,2] likelihood estimates of the MZ and DZ means and the additive genetic and within-family environmental covariance matrices obtained by fitting the structural model to the raw data vectors in Mx (Neale et al., 1999). The same data were analyzed by MCMC using WinBUGS to implement the Gibbs sampler. The first step is to construct a directed graph (Fig. 1) that expresses the logical and stochastic process by which the twin data are generated. WinBUGS provides a graphical user interface (GUI) that allows most of the elements of the model to be specified graphically as doodles. Subsequently the graphical model may be automatically translated into a script with syntax similar to S-Plus, and compiled with object data prior to initialization with trial values and execution. The automated code may be modified manually to carry out side-computations that cannot be implemented directly in the GUI, such as the computation of standardized estimates. The underlying model follows that of Jinks and Fulker (1970) in recognizing that each twin of a DZ pair is realized, in the absence of shared environmental effects, by sampling three separate (normal) variables: a between families genetic component, g 2 ; a within families genetic effect, g 1, and a within families environmental effect, e 1. The expected variances of the between and within family genetic effects may be parameterized in terms of additive and non-additive genetic effects. If genetic effects are additive and mating is random, the variances of g 2 and g 1 are equal. The process that generates MZ twins is similar, but each twin of a pair share the same within and between-family genetic effects (so the genetic effects of MZs are identical). Each MZ twin still receives its own unique within-family environmental effect, e 1. The logic of the genetic model follows closely the underlying process by which genetic and environmental effects arise. The between family genetic effects

5 Markov Chain Monte Carlo 283 Fig. 1. Graphical model for random additive genetic and within-family environmental effects on multivariate MZ and DZ twin data. reflect the average genetic differences between parents and the within-family effects those of genetic segregation. The environment within families is thus a further process that differentiates between individuals of known genetic constitution. This natural logic is mirrored in the way in which BUGS models are written for twin data (Fig. 1). Figure 1 represents many, but not all, of the features of a BUGS model. A fuller description and examples are provided in the manuals accompanying the downloaded WinBUGS software. (Spiegelhalter et al., 2000). Constants are represented by rectangles and variables by ellipses. Stochastic dependence is represented by a single one-headed arrow. The distribution of stochastic nodes is not shown in the figure, but is selected from a menu of options available when building the doodle interactively. The rectangles ( plates ) represent the domain over which subscripts vary in subscripted variables and translate into For loops in the code derived from the doodle. The graphical model is in two parts. The left half of the diagram represents the process by which data on DZ pairs are generated. The element ydz[l,j,k] contains the observed value of the jth twin of the ith pair on the kth variable. The model allows generally for N variables. Similarly, the corresponding typical datum for an MZ individual is stored in ymz[l,j,k] represented on the right of the diagram. Elements such as ydz[l,j,k] are termed nodes and are represented in ellipses in the BUGS GUI. The observed values of the twins depend stochastically on prior nodes whose interrelationships are specified in the graph. Starting at the top, we have the population mean vector, mu, it is assumed initially that these means have a relatively uninformative prior distribution and are sampled from a multivariate normal distribution with mean vector mean (0,0) and precision matrix (precis). The precision matrix is equal to the inverse of the prior 2 2 covariance matrix of the population means. These so-called meta-parameters are typically chosen, in our application, to reflect our relative lack of prior information about the population parameters. This amounts to making the diagonal elements of precis fairly small (we used ). These prior values are supplied as data to BUGS. Notice that with non-informative priors, the Bayesian inference usually produces results that are comparable to those obtained using MLE since the joint posterior distribution is highly dominated by the likelihood function relative to the prior distribution. The next layer of the process in DZ twins, which corresponds to the way twin pairs are produced, comprises generation of the means of the ndz twin pairs. The node g2dz[i,k] denotes the expected value of the

6 284 Eaves and Erkanli mean of the jth pair on the kth variable. The vector of between family effects is generated from the multivariate normal distribution with mean vector mu and precision matrix tau.g. The matrix tau.g is the inverse of half the additive genetic covariance matrix. Our initial model assumes that this matrix is of general positive definite form. It is possible to devise models that reflect other hypotheses about the genetic covariance structure. In the GUI notation of WinBUGS, a singleheaded arrow represents stochastic dependence. This should not be confused with the path coefficient familiar in LISREL formulations. For simplicity we ignore the effects of the shared environment in this early example, but there is no difficulty adding an environmental component to the differences between families. Our choice of g2 to denote the between-pair genetic effects emphasizes the parallel between the development of the MCMC model for twin data and the early formulation of the variance components model for twin data by Jinks and Fulker (1970). The fact that each ith pair of the set of ndz DZ pairs has its own genetic deviation is represented by the large rectangle ( plate ) encompassing the g2dz[i,1:n]. The plate corresponds to a for loop in many programming languages. The expected values of the individual DZ twins, g1dz[i,j,k] are obtained by adding bivariate normal within-pair genetic deviations to the expected values of the corresponding pair. The genetic deviations are again sampled from a multivariate normal distribution with covariance matrix equal to half the additive genetic variance. Thus, in our GUI, the individual expected values of the DZ twins depend stochastically on the g2dz[i,k] and the genetic precision matrix, tau.g. The fact that the single headed arrow ( edge ) links both g2dz and g1dz to tau.g automatically imposes the constraint that the genetic variances are equal within and between DZ (sib) pairs when gene action is additive. Finally, the individual twin observations (ydzil,j,k]) are represented as samples from the multivariate normal distribution with expected values equal to the genetic effects, g1dz, and precision equal to the inverse, tau.e, of the within-family environmental covariance matrix. The fact that the individual twins are nested within pairs is represented in the graph by the second plate nested within the outer plate. The inner plate contains the expected individual values, g1dz, and the observations, ydz, but excludes the g2dz that are only captured within the domain of the outer plate. The graph is completed by representing, on the right-hand side of Figure 1, the process that generates pairs of MZ twins. The MZ part of the graph differs only in the fact that both the between-family ( g 2 ) and within-family ( g 1 ) genetic deviations contribute to expected values of the twin pair in MZ twins, because they have identical alleles segregating from their parental genotypes. The individual MZ twin observations are assumed to be multivariate normal with expected values equal to the pair means (g1mz[i,k]) and dispersion equal to the within family environmental covariance matrix. The fact that both between and within-family genetic deviations contribute to differences between MZ pairs is indicated by excluding the g1mz effects from the inner plate in the case of MZ pairs (Fig. 1). Note that the nodes on the MZ part of the diagram depend stochastically (have arrows coming from) the same precision nodes as those for DZ twins. By putting MZ and DZ twins on the same diagram in this way, the implied equalities of the components of covariance in MZ and DZ twins are automatically imposed. The BUGS code generated automatically from the graph was amended to invert the genetic and environmental precision matrices to yield the additive genetic and within-family environmental covariance matrices. The code is supplied in (Appendix 1). Electronic copies of this and other doodles, code and simulated data used in this paper can be obtained from the first author. The MCMC simulation was started with trial values of (15,15for the mean vector, the identity matrix, I, for the genetic precision matrix (tau.g) and 6I for the environmental precision matrix (tau.e). These values were selected to be of the right order but sufficiently far from the ML estimates as to pose a realistic challenge to the MCMC algorithm. The MCMC algorithm converged very rapidly, but a 2000 iteration burnin preceded a further 5000 iterations that were sampled to characterize the stationary distribution. With the relatively small data set and simple model the CPU time was very short on a lap-top computer, though significantly slower than Mx applied to the raw observations. WinBUGS offers a rich range of tools for monitoring the progress of the algorithm, including active traces of the iterations of any desired nodes, histories of selected bands of iterations, kernel density plots and a variety of summary statistics. Table II gives the statistical summaries of the first 5000 MCMC iterations after a 2000 iteration burn-in. The first part of the table corresponds to Table I and gives the MCMC values of the parameters that were also obtained by ML using the conventional structural

7 Markov Chain Monte Carlo 285 Table II. Summary Statistics for 5000 MCMC Iterations of Bivariate AE Model after 2000 Iteration Burn in Node Mean SD MC error 2.5% Median 97.5% deviance mu[1] mu[2] sigma2.g[1,1] sigma2.g[1,2] sigma2.g[2,2] sigma2.e[1,1] sigma2.e[1,2] E sigma2.e[2,2] g1mz[1,1] g1mz[1,2] g1mz[2,1] g1mz[2,2] g1mz[3,1] g1mz[3,2] g1mz[4,1] g1mz[4,2] g1mz[5,1] g1mz[5,2] g1dz[1,1,1] g1dz[1,1,2] g1dz[1,2,1] g1dz[1,2,2] g1dz[2,1,1] g1dz[2,1,2] g1dz[2,2,1] g1dz[2,2,2] g1dz[3,1,1] g1dz[3,1,2] g1dz[3,2,1] g1dz[3,2,2] g1dz[4,1,1] g1dz[4,1,2] g1dz[4,2,1] g1dz[4,2,2] g1dz[5,1,1] g1dz[5,1,2] g1dz[5,2,1] g1dz[5,2,2] Note: The nodes are defined as follows: deviance minus twice the log-likelihood; mu[1],mu[2] means of first and second variable; sigma2.g elements of 2 2 genetic covariance matrix; sigma2.e elements of 2 2 environmental covariance matrix; g1mz[i,k] genetic effect on kth variable in ith MZ twin pair ( genetic score ); g1dz[i,j,k] genetic effect on kth variable in jth twin of ith DZ pair. modeling approach. Note that the agreement between the ML estimates and those obtained by averaging the sequence of Monte Carlo simulations is very close. Table II however, gives a number of other statistics that are not readily available from the ML algorithm. The MCMC algorithm gives the median parameter values and the upper and lower 2.5% confidence intervals. Mx experienced difficulty in obtaining these confidence intervals for this small sample. The standard deviations of the parameters are also obtained directly from the MCMC algorithm. These may be time-consuming to calculate numerically within the ML algorithm since they require numerical computation of the Hessian matrix that usually requires a series of additional steps after the ML estimates have been obtained.

8 286 Eaves and Erkanli The flexibility of the MCMC method, however, is illustrated by the fact that it yields almost as a byproduct, estimates of the individual genetic deviations of the MZ and DZ twins, together with a variety of statistics for evaluating their precision. The fact that MCMC is a carefully constructed simulation algorithm means that, at each iteration, the latent genetic and environmental effects of individual twins are simulated conditional on their phenotypes and the parameters of the genetic model. These latent scores and, indeed, any missing values are thus treated like every other parameter in the MCMC model and, after the algorithm has converged, can be sampled and summarized to provide mean values and estimates of error for the genetic and environmental deviations of individual twins. In multivariate genetic analyses, MCMC can be expected to evaluate genetic or environmental factor scores (see, e.g., Molenaar et al., 1990) and their confidence intervals at the same time as fitting the genetic factor model at very little extra computational cost. We regard the availability of estimates of individual scores and missing values as a significant bonus of the Bayesian approach. These estimates are given for the first 10 MZ and DZ pairs in Table II. We note in passing that, although we did not simulate missing data in this basic example, in principle the MCMC method also can also routinely simulate missing values at no additional computational cost. However, in the current version of BUGS, the simulation of multivariate normal deviates cannot accommodate missing values. The method thus appears to accomplish in a relatively straightforward manner much of what can be achieved by other tailor-made algorithms. Example 2: Nonlinear Growth Curve Models for Twin Data The graph in Figure 1 is modified simply to take into account a non-linear developmental model in which there is random genetic and environmental variation in the parameters of a non-linear growth curve (Fig. 2). Introducing a third plate, nested within individual twins, to reflect the repeated measures of the outcome variable extends the model. In the figure, the response of the jth twin of the ith DZ pair on the kth occasion is denoted by rdz[l,j,k]. The responses for MZ twins are denoted by the rmz[l,j,k]. The model assumes that the responses on the kth occasion have expected values ydz[l,j,k] and ymz[l,j,k]. The residual variances are assumed, for simplicity, to be constant and independent over occasions (i.e. any correlation across time is explained by the underlying developmental model). The residual variance is the inverse of the precision tau.res that is assumed to be the same for MZ and DZ twins. The expected values of the MZ and DZ responses are assumed to be logical functions of underlying random latent variables whose (multivariate) genetic and environmental structure can be captured by the model in Figure 1. In this case we assume that the responses are measured at ten equally spaced intervals (k ), and that the temporal change within individuals follows a four-parameter logistic function with random genetic and environmental variation between individuals in each of the four parameters. Thus we write (for DZ twins) rdz[l,j,k] y[l,j,1] y[l,j,2]/ {1 exp( y[l,j,4] (k y[l,j,3])). The individual parameters thus correspond to initial value, y[l,j,1], range from initial value to asymptotic value, y[l,j,2], time of maximum rate of change, y[l,j,3] and rate of change, y[l,j,4]. Similar parameters are defined for MZ pairs. Other functional forms can be incorporated by relatively minor changes in the code given in Appendix 2. Data were simulated using SAS for 500 pairs each of MZ and DZ twins on the assumption that the four growth curve parameters each had independent additive genetic (A) and within-family environmental (E) components. The heritabilities of the four components were all assumed to be 0.8. Population parameter values were assumed to be N[1, 1], ranges were N[10, 1], times of maximum growth were N[5.5, 1], and rates of change were N[1,0.1]. The residual within-occasion error was assumed to be 0.5. The genetic and environmental correlations between the random growth-curve parameters were all assumed to be zero in the original data simulation, but the genetic and environmental covariances were allowed to be free parameters in the Bayesian analysis. Noninformative multivariate normal priors were assumed for the mean response vectors and Wishart priors (omega.g and omega.e respectively) were assumed for the precisions of the genetic and environmental covariance matrices. The 10,000 MCMC iterations took approximately 45 minutes on a laptop computer. The statistics for the principal parameters of the model sampled over the last 2000 MCMC iterations are summarized in Table III. The means of the growth curve parameters and error variances are reproduced quite precisely. The heritabilities of the four random components (based on the median values of the genetic and environmental components) are: initial value 0.932; range 0.825; time of maximum change 0.762; and rate of change The apparent

9 Markov Chain Monte Carlo 287 Fig. 2. Graphical model for random additive genetic and environmental effects in four-paramater logistic growth curve model for MZ and DZ twin data. upward bias in the estimated heritability of the initial value is not resolved, but cannot be explained simply by the correlations between twins observed in the simulated initial values since these were (MZ) and (DZ) respectively, close to their expected values. The MCMC estimates of the genetic and environmental covariances are all close to the zero values assumed in the simulation. As in the case of the basic multivariate model (Example 1), the MCMC analysis automatically estimates the genetic effects contributing to growth curve parameters of the individual twins but illustrative values are not presented here because of space limitations. These individual effects, for example, can be further analyzed to identify interesting sub groups or populations of twins that are hidden in the observed data, which would be extremely difficult to do in a standard MLE approach. Example 3: Genotype Environment Interaction and Correlation Our third example addresses a challenging problem that has been frustratingly insoluble within the framework of structural modeling or conventional regression analysis, namely that of the genetic control of sensitivity to the environment (G E interaction). Available methods for the analysis of G E have depended on stratification of a sample of relatives by values of an environmental covariate (e.g., twins discordant for an environmental factor or siblings stratified by a putative environmental covariate) and testing for heterogeneity among genetic parameters within strata. This approach assumes that the environmental measures are fixed rather than random and independent of genetic effects on the outcome of interest. Environments are not fixed and their independence can seldom be guaranteed in practice when, for example, aspects of the family environment that interact with genetic liability are themselves correlated with genetic liability. Our preliminary studies suggest that the MCMC approach may provide a framework for removing this restriction and enhancing our capability of a more rigorous and flexible analysis of G E interaction. Clearly, there are many different ways in which nonadditive effects, including G E, may feature in a family study. We consider one example likely to be relevant to future analyses of developmental twin data. We assume that we have measured an outcome, symptoms of depression, for example, that is influenced by genetic and environmental factors that may interact. We further assume that we have measured a covariate (endophenotype) that is itself partly genetically deter-

10 288 Eaves and Erkanli Table III. Summary Statistics for Four-Parameter Logistic Model for 10-Occasion Longitudinal MZ and DZ Twin Data from 2000 MCMC Iterations after 8000 Iteration Burn-in Node Mean SD MC error 2.5% Median 97.5% mu[1] mu[2] mu[3] mu[4] E sigma2.g[1,1] sigma2.g[1,2] sigma2.g[1,3] sigma2.g[1,4] E sigma2.g[2,2] sigma2.g[2,3] sigma2.g[2,4] E sigma2.g[3,3] sigma2.g[3,4] E sigma2.g[4,4] E sigma2.e[1,1] E sigma2.e[1,2] E sigma2.e[1,3] E sigma2.e[1,4] E sigma2.e[2,2] E sigma2.e[2,3] E sigma2.e[2,4] E sigma2.e[3,3] sigma2.e[3,4] E sigma2.e[4,4] E sigma2.res E Notes: Subscripts on estimates of means (mu) and genetic and environmental covariances (sigma2.g and sigma2.e) refer to parameters of four-parameter logistic model as follows: 1 initial value (1.0); 2 total growth (final asymptotic value-initial value, 10.0); 3 age of maximum rate of change (5.5); 4 growth rate (1.0). True values are given in parentheses. Sigma2.res is the residual, within occasion, variance (0.5), assumed to be constant across occasions. Residual effects are assumed to be uncorrelated across occasions. mined and indexes the genetic sensitivity of the individual to a specific environmental factor. Finally, we assume that we have measured a variable that is hypothesized to be a covariate (partly environmental ) of the outcome. The regression of the outcome on the environmental covariate varies between subjects as a function of differential (possibly genetic) sensitivity of individuals to the measured environmental covariate. We could, for example, envision a system in which pre-pubertal anxiety was genetically correlated with post-pubertal depression, but was also an index of genetic sensitivity to the impact of life events. The problem may be further complicated by the fact that there may be genetic effects on exposure to life events and that these may correlate genetically with anxiety and or depression. Figure 3 represents the graph that specifies the model for this particular problem. As before, the model is based on the underlying multivariate linear model represented in Figure 1, with modifications to allow for the unique aspects of the GxE model. As before, the model comprises two similar components representing the relationships among the nodes for DZ (left side of the figure) and MZ twin individuals nested within pairs. In this case, we define three variables each having their own independent errors. For DZ twins, we define rdz[i,j] as the response variable for the jth twin in the ith DZ pair, the covariates sdz[i,j] and edz[i,j] represent the corresponding index of sensitivity to the environment and the measured environment respectively. The model for MZ manifest variables follows the same basic pattern, with modifications to allow for the fact that both within-family and betweenfamily genetic differences contribute to differences between MZ pairs (see Fig. 3, cf. Jinks and Fulker, 1970). The DZ responses have expected values erdz[i,j] g1dz[i,j,3] esdz[i,j]edz[i,j] and precision tau.r equal to the inverse of the withinfamily environmental variance in the responses sigma2.r. The product term represents the interaction

11 Markov Chain Monte Carlo 289 Fig. 3. Graphical model for random additive genetic effects and within-family environmental effects on sensitivity to partly heritable environmental effects (G E interaction in presence of genotype-environment correlation). between genes and the measured environment, edz[l,j], as the product of the measured environment and a random coefficient, esdz[i,j], that depends on the genotype of the individual such that esdz[l,j] g1dz[i,j,2]. The environmental measures, edz[i,j], have expected values eedz[i,j] g1dz[i,j,1] and precision tau.e equal to the inverse of the within-family environmental effect on the environmental measure. The sensitivities to the environment, esdz[i,j], are assumed to be indexed by the measured variable sdz[i,j] that have expected values esindz[i,j] beta[1] beta[2]esdz[l,j] and precision tau.s equal to the withinfamily environmental variance in the index of sensitivity to the environment, sigma2.s. The coefficients beta[1] and beta[2] are the intercept and slope respectively of the regression of the index of sensitivity on the latent genetic sensitivity to the environment. The structure of the three latent genetic variables, g1dz[l,j,k],k , follows that already described above for the basic multivariate linear structural model for MZ and DZ twin data, requiring specification of their mean vector, mu, and the inverse, tau.g, of half the additive genetic covariance matrix, sigma2.g. In its current form, the model assumes that there is no withinfamily environmental correlation between the three manifest variables except for that introduced between the response variable and the environmental measure by virtue of the regression of the former on the latter. Other parameterizations may be devised that enable this assumption to be relaxed. Appendix 3 reproduces the code generated by WinBUGS, with some minor modifications to allow direct inspection of genetic covariances and environmental variances. The population parameters used to generate the simulated data are provided in Table V. The three latent variables are all assumed to be independent with unit total variance. The assumption of independence is made in the simulation to simplify inspection of the simulated data points but not in the data analysis. Thus, although the data do not include G-E correlation in the first example, they do allow for it in the analysis through the estimation of the positive-definite genetic covariance matrix. If the analysis is correct, the covariances between the latent variables should all be zero. The simulations assumed that (additive) genetic effects contributed to random variation in sensitivity to the environment (h 2 0.4), the environmental index (h 2 0.4) and the outcome variable prior to GxE interaction (h 2 0.8). Although the mean sensitivity to the environment is zero,

12 290 Eaves and Erkanli there is considerable variation, some of it genetic, among individuals in sensitivity to the environment. In this case, some individuals will show a positive regression of phenotype on the environment and others a negative response. The latent variables are assumed to be normal in the current formulation. Table IV summarizes the statistics realized in a simulation of 1000 MZ and 1000 DZ pairs from these population parameters. We note that the simulated values for the measured environment, sensitivity to the environment and the outcome prior to GxE yield statistics quite close to those expected. The response variable, which includes the effects of GxE interaction, is significantly less correlated in MZ and DZ twins than the raw outcome variable in the absence of GxE. However, we note that GxE interaction adds about 40% to the variance of the raw outcome in the absence of GxE. Table V summarizes the results of the MCMC analysis of the simulated data when allowance is made in the model for random effects on sensitivity to the environment. We note that BUGS correctly recovers the latent genetic covariance structure of the influences on the outcome, the environment and sensitivity to the environment, and correctly characterizes the relationship between the index of sensitivity to the environment and the latent sensitivity to the environment. The estimated genetic contribution to the environment and the genetic contribution to sensitivity are about right and the genetic covariances between the measures are also about zero as assumed in the simulation. The analysis of GxE does not address directly the question of the impact of removing GxE in the model for data in which there is known to be substantial GxE. Each of the MCMC iteration in BUGS computes an approximate realization of the posterior distribution of deviance (minus twice the logarithm of the likelihood) that can be traced and summarized over iterations in exactly the same way as any other node in the model. Subsequently, different models can be compared based on their deviance distributions. Alternatively, one can use the DIC (Deviance Information Criterion) which is a penalized version of the posterior mean of the deviance function (Spiegelhalter et al., 1998). For the full GxE model, the average deviance over 5000 iterations (Table V) was The crucial element of our GxE model is the random component in sensitivity to the environment. We thus recast the model of Figure 3 to retain a fixed effect of the measured environment (Fig. 4), but removing the random effect. After a 2000 iteration burn-in we obtained the revised parameter estimates also given in Table V. The average deviance was , which is significantly higher than the value obtained under the full model indicating strong support for GxE in the simulated example. Table IV. Summary Statistics for Simulated Data on 1000 MZ and 1000 DZ Twins in Presence of G E and Genetic Effects on Environmental Index (MZ Correlations Are in the Upper Triangle and DZ Correlations in the Lower Triangle) Correlations V11 V21 V31 S1 Resp1 V12 V22 V32 S2 Resp2 V V V S Resp V V V S Resp MZ Mean MZ S.D DZ Mean DZ S.D Note: Variables are labeled as follows: V11 Measured environment on first twin; V21 Index of environmental sensitivity of first twin; V31 Phenotype of first twin in absence of G E (not used in analysis); S1 genetic sensitivity of first twin to the environment; Resp1 Phenotype of first twin. V12, V22, V32, S2 and Resp2 are the corresponding variables for the second twin.

Ordinal Data Modeling

Valen E. Johnson James H. Albert Ordinal Data Modeling With 73 illustrations I ". Springer Contents Preface v 1 Review of Classical and Bayesian Inference 1 1.1 Learning about a binomial proportion 1 1.1.1