Application of Multinomial-Dirichlet Conjugate in MCMC Estimation : A Breast Cancer Study

Similar documents
Bayesian Inference Bayes Laplace

Combining Risks from Several Tumors Using Markov Chain Monte Carlo

Advanced Bayesian Models for the Social Sciences. TA: Elizabeth Menninga (University of North Carolina, Chapel Hill)

Advanced Bayesian Models for the Social Sciences

AB - Bayesian Analysis

Bayesian Logistic Regression Modelling via Markov Chain Monte Carlo Algorithm

In addition, a working knowledge of Matlab programming language is required to perform well in the course.

Bayesian and Frequentist Approaches

Ordinal Data Modeling

Bayesian Estimation of a Meta-analysis model using Gibbs sampler

Georgetown University ECON-616, Fall Macroeconometrics. URL: Office Hours: by appointment

A test of quantitative genetic theory using Drosophila effects of inbreeding and rate of inbreeding on heritabilities and variance components #

Estimation of contraceptive prevalence and unmet need for family planning in Africa and worldwide,

Introduction to Bayesian Analysis 1

Mediation Analysis With Principal Stratification

Bayesian Statistics Estimation of a Single Mean and Variance MCMC Diagnostics and Missing Data

Bayesian and Classical Approaches to Inference and Model Averaging

Bayesian Methods LABORATORY. Lesson 1: Jan Software: R. Bayesian Methods p.1/20

How few countries will do? Comparative survey analysis from a Bayesian perspective

Using Test Databases to Evaluate Record Linkage Models and Train Linkage Practitioners

An Exercise in Bayesian Econometric Analysis Probit and Linear Probability Models

Bayesian Variable Selection Tutorial

Bayesian Bi-Cluster Change-Point Model for Exploring Functional Brain Dynamics

WIOLETTA GRZENDA.

Bayesian Joint Modelling of Longitudinal and Survival Data of HIV/AIDS Patients: A Case Study at Bale Robe General Hospital, Ethiopia

Bivariate lifetime geometric distribution in presence of cure fractions

STATISTICAL INFERENCE 1 Richard A. Johnson Professor Emeritus Department of Statistics University of Wisconsin

Multilevel IRT for group-level diagnosis. Chanho Park Daniel M. Bolt. University of Wisconsin-Madison

A Hierarchical Linear Modeling Approach for Detecting Cheating and Aberrance. William Skorupski. University of Kansas. Karla Egan.

A formal analysis of cultural evolution by replacement

Introduction to Survival Analysis Procedures (Chapter)

Response to Comment on Cognitive Science in the field: Does exercising core mathematical concepts improve school readiness?

Kelvin Chan Feb 10, 2015

Introduction. Patrick Breheny. January 10. The meaning of probability The Bayesian approach Preview of MCMC methods

Econometrics II - Time Series Analysis

Bayesian Variable Selection Tutorial

SCHIATTINO ET AL. Biol Res 38, 2005, Disciplinary Program of Immunology, ICBM, Faculty of Medicine, University of Chile. 3

S Imputation of Categorical Missing Data: A comparison of Multivariate Normal and. Multinomial Methods. Holmes Finch.

Bayesian hierarchical modelling

Inference About Magnitudes of Effects

Applying Bayesian Ordinal Regression to ICAP Maladaptive Behavior Subscales

BayesRandomForest: An R

HIDDEN MARKOV MODELS FOR ALCOHOLISM TREATMENT TRIAL DATA

Bayesian Modelling on Incidence of Pregnancy among HIV/AIDS Patient Women at Adare Hospital, Hawassa, Ethiopia

A simple introduction to Markov Chain Monte Carlo sampling

Modeling Change in Recognition Bias with the Progression of Alzheimer s

Bayesian Hierarchical Models for Fitting Dose-Response Relationships

Bayesian Nonparametric Modeling of Individual Differences: A Case Study Using Decision-Making on Bandit Problems

Bayesian growth mixture models to distinguish hemoglobin value trajectories in blood donors

Bayesian Confidence Intervals for Means and Variances of Lognormal and Bivariate Lognormal Distributions

Non-parametric methods for linkage analysis

THE INDIRECT EFFECT IN MULTIPLE MEDIATORS MODEL BY STRUCTURAL EQUATION MODELING ABSTRACT

For general queries, contact

Hidden Markov Models for Alcoholism Treatment Trial Data

How many Cases Are Missed When Screening Human Populations for Disease?

Score Tests of Normality in Bivariate Probit Models

Bayesian Prediction Tree Models

Journal of Mathematical Psychology. Understanding memory impairment with memory models and hierarchical Bayesian analysis

Exploring Trends of Cancer Research Based on Topic Model

Individual Differences in Attention During Category Learning

Bayesian meta-analysis of Papanicolaou smear accuracy

Commentary: Practical Advantages of Bayesian Analysis of Epidemiologic Data

Explicit Bayes: Working Concrete Examples to Introduce the Bayesian Perspective.

Bayesian Analysis of Between-Group Differences in Variance Components in Hierarchical Generalized Linear Models

The assessment of genetic risk of breast cancer: a set of GP guidelines

Families and Segregation of a Common BRCA2 Haplotype

Exploring the Influence of Particle Filter Parameters on Order Effects in Causal Learning

STATISTICAL ANALYSIS FOR GENETIC EPIDEMIOLOGY (S.A.G.E.) INTRODUCTION

Bayesian mixture modeling for blood sugar levels of diabetes mellitus patients (case study in RSUD Saiful Anwar Malang Indonesia)

Revisiting higher education data analysis: A Bayesian perspective

Detection of Unknown Confounders. by Bayesian Confirmatory Factor Analysis

TRIPODS Workshop: Models & Machine Learning for Causal I. & Decision Making

Bayesian Mediation Analysis

BAYESIAN INFERENCE FOR HOSPITAL QUALITY IN A SELECTION MODEL. By John Geweke, Gautam Gowrisankaran, and Robert J. Town 1

A Brief Introduction to Bayesian Statistics

A COMPARISON OF BAYESIAN MCMC AND MARGINAL MAXIMUM LIKELIHOOD METHODS IN ESTIMATING THE ITEM PARAMETERS FOR THE 2PL IRT MODEL

Practical Bayesian Design and Analysis for Drug and Device Clinical Trials

AALBORG UNIVERSITY. Prediction of the Insulin Sensitivity Index using Bayesian Network. Susanne G. Bøttcher and Claus Dethlefsen

MISSING DATA AND PARAMETERS ESTIMATES IN MULTIDIMENSIONAL ITEM RESPONSE MODELS. Federico Andreis, Pier Alda Ferrari *

Probabilistic models for patient scheduling

Inferring relationships between health and fertility in Norwegian Red cows using recursive models

WinBUGS : part 1. Bruno Boulanger Jonathan Jaeger Astrid Jullion Philippe Lambert. Gabriele, living with rheumatoid arthritis

Running Head: BAYESIAN MEDIATION WITH MISSING DATA 1. A Bayesian Approach for Estimating Mediation Effects with Missing Data. Craig K.

Applications with Bayesian Approach

PUBLISHED VERSION. Authors. 3 August As per correspondence received 8/07/2015 PERMISSIONS

Hierarchical Bayesian Modeling of Individual Differences in Texture Discrimination

2014 Modern Modeling Methods (M 3 ) Conference, UCONN

Bayesian Model Averaging for Propensity Score Analysis

Bayesian Data Analysis Third Edition

Statistical Analysis Using Machine Learning Approach for Multiple Imputation of Missing Data

A Nonlinear Hierarchical Model for Estimating Prevalence Rates with Small Samples

Variation in Measurement Error in Asymmetry Studies: A New Model, Simulations and Application

CS229 Final Project Report. Predicting Epitopes for MHC Molecules

Modeling Multitrial Free Recall with Unknown Rehearsal Times

Statistical Tolerance Regions: Theory, Applications and Computation

Bayesian methods in health economics

Treatment effect estimates adjusted for small-study effects via a limit meta-analysis

Approximate Inference in Bayes Nets Sampling based methods. Mausam (Based on slides by Jack Breese and Daphne Koller)

Transcription:

Int. Journal of Math. Analysis, Vol. 4, 2010, no. 41, 2043-2049 Application of Multinomial-Dirichlet Conjugate in MCMC Estimation : A Breast Cancer Study Geetha Antony Pullen Mary Matha Arts & Science College Vemom P.O., Mananthavady - 670645, India geethapullen@gmail.com M. Kumaran Department of Statistical Sciences Kannur University, Kannur mkdss@gmail.com Abstract Studies have been made to investigate the familial risk of breast cancer based on a large case-control study and conclude that a small number of affected cases were due to the presence of a rare autosomal dominant allele, where as a larger number of cases reported were non genetic [7]. The proportion of affected in the population in different age groups mentioned in [7] is modeled by a Dirichlet prior in the present paper. MCMC simulation from a Multinomial-Dirichlet conjugate using R program, calculates the estimates of these proportions. These estimates support the conclusion in [7], that the general population has a high probability of getting affected at an age of 40 + and then at an age of 70 +. Keywords: Multinomial, Dirichlet distributions, MCMC- Gibbs sampling, Rare autosomal dominant allele, R program 1 Introduction Numerous studies have investigated the genetic transmission of breast cancer. In some studies, affected individuals have been defined solely by the presence of breast cancer, while in other studies, breast cancer cases have been divided into subgroups based on menopausal status [2,11], age at onset [5,6], bilaterality [12,13], time interval between first and second primary tumors for bilateral cases [13], and occurrence of other cancers [11]. The findings in Claus

2044 G. A. Pullen and M. Kumaran et. al (1991)[7] lends support to the hypothesis that the distribution of breast cancer cases in the general population includes a small number of genetic cases and a larger number of non genetic cases. The non genetic reasons may be attributed to the occurrence of other cancers like ovarian cancer, many other environmental factors like alcohol, cigarette consumption, food adulteration and many other socio demographic parameters. In the present study, we propose beta distributions to model the proportion θ i of a person in the age group i to be affected. The study, based on a Gibbs sampling from Multinomial-Dirichlet distributions agrees with the conclusion in Claus et. al (1991)[7], that, there is higher probabilities for the age groups 40 + and 70 + to be the ages at onset of disease, for those getting affected. 1.1 Multinomial-Dirichlet distributions in Bayesian analysis The variable X =(X 1,X 2,...,X p ) has the Multinomial distribution X Multinomial(N; θ 1,θ 2,...,θ p ), if θ i =1,ΣX i = N and N P (X N; θ 1,θ 2,...,θ p )= x 1 x 2...x p θ x 1 1 θ x 2 2...θp xp. The marginal distribution of each X i being X i Binomial(N,θ i ); 0 θ i 1, i =1, 2,...,p. The conjugate prior of the Multinomial distribution is the Dirichlet distribution, the multivariate generalization of beta distribution. Hence the parameter vector θ =(θ 1,θ 2,...,θ p ) has a prior distribution given by, θ Dirichlet(α 1,α 2,...,α p ) whose density function is given by g(θ α 1,α 2,...,α p )= Γ(Σα i) (Γαi ) θ(α 1 1) 1 θ (α 2 1) 2...θ p (αp 1),α i > 0 and 0 θ i 1, θi =1. Marginally, θ i Beta(α i, Σ k i α k ),i =1, 2,...,p. The posterior distribution of θ given X is, θ x Dirichlet(x 1 + α 1,x 2 + α 2,...,x p + α p ). 1.2 Gibbs sampling Since the advent of MCMC methods in the early 1990 s, the Bayesian computations have been extended to a large and growing applications. Many authors discuss posterior simulation in detail [3,4,8,9,10]. Gibbs sampling is a popular MCMC algorithm commonly used when the conditional posterior distributions are easy to work with. It permits the analysis of any statistical model possessing a complicated multivariate distribution to be reduced to the

Application of multinomial-dirichlet conjugate in MCMC estimation 2045 analysis of its much simpler and lower dimensional full conditional distributions. Hence, in the particular situation, θ x Dirichlet(x 1 + α 1,x 2 + α 2,...,x p + α p ), where θ =(θ 1,θ 2,...,θ p ). Gibbs sampling produces a sequence of samples, θ (r) i for r =1, 2,...,m and m r=1 i =1, 2,...,p, with the property that Φ(θ i (r) ) converges to E[Φ(θ m i ) x] as m goes to infinity (provided E[Φ(θ i ) x] exists); where Φ(θ i ) is a function of the model s parameters. Gibbs sampling produces this sequence by iteratively sampling from the posterior conditional distributions. Hence, at the (t +1) th iteration, θ (t+1) 1 g(θ 1 θ (t) 2,θ (t) 3,...,θ p (t)) θ (t+1) 2 g(θ 2 θ (t+1) 1,θ (t) 3,...,θ p (t))... θ (t+1) p g(θ p θ (t+1) 1,θ (t+1) 2,...,θ (t+1) p 1 ), and forms, θ (t+1) =(θ 1 (t+1),θ 2 (t+1),...,θ p (t+1) ). 2 Main Result Segregation analysis and goodness of fit tests of genetic models provided evidence for the existence of a rare dominant allele (A) leading to increased susceptibility to breast cancer [7]. The effect of genotype on the risk of breast cancer is shown to be a function of a woman s age. The life time risk of breast cancer for carriers of the abnormal allele (A) was estimated to be nearly 100% [7]. Thus, persons with both the genotypes AA and Aa will be affected by the disease at some time during their life period. They estimated the proportion in the population affected by breast cancer at different age groups as given in table 1: Age (years) Probability 20-29 0.0167 30-39 0.1277 40-49 0.2314 50-59 0.1719 60-69 0.1266 70-79 0.2709 80 + 0.0548 Total 1.0000 Table 1: P roportions

2046 G. A. Pullen and M. Kumaran Because of the extremely low occurrence of breast cancer in women before the age of 20 years, the probability of becoming affected with breast cancer before 20 years is assumed to be zero. In the present study, we considered the seven age groups as seven classes of a multinomial distribution (see table 1). { X (r) 1 if r i = th person has breast cancer at an age belonging to age group i 0 otherwise for i =1, 2,...,7;r =1, 2,...,N. θ i = P (X (r) i =1 X (r) i 1 = 0); i =1, 2,...,7, with X (r) 0 =0. Different age groups are the ages at onset of the disease and age group 80 + can be thought of as being not affected and hence, θ 7 =1 6 i=1 θ i. Also, each X (r) i Bernoulli(θ i ); i =1, 2,...,7, and, θ i = P (occurrence of breast cancer in the age group i ) = E(X (r) i ). Also, X i = r X (r) i Binomial(N,θ i ); i =1, 2,...,7. Hence, X =(X 1,X 2,...,X 7 ) Multinomial(N,θ 1,θ 2,...,θ 7 ). A lot of non genetic parameters may lead to the occurrence of the disease to a non carrier (a person with two normal allele of the gene - aa); for example, the occurrence of ovarian cancer, late marriage, infertility etc. It is well known that the beta (dirichlet) is the conjugate prior of the binomial (multinomial). One of the many applications of the beta distribution in Statistics is in the context of bioassay. The most common use of the beta in bioassay is in modeling dispersion of a binomial parameter. A typical application is in quantal bioassay, where success may constitute detection of a tumor of a certain type in a certain organ. In more general settings, such as multinomial response vector, the multivariate generalization of the beta distribution, the Dirichlet, is often used as a model for the response vector; in this case the beta will appear as a model for the marginal distributions. Hence, we let, θ i Beta(α i,β i ); i =1, 2,...,7, where β i = k i α k, and θ =(θ 1,θ 2,...,θ 7 ) Dirichlet(α 1,α 2,...,α 7 ). Hence the prior distribution of θ is g(θ) =Dirichlet(α 1,α 2,...,α 7 ). The posterior distribution of θ x is Dirichlet (α 1 + x 1,α 2 + x 2,...,α 7 + x 7 ). We begin Gibbs sampling by assigning to θ =(θ 1,θ 2,...,θ 7 ), the prior probabilities obtained from Table 1. A random sample is drawn from Multinomial (N; θ 1,θ 2,...,θ 7 ), with N = 100. These values are taken as the first parameter values of beta distributions for the initial values of the iterations. At the end of the iterations, we have a sample of size m from the posterior distribution,

Application of multinomial-dirichlet conjugate in MCMC estimation 2047 arrayed as θ (1) 1 θ (1) 2... θ (1) 7 θ (2) 1 θ (2) 2... θ (2) 7............. θ (m) 1 θ (m) 2... θ (m) 7 Each row is a sample from the joint posterior distribution and columns are samples from the marginal distributions. First column gives samples from g(θ 1 x) and second column gives samples from g(θ 2 x) and so on. Now, discarding the initial values and averaging over the sample size, we get the improved estimates of the parameters. The R program is being used for the estimation. 2.1 Output Analysis Table 2 depicts the output analysis of the study for the seven groups for four runs of the program. Age (years) θ Run1 Run2 Run3 Run4 20-29 θ 1 0.01007200 0.01033389 0.01965272 0.01007097 30-39 θ 2 0.14942026 0.13028229 0.09966381 0.09102453 40-49 θ 3 0.25005136 0.24093310 0.27035829 0.26979353 50-59 θ 4 0.15055422 0.18986951 0.19958647 0.22040184 60-69 θ 5 0.14980235 0.11028538 0.08959871 0.12051873 70-79 θ 6 0.25036396 0.24954914 0.26031505 0.23963357 80 + θ 7 0.03999377 0.07052424 0.06031410 0.04930833 Total - 1.00025792 1.001778 0.9994892 1.000751 Table 2: P roportions Invoking Table 2, the value of θ 3 is larger, compared to the values of all other parameters, leading to the conclusion that, the age group 40 + is the highest probable age at onset of breast cancer; the next highest value of proportions is θ 6 corresponding to age group 70 +. As is evident from Table 3, the extremely small variances of the iterated values illustrate the suitability of Gibbs sampling in the situation. Also, as we discard the initial values of iteration, we may use improper priors like Unif orm(0, 1) to get the initial values, and the output of various runs is given in Table 4.

2048 G. A. Pullen and M. Kumaran Variances Run1 Run2 Run3 Run4 θ 1 0.000106198 0.0001073495 0.0001842645 0.0001019053 θ 2 0.0012585498 0.001106535 0.0008381735 0.0008451778 θ 3 0.0019129253 0.0017651658 0.0019923381 0.0019049895 θ 4 0.0012695066 0.0015165971 0.0015674542 0.0017302377 θ 5 0.0012813251 0.0009974279 0.0008056178 0.001045555 θ 6 0.0018793944 0.0018086309 0.0019176001 0.0018403571 θ 7 0.0003776579 0.0006548418 0.0005499078 0.0004634782 Table 3: V ariances θ Run1 Run2 Run3 Run4 θ 1 0.01009671 0.01981130 0.01014169 0.0099534367 θ 2 0.12982770 0.14016432 0.07934393 0.159175893 θ 3 0.26999674 0.27926111 0.25974321 0.229998541 θ 4 0.12967862 0.18090535 0.22011336 0.188592595 θ 5 0.11975033 0.08004237 0.11018176 0.099838067 θ 6 0.26946581 0.2497542 0.27010255 0.250795979 θ 7 0.06970986 0.04947005 0.04989858 0.0060248363 Total 0.99862209 0.9994087 0.9995251 0.998603 Table 4: P roportionsusinginitialvaluesfromunif orm 3 Conclusion When we compare the pattern in which the proportion estimates appear, with prior proportion values, the Multinomial - Dirichlet conjugate seems a good model in modeling the relationship between the age at onset and proportion of women affected of breast cancer. References [1] Berger J O., Statistical Decision Theory, Springer Verlag,N. York (1980). [2] Bishop D.T., Cannon - Albright L., McLellan T., Gardner E. J., Skolnick M. H., Segregation and Linkage analysis of nine Utah breast cancer pedigrees, Genet. Epidemiol.,5(1988),151-169. [3] Carlin B P., Louis T A.,Bayes and Empirical Bayes methods for data analysis, Chapman and Hall, CRC Press, Boca Raton, (2000).

Application of multinomial-dirichlet conjugate in MCMC estimation 2049 [4] Chib S., Markov Chain Monte Carlo Methods : Computation and Inference, Handbook of Econometrics, 5 ( eds. J J Heckerman and E Leamer ), Elsevier, Amsterdam,(2001),3569-3649. [5] Claus E B., Risch N., Thompson W D., Age of onset as an indicator of familial risk of breast cancer, Am. J. Epidemiol.,131(1990a),961-972. [6] Claus E B., Risch N., Thompson W D., Using age of onset to distinguish between sub forms of breast cancer, Ann. Hum. Genet.,54(1990b),169-177. [7] Claus E B., Risch N., Thompson W D.,Genetic analysis of breast cancer in the cancer and steroid hormone study, Am. J. Hum. Genet.,48(1991),232-242. [8] Gelman A J., Carlin B., Stern H., Rubin D.,Bayesian Data Analysis, Chapman and Hall, London, 2 nd ed, (2004). [9] Geweke J., Using simulation methods for Bayesian econometric models : Inference, development, and communication, Econometric Reviews,18,(1999),1-126. [10] Geweke J.,Contemporary Bayesian Econometrics and Statistics, Wiley, NJ,(2005). [11] Go RCP, King M C., Bailey - Wilson J, Elston R C., Lynch H T., Genetic epidemiology of breast cancer and associated cancers in high - risk families. I. Segregation analysis, J. Natl. Cancer Institute., 71(1983), 455-461. [12] Golstein A M., Haile RWC., Hodge S E., Paganini - Hill A., Spence M A.,Possible heterogeneity in the segregation pattern of breast cancer in families with bilateral breast cancer, Genet. Epidemiol., 5(1988), 121-133. [13] Golstein A M., Haile RWC., Marazita M L., Paganini - Hill A., A genetic epidemiologic investigation of breast cancer in families with bilateral breast cancer. I. Segregation analysis, J. Natl. Cancer Institute.,78(1987),911-918. [14] Greenberg E.,Introduction to Bayesian Econometrics,Cambridge University Press,(2003). [15] Heckerman D., Geiger D., Chickering D M., Learning Bayesian Networks : The combination of knowledge and statistical data, Machine Learning(1995). Received: April, 2010