A MODEL FOR MIXED CONTINUOUS AND DISCRETE RESPONSES WITH POSSIBILITY OF MISSING RESPONSES

Size: px

Start display at page:

Download "A MODEL FOR MIXED CONTINUOUS AND DISCRETE RESPONSES WITH POSSIBILITY OF MISSING RESPONSES"

Donna Harrison
5 years ago
Views:

1 Journal o Sciences Islamic epublic o Iran : 5-6 National Center For Scientiic esearch ISSN 6- A MODE FO MIED CONTINUOUS AND DISCETE ESPONSES WITH POSSIBIIT OF MISSING ESPONSES M. Ganjali Department o Statistics Facult o Mathematical Sciences Shahid Beheshti Universit Evin Tehran 988 Islamic epublic o Iran Abstract A model or missing data in mixed binar and continuous responses which can be used on cross-sectional data is presented. In this model response indicator or the binar response can be dependent on the continuous response. A closed orm or the likelihood is ound. For data with a complicated pattern o missing responses some new residuals are also proposed. The model o multiplicative heteroscedasticit is used to consider the problem o heteroscedasticit o the continuous response. The model is illustrated on the data o an observational stud where the eect o pschological disorder o parents on both the verbal comprehension score and the presence o adverse smptoms in their children are modeled in the presence o missing responses. Kewords: Mixed data Missing responses Multiplicative heteroscedasticit Pearson residuals Pschological disorder. Introduction Some biomedical pschological and health sciences data include both discrete and continuous outcomes. One example is the analsis o development toxicit endpoints when the relationship between etal weight and malormation in live etus is an important statistical issue []. Another example is in the stud o the maternal smoking eect on respirator illness in children where we have a continuous measure o pulmonar unction and a binar measure o chronic smptoms in children. For the irst example separate analses o the categorical or the continuous responses cannot properl assess the eect o dose on etal weight and malormations simultaneousl. For the second example separate analses cannot assess the eect o maternal smoking on all the responses simultaneousl. Furthermore separate analses give biased estimates or the parameters and misleading inerence [9]. Consequentl we need to consider a method in which these variables can be modeled jointl. For joint modeling o responses one method is to use the general location model o Olkin and Tate [] where the joint distribution o the continuous and categorical variables is decomposed into a marginal multinomial distribution or the categorical variables and a conditional Multivariate normal distribution or the m-ganjali@cc.sbu.ac.ir 5

2 Vol. No. Winter Ganjali J. Sci. I.. Iran continuous variables given the categorical variables. Another method is to consider the conditional distribution o discrete variables given the continuous variables and a marginal distribution or continuous variables. Cox and Wermuth [] empiricall examined the choice between these two methods. The third method which is developed here is to use simultaneous modeling o categorical and continuous variables to take into account the association and dependence between the responses b the correlation between the errors in the models or responses. For more details o this approach see or example Heckman [6] in which a general model or simultaneousl analing two mixed correlated responses is introduced and Catalano and an [] who extend and used the model or a cluster o discrete and continuous outcomes. ittle and ubin [] made an important distinction between the various tpes o missing mechanism. The deined the missing mechanism as missing completel at random MCA i missingness is dependent neither on the observed responses nor on the missing responses and as random MA i it is dependent on the observed responses but not on the missing responses. Missingness is deined as non-random i it depends on the unobserved responses. From a likelihood point o view MCA and MA are ignorable but missing not at random MNA is non-ignorable. For mixed data with missing outcomes ittle and Schluchter [] and Fitmaurice and aird [] used the general location model o Olkin and Tate [] with the assumption o missingness at random MA to justi ignoring the missing data mechanism []. This means that the used all available responses without a model or missing mechanism to obtain parameter estimates using the EM Expectation Maximiation algorithm. For a good discussion o mixed normal and non-normal data with missing responses under the assumption o MA see [] and [6]. In this paper a general latent variable model or simultaneousl handling response and non-response in mixed discrete and continuous data with potentiall nonrandom missing values in both tpes o responses is presented. With this model the dependence between responses can be taken into account b the correlation between errors o the response models. The aim is to use the general model o Heckman [6] or the joint modeling o the discrete and continuous responses. It is shown how we can incorporate a model or a complicated pattern o missing responses. In Section the model is described and some new residuals are introduced. In Section this model is used on a subset o data available in ittle and Schluchter [] about the eects o parental pschological disorders on various aspects o the development o their children. The results o itting this model are also presented. In Section the paper concludes with some remarks.. Model and esiduals or Mixed Data with Missing esponses.. The Model Suppose is a binar discrete variable and is a continuous variable. Variables and are correlated and must be modeled simultaneousl. et and denote the underling latent variables o the binar response the non-response mechanism or the binar variable and the non-response or the continuous variable respectivel and deine i as the response indicator or i as the response indicator or and i as the binar response. The model takes the orm: ε a ε b ε c ε d where j or j are vectors o explanator variables. It is assumed that E ε j or j... The unstructured us variance covariance matrix o the vector o errors us ε ε ε ε The vector parameters and the correlation parameters or j < j j and j j is 5

3 J. Sci. I.. Iran Ganjali Vol. No. Winter j... should be estimated. In this model an multivariate distribution can be assumed or the errors in the model. Here a multivariate normal distribution is assumed. To consider the problem o heteroscedasticit o the continuous response we can use multiplicative model o heteroscedasticit [5] or which is exp γ γ e where γ and the vector o parameter γ should be estimated and is the vector o explanator variables. The non-response o in this model depends on both responses the non-response or also depends on both responses and depends on. I one o the correlation parameters j j or j and j is ound to be signiicant then we have a nonrandom missing process. Parameter can tell us whether or not non-response in the continuous variable binar variable is dependent on response in the discrete variable continuous variable. Parameter can tell us whether or not non-response in the continuous variable is dependent on the non-response in the discrete variable. The likelihood or this model which has been given in appendix shows the simpliication obtained b using the assumption o normalit in the models or nonresponse. The simpliication arises rom the act that instead o resorting to a numerical approximation o the integral needed to integrate out the missing values o the continuous response we just need to invoke a point on the cumulative three-variate normal distribution which although approximate can be computed in man standard statistical packages with suicientl high precision. The NAG [] routine EOUCF is used in this paper to minimie the mines logarithm o the likelihood given in the appendix. EOUCF is a Fortran routine to minimie a smooth unction subject to constraints or example simple bounds on the variables using a sequential quadratic programming SQP [] method. In this routine all unspeciied derivatives are approximated b inite dierences... esiduals The missing values o responses create problems or the usual residual diagnostics see e.g. Hirano et al. [8]. For example usual Pearson residuals o the variable are o the orm r E var Z i.e. the assume MCA. These unconditional residuals will be misleading i missingness is MA or MNA as in these cases the 's are rom a conditional or truncated conditional distribution. We can examine the residuals o the responses conditional on being observed. et start with using theoretical orm involving E Z E and var Z rather than their predicted values. The Pearson residuals or continuous response can take the orm E Z r a var Z and or the discrete response can take the orm E E r b E where EZ Z M [ M M ] varz Z where M is the Mill's ratio [7] and E P r where and are respectivel the cumulative univariate and bivariate normal distributions. The estimated Pearson residuals r ˆ r ˆ can be ound b using the maximum likelihood estimates o the parameters in Sstem. As residuals in a and b are deined b conditioning on observing the responses the dier rom those o Ten Have et al. [5]. Ten Have et al. [5] ound the expectation and variances o responses and consequentl the residuals unconditional on the act that responses should be observed. Then to calculate the residuals the assumed no link between 55

4 Vol. No. Winter Ganjali J. Sci. I.. Iran response and nonresponse. This gives biased estimates o the means and variances o responses i missingness is not at random [].. Application: The Eect o Pschological Disorders on the Verbal Comprehension Score.. Pschological Disorders Data The data set extracted rom ittle and Schluchter [] has been used also b Fitmaurice and aird []. These authors assumed a mechanism b which observations were MA and then used the general model o Olkin and Tate [] to decompose the joint distribution o the continuous and discrete variables. In these data schoolage children are classiied according to the risk o their parents to pschological disorders normal moderate risk and high risk. ittle and Schluchter [] and Fitmaurice and aird [] used two continuous responses a standardied reading score and a standardied verbal comprehension score and a single binar response indicating the presence o adverse smptoms which was obtained or each child. Here onl the standardied verbal comprehension score and the binar response or the irst child in amil are used. The data include 69 amilies with a general pattern o missing responses. Table gives the dierent patterns o missing response or each child. For example 5 irst-born children responded or both variables and the observed value o their binar response is. Since a preliminar analsis indicated that there was no signiicant dierence between moderate and high risk groups o pschological disorders [] in this paper the moderate and high risk groups are combined into a single group. Table. Dierent patterns o missing responses or pschological disorders data No Table. Descriptive statistics or pschological disorders observed data ow isk Score Ad. Sm No mean S.D. Total Total Calculated ignoring low risk Count% Table presents summar statistics or presence o the adverse smptoms Ad. Sm. in Table and the standardied verbal comprehension score using onl observed data. In this table the value o or ow isk means that we have an individual rom moderate or high risk group and the value o means that individual is rom low risk group. Table shows that the mean standard deviation score o children rom a amil with a low risk o pschological disorders ow isk in Table is more lower than amilies with moderate or high risk o pschological disorders. It also shows that the occurrence o adverse smptoms is less in amil with a low risk o pschological disorders. As the variance o scores is less or a amil with a low risk o pschological disorders the problem o heteroscedasticit should be taken into account. A Kolmogorov Smirnov test o the assumption o normalit or scores o the children o a low amil o pschological disorders is not rejected p-value.. The same test does not reject the assumption o the normalit or scores o the children rom a moderate or high risk o pschological disorders p-value.. The Pearson correlation between two responses is r.9. This emphasies that two responses should be modeled simultaneousl... Models or Pschological Disorders Data For comparative purposes three models are considered. The irst model model I uses onl complete cases and does not consider the correlation between two responses. This model is ε 5a 5b ε exp γ γ 5c where there is no correlation between ε and ε. The second model model II uses model I and takes into account the correlation between two responses. The third model model III considers the ull model model with missing mechanism with the ollowing orm ε 6a 6b ε 56

5 J. Sci. I.. Iran Ganjali Vol. No. Winter and M ε 6c M ε 6d exp γ γ. 6e In these models M i child is rom a low risk group i child is rom a Moderate risk group In model III non-response models and let have one more explanator variable [6]... esults or Pschological Disorders Data For the standardied verbal comprehension score and the binar response or the irst born child in amil model III inds no connection between missing mechanism and responses with a change o deviance o.6 on d.. p-value.66. Consequentl using onl the complete data without considering the missing mechanism can give unbiased estimates o the parameters. esults o using three models are given in table. For model III as the correlation parameters or j and j are not signiicant as j j mentioned above results with removing these parameters are given. These results show that there is a negative correlation a change in deviance o.7 with d between two responses the more the score the less likel the presence o adverse smptoms and separate analsis o the responses gives biased estimates o the standard errors o the parameters. For example under model I the standard errors o the estimates o and in comparison with the results o the model II are overestimated and the standard errors o the estimates o and are underestimated. esults also show that children rom a low risk amil can obtain a better standardied verbal comprehension score and the variance o their scores is less than that o the amil with a high or moderate risk γ.786 or model II. The estimate o the parameter b model III shows a positive correlation between two underling variables or missing mechanisms. This means that children who do not respond their score are more likel to not to respond their discrete response. Table shows values o residuals calculated using b and model II or the discrete response. esiduals or continuous response calculated using a do not include an value larger than in absolute value and with the residual values in Table model II can be considered as a good it or the pschological disorders data. Table. esults using three models or the pschological disorders data Parameter ModelIII ModelII ModelI Estimate S.E. Estimate S.E. Estimate S.E γ γ loglik Table. esiduals or discrete response using model II or pschological disorders data Discussion In this paper a latent variable model is presented or simultaneousl modelling o binar and continuous correlated responses. The model is ormulated in such a wa that can handle the possibilit o missing responses. The latent variable model in sstem can also be used or a mixture o ordinal and continuous data with usual changes or ordinalit o the discrete response. It can also be used or more than two variables with missing responses. In this case the more variables with missing values we have the more 57

6 Vol. No. Winter Ganjali J. Sci. I.. Iran response indicators should be deined and the more parameters coming to the estimation procedure. The selection models in sstems can be sensitive to the assumption o normalit or error distributions []. However the residuals introduced in this paper can be used to practicall examine this assumption. Acknowledgments The author is grateul to Shahid Beheshti Universit or the awarded research grant. eerences. Catalano P. and an.m. Bivariate latent variable models or clustered discrete and continuous outcomes. Journal o the American Statistical Association 879: Cox D.. and Wermuth N. esponse models or mixed binar and quantitative variables. Biometrika 79: Fitmaurice G.M. and aird N.M. egression models or mixed discrete and continuous responses with potentiall missing values. Biometrics 5: Fletcher. Practical Methods o Optimiation. John Wile. 5. Harve A. Estimating regression models with multiplicative heteroscedasticit. Econometrica : Heckman J.J. Dumm endogenous variables in a simultaneous equation sstem. Econometrica 66: Heckman J.J. Sample selection bias as a speciication error. Ibid. 7: Hirano K. Imbens G. idder G. and ubin D. Combining panel data sets with attrition and rereshment samples. Technical working paper National Bureau o Economic esearch. Cambridge Massachusetts eon A.. and Carriere K.C. On the one sample location hpothesis or mixed bivariate data. Commun. statist. Theor and Meth. 9: ittle.j and ubin D. Statistical Analsis with Missing Data. New ork Wile ittle.j. and Schluchter M. Maximum likelihood estimation or mixed continuous and categorical data with missing values. Biometrika 7: Maddala G.S. imited dependent and qualitative variables in econometrics. Cambridge 98.. NAG. Numerical Algorithms Group Manual. Mark 6. Oxord U.K Olkin I. and Tate.F. Multivariate correlation models with mixed discrete and continuous variables. Annals o Mathematical Statistics : Ten Have T..T. Kunselman A.. Pulkstenis E.P. and andis J.. Mixed eects logistic regression models or longitudinal binar response data with inormative dropout. Biometrics 5: Schaer J.. Analsis o Incomplete Multivariate Data Chapman & Hall 997. Appendix: ikelihood or Mixed Data with Missing Mechanism For individuals who observe neither nor the likelihood is Pr 7 where is the cumulative bivariate normal distribution. For individuals who observe onl and the value o is the likelihood is where Pr Pr Pr 8 and is cumulative three-variate normal distribution. For individuals who observe onl and the value o is the likelihood is Pr Pr Pr Pr Pr. 9 Σ For individuals who observe onl the likelihood is Pr [Pr Pr ] 58

7 J. Sci. I.. Iran Ganjali Vol. No. Winter [ ] For individuals with both and observed and the value o equal to the likelihood is Pr Pr [Pr Pr ] Pr [ ] where For individuals with both and observed and the value o equal to the likelihood is Pr Pr Pr [ Pr Pr Pr Pr ] Pr [ 59

8 Vol. No. Winter Ganjali J. Sci. I.. Iran ]. 6

Complier Average Causal Effect (CACE)

Complier Average Causal Effect (CACE) Booil Jo Stanford University Methodological Advancement Meeting Innovative Directions in Estimating Impact Office of Planning, Research & Evaluation Administration