Modelling Spatially Correlated Survival Data for Individuals with Multiple Cancers Dipak K. Dey, Ulysses Diva and Sudipto Banerjee Department of Statistics University of Connecticut, Storrs. March 16, 2007 1 Modeling Multivariate Survival Data
Introduction Motivation and Research Objectives Spatial data is used in a variety of disciplines current focus is on spatially referenced survival data. 2 Modeling Multivariate Survival Data
Introduction Motivation and Research Objectives Spatial data is used in a variety of disciplines current focus is on spatially referenced survival data. Accounting for spatial correlation provides additional insights Failure to account for such association could potentially lead to spurious and sometimes misleading results 2 Modeling Multivariate Survival Data
Introduction Motivation and Research Objectives Spatial data is used in a variety of disciplines current focus is on spatially referenced survival data. Accounting for spatial correlation provides additional insights Failure to account for such association could potentially lead to spurious and sometimes misleading results Spatial variation in survival patterns often reveal underlying lurking factors, which help in better understanding of the disease patterns. 2 Modeling Multivariate Survival Data
Introduction Motivation and Research Objectives Spatial data is used in a variety of disciplines current focus is on spatially referenced survival data. Accounting for spatial correlation provides additional insights Failure to account for such association could potentially lead to spurious and sometimes misleading results Spatial variation in survival patterns often reveal underlying lurking factors, which help in better understanding of the disease patterns. Spatial-survival models attempt to map spatial frailties 2 Modeling Multivariate Survival Data
Introduction Motivation and Research Objectives Spatial data is used in a variety of disciplines current focus is on spatially referenced survival data. Accounting for spatial correlation provides additional insights Failure to account for such association could potentially lead to spurious and sometimes misleading results Spatial variation in survival patterns often reveal underlying lurking factors, which help in better understanding of the disease patterns. Spatial-survival models attempt to map spatial frailties Data available on patients with several cancers, where each cancer can potentially have its own spatial distribution. 2 Modeling Multivariate Survival Data
Introduction Motivation and Research Objectives Spatial data is used in a variety of disciplines current focus is on spatially referenced survival data. Accounting for spatial correlation provides additional insights Failure to account for such association could potentially lead to spurious and sometimes misleading results Spatial variation in survival patterns often reveal underlying lurking factors, which help in better understanding of the disease patterns. Spatial-survival models attempt to map spatial frailties Data available on patients with several cancers, where each cancer can potentially have its own spatial distribution. How should we model these multiple spatial frailties? 2 Modeling Multivariate Survival Data
The SEER database Introduction SEER (Surveillance Epidemiology and End Results) database of the National Cancer Institute records primary cancer types First primary is followed up with subsequent cancers. GIS-link: Each patient s county of residence is available 3 Modeling Multivariate Survival Data
The SEER database Introduction SEER (Surveillance Epidemiology and End Results) database of the National Cancer Institute records primary cancer types First primary is followed up with subsequent cancers. GIS-link: Each patient s county of residence is available Survival models for multiple cancers are sought for practical reasons: To assess survival from multiple primary cancers simultaneously To adjust survival rates from a specific primary cancer in the presence of other primary cancers 3 Modeling Multivariate Survival Data
The SEER database Multiple response data Date of Date of Survival Patient ID diagnosis Cancer type end-point Status Time (months) 1234 March, 1991 Colorectal May, 1994 Dead 38 1234 August, 1992 Stomach May, 1994 Dead 21 1234 February, 1994 Pancreas May, 1994 Dead 3 The SEER (Surveillance Epidemiology and End Results) database lists cases of multiple cancers by cancer types diagnosed. Together with medicare data one can extract an individual patient s complete medical history. Simple example: The patient in the above table Patient specific information available in the registry includes gender, race, marital status, and their county of residence (provided the state belongs to the SEER registry). 4 Modeling Multivariate Survival Data
The SEER database Multiple response data Assume there are I counties. In the i th county we observe n i patients. Total no. of patients: N = I i=1 n i Ordered pair (i, j): j th individual from the i th county. Each patient has been diagnosed with at least one of K pre-fixed cancer types. t ijk is the survival time for the (i, j) th individual from his diagnosis with the k th type of cancer. 5 Modeling Multivariate Survival Data
The SEER database Multiple response data Assume there are I counties. In the i th county we observe n i patients. Total no. of patients: N = I i=1 n i Ordered pair (i, j): j th individual from the i th county. Each patient has been diagnosed with at least one of K pre-fixed cancer types. t ijk is the survival time for the (i, j) th individual from his diagnosis with the k th type of cancer. Some special care is needed: Not all patients develop all the K types of cancer, List possible cancers {1, 2,..., K} and form patient-specific index set C (i,j) {1, 2,..., K}. Example: if patient 1234 in Section is the 10th individual from the 8th county and if colorectal, stomach and pancreas are cancer types 1, 3 and 5, in {1, 2,..., K} then C (8,10) = {1, 3, 5}. 5 Modeling Multivariate Survival Data
Survival modelling framework Multiple responses Survival times will be correlated for similar county-patient-cancer combinations. We capture these correlations by introducing appropriate frailties. 6 Modeling Multivariate Survival Data
Survival modelling framework Multiple responses Survival times will be correlated for similar county-patient-cancer combinations. We capture these correlations by introducing appropriate frailties. Multiple response frailty modelling h (t ijk ) = h 0 (t ijk ) exp ( x T ijk β + u (i,j) + v k + φ ik ) 6 Modeling Multivariate Survival Data
Survival modelling framework Multiple responses Survival times will be correlated for similar county-patient-cancer combinations. We capture these correlations by introducing appropriate frailties. Multiple response frailty modelling h (t ijk ) = h 0 (t ijk ) exp ( x T ijk β + u (i,j) + v k + φ ik ) u (i,j) : frailty for the (i, j) th patient (these may be looked upon as patient effects nested within counties). v k : frailty for the k th cancer type. φ ik be the frailty for the k th cancer type nested within the i th county. x ijk denotes the patient-cancer specific covariates. 6 Modeling Multivariate Survival Data
Bayesian hierarchical modelling The data likelihood Hierarchical models combine information from the data likelihood and priors on parameters. The data likelihood I n i i=1 j=1 k C (i,j) [S (t ijk x ijk, β, η)] [ h 0 (t ijk ; η) exp ( x T ijk β + u (i,j) + v k + φ ik )] δij S (t ijk x ijk, β, η) is the survival function evaluated at t ijk conditional on the parameters and observed covariates. 7 Modeling Multivariate Survival Data
Bayesian hierarchical modelling Modelling the frailties Patient effects nested within counties: Priors on u ij indep u (i,j) N ( 0, σi 2 ) Cancer effects: Priors on v k s v = (v 1,..., v K ) T N (0, Λ), where Λ is a K K (unknown) covariance matrix, which is assigned an Inverse-Wishart prior. 8 Modeling Multivariate Survival Data
Bayesian hierarchical modelling Spatial Frailties Spatial Effects: Form Φ = ( φ T 1,...φ T ) T I, where φ i = (φ i1,..., φ ik ) T is the collection of the spatial frailties for the K cancers within the i th county. Priors on φ ik s Φ = ( φ T 1,...φ T ) T I N (0, ΣW (α) Λ). Here Σ W (α) = (Diag (m i ) αw ) 1. m i is the number of neighbors of county i W is the adjacency matrix of the graph representing our region. Only α is random in Σ W (α). Note: α (0, 1) ensures a proper M CAR distribution. 9 Modeling Multivariate Survival Data
Bayesian hierarchical modelling Remarks on likelihood modelling h 0 (t) denotes baseline hazard function modelled semiparametrically as a mixture of Beta functions (see, e.g., Gelfand and Mallick, 1995 and Ibrahim et al., 2001) as cancer-specific (in which case we write h 0k (t)) or county-specific (in which case we write h 0i (t)). In principle we can also include subject-cancer interaction terms of the form w (ij)k, which may reveal cancer-specific variation in patient frailties. However, in practice reliably estimating these higher order interaction terms becomes difficult, especially with data such as ours, where many subjects die before contracting fewer than three cancers. 10 Modeling Multivariate Survival Data
Bayesian hierarchical modelling Spatial modelling extensions We are using the same Λ to model the MCAR and the cancer main effects: we are modelling the dispersion matrix of the space-cancer interactions, Φ, as a Kronecker product of spatial dispersion and cancer dispersion. 11 Modeling Multivariate Survival Data
Bayesian hierarchical modelling Spatial modelling extensions We are using the same Λ to model the MCAR and the cancer main effects: we are modelling the dispersion matrix of the space-cancer interactions, Φ, as a Kronecker product of spatial dispersion and cancer dispersion. We may generalize the MCAR (α, Λ) to MCAR ((α 1,..., α K ), Λ), enriching our model by assigning a different spatial smoothness parameter to each cancer type. (Carlin and Banerjee, 2003) 11 Modeling Multivariate Survival Data
Bayesian hierarchical modelling Spatial modelling extensions We are using the same Λ to model the MCAR and the cancer main effects: we are modelling the dispersion matrix of the space-cancer interactions, Φ, as a Kronecker product of spatial dispersion and cancer dispersion. We may generalize the MCAR (α, Λ) to MCAR ((α 1,..., α K ), Λ), enriching our model by assigning a different spatial smoothness parameter to each cancer type. (Carlin and Banerjee, 2003) Yet another modification considers v i N (0, Λ i ) (not identically distributed), whereupon each county has its own cancer-covariance pattern. 11 Modeling Multivariate Survival Data
Bayesian hierarchical modelling Spatial modelling extensions We are using the same Λ to model the MCAR and the cancer main effects: we are modelling the dispersion matrix of the space-cancer interactions, Φ, as a Kronecker product of spatial dispersion and cancer dispersion. We may generalize the MCAR (α, Λ) to MCAR ((α 1,..., α K ), Λ), enriching our model by assigning a different spatial smoothness parameter to each cancer type. (Carlin and Banerjee, 2003) Yet another modification considers v i N (0, Λ i ) (not identically distributed), whereupon each county has its own cancer-covariance pattern. In addition, one may compare MCAR (α, Λ) to MCAR (α, (Λ 1,..., Λ I )), retaining a common smoothness parameter for the different cancers (even put α = 1) but allow different covariance functions for different counties. 11 Modeling Multivariate Survival Data
Bayesian hierarchical modelling Spatial modelling extensions However, the above generalizations of the MCAR (α, Λ) will often require substantial information on multiple cancer patients. Unfortunately, such information is rare in practice. For instance, in the SEER database less than one percent of the patients have moderately large monitoring times with more than two different types of cancers. 12 Modeling Multivariate Survival Data
Bayesian hierarchical modelling Spatial modelling extensions However, the above generalizations of the MCAR (α, Λ) will often require substantial information on multiple cancer patients. Unfortunately, such information is rare in practice. For instance, in the SEER database less than one percent of the patients have moderately large monitoring times with more than two different types of cancers. Hence general models are identifiable only via priors. 12 Modeling Multivariate Survival Data
Bayesian hierarchical modelling Spatial modelling extensions However, the above generalizations of the MCAR (α, Λ) will often require substantial information on multiple cancer patients. Unfortunately, such information is rare in practice. For instance, in the SEER database less than one percent of the patients have moderately large monitoring times with more than two different types of cancers. Hence general models are identifiable only via priors. Therefore, we restrict our subsequent attention to the M CAR(α, Λ) specification and compare a spatially independent model with basically the same structure by replacing the Σ W (α) with a diagonal matrix which does not account for the spatial structure of the region. 12 Modeling Multivariate Survival Data
Bayesian hierarchical modelling Spatial modelling extensions However, the above generalizations of the MCAR (α, Λ) will often require substantial information on multiple cancer patients. Unfortunately, such information is rare in practice. For instance, in the SEER database less than one percent of the patients have moderately large monitoring times with more than two different types of cancers. Hence general models are identifiable only via priors. Therefore, we restrict our subsequent attention to the M CAR(α, Λ) specification and compare a spatially independent model with basically the same structure by replacing the Σ W (α) with a diagonal matrix which does not account for the spatial structure of the region. Now the independent frailties for the K cancers within the i th county, denoted by Φ, has a N ( 0, τ 2 I I I Λ ) distribution where τ 2 is a scale parameter. 12 Modeling Multivariate Survival Data
Numerical computations Models Model Description Model Model Description Model I Full MCAR (α, Λ) Model II MCAR (α, Λ) Without Patient Frailties {u (i,j) } Model III Full Spatial Independence Model Model IV Spatial Independence Model Without Patient Frailties {u (i,j) } 13 Modeling Multivariate Survival Data
Numerical computations MCAR model: MCAR (α, Λ) We design an appropriate Gibbs sampler for the above hierarchical model. β have full conditional distributions that are log-concave (with either flat or vague Gaussian priors) and can be updated using the Adaptive Rejection Sampling (ARS) algorithm (Gilks and Wild, 1992) or a Metropolis step. The same is true for the patient frailties { u (i,j) }, the cancer frailties {v k } and the spatial frailties {φ ik }. The mixture weights, η, in the baseline hazard modelling needs to be updated using a Metropolis step. The full conditionals for each of the σi 2 inverted-gamma distributions are conjugate Λ has a conjugate inverted-wishart distribution, which follows from the following Lemma: 14 Modeling Multivariate Survival Data
Numerical computations The Lemma Lemma: Lemma: Let A and B be I I and K K matrices and let x be any vector of length IK. Then, there exists a K K matrix C and an I I matrix D such that x T (A B) x = tr (CB) = tr (DA). 15 Modeling Multivariate Survival Data
Illustrations SEER data description Source: 1973-2001 SEER Public Use Incidence Database on patients from the state of Iowa diagnosed with colorectal cancer. In the dataset, 28% of the patients survival times were censored, 44% of patients were females, 69% of patients had at most two primary cancers, 47% had colon cancer before rectal cancer, majority of cases (78% for colon and 75% for rectal cancer) were in the local or regional stage, and most of the cases were diagnosed when patients were 65 years or older (76% for colon and 74% for rectal cancer). 16 Modeling Multivariate Survival Data
Illustrations SEER data description Colorectal cancer involves two gastrointestinal related organs. Good motivation for using this cancer type to illustrate the methodology. Diagnosis: two cancer types (colon and rectum) with survival times recorded in months from diagnosis up to study termination (censored) or death (failure). 1083 patients; represent 96 of the 99 counties in Iowa. Age at diagnosis for each type of cancer categorized: < 55, 55 65, 65 75, and > 75. Stage of each type of cancer: in situ or unstaged, local, regional, or distant. Others: total number of primary cancers, dichotomized as 0 (nprimes 2) and 1 (nprimes > 2), and gender. An indicator variable, colon first, is used to capture the effect of the order of occurrence. 17 Modeling Multivariate Survival Data
Illustrations Parameter Estimates: Regressors Parameter Model I Model II Model III Model IV Patient-level Gender=Female -0.36 (-0.36, -0.36) -0.36 (-0.36, -0.36) -0.36 (-0.36, -0.36) -0.36 (-0.36, -0.36) Number of primaries -0.30 (-0.30, -0.30) -0.30 (-0.30, -0.30) -0.30 (-0.30, -0.30) -0.30 (-0.30, -0.30) Colon first 0.04 (0.04, 0.04) 0.04 (0.04, 0.04) 0.04 (0.04, 0.04) 0.04 (0.04, 0.04) Colon cancer Frailty, ν 1-2.40 (-2.90, -1.98) -2.46 (-2.92, -2.01) -1.70 (-1.86, -1.56) -1.80 (-2.05, -1.57) Stage In situ or Unstaged Local -0.25 (-0.33, -0.18) -0.25 (-0.33, -0.18) -0.25 (-0.33, -0.18) -0.25 (-0.32, -0.18) Regional 0.03 (-0.05, 0.10) 0.03 (-0.04, 0.10) 0.03 (-0.05, 0.10) 0.03 (-0.04, 0.11) Distant 1.12 (1.02, 1.21) 1.12 (1.02, 1.21) 1.12 (1.02, 1.21) 1.12 (1.02, 1.20) Age < 55 55 < 65 0.47 (0.36, 0.58) 0.47 (0.36, 0.58) 0.47 (0.36, 0.57) 0.47 (0.35, 0.57) 65 < 75 0.74 (0.64, 0.84) 0.74 (0.64, 0.84) 0.74 (0.64, 0.84) 0.74 (0.64, 0.84) 75 1.16 (1.07, 1.27) 1.16 (1.06, 1.26) 1.16 (1.06, 1.26) 1.16 (1.06, 1.26) Rectal cancer Frailty, ν 2-2.17 (-2.63, -1.72) -2.25 (-2.73, -1.80) -1.88 (-2.07, -1.69) -2.07 (-2.37, -1.75) Stage In situ or Unstaged Local -0.40 (-0.81, 0.02) -0.40 (-0.81, 0.04) -0.40 (-0.80, 0.02) -0.40 (-0.82, 0.02) Regional 0.07 (-0.36, 0.49) 0.07 (-0.35, 0.47) 0.07 (-0.35, 0.49) 0.07 (-0.35, 0.48) Distant 1.44 (1.16, 1.72) 1.44 (1.16, 1.72) 1.44 (1.16, 1.71) 1.44 (1.17, 1.73) Age at diagnosis < 55 55 < 65 0.36 (0.05, 0.66) 0.35 (0.05, 0.65) 0.36 (0.05, 0.66) 0.35 (0.04, 0.64) 65 < 75 0.76 (0.36, 1.17) 0.76 (0.35, 1.16) 0.76 (0.36, 1.18) 0.76 (0.36, 1.17) 75 1.28 (0.86, 1.70) 1.28 (0.87, 1.71) 1.28 (0.87, 1.72) 1.28 (0.86, 1.70) Reference categories. 18 Modeling Multivariate Survival Data
Illustrations Parameter Estimates: Other parameters Parameter Model I Model II Model III Model IV α 0.95 (0.81, 1.00) 0.95 (0.81, 1.00) η 1 0.90 (0.88, 0.92) 0.90 (0.88, 0.92) 0.90 (0.88, 0.92) 0.90 (0.88, 0.92) η 2 0.03 (0.01, 0.04) 0.03 (0.01, 0.04) 0.02 (0.01, 0.04) 0.02 (0.01, 0.04) η 3 0.02 (0.01, 0.04) 0.03 (0.01, 0.04) 0.03 (0.01, 0.04) 0.03 (0.01, 0.04) η 4 0.03 (0.01, 0.04) 0.03 (0.01, 0.04) 0.02 (0.01, 0.04) 0.02 (0.01, 0.04) η 5 0.03 (0.01, 0.04) 0.03 (0.01, 0.04) 0.03 (0.01, 0.04) 0.02 (0.01, 0.04) Λ 11 1.29 (0.59, 2.11) 1.30 (0.59, 2.12) 0.41 (0.20, 0.65) 0.41 (0.19, 0.65) Λ 22 0.54 (0.20, 0.91) 0.57 (0.23, 0.96) 0.27 (0.13, 0.42) 0.29 (0.13, 0.46) Λ 12 1.78 (0.90, 2.74) 1.84 (0.95, 2.83) 0.87 (0.48, 1.30) 0.98 (0.54, 1.47) ρ 0.36 (0.20, 0.53) 0.37 (0.21, 0.54) 0.45 (0.31, 0.60) 0.46 (0.31, 0.62) Precision is 1 10 3. 19 Modeling Multivariate Survival Data
Model Comparisons CPO and ALMPL score The conditional predictive ordinate (CPO) and the average log-marginal pseudo-likelihood (ALMPL) were used. (Gelfand and Dey, 1994, and Chen et al., 2000) CPO statistic CP O ij = f ( D ij x ij, D ( ij)) ( = ) 1 1 π (Θ D) Θ f (D ij x ij, Θ) D ij = (i, j) th observation, D is the complete data and Θ is the collection of parameters. ALMPL ALMP L = 1 N log CP O ij. ij Larger values of ALMPL indicate better model fit. 20 Modeling Multivariate Survival Data
Model Comparisons DIC score The modified Deviance Information Criterion (DIC) proposed by Celeux et al. (2006) was also used for model comparison. The DIC for a particular model with data y, random effects Z and parameters Ω and likelihood f ( y Ω ) is given by DIC score DIC = 4E Ω,Z [ log f ( y, Z Ω ) y ] +2E Z [ log p ( y, Z EΩ [ Ω y, Z ])] This essentially integrates out the random effects. Lower values of DIC indicate better models. Measure of complexity [ ( p D = DIC + 2E Ω,Z log f y, Z Ω ) ] y Smaller values pointing to models with lesser complexity. 21 Modeling Multivariate Survival Data
Model Comparisons Model Comparison Table Model ALMPL DIC p D Model I -5.92 90.40 Model II -6.03-247.81 2.71 Model III -6.96 5761.58 276.75 Model IV -8.48 3588.42 149.13 DIC difference from Model I (smaller indicates better model). 22 Modeling Multivariate Survival Data
Figures Index plots of CPO s Index plots of the log CP Oij A log CP OB ij for each model pair. Positive values indicate support for model A over model B while negative values indicate the opposite. From left to right, top to bottom: I-II, I-III, I-IV, II-III, II-IV, III-IV. Y-axes range is from -1.0 to 7.0 (comparisons involving Model IV were truncated at 7.0 to facilitate presentation: 23 Modeling Multivariate Survival Data
Figures Cancer frailties Distribution of cancer-specific frailties (y-axes) under the different models. Positive frailties indicate higher relative risks while negative values indicate the opposite: 24 Modeling Multivariate Survival Data
Figures Patient-specific frailties Distribution of average patient-specific frailties (y-axes) from different counties (x-axes) in Iowa under Model I (top) and Model III (bottom). Positive frailties indicate higher relative risks while negative values indicate the opposite: 25 Modeling Multivariate Survival Data
Figures Spatial frailties Distribution of spatial frailties (y-axes) for colon and rectal cancers from different counties (x-axes) in Iowa under best model. Positive frailties indicate higher relative risks while negative values indicate the opposite: 26 Modeling Multivariate Survival Data
Figures Spatial frailties: choropleth map Spatial frailties for colon and rectal cancers in Iowa under the different models. Color change from blue to red indicates an increase from negative to positive spatial frailties. Cutoff values are one standard deviation units from the mean of each cancer type in each model. Locations of urban areas are also indicated to potentially account for clustering: 27 Modeling Multivariate Survival Data
Concluding Remarks Remarks We presented a methodology for modelling spatially correlated multivariate cancer survival data, illustrated using colon and rectal cancer data for the state of Iowa obtained from the SEER database. 28 Modeling Multivariate Survival Data
Concluding Remarks Remarks We presented a methodology for modelling spatially correlated multivariate cancer survival data, illustrated using colon and rectal cancer data for the state of Iowa obtained from the SEER database. The results suggest that there is positive correlation between the hazard rates of colon and rectal cancers. This information may not be available or interpretable when modelling the cancers independently. 28 Modeling Multivariate Survival Data
Concluding Remarks Remarks We presented a methodology for modelling spatially correlated multivariate cancer survival data, illustrated using colon and rectal cancer data for the state of Iowa obtained from the SEER database. The results suggest that there is positive correlation between the hazard rates of colon and rectal cancers. This information may not be available or interpretable when modelling the cancers independently. Other mechanisms for handling spatial covariation, such as the shared component approach presented by Held and Best (2001) applied to disease mapping of oral and esophageal cancers, are also possible. The basic modelling framework may be extended to different hazard structures if the need arises. 28 Modeling Multivariate Survival Data
Concluding Remarks Remarks We presented a methodology for modelling spatially correlated multivariate cancer survival data, illustrated using colon and rectal cancer data for the state of Iowa obtained from the SEER database. The results suggest that there is positive correlation between the hazard rates of colon and rectal cancers. This information may not be available or interpretable when modelling the cancers independently. Other mechanisms for handling spatial covariation, such as the shared component approach presented by Held and Best (2001) applied to disease mapping of oral and esophageal cancers, are also possible. The basic modelling framework may be extended to different hazard structures if the need arises. 28 Modeling Multivariate Survival Data
Concluding Remarks Thank you! 29 Modeling Multivariate Survival Data