Ridge regression for risk prediction

Size: px

Start display at page:

Download "Ridge regression for risk prediction"

Norma Arline Beasley
5 years ago
Views:

1 Ridge regression for risk prediction with applications to genetic data Erika Cule and Maria De Iorio Imperial College London Department of Epidemiology and Biostatistics School of Public Health May 2012

2 Outline 1 Risk Prediction using Genetic Data 2 Methods and challenges 3 Ridge Regression Shrinkage parameter Significance testing 4 Conclusions

3 Outline 1 Risk Prediction using Genetic Data 2 Methods and challenges 3 Ridge Regression Shrinkage parameter Significance testing 4 Conclusions

4 Risk Prediction using Genetic Data In the decade following the publication of the first draft of the Human Genome Sequence...

.....genome-wide association studies have identified thousands

5 Risk Prediction using Genetic Data In the decade following the publication of the first draft of the Human Genome Sequence......genome-wide association studies have identified thousands of genetic variants associated with hundreds of diseases and traits.

6 Risk Prediction using Genetic Data However, clinicians are getting impatient about the utility of these identified variants for risk prediction in complex diseases:

7 Risk Prediction using Genetic Data However, clinicians are getting impatient about the utility of these identified variants for risk prediction in complex diseases:

8 Risk prediction using genetic data Recently, questions have been raised about the potential utility of genetic risk prediction for complex diseases (Clayton, 2009).

9 Risk prediction using genetic data Recently, questions have been raised about the potential utility of genetic risk prediction for complex diseases (Clayton, 2009). The aim here is to make ridge regression possible for genetic data in a semi-automatic way

10 Risk prediction using genetic data Recently, questions have been raised about the potential utility of genetic risk prediction for complex diseases (Clayton, 2009). The aim here is to make ridge regression possible for genetic data in a semi-automatic way The framework that we propose allows for the simultaneous inclusion of all predictors genome-wide in a regression model.

11 Risk prediction using genetic data Recently, questions have been raised about the potential utility of genetic risk prediction for complex diseases (Clayton, 2009). The aim here is to make ridge regression possible for genetic data in a semi-automatic way The framework that we propose allows for the simultaneous inclusion of all predictors genome-wide in a regression model. Our approach is appropriate where there are many predictors of small effect size,which is thought to be the case in genetic data.

12 Outline 1 Risk Prediction using Genetic Data 2 Methods and challenges 3 Ridge Regression Shrinkage parameter Significance testing 4 Conclusions

5 0 Univariate 15 tests of association 1 2 3 4 5 6 7 8 9 Type 1 diabetes 10 11 12 13 14 16 15 22 21 20 19 18 17 X 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 16 15 22 21 20 19 18 17 X 15 Type 2 diabetes

13 5 0 Univariate 15 tests of association Type 1 diabetes X X 15 Type 2 diabetes X Chromosome enome-wide scan for seven diseases. For each of seven diseases e trend test P value for quality-control-positive SNPs, excluding h disease that were excluded for having poor clustering after tion, are plotted against position on each chromosome. Chromosomes are shown in alternating colours for clarity, with P values, highlighted in green. All panels are truncated at 2log 10 (P value) 5 15, although some markers (for example, in the M T1D and RA) exceed this significance threshold Nature Publishing Group

14 5 0 Univariate 15 tests of association Type 1 diabetes X X 15 Type 2 diabetes X Chromosome enome-wide scan for seven diseases. For each of seven diseases e trend test P value for quality-control-positive SNPs, excluding h disease that were excluded for having poor clustering after tion, are plotted against position on each chromosome. Chromosomes are shown in alternating colours for clarity, with P values, highlighted in green. All panels are truncated at 2log 10 (P value) 5 15, although some markers (for example, in the M T1D and RA) exceed this significance threshold Nature Publishing Group

15 X X X X chotic features5 (delusions and hallucinations) often occur. recognizing that this signal was not additionally supported sis is poorly understood but there is robust evidence for a expanded reference group analysis (see below and Supplem 0 l genetic contribution to risk27,28. The estimated sibling Table 9) and that independent replication is essential, we no 27,28 risk (ls) is The definition 15 and heritability 80 90% Type 1 several diabetesgenes at this locus could have pathological relevance (Fig. 5). These include PALB2 (partner and localizer of BR notype is based 10 solely on clinical features because, as yet, which is involved in stability of key nuclear structures inc lacks validating 5 diagnostic tests such as those available for sical illnesses. 0Indeed, a major goal of molecular genetics chromatin and the nuclear matrix; NDUFAB1 (NADH dehyd s to psychiatric illness is an improvement in diagnostic ase (ubiquinone) 1, alpha/beta subcomplex, 1), which enc Type 2 subunit diabetes of complex I of the mitochondrial respiratory chai on that will follow identification of the biological systems 15 DCTN5 (dynactin 5), which encodes a protein involved in in pin the clinical 10 syndromes. The phenotype definition that sed includes individuals that have suffered one or more lular transport that is known to interact with the gene disrup 5 schizophrenia 1 (DISC1)32, the latter having been implicated f pathologically 0 elevated mood (see Methods), a criterion res the clinical spectrum of bipolar mood variation that ceptibility to bipolar disorder as well as schizophrenia33. ilial aggregation29. Of the four regions showing association at P, Chromosome expanded reference group analysis (Supplementary Table 9), genomic regions have been implicated in linkage studies30 Genome-wide scan for seven diseases. For each of seven diseases Chromosomes are shown in alternating colours for clarity, with interest that geneintogreen. the signal at rs (P at 5 tly,trend replicated evidence implicating specific SNPs, genes excluding has been he test P value for quality-control-positive highlighted All panels are truncated P values,1 3 the 1025closest ncreasing evidence suggests an overlap in genetic suscept) (P is KCNC2 which encodes the Shaw-related voltage-gate value) 5 15, although some markers (for example, in the M ch disease that were excluded for having poor clustering after 2log10 hction, schizophrenia, a psychotic disorder many similarassium Ionthis channelopathies are well-recognized as ca T1D andchannel. RA) exceed significance threshold. are plotted against position on eachwith chromosome.. In particular association findings have been reported with episodic central nervous system disease, including seizures, Univariate tests of association Nature Publishing Group Bipolar disorder Coronary artery disease X WTCCC (2007) Crohn s disease

16 Multivariate methods Consider all SNPs jointly

17 Multivariate methods Consider all SNPs jointly Standard multivariate methods cannot be used with modern genetic data sets which have p n. Typically, additional (non-genetic) covariates are included in the analysis, further increasing the dimensionality of the data.

18 Multivariate methods Consider all SNPs jointly Standard multivariate methods cannot be used with modern genetic data sets which have p n. Typically, additional (non-genetic) covariates are included in the analysis, further increasing the dimensionality of the data. Penalized regression methods constrain the size of the maximum likelihood estimates of regression coefficients. Known as shrinkage methods - shrink regression coefficients towards zero.

19 Multivariate methods Consider all SNPs jointly Standard multivariate methods cannot be used with modern genetic data sets which have p n. Typically, additional (non-genetic) covariates are included in the analysis, further increasing the dimensionality of the data. Penalized regression methods constrain the size of the maximum likelihood estimates of regression coefficients. Known as shrinkage methods - shrink regression coefficients towards zero. A number of penalized regression approaches have been proposed in the literature: Lasso regression, HyperLasso, Elastic Net...

20 Multivariate methods Consider all SNPs jointly Standard multivariate methods cannot be used with modern genetic data sets which have p n. Typically, additional (non-genetic) covariates are included in the analysis, further increasing the dimensionality of the data. Penalized regression methods constrain the size of the maximum likelihood estimates of regression coefficients. Known as shrinkage methods - shrink regression coefficients towards zero. A number of penalized regression approaches have been proposed in the literature: Lasso regression, HyperLasso, Elastic Net... Ridge Regression

21 Prior distributions in Lasso and Ridge Regression

22 Outline 1 Risk Prediction using Genetic Data 2 Methods and challenges 3 Ridge Regression Shrinkage parameter Significance testing 4 Conclusions

23 Ridge regression Ridge regression (Hoerl & Kennard, 1970) is a penalized regression approach proposed to overcome the problems associated with multicollinearity among predictors in multiple regression. Among penalized regression approaches, ridge regression has been shown to offer very good predictive performance (Frank & Friedman, 1993). We applied ridge regression to the problem of risk prediction using genetic data obtained from genome-wide association studies. Ridge regression shrinks the squared length of the regression coefficient vector - corresponds to a quadratic penalty on the coefficients.

24 Outline 1 Risk Prediction using Genetic Data 2 Methods and challenges 3 Ridge Regression Shrinkage parameter Significance testing 4 Conclusions

25 Shrinkage parameter Controls the degree of shrinkage of the regression coefficients. A larger shrinkage parameter shrinks the coefficients further towards zero. Data-driven methods proposed in the literature cannot be applied p n, because they depend on the ordinary least squares estimates.

26 Shrinkage parameter Controls the degree of shrinkage of the regression coefficients. A larger shrinkage parameter shrinks the coefficients further towards zero. Data-driven methods proposed in the literature cannot be applied p n, because they depend on the ordinary least squares estimates. Ridge trace (graphical method)

27 Our starting point Linear model: Y = Xβ + ɛ ɛ iid N(0, σ 2 )

28 Our starting point Linear model: Y = Xβ + ɛ ɛ iid N(0, σ 2 ) Ridge regression: ˆβ k = arg min β n y i i=1 p β i x ij j=1 2 + k p βj 2 j=1

29 Our starting point Linear model: Y = Xβ + ɛ ɛ iid N(0, σ 2 ) Ridge regression: ˆβ k = arg min β n y i i=1 p β i x ij j=1 2 + k p βj 2 j=1 Proposed by Hoerl, Kennard & Baldwin (1975): k HKB = pˆσ2 ˆβ ˆβ

30 Our starting point Linear model: Y = Xβ + ɛ ɛ iid N(0, σ 2 ) Ridge regression: ˆβ k = arg min β n y i i=1 p β i x ij j=1 2 + k p βj 2 j=1 Proposed by Hoerl, Kennard & Baldwin (1975): k HKB = pˆσ2 ˆβ ˆβ ˆσ 2, ˆβ estimated from ordinary least squares (OLS).

31 We observed Linear model: Y = Xβ + ɛ ɛ iid N(0, σ 2 ) Proposed by Hoerl, Kennard & Baldwin (1975): k HKB = pˆσ2 ˆβ ˆβ

32 We observed Linear model: Y = Xβ + ɛ = Z α + ɛ ɛ iid N(0, σ 2 ) Proposed by Hoerl, Kennard & Baldwin (1975): k HKB = pˆσ2 ˆβ ˆβ

33 We observed Linear model: Y = Xβ + ɛ = Z α + ɛ ɛ iid N(0, σ 2 ) Proposed by Hoerl, Kennard & Baldwin (1975): k HKB = pˆσ2 pˆσ2 = ˆβ ˆβ ˆα ˆα

34 We observed Linear model: Y = Xβ + ɛ = Z α + ɛ ɛ iid N(0, σ 2 ) Proposed by Hoerl, Kennard & Baldwin (1975): k HKB = pˆσ2 pˆσ2 = ˆβ ˆβ ˆα ˆα ˆα are principal components regression coefficients.

35 We observed Linear model: Y = Xβ + ɛ = Z α + ɛ ɛ iid N(0, σ 2 ) Proposed by Hoerl, Kennard & Baldwin (1975): k HKB = pˆσ2 pˆσ2 = ˆβ ˆβ ˆα ˆα ˆα are principal components regression coefficients. PCR coefficients are available when p >> n

36 We propose k HKB = pˆσ2 ˆα ˆα Harmonic mean of the ideal shrinkage parameters of the PCR coefficients, with coefficients replaced by their ordinary least squares estimates.

37 We propose k HKB = pˆσ2 ˆα ˆα k r = r ˆσ2 r ˆα r ˆα r Harmonic mean of the ideal shrinkage parameters of the PCR coefficients, with coefficients replaced by their ordinary least squares estimates.

38 We propose k HKB = pˆσ2 ˆα ˆα k r = r ˆσ2 r ˆα r ˆα r Harmonic mean of the ideal shrinkage parameters of the PCR coefficients, with coefficients replaced by their ordinary least squares estimates. How many components?

39 How many components? % of replicates with larger MSE using k HKB than using k r percent signal 49 to noise ratio number of PCs (r) 20 0

40 How many components? Most of the variance in genetic data can be explained by the first few principal components.

41 How many components? PSE = { 1 + tr(hh ) n } σ 2 + b b n = variance + bias2 n H is the hat matrix : Ŷ = HY Degrees of freedom for variance = tr(hh ) (Hastie & Tibshirani (1990) ).

42 How many components? For given r, RR estimates have less bias than PCR estimates.

43 How many components? For given r, RR estimates have less bias than PCR estimates. PCR using r components has r degrees of freedom for variance.

44 How many components? For given r, RR estimates have less bias than PCR estimates. PCR using r components has r degrees of freedom for variance. We fixed r such that degrees of freedom of the ridge model using r components equals r.

45 How many components? For given r, RR estimates have less bias than PCR estimates. PCR using r components has r degrees of freedom for variance. We fixed r such that degrees of freedom of the ridge model using r components equals r. tr ( HH ) = r

46 Simulation Study Mean prediction squared error:

47 Simulation Study Mean prediction squared error: p-value trace:

48 Simulation study Performance comparison SNP ranking followed by multivariate regression HyperLasso Continuous and binary outcomes Univariate HLasso RR % of SNPs ranked by univariate p-value 0.1% 0.5% 1 % 3% 4% Continuous outcomes (mean PSE) Binary outcomes (mean CE)

49 Bipolar Disorder Data Two GWAS of Bipolar Disorder: WTCCC and GAIN. Case-control studies - model extended to logistic ridge regression. SNPs typed on different platforms. Impute2 to obtain common SNPs. When determining shrinkage parameter, training data were thinned (1 SNP every 100kb). Univariate model - which significance threshold? HyperLasso - cross-validation to choose the parameters is computationally intensive.

50 Bipolar Disorder Data Two GWAS of Bipolar Disorder: WTCCC and GAIN. Case-control studies - model extended to logistic ridge regression. SNPs typed on different platforms. Impute2 to obtain common SNPs. When determining shrinkage parameter, training data were thinned (1 SNP every 100kb). Univariate model - which significance threshold? HyperLasso - cross-validation to choose the parameters is computationally intensive. Univariate HyperLasso Ridge Regression p-value threshold Mean Classification Error

51 Outline 1 Risk Prediction using Genetic Data 2 Methods and challenges 3 Ridge Regression Shrinkage parameter Significance testing 4 Conclusions

52 Significance testing in ridge regression Ridge regression is not a variable selection method - the shrinkage penalty does not shrink any coefficient estimates to zero.

53 Significance testing in ridge regression Ridge regression is not a variable selection method - the shrinkage penalty does not shrink any coefficient estimates to zero. A test of significance of ridge regression coefficients had been proposed (Halawa & El Bassiouni, 2000) and applied (Malo et al, 2008) but not evaluated.

54 Significance testing in ridge regression Ridge regression is not a variable selection method - the shrinkage penalty does not shrink any coefficient estimates to zero. A test of significance of ridge regression coefficients had been proposed (Halawa & El Bassiouni, 2000) and applied (Malo et al, 2008) but not evaluated. We extended the test to be applicable when p >> n and to be applied in logistic ridge regression, and evaluated its performance on simulated and real data sets.

55 Significance test Based on a Wald test: T k = ˆβ ( k ) H 0 : T k N (0, 1) se ˆβ k ( ) se ˆβ k from covariance matrix ( ) Var ˆβ k = ˆσ 2 (X X + ki) 1 X X(X X + ki) 1 taking into account both correlation in predictors and amount of shrinkage.

56 Simulation study Causal SNP Simulation study Frequency p = 0!1.0! coefficient estimate p = Probability p = 1.07e!08! T! p = Si u p te p w To

57 Lung Cancer Data Freque Simulation study Non-causal SNP Frequency !1.0! coefficient estimate p = Probab Probability ! T! p = pe te pe w To w no!1.0! ! coefficient estimate T!

58 Simulation study True-positive and False-positive rates Individuals SNPs ALL ALL Shrinkage Parameter Approximate test Permutation test TPR FPR TPR FPR TPR FPR TPR FPR TPR FPR TPR FPR TPR FPR TPR FPR

59 Lung Cancer Data Approximate test Permutation test a b!log p!value rs rs rs other SNPs!log p!value Inf rs rs rs other SNPs Shrinkage parameter Shrinkage parameter

60 Outline 1 Risk Prediction using Genetic Data 2 Methods and challenges 3 Ridge Regression Shrinkage parameter Significance testing 4 Conclusions

61 Summary Prediction is a challenging problem!

62 Summary Prediction is a challenging problem! Ridge regression is a popular penalized regression approach that has been shown to perform well for prediction.

63 Summary Prediction is a challenging problem! Ridge regression is a popular penalized regression approach that has been shown to perform well for prediction. We propose a semi-automatic method for choosing the shrinkage parameter in ridge regression, which can be applied when p n.

64 Summary Prediction is a challenging problem! Ridge regression is a popular penalized regression approach that has been shown to perform well for prediction. We propose a semi-automatic method for choosing the shrinkage parameter in ridge regression, which can be applied when p n. We introduced a method for testing the significance of regression coefficients estimated using ridge regression.

65 Summary Prediction is a challenging problem! Ridge regression is a popular penalized regression approach that has been shown to perform well for prediction. We propose a semi-automatic method for choosing the shrinkage parameter in ridge regression, which can be applied when p n. We introduced a method for testing the significance of regression coefficients estimated using ridge regression. We have enabled ridge regression to be a feasible tool for genetic risk prediction on a genome-wide scale.

66 R package ridge Fitting ridge regression models to data comprising hundreds of thousands of predictors presents computational challenges.

67 R package ridge Fitting ridge regression models to data comprising hundreds of thousands of predictors presents computational challenges. We have written an R package, ridge, for fitting such models.

68 R package ridge Fitting ridge regression models to data comprising hundreds of thousands of predictors presents computational challenges. We have written an R package, ridge, for fitting such models. For large data sets, C code is used (with a user-friendly R interface).

69 R package ridge Fitting ridge regression models to data comprising hundreds of thousands of predictors presents computational challenges. We have written an R package, ridge, for fitting such models. For large data sets, C code is used (with a user-friendly R interface). Where available, multi-core or GPU computation speeds up matrix operations.

70 R package ridge Fitting ridge regression models to data comprising hundreds of thousands of predictors presents computational challenges. We have written an R package, ridge, for fitting such models. For large data sets, C code is used (with a user-friendly R interface). Where available, multi-core or GPU computation speeds up matrix operations. Flexibility to include non-genetic covariates - penalized or not.

71 R package ridge Fitting ridge regression models to data comprising hundreds of thousands of predictors presents computational challenges. We have written an R package, ridge, for fitting such models. For large data sets, C code is used (with a user-friendly R interface). Where available, multi-core or GPU computation speeds up matrix operations. Flexibility to include non-genetic covariates - penalized or not. Significance test is implemented.

72 R package ridge Fitting ridge regression models to data comprising hundreds of thousands of predictors presents computational challenges. We have written an R package, ridge, for fitting such models. For large data sets, C code is used (with a user-friendly R interface). Where available, multi-core or GPU computation speeds up matrix operations. Flexibility to include non-genetic covariates - penalized or not. Significance test is implemented. Graphical outputs - ridge and p-value traces.

73 R package ridge Fitting ridge regression models to data comprising hundreds of thousands of predictors presents computational challenges. We have written an R package, ridge, for fitting such models. For large data sets, C code is used (with a user-friendly R interface). Where available, multi-core or GPU computation speeds up matrix operations. Flexibility to include non-genetic covariates - penalized or not. Significance test is implemented. Graphical outputs - ridge and p-value traces. Option for user-specified shrinkage parameter, with our semi-automatic method as the default.

74 Acknowledgements Maria De Iorio Colleagues in the Department of Epidemiology and Biostatistics, Imperial College London Colleagues in the Department of Statistical Science, University College London ILCO study nested within EPIC WTCCC and GAIN studies

75 References [1] D. G Clayton. Prediction and interaction in complex disease genetics: experience in type 1 diabetes. PLoS Genetics, [2] Erika Cule and Maria De Iorio. A semi-automatic method to guide the choice of ridge parameter in ridge regression. arxiv, stat.ap, May [3] Erika Cule, Paolo Vineis, and Maria De Iorio. Significance testing in ridge regression for genetic data. BMC Bioinformatics, 12(1):372, [4] Ildiko Frank and Jerome Friedman. A statistical view of some chemometrics regression tools. Technometrics, 35(2): , May [5] A M Halawa and M Y El Bassiouni. Tests of regression coefficients under ridge regression models. Journal of Statistical Computation and Simulation, 65(1): , [6] T Hastie and R Tibshirani. Generalized Additive Models. Chapman & Hall, [7] Arthur E Hoerl and RW Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1):55 67, [8] Clive J Hoggart, JC Whittaker, M De Iorio, and David J Balding. Simultaneous analysis of all snps in genome-wide and re-sequencing association studies. PLoS Genet, 4(7):e , [9] Nathalie Malo, Ondrej Libiger, and Nicholas J Schork. Accommodating linkage disequilibrium in genetic-association analyses via ridge regression. Am J Hum Genet, 82(2): , Feb [10] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society B, 58: , [11] Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society B, 67(2): , Jan 2005.

Part [2.1]: Evaluation of Markers for Treatment Selection Linking Clinical and Statistical Goals

Part [2.1]: Evaluation of Markers for Treatment Selection Linking Clinical and Statistical Goals Patrick J. Heagerty Department of Biostatistics University of Washington 174 Biomarkers Session Outline