Supplementary Materials Supplementary Methods K-fold cross-validation procedure We randomly divided the original dataset S into K subsets. In the k-th cross-validation, the k-th subset was left out to be the validation set, denoted by S k Va, while the other K-1 subsets were used as the training set, denoted by S k Tr. We implemented the forward selection algorithm, using S k Tr to build a series of prediction models, M k 1, M k 2,..., and applied these models to the validation dataset, S k Va, to estimate the predicted AUC values. The predicted AUC values from the K-fold cross-validation were averaged at each level of model complexity, yielding a series of averaged predicted AUC values from the cross-validation, denoted by (AUC 1, AUC 2,...). The level at which the averaged AUC value stopped increasing was used as the appropriate complexity level for the final prediction model built on the original dataset S. Cross-validation procedure with bootstrap aggregating (bagging) As an alternative to the K-fold cross-validation procedure, we also propose a cross-validation procedure with bootstrap aggregating (bagging) to further improve the method s robustness and power. Bagging was first introduced [1] to reduce an estimator s variance with little cost in bias. Its basic principle is to obtain an aggregated predictor by generating multiple predictors from bootstrap replicate samples of the data. The bagging method works especially well when the prediction method is unstable (i.e., small changes in the learning set result in large changes in prediction). Recently, Petersena et al [2] introduced a cross-validated bagging procedure by integrating bagging into the cross-validation procedure. They showed that an appropriate bias-variance trade-off for the parameter 1
of interest can be achieved by conducting the cross-validation at the level of the bagging estimator itself (i.e., the cross-validated bagging estimator) [1]. Based on this concept, we can use cross-validation with bagging instead of K-fold cross-validation in the forward ROC method. To do this, we randomly divide the original dataset S into K subsets. In the k-th cross-validation, we draw B bootstrap samples (B k 1, B k 2,..., B k B ) from the training set S k Tr. For the b-th bootstrap sample, we implement the forward selection algorithm to build a series of prediction models, M b 1, M b 2,..., and then apply these models to the validation dataset, S k Va, to estimate the predicted AUC values. The predicted AUC values from the B bootstrap samples are then averaged at each level of model complexity, yielding a series of bagging estimators of the predicted AUC values for the k-th cross-validation, denoted by (AUC k 1, AUC k 2,...). The K results are then averaged to provide overall cross-validated bagging AUC estimators, denoted by (AUC 1, AUC 2,...). The level with the highest average AUC value is used as the complexity level for the final prediction model built on the original dataset S. Cross-validation with bagging is a computationally intensive procedure in which a number of bootstrap samples are needed for each fold of cross-validation. Assuming time t is required for the K-fold cross-validation, cross-validation with bagging needs time B t, where B is the number of bootstrap samples in each cross-validation. Despite its heavy computational requirement, cross-validation with bagging has the potential to improve the method s performance and lead to a more stable and more accurate predictive genetic test. 2
Supplementary table 1 Summary of the simulation settings Simulation scenarios I and II Supplementary Simulation 2 Supplementary Simulation 3 Risk Inheritance variants a mode b Odds ratios Interaction c Number of Noise loci d Setting1 10 G e : a, b, c G: 1.2, 1.3, 1.5 a e: 1.7 Setting2 15 E e A, R, D : e E: 1.4 b c: 1.4 Setting3 20 Setting1 a, b, c A, R, D 2, 2.5, 2.6 Setting2 Setting3 a, b, c, e, f, g a, b, c, e, f, g, h, i, j, k a, b, c, e, f, A, R, D, A, R, D A, R, D, A, R, D, A, R, D, R A, A, A, A, R, 1.5, 1.7, 1.8, 1.6, 1.6, 1.9 1.6, 1.9, 1.8, 1.6, 1.7, 1.9,1.6, 1.5, 1.6, 1.9 1.5, 4.0, 1.2, 1.2, 1.3 No interaction No interaction No interaction a f: 2.5 b e: 2.4 a The minor allele frequencies of risk loci were generated randomly from a uniform distribution that 20 20 20 0 ranged from 0.1 to 0.5. b A, R and D represent additive, recessive and dominant modes of inheritance, respectively. c Numbers listed in the Interaction column measure the risk to individuals who carry risk alleles of interacting loci and/or environmental factors vs. all other individuals. For the a e interaction in simulation scenario I, we assumed that individuals who were exposed to the environmental risk level and carried the risk allele a (i.e., high risk individuals) had a 1.7 times higher risk of disease than all other individuals. Similarly, for the b c interaction, we assumed that individuals who carried two risk alleles at locus b and one or more risk alleles at locus c had a 1.4 times higher risk of disease than all other individuals. d The allele frequencies of the non-causal loci were generated randomly from a uniform distribution that ranged from 0.1 to 0.9. e G represents genetic risk factors and E represents environmental risk factors. 3
Supplementary Simulations Supplementary simulation 1 We evaluated the effect of cross-validation with bagging on the forward ROC method using simulation scenarios I and II. The details of the simulation settings are listed in supplementary table 1. For each replicate, 50 bootstrap samples were generated for the bagging cross-validation procedure. The results are summarized in supplementary table 2. In simulation I, we found that the proposed forward ROC method has an overall similar performance, whether using 10-fold cross-validation or using cross-validation with bagging. When there were a small number of loci, 10-fold cross-validation performed slightly better than cross-validation with bagging. However, with an increasing number of noise loci, cross-valuation with bagging attained higher AUC means and smaller mean square errors (MSEs) than 10-fold cross-validation. For instance, when there were 20 noise loci, cross-validation with bagging led to an AUC mean of 0.6000, a 0.25% increase from 10-fold cross-validation. In terms of computational efficiency, 10-fold cross-validation and cross-validation with bagging required 1.5 minutes and 50 minutes, respectively, to complete one simulation replicate on a computer equipped with 4G memory. In simulation II we found, when the missing rate increased, that cross-validation with bagging performed better than 10-fold cross-validation. For instance, with 10% missing data, the forward ROC method using cross-validation with bagging had an AUC mean of 0.6102, a 0.39% increase from using 10-fold cross-validation. 4
Supplementary table 2 A comparison of bagging cross-validation and 10-fold cross validation. Scenario I Forward ROC using 10-fold CV Forward ROC using Bagging CV Number MEAN a BIAS SD b MSE c MEAN BIAS SD MSE 10 Noise Loci 0.6198-0.0312 0.0305 0.0019 0.6178-0.0333 0.0297 0.0020 15 Noise Loci 0.6018-0.0492 0.0343 0.0036 0.6036-0.0474 0.0329 0.0033 20 Noise Loci 0.5985-0.0525 0.0398 0.0043 0.6000-0.0510 0.0379 0.0040 Scenario II Forward ROC using 10-fold CV Forward ROC using Bagging CV Missing % MEAN BIAS SD MSE MEAN BIAS SD MSE 0 0.6198-0.0312 0.0305 0.0019 0.6178-0.0333 0.0297 0.0020 10 0.6078-0.0432 0.0317 0.0029 0.6102-0.0408 0.0311 0.0026 15 0.6007-0.0504 0.0327 0.0036 0.6044-0.0467 0.0323 0.0032 a AUC estimator, b standard deviation, c mean square error Supplementary simulation 2 The random forest (RF) method is a powerful tool for high-dimensional risk prediction [3]. RFs have several unique features, such as being capable of uncovering interactions among genes and/or environmental factors with lower marginal effects [4]. We conducted a simulation to compare the proposed forward ROC method with the RF method. In the simulation, we set the proportion of disease-susceptibility loci to 1/8, 1/4 and 1/3. The detailed settings of the simulation are described in supplementary table 1. The true AUC in all three settings was set to be approximately 0.69. The RF method was performed using the R package randomforest version 4.5 with a forest size of 500 trees. 1000 replicates were simulated, each consisting of 1000 cases and 1000 controls. The simulation results are summarized in supplementary table 3. We found that, when the proportion of disease-susceptibility loci is small (i.e., 1/8) and the effect size is strong, the forward ROC method attained a better performance than the RF method, with an increase of 2.17% in the AUC mean. When the proportion of disease-susceptibility loci increased to 1/4, the two methods had a similar performance, with AUC means of 0.6538 for forward ROC and 0.6561 for RF. With a 5
further increase in the proportion of disease-susceptibility loci and a decrease in the effect size, RFs tended to capture more disease-susceptibility loci and attained higher classification accuracy than the forward ROC method. For instance, when the proportion of disease-susceptibility loci increased to 1/3, the RF method performed better than the forward ROC method, with an increase of 5.82% in the AUC mean. Supplementary table 3 Summary of results from supplementary simulation 2 Proportion of disease-loci 1/8 1/4 1/3 Mean MSE Mean MSE Mean MSE Forward ROC 0.6679 0.0007 0.6538 0.0021 0.6365 0.0060 Random Forest (RF) 0.6537 0.0014 0.6561 0.0018 0.6736 0.0017 Supplementary simulation 3 Some diseases may be caused by a small set of loci with large effect sizes. For instance, five SNPs have been detected as being associated with age-related macular degeneration (AMD), and were combined to explain approximately half of the classical sibling AMD risk [5]. We conducted a simulation under such a disease scenario to assess the performance of the forward ROC method, CART and the allele counting method. In this simulation, we introduced five disease-susceptibility loci and 2 two-way interactions, without including any noise loci (For details, see supplementary table 1). The true AUC was set at 0.8. We compared the forward ROC method with the CART and allele counting methods, based on 1000-replicate simulations. The predicted AUC was calculated from the evaluation sets and the results are summarized in supplementary table 4. We found that the proposed forward ROC method performed better than the other two methods, with a higher AUC mean and a lower MSE. The forward ROC method attained a predicted AUC of 0.7892, which was close to the true AUC of 0.8. The CART and allele counting methods attained AUC values of 0.7633 and 0.7366, respectively. The standard deviations of the AUCs for all three methods were relatively small: 0.0114, 0.0132 and 6
0.0118 for the forward ROC method, CART and the allele counting method, respectively. Supplementary table 4 Summary of results from supplementary simulation 3 MEAN a BIAS SD b MSE c Forward ROC 0.7892-0.0122 0.0114 0.0003 CART 0.7633-0.0381 0.0132 0.0016 Allele counting 0.7366-0.0647 0.0118 0.0043 a AUC estimator. b standard deviation. c mean square error. Wellcome Trust RA GWAS Dataset RA cases from the Wellcome Trust study were recruited by the Arthritis Research Campaign Epidemiology Unit and met the standard clinical criteria for RA. One-half of the controls were chosen from the 1958 British Birth Cohort and the other half came from the UK Blood Service controls. All individuals were genotyped using the Affymetrix 500K chip. We excluded samples of low quality (e.g., low DNA quality). The final samples for analysis comprised 2938 RA cases and 1860 controls. After removing low quality single nucleotide polymorphisms (SNPs) (e.g., those with low allele frequencies), 460,547 SNPs remained for analysis. Supplementary table 5 Summary of the 35 RA associated loci reported from different studies SNP Gene Chromosome Source Reference rs11162922 IFI44 1 500K [6] rs6684865 MMEL1 1 500K [6] rs3890745 MMEL1-TNFRSF14 1 500K [7] rs2240340 PADI4 1 Imputed [8] rs2476601 PTPN22 1 Imputed [9] rs6679677 RSBN1 1 500K [6] rs1061622 TNFRSF1B 1 Imputed [8] rs3087243 CTLA4 2 500K [8] rs3738919 ITGAV 2 Imputed [10] rs7574865 STAT4 2 Imputed [11] rs3816587 ANAPC4 4 500K [6] 7
rs6822844 IL2-IL21 4 Imputed [12] rs3817964 HLA-DRB1 6 Imputed [13] rs660895 HLA-DRB1 6 Imputed [13] rs6910071 HLA-DRB1 6 Imputed [13] rs6457617 MHC 6 500K [6] rs2442728 MHC 6 Imputed [14] rs4678 MHC:VARS2L 6 Imputed [14] rs6920220 OLIG3-TNFAIP3 6 500K [6] rs10499194 OLIG3-TNFAIP3 6 Imputed [15] rs42041 CDK6 7 500K [7] rs2280714 IRF5 7 Imputed [16] rs11761231 N/A 7 500K [6] rs2812378 CCL21 9 500K [7] rs1953126 PHF19 9 Imputed [17] rs10818488 TRAF1 9 Imputed [18] rs3761847 TRAF1-C5 9 Imputed [19] rs2104286 IL2RA 10 500K [6] rs4750316 PRKCQ 10 500K [7] rs1678542 KIF5A-PIP4K2C 12 500K [7] rs1324913 KLF12 13 Imputed [20] rs9550642 N/A 13 500K [6] rs4810485 CD40 20 500K [7] rs2837960 N/A 21 500K [6] rs743777 C1QTNF6 22 500K [6] References 1 Breiman L: Bagging predictors. Machine Learning 1996;24:123-140. 2 Petersena ML, Molinaro AM, Sinisi SE, Van der Laan MJ: Cross-validated bagged learning. Journal of Multivariate Analysis 2007;98:1693-1704. 3 Breiman L: Random Forests. Machine Learning 2001;45:5-23. 4 Bureau A, Dupuis J, Falls K, Lunetta KL, Hayward B, Keith TP, Van EP: Identifying SNPs predictive of phenotype using random forests. Genet Epidemiol 2005;28:171-182. 5 Maller J, George S, Purcell S, Fagerness J, Altshuler D, Daly MJ, Seddon JM: Common 8
variation in three genes, including a noncoding variant in CFH, strongly influences risk of age-related macular degeneration. Nat Genet 2006;38:1055-1059. 6 WTCCC: Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 2007;447:661-678. 7 Raychaudhuri S, Remmers EF, Lee AT, Hackett R, Guiducci C, Burtt NP, Gianniny L, Korman BD, Padyukov L, Kurreeman FA, Chang M, Catanese JJ, Ding B, Wong S, van der Helm-van Mil AH, Neale BM, Coblyn J, Cui J, Tak PP, Wolbink GJ, Crusius JB, van der Horst-Bruinsma IE, Criswell LA, Amos CI, Seldin MF, Kastner DL, Ardlie KG, Alfredsson L, Costenbader KH, Altshuler D, Huizinga TW, Shadick NA, Weinblatt ME, de VN, Worthington J, Seielstad M, Toes RE, Karlson EW, Begovich AB, Klareskog L, Gregersen PK, Daly MJ, Plenge RM: Common variants at CD40 and other loci confer risk of rheumatoid arthritis. Nat Genet 2008;40:1216-1223. 8 Plenge RM, Padyukov L, Remmers EF, Purcell S, Lee AT, Karlson EW, Wolfe F, Kastner DL, Alfredsson L, Altshuler D, Gregersen PK, Klareskog L, Rioux JD: Replication of putative candidate-gene associations with rheumatoid arthritis in >4,000 samples from North America and Sweden: association of susceptibility with PTPN22, CTLA4, and PADI4. Am J Hum Genet 2005;77:1044-1060. 9 Lee AT, Li W, Liew A, Bombardier C, Weisman M, Massarotti EM, Kent J, Wolfe F, Begovich AB, Gregersen PK: The PTPN22 R620W polymorphism associates with RF positive rheumatoid arthritis in a dose-dependent manner but not with HLA-SE status. Genes Immun 2005;6:129-133. 10 Jacq L, Garnier S, Dieude P, Michou L, Pierlot C, Migliorini P, Balsa A, Westhovens R, Barrera P, Alves H, Vaz C, Fernandes M, Pascual-Salcedo D, Bombardieri S, Dequeker J, Radstake TR, Van RP, van de Putte L, Lopes-Vaz A, Glikmans E, Barbet S, Lasbleiz S, Lemaire I, Quillet P, Hilliquin P, Teixeira VH, Petit-Teixeira E, Mbarek H, Prum B, Bardin T, Cornelis F: The ITGAV rs3738919-c allele is associated with rheumatoid arthritis in the European Caucasian population: a family-based study. Arthritis Res Ther 2007;9:R63. 11 Orozco G, Alizadeh BZ, Delgado-Vega AM, Gonzalez-Gay MA, Balsa A, Pascual-Salcedo D, Fernandez-Gutierrez B, Gonzalez-Escribano MF, Petersson IF, van Riel 9
PL, Barrera P, Coenen MJ, Radstake TR, van Leeuwen MA, Wijmenga C, Koeleman BP, Alarcon-Riquelme M, Martin J: Association of STAT4 with rheumatoid arthritis: a replication study in three European populations. Arthritis Rheum 2008;58:1974-1980. 12 Zhernakova A, Alizadeh BZ, Bevova M, van Leeuwen MA, Coenen MJ, Franke B, Franke L, Posthumus MD, van Heel DA, van der Steege G, Radstake TR, Barrera P, Roep BO, Koeleman BP, Wijmenga C: Novel association in chromosome 4q27 region with rheumatoid arthritis and confirmation of type 1 diabetes point to a general risk locus for autoimmune diseases. Am J Hum Genet 2007;81:1284-1288. 13 Gorman JD, David-Vaudey E, Pai M, Lum RF, Criswell LA: Particular HLA-DRB1 shared epitope genotypes are strongly associated with rheumatoid vasculitis. Arthritis Rheum 2004;50:3476-3484. 14 Vignal C, Bansal AT, Balding DJ, Binks MH, Dickson MC, Montgomery DS, Wilson AG: Genetic association of the major histocompatibility complex with rheumatoid arthritis implicates two non-drb1 loci. Arthritis Rheum 2009;60:53-62. 15 Plenge RM, Cotsapas C, Davies L, Price AL, de Bakker PI, Maller J, Pe'er I, Burtt NP, Blumenstiel B, DeFelice M, Parkin M, Barry R, Winslow W, Healy C, Graham RR, Neale BM, Izmailova E, Roubenoff R, Parker AN, Glass R, Karlson EW, Maher N, Hafler DA, Lee DM, Seldin MF, Remmers EF, Lee AT, Padyukov L, Alfredsson L, Coblyn J, Weinblatt ME, Gabriel SB, Purcell S, Klareskog L, Gregersen PK, Shadick NA, Daly MJ, Altshuler D: Two independent alleles at 6q23 associated with risk of rheumatoid arthritis. Nat Genet 2007;39:1477-1482. 16 Han SW, Lee WK, Kwon KT, Lee BK, Nam EJ, Kim GW: Association of polymorphisms in interferon regulatory factor 5 gene with rheumatoid arthritis: a metaanalysis. J Rheumatol 2009;36:693-697. 17 Chang M, Rowland CM, Garcia VE, Schrodi SJ, Catanese JJ, van der Helm-van Mil AH, Ardlie KG, Amos CI, Criswell LA, Kastner DL, Gregersen PK, Kurreeman FA, Toes RE, Huizinga TW, Seldin MF, Begovich AB: A large-scale rheumatoid arthritis genetic study identifies association at chromosome 9q33.2. PLoS Genet 2008;4:e1000107. 10
18 Kurreeman FA, Padyukov L, Marques RB, Schrodi SJ, Seddighzadeh M, Stoeken-Rijsbergen G, van der Helm-van Mil AH, Allaart CF, Verduyn W, Houwing-Duistermaat J, Alfredsson L, Begovich AB, Klareskog L, Huizinga TW, Toes RE: A candidate gene approach identifies the TRAF1/C5 region as a risk factor for rheumatoid arthritis. PLoS Med 2007;4:e278. 19 Plenge RM, Seielstad M, Padyukov L, Lee AT, Remmers EF, Ding B, Liew A, Khalili H, Chandrasekaran A, Davies LR, Li W, Tan AK, Bonnard C, Ong RT, Thalamuthu A, Pettersson S, Liu C, Tian C, Chen WV, Carulli JP, Beckman EM, Altshuler D, Alfredsson L, Criswell LA, Amos CI, Seldin MF, Kastner DL, Klareskog L, Gregersen PK: TRAF1-C5 as a risk locus for rheumatoid arthritis--a genomewide study. N Engl J Med 2007;357:1199-1209. 20 Julia A, Ballina J, Canete JD, Balsa A, Tornero-Molina J, Naranjo A, Alperi-Lopez M, Erra A, Pascual-Salcedo D, Barcelo P, Camps J, Marsal S: Genome-wide association study of rheumatoid arthritis in the Spanish population: KLF12 as a risk locus for rheumatoid arthritis susceptibility. Arthritis Rheum 2008;58:2275-2286. 11