Patient characteristics of training and validation set. Patient selection and inclusion overview can be found in Supp Data 9. Training set (103)

Roepman P, et al. An immune response enriched 72-gene prognostic profile for early stage Non-Small- Supplementary Data 1. Patient characteristics of training and validation set. Patient selection and inclusion overview can be found in Supp Data 9. Training set (103) Validation set (69) Gender (%) (%) male 77 74.8 51 73.9 female 26 25.2 18 26.1 Age at diagnosis median range Hospital 62 41-77 67 22-79 NKI 30 29.1 6 8.7 Heidelberg 18 17.5 14 20.3 Bialystok 12 11.7 1 1.4 Gdansk 32 31.1 27 39.1 Vumc 11 10.7 21 30.4 Smoking current smoker 45 43.7 30 43.5 former smoker 44 42.7 28 40.6 non-smoker 3 2.9 3 4.3 unknown 11 10.7 8 11.6 Histology large cell carcinoma 8 7.8 2 2.9 squamous cell carcinoma 57 55.3 35 50.7 adenocarcinoma 33 32.0 23 33.3 other 5 4.9 9 13.0 Stage I 72 69.9 45 65.2 II 31 30.1 24 34.8 Follow-up period (months) median range Status 46 4-156 24 0.5-111 alive / censored 59 57.3 33 47.8 death lung cancer 35 34.0 16 23.2 death other 9 8.7 20 29.0 Relapse-free survival time (months) median range Overall survival time (months) 43 2-156 22 0.5-111 median range Treatment before surgery 46 4.3-156 24 0.5-111 yes 5 4.9 2 2.9 no 96 93.2 58 84.1 unknown 2 1.9 9 13.0

Supplementary Data 2 Hierarchical clustering (Euclidian distance, complete clustering) of NSCLC samples based on expression of all genes Hospital Histology Stage Survival Hospital Histology Stage Survival NKI-AvL Heidelberg Bialystok Gdansk VUmc squamous cell carcinoma adenocarcinoma other non-small cell stage I stage II alive or censored death by lung cancer

Supplementary Data 3 Schematic overview of the multiple samples procedure that was used for development of a robust nearest mean classifier. A 10-fold cross validation loop was used to identify genes which expression ratios correlate with overall and recurrence free survival time. The initial 103 training samples were randomly split into a training set (n=93) and in a test set (n=10). The training set was used to identify which gene correlate best with overall (OS) and recurrence-free (RFS) survival (based on three statistics to reduce selection of genes based on noise as much as possible: Welsh t-test, logrank test and a Cox proportional hazard ratio). Subsequently the top 100 genes were used for building a nearest mean classifier and the performance was tested on the test set. Repeating this procedure for different training and test splits (multiple sampling) resulted in a multiple sets of most prognostic genes. The multiple gene rankings for OS and RFS were combined and the set of most prognostic genes (most often selected genes during the multiple sampling procedure) was selected via a top-down approach. The performance of the optimal set of genes (72 genes) was evaluated using a leave-one-out approach on all training samples to define the classifier performance using the optimal threshold. The NSCLC 72 gene classifier was finally validated on the independent set of validation samples (n=69). 10-fold cross validation loop ±42,000 gene probes 103 samples (>2y follow-up) >500x Randomly split samples 93 train 10 test Gene scoring Test on test samples Welsh t-test Logrank test Cox proportional hazard ratio Score sample outcome Combine multiple gene rankings Score gene ranking Combine ranking Select top 100 genes Nearest mean classifier Combine Multiple predictions Select set of best prognostic genes (72) Sample prediction robustness & accuracy LOO cross-validation on 103 training samples classifier perfomance using optimal threshold Determine low-risk and high-risk profile Independent validation 69 samples

Supplementary Data 4. Kaplan-Meier plot survival estimates of overall survival (OS) (A) and recurrence-free survival (RFS) (B) based on the multiple sampling outcomes of the test samples using the 10-fold cross validation procedure described in Supp. Data 3. These results indicated that the multiple samples approach is suitable for development of a nearest mean classifier that is unbiased towards the training samples. A B overall survival% 0.0 0.2 0.4 0.6 0.8 1.0 P= 0.001 low-risk profile high-risk profile recurrence-free survival% 0.0 0.2 0.4 0.6 0.8 1.0 P= 0.011 low-risk profile high-risk profile 0 20 40 60 80 100 120 140 months 0 20 40 60 80 100 120 140 months

Supplementary Data 5 NSCLC survival associated genes. The set of 72 prognostic classifier genes is ranked according to the gene expression association with recurrence-free survival.

Supplementary Data 6 Kaplan-Meier plot survival estimates of overall survival based on the 72-gene classifier within tumour stage (A) and based on the 72-gene classifier within tumour histology (B). A overall survival% 0.0 0.2 0.4 0.6 0.8 1.0 stage I & low-risk stage II & low-risk stage I & high-risk stage II & high-risk P<0.001 0 50 100 150 months B overall survival% 0.0 0.2 0.4 0.6 0.8 1.0 scc & low-risk scc & high-risk adeno & low-risk adeno & high-risk P<0.001 0 50 100 150 months

Supplementary Data 7 Functional category analysis of the classifier signature. Functional categories within the set of 72 prognostic genes were identified using gene ontology (GO) analysis. Significantly (P<0.01) overrepresented GO categories are colored red. Blue shaded area indicted categories associated with immune response, green with antigen binding, yellow with protein modification and degradation. antimicrobial humoral response humoral defense mechanism lymphocyte activation humoral immune response hemopoiesis immune cell activation protein modification response to pest, pathogenor parasite response to stress immune response response to other organism lymphocyte differentiation defense response response to biotic stimulus hemopoietic or lymphoid organ development development protein catabolism protein catabolism ubiquitin cycle protein metabolism proteolysis during cellular protein catabolism cellular protein catabolism protein catabolism macromolecule metabolism cellular macromolecule catabolism dent macromolecule catabolism lytic vacuole response to stimulus physiological process mrna binding nucleic acid binding lysosome vacuole cytoplasm organelle integral to plasma membrane membrane biological process gene ontology cellular component binding molecular function catalytic activity antigen binding translation factor activity, nucleic acid binding translation regulator activity carboxylic ester hydrolase activity intrinsic to plasma membrane hydrolase activity thiolester hydrolase activity

Supplementary Data 8 Comparison of different early stage NSCLC prognostic gene classifiers

Supplementary Data 9 Patient selection and inclusion overview.