Introduction ORIGINAL RESEARCH

ORIGINAL RESEARCH Assessng the Statstcal Sgnfcance of the Acheved Classfcaton Error of Classfers Constructed usng Serum Peptde Profles, and a Prescrpton for Random Samplng Repeated Studes for Massve Hgh-Throughput Genomc and Proteomc Studes James Lyons-Weler a,f, Rchard Pelkan b, Herbert J Zeh III c,f, Davd C Whtcomb d,f, Davd E Malehorn e,f, Wllam L Bgbee e,f, Mlos Hauskrecht b,f a Department of Pathology, Cancer Bomarkers Laboratory, Center for Pathology Informatcs, Benedum Oncology Informatcs Center b Department of Computer Scence c Department of Surgery d Departments of Medcne, Cell Bology & Physology, and Human Genetcs e Clncal Proteomcs Faclty f Unversty of Pttsburgh Cancer Insttute Unversty of Pttsburgh Abstract: Peptde profles generated usng SELDI/MALDI tme of flght mass spectrometry provde a promsng source of patentspecfc nformaton wth hgh potental mpact on the early detecton and classfcaton of cancer and other dseases. The new proflng technology comes, however, wth numerous challenges and concerns. Partcularly mportant are concerns of reproducblty of classfcaton results and ther sgnfcance. In ths work we descrbe a computatonal valdaton framework, called PACE (Permutaton-), that lets us assess, for a gven classfcaton model, the sgnfcance of the Acheved Classfcaton Error (ACE) on the profle data. The framework compares the performance statstc of the classfer on true data samples and checks f these are consstent wth the behavor of the classfer on the same data wth randomly reassgned class labels. A statstcally sgnfcant ACE ncreases our belef that a dscrmnatve sgnal was found n the data. The advantage of PACE analyss s that t can be easly combned wth any classfcaton model and s relatvely easy to nterpret. PACE analyss does not protect researchers aganst confoundng n the expermental desgn, or other sources of systematc or random error.we use PACE analyss to assess sgnfcance of classfcaton results we have acheved on a number of publshed data sets. The results show that many of these datasets ndeed possess a sgnal that leads to a statstcally sgnfcant ACE. Keywords: ovaran cancer, pancreatc cancer, prostate cancer, bomarkers, bonformatcs, proteomcs, dsease predcton models, early detecton Introducton Hgh-throughput, low resoluton tme-of-flght mass spectrometry systems such as surface-enhanced laser desorpton onzaton - tme of flght (SELDI-TOF) mass spectrometry (SELDI; Merchant and Wenberger, 2; Issaq et al., 22) and matrx-asssted laser desorpton/onzaton-tme of flght mass spectrometry (MALDI) are just begnnng to emerge as wdely recognzed hgh-throughput data sources for potental markers for the early detecton of cancer (Wrght et al., 1999; Adam et al., 21; Petrcon et al., 22). Spectra, or peptde profles, are readly generated from easly collected samples such as serum, urne, lymph, and cell lysates. Comparsons have been made for a large number of cancers (Table 1) n search of dagnostc markers, wth astonshngly good ntal results for the classfcaton of cancer and control profles collected wthn respectve studes. Wth these very promsng results the questons related to the sgnfcance and reproducblty of such classfcaton results become mmnent. Reproducblty and sgnfcance are essental wth these types of data snce the dentty of the peptdes located at clncally sgnfcant m/z postons that translate to the classfcaton accuracy are unknown and ther correctness cannot be verfed through ndependent expermental studes. The process of peptde profle generaton s subject to many sources of systematc errors. If these are not properly understood they can potentally jeopardze the valdty of the results. Such concerns have led to the analyss of possble bases present n publshed data sets and questons on the reproducblty of some of the obtaned classfcaton results under the proper expermental setup (Baggerly et al., 24). Such studes hghlght the need for randomzaton of sample order acquston and processng, mantanng constant protocols Correspondence: James Lyons-Weler, lyonswelerj@upmc.edu Cancer Informatcs 25:1(1) 53-77 53

Lyons-Weler, Pelkan, and Zeh, et al Table 1: Publshed senstvtes and specfctes of SELDI-TOF-MS proflng for varous types of cancers Cancer Type SN, SP Reference Ovaran Cancer 1%, 95% Petrcon et al., 22 Prostate Cancer 1%, 1% Qu et al., 22 Breast Cancer 9%, 93% Vlahou et al., 23 Breast Cancer 91%, 93% L et al., 22 Head & Neck Cancer 83%, 9% Wadsworth et al., 24 Lung Cancer 93.3%, 96.7% Xao et al., 23 Pancreatc Cancer 78%, 97% Koopmann et al., 24 over the course of a study (ncludng sample handlng and storage condtons), dentfcaton of potental confoundng factors and the use of a balanced study desgn whenever possble to allow proper characterzaton of varaton n the non-dseased populaton. Certanly, a desgn matrx should be created for each study and nspected for patterns that reflect complete or severe partal ncdental confoundng. In addton, mult-ste valdaton studes, whch are currently ongong n the EDRN (Early Detecton Research Network), can help to dentfy possble problems. The peptde profle data are not perfect and nclude many random components. The presence of large amounts of randomness s a threat for nterpretve data analyss; the randomness ncreases the possblty of dentfyng a structure and patterns n a completely unnformatve sgnal. In such a case we want to have an addtonal assurance that the data and results of nterpretve (classfcaton) analyss obtaned for these data are not due to chance. Permutaton tests (Kendall, 1945; Good, 1994) used commonly n statstcs offer one soluton approach to ths problem and allow us to determne the sgnfcance of the result under random permutaton of target labels. In ths work, buldng upon the permutaton test theory, we propose a permutaton-based framework called PACE (Permutaton-Acheved Classfcaton Error) that can assess the sgnfcance of the classfcaton error for a gven classfcaton model wth respect to the null hypothess under whch the error result s generated n response to random permutaton of the class labels. The man advantage of the PACE analyss s that t s ndependent of the model desgn. Ths allows the problems of choosng the best dsease predcton model and achevng a sgnfcant result to become decoupled. Many of the methods of hgh-throughput data analyss are very advanced, and thus may be poorly understood by the majorty of researchers who would lke to adopt a relable analyss strategy. Understandng PACE analyss nvolves only vsual examnaton of an ntutve graph (e.g., Fgure 1), whch makes t easy to apply and explan to the novce. In the followng we frst descrbe the classfcaton problem and evaluaton of the classfcaton performance. Next we ntroduce the PACE framework that offers addtonal assessment of the sgnfcance of the results. We compare PACE conceptually to exstng confdence assessment methods; t s found to be potentally complementary to confdence nterval-based bootstrap methods, whch seek to determne whether a confdence nterval around a statstc of nterest ncludes a sngle pont (or a seres of sngle ponts;.e., the ROC curve). Fnally, we apply PACE analyss to a number of publshed and new SELDI-TOF-MS data sets. We demonstrate wth postve and negatve results the utlty of reportng not only the ACE but also whether a gven ACE s statstcally sgnfcant. PACE thus provdes a begnnng reference pont for researchers to determne objectvely whether they have constructed a sgnfcant classfer n the dscovery phase. Evaluaton of classfers Classfcaton Classfcaton s the task of assgnng class labels to sample data whch come from more than one category. In our case, the classfcaton task s to deter- 54 Cancer Informatcs 25:1(1)

Serum peptde profle classfers and hgh-throughput random samplng repeated studes mne whether a partcular proteomc profle comes from a case (cancerous) or control (non-cancerous) populaton. A classfcaton model whch assgns labels (ether case or control) to profles can be learned from tranng examples; profles wth known case and control labels. The goal s to acheve a classfer that performs as best as possble on future data. Practcal concerns related to the classfer learnng nclude the possblty of model overft. The overft occurs when the classfcaton model s based strongly towards tranng examples and generalzes poorly to new (unseen) examples. Typcally, model overft occurs due to the ncluson of too many parameters n the model n conjuncton wth a small number of examples. To assess the ablty of the classfcaton model to future data we can splt the data from the study nto tranng and test sets; the tranng set s used n the learnng stage to buld the classfer, the testng set s wthheld from the learnng stage and t s used for evaluaton purposes only. Evaluaton Tranng set: a collecton of samples used to dentfy features and classfcaton rules based on dscrmnatory nformaton derved from the comparsons of features between or among groups. Test set: a collecton of samples to whch the classfcaton rules learned from the tranng set are appled to produce an estmate of the external generalzablty of the estmated classfcaton error. The classfcaton error rate observed when classfer s appled to them s called the test error rate. (Smlarly, the senstvty s called test set senstvty, etc.). The classfer rules learned nclude parameters optmzed usng the tranng set that are then ncluded n the predcton phase (for predctons on the test set). Test errors are usually hgher than the tranng errors; Feng et al refer to the dfference as optmsm ; (Z. Feng, personal communcaton). Test errors are less based than tranng errors, and therefore are more (but not completely) reflectve of the expected classfcaton error should the classfer be appled to new cases from the same populaton. The use of the test data set errors as the estmate s approprate because t s low-based compared to the classfcaton errors acheved usng only the tranng data set. The test set may be a held-out set of samples, or, more commonly, a number of held-out sets to avod naccuracy of ACE. Valdaton set: a set of samples collected and/or processed and/or analyzed n a laboratory or at a ste dfferent from the laboratory or ste where the orgnal tranng sets were produced. Valdaton sets are never ncluded n the learnng step. All valdaton sets are test sets but not all test sets are valdaton sets. The more ndependence there s among sample sets, laboratory protocols, and mplementaton of a partcular method of predctng class membershp, the more robust the bomarkers. Cross-valdaton Methods for estmatng the test error nclude leaveone-out cross-valdaton, k-fold valdaton, and random subsamplng valdaton. The selecton of each of these depends n part on the number of samples avalable; these methods and ther sutablty for applcaton to the analyss of hgh-throughput genomc and proteomc data sets have recently been explored (Braga-Neto & Dougherty, 24). Use of the test error rates and performance measures derved from those rates allows one to assess the expected senstvty (SN) and specfcty (SP) of a gven test or classfer; these performance measures are usually summarzed n a confuson matrx. Even wth these estmated performance measures, however, a more general queston remans: for a broad range of potental outcomes and focus, from bomarker evaluaton, dscovery, valdaton and translaton, what level of senstvty s to be deemed sgnfcant, or suffcent, at a specfed level of specfcty? The clear overall objectve of maxmzng both SN and SP s bult nto the recever-operator-characterstc (ROC) evaluaton of a test, and the search of the most nformatve test usually seeks to maxmze the area under the curve (AUC). Estmates of SN, SP, the ROC curve, and ts area can all be determned usng random subsamplng valdaton. These approaches are well-studed, and ther estmates of expected classfcaton error are generally understood to be less based than those estmated usng tranng data sets. Permutaton based valdaton The ndvdual performance statstcs by themselves, do not always allow us to judge the mportance of the result. In partcular, one should be always con- Cancer Informatcs 25:1(1) 55

Lyons-Weler, Pelkan, and Zeh, et al cerned by the possblty that the observed statstc s the result of chance. Careful elmnaton of ths possblty gves more credblty to the result and establshes ts potental mportance. Permutaton test methods offer a class of technques that make ths assessment possble under a wde varety of assumptons. Expected performance under the null model vares wth the specfcs of a desgn, and the dstrbuton of the performance statstcs vary wth the dstrbuton of nformaton among markers and the type of dsease predcton model used. Permutaton test methods work by comparng the statstc of nterest wth the dstrbuton of the statstc obtaned under the null (random) condton. Our prorty n predctve models s to crtcally evaluate the observed dscrmnatory performance. In terms of hypothess testng the null hypothess we want to reject s: The performance statstc of the dsease predcton model on the true data s consstent wth the performance of the model on the data wth randomly assgned class labels. The objectve of optmzng a classfcaton score tself s largely uncontrolled n most genomc and proteomc hgh-throughput analyss studes. Researchers do not, for example, typcally attempt to determne and therefore do not report the statstcal sgnfcance of the senstvty of a test, n spte of the exstence of a number of approaches for performng such assessments. Here we ntroduce a permutaton method for assessng sgnfcance on the acheved classfcaton error (ACE) of a constructed predcton model. Theory A permutaton test s a non-parametrc approach to hypothess testng, whch s useful when the dstrbuton for the statstc of nterest T s unknown. By evaluatng a classfer s statstc of nterest when presented wth data havng randomly permuted labels, an emprcal dstrbuton over T can be estmated. By calculatng the p-value of the statstc s value when the classfer s presented wth the true data, we can determne f the classfer s behavor s statstcally sgnfcant wth respect to the level of confdence α. Let be a set of all permutatons of labels of the dataset wth d examples. The permutaton test (Mukherjee et al., 23) s then defned as: Repeat N tmes (where n s an ndex from 1,,N) Choose a permutaton from a unform dstrbuton over Compute the statstc of nterest for ths permutaton of labels t n where = T ( x d n π, y d,..., x, y 1 π n d π n 1 d x, y denotes a profle-label par, where the profle x s assgned the label accordng to the permutaton n π Construct an emprcal cumulatve dstrbuton over the statstc of nterest: N 1 n Pˆ( T t) = H ( t t ) N n= 1 where H denotes the Heavsde functon. Compute the statstc of nterest for the actual labels, t = T ( x1, y1,..., x d, y d ) and ts correspondng p-value π n ) 56 Cancer Informatcs 25:1(1)

Serum peptde profle classfers and hgh-throughput random samplng repeated studes Table 2: Steps n the Analyss of Hgh-Throughput Peptde Proflng Spectra. These steps were elucdated n part n dscusson wth the EDRN Bonformatcs Workng Group. We gratefully acknowledge ther nput. Expermental Desgn Measurement Preprocessng Data Representaton Feature Selecton Classfcaton Selecton of type and numbers of samples to compare Determnaton of sample rate Mass calbraton Profle QA/QC flterng Varance correcton/regularzaton Smoothng Baselne correcton Normalzaton (nternal or external) Profle Algnment Determnaton of profle attrbutes: Peak selecton Whole-profle Partal-profle Bnnng May also nclude peak-fndng algorthms and peak-matchng routnes Identfcaton of profle features whch are lkely to be clncally sgnfcant: Unvarate statstcal analyss Multvarate feature selecton Renderng sample class nferences Computatonal Valdaton / Study Desgn Sgnfcance Testng of ACE Calculaton of an estmated classfcaton error rate whch s hopefully unbased and accurate. May nvolve: Random subsamplng Bootstrappng k-fold valdaton Leave-one-out valdaton PACE (ths paper) Boostrap confdence nterval estmaton (Efron and Tbshran, 1997) If p ˆ = P ( T t ) under the emprcal dstrbuton p Pˆ. α reject the null hypothess. For our purposes, the statstc of nterest T s the acheved classfcaton error (ACE). Applcaton of permutaton-based valdaton to peptde proflng (PACE) We defne a classfcaton method f as all steps appled by a researcher to the data pror to some bologcal nterpretaton. These nclude the steps summarzed n Table 2. In the case of SELDI/MALDI- TOF-MS, ths may nclude mass calbraton, baselne correcton flterng, normalzaton, peakselecton, a varety feature selecton and classfcaton, approaches. We take the poston that every researcher that has decded to approach the problem of analyss of a hgh-throughput proteomc data set has embarked on a journey of method development;.e., the seres of decsons made by the research tself s method f. We assume that the researcher has adopted a study desgn that employs one or more tranng/test set splts, For our purposes, we use 4 random tranng/test splts to acheve a reasonably accurate estmate of ACE. A thrd valdaton sample can be set asde to verfy the statstc on the prstne data. The valdaton set can ether be produced at the same tme, under the same condtons as the tranng/test data set. A more general estmate of the external va- Cancer Informatcs 25:1(1) 57

Lyons-Weler, Pelkan, and Zeh, et al.5.4.3.2 MACE ACE 95th % 99th % Fgure 1: Example of PACE analyss. The permutaton-acheved classfcaton error (PACE) dstrbuton s estmated by computng a statstc (n ths case, testng error) over repeated relabelng of the sample data. The top sold lne ndcates the mean acheved classfcaton error (MACE) of ths dstrbuton. The low 95 th and 99 th percentles of the PACE dstrbuton are gven by the dashed and dotted lnes, respectvely. If the acheved classfcaton error (ACE, bottom marked lne) falls below a percentle band, t s a statstcally sgnfcant result at that confdence level. In ths example, ACE for a Naïve Bayes classfer usng a weghted separablty wthout peak selecton or decorrelaton (see below for detals) falls consstently below the 99 th percentle band of the PACE dstrbuton. It can be sad that ths classfer produced a statstcally sgnfcant result at the 99% level..1 5 1 15 2 25 ldty of the estmate of the generalzaton error and ts robustness to dfferent laboratory condtons (and thus an assessment of the potental for practcal (clncal) applcaton) s obtaned when the valdaton set s obtaned at a dfferent tme or better yet n a dfferent laboratory (as n multste valdaton studes). Permutaton- (PACE) Analyss Gven the acheved classfcaton error (ACE) estmated va method f, generate an arbtrarly large number of new data sets wth random sample relabelng. Method f s appled to each of the permuted data sets, resultng n a null dstrbuton of ACE (called PACE). Lower 95th and 99th percentles are located n PACE: ACE s then compared to these percentles to assess the statstcal sgnfcance of the classfer method f. Alternatves to PACE The permutaton based approach compares the error acheved on the true data to errors on randomly labeled data. It tres to show that the result for the true data s dfferent from results on the random data, and thus t s unlkely the consequence of a random process. We note that the permutaton-based method s dfferent and thus complementary to standard hypothess testng methods that try to determne confdence ntervals on estmates of the target statstcs. We also note that one may apply standard hypothess testng methods to check f the target statstc for our classfcaton model s statstcally sgnfcantly dfferent from ether the fully random, trval or any other classfcaton model. However, the permutaton framework always looks at the combnaton of the data label generaton and classfcaton processes and thus establshes the dfference n between the performance on the true and random data. Classfcaton error s a composte evaluaton metrc. Other types of performance measures for whch confdence ntervals have been studed so far nclude sgnfcance of SN at a fxed SP (Lnnet, 1987), AUC (as mplemented, for example, n Accu- ROC; Vda, 1993), and the ROC curve tself (Macskassy et al., 23). Here we brefly explan these optons. Whch performance measure to assess may vary accordng to strategy. Bootstrap-estmated or analytcally determned confdence ntervals around SN at a specfed SP (Lnnet, 1987) requres that a desred SP be known, and ths depends on ts ntent; for example a screenng test should have very 58 Cancer Informatcs 25:1(1)

Serum peptde profle classfers and hgh-throughput random samplng repeated studes hgh SP to avod resultng n too many false postves when appled to a populaton. Even here, however, very hgh and too many are rather contextdependent, should not be consdered n a slo by gnorng exstng or other proposed dagnostc tests. Acceptable FP values depend to a degree on the SP of exstng practces, and to an extent on the prevalence of the dsease. Any screen can be consdered to change the prevalence of dsease n the potental patent populaton, and therefore follow-up wth panels of mnmally nvasve markers, or multvarate studes of numerous rsk factors (demographc, famlal, vaccnaton, smokng hstory), and longterm montorng, mght make such screenng worthwhle. Hgh-throughput proteomcs hghlghts the need for dynamc clncal dagnostcs. The varous approaches suggested by Lnnet were extended and revsed wth a suggeston by Platt et al. (2) to adopt the bootstrap confdence nterval method (Efron and Tbshran, 1993). A workng paper by Zhou and Qn (23) explores related approaches. One strategy s to perform bootstrappng (Efron and Tbshran, 1993) and calculate a 1-α confdence nterval around a measure of nterest. Bootstrappng s a subsamplng scheme n whch N data sets are created by subsamplng the features of the orgnal data set, wth replacement. Each of the N data sets s analyzed. Confdence ntervals around some measure of nterest (T) can be calculated or consensus nformaton can be gathered; n ether case, varablty n an estmate T s used a measure of robustness of T. Varous mplementatons of the bootstrap are avalable; the least based appears to be bas-corrected accelerated verson (Efron and Tbshran, 1993). A second strategy s to calculate confdence ntervals around the AUC measure. Bootstrappng (Efron and Tbshran, 1997) s sometmes used to estmate AUC confdence ntervals. Relyng on confdence n the AUC can be problematc because t reports on the entre ROC, and, n practce, only part of the ROC s consdered relevant for a partcular applcaton (e.g., hgh SP requred by screenng tests. A lterature on assessng the sgnfcance of partal ROC curves has been developed (Dodd and Pepe, 23; Gefen et al., 23); a recent study (Stephan, Wesselng et al., 23) compared the features and performance of eght programs for ROC analyss. A thrd strategy s to calculate bootstrap confdence bands around the ROC curve tself (Macskassy et al., 23). Under ths approach, bootstrappng s explored and bands are created usng any of a varety of sweepng methods that explore the ROC curve n one (SN) or two (SN and 1-SP) dmensons. Expermental results of PACE analyss on clncal data We appled PACE analyss to the followng publshed data sets, and one new data set from the UPCI, usng a number of methods of analyss: UPCI Pancreatc Cancer Data Ovaran Cancer Data (D1; Petrcon et al., 22) Ovaran Cancer Data (D2; Petrcon et al., 22) Prostate Cancer Data (Qu et al., 22) The UPCI s pancreatc cancer data are only n the prelmnary stages of analyss and we report only ntal results. An ongong study wth an ndependent valdaton set s underway. Preoperatve serum samples were taken from 32 pancreatc cancer cases (17 female, 15 male). Twenty-three non-cancer age, gender, and smokng hstory-matched controls were analyzed; ages ranged from 34 to 87, pancreatc cancer cases had a mean age of 64, controls had a mean age of 67 (p=.19). Of the cancer samples, 16 were resected; 6 patents had locally advanced unresectable dsease, and 1 had metastatc dsease. The ovaran cancer datasets D1 and D2 (Petrcon et al., 22) were obtaned through the clncal proteomcs program databank (http://ncfdaproteomcs.com/). Both datasets were created from the same samples, but D2 was processed usng a dfferent chp surface (WCX2) as opposed to the hydrophobc H4 chp used to generate the data n D1. The samples consst of 1 controls: 61 samples wthout ovaran cysts, 3 samples wth bengn cysts smaller than 2.5 cm, 8 samples wth bengn cysts larger than 2.5 cm, and 1 sample wth Cancer Informatcs 25:1(1) 59

Lyons-Weler, Pelkan, and Zeh, et al Table 3: Lst of methods appled to datasets. Each dataset was evaluated usng PACE analyss wth every possble combnaton of these methods. MAC = maxmum allowed correlaton. Method Optons (Choce of one) Peak Detecton On (Select only peaks) Off (Use the whole profle) Feature Selecton Area under ROC curve (AUC) Fsher score J5 test Smple separablty crteron t-test score Weghted separablty crteron De-correlaton Enhancement On (MAC < 1) Off (MAC = 1) Classfcaton Model Naïve Bayesan Classfer Support Vector Machne (SVM) bengn gynecologcal dsease. The samples nclude 1 cases: 24 samples wth stage I ovaran cancer, and 76 samples wth stage II, II and IV ovaran cancer. The prostate cancer dataset (Qu et al., 22) was also acqured from the clncal proteomcs program databank. It conssts of 253 controls: 75 samples wth a prostate-specfc antgen (PSA) level less than 4 ng/ml, 137 samples wth a PSA level between 4 and 1 ng/ml, 16 samples wth a PSA level greater than 1 ng/ml, and 25 samples wth no evdence of dsease and PSA level less than 1 ng/ml. 69 cases exst n ths dataset: 7 samples wth stage I prostate cancer, 31 samples wth stage II and III prostate cancer, and 31 samples wth bopsy-proven prostate cancer and PSA level greater than 4 ng/ml. Methods Appled and Evaluated Table 3 gves a summary of methods appled n the analyss. A bref descrpton of some of these methods s provded below. A thorough descrpton of these methods can be found n Hauskrecht at al. (25, n press). Peak detecton In some crcles t s a strong belef that only peaks n a profle represent nformatve features of a profle. Peak detecton can take place before performng further feature selecton n order to lmt the ntal amount of the profle to be consdered. There are varous ways n whch peak detecton can be performed; for the purposes of our experments, we utlze a peak detecton method that examnes the mean profle generated for all tranng samples, and then determnes ts local maxma. The local maxma postons become the only features consdered for feature selecton later n the ppelne dsplayed n Table 3. Alternatvely, we can gnore the peak detecton phase completely and consder the entre profle for feature selecton. Feature selecton methods Fsher Score: The Fsher score s ntended to be a measure of the dfference between dstrbutons of a sngle varable. A partcular feature s Fsher score s computed by the followng formula: where F( ) = + 2 ( μ μ ) + ( ) 2 σ + ( σ ) 2 ± μ s the mean value for the th feature n the postve or negatve profles, and ± σ s the standard devaton. We utlze a varant of ths crteron (Furey, 2), computed wth the followng formula: μ μ F( ) = σ + σ To avod confuson, we refer to the second formula above as our Fsher-lke score. Features wth hgh Fsher scores possess the desrable qualty of havng a large dfference between means of case versus control groups, whle mantanng low overall varablty. These features are more lkely to be consstently expressed dfferently between case and control samples, and therefore ndcate good canddates for feature selecton. AUC Score (for feature selecton): Recever operatng characterstc curves are commonly used to measure the performance of dagnostc systems n + + 6 Cancer Informatcs 25:1(1))

Serum peptde profle classfers and hgh-throughput random samplng repeated studes terms of ther ht-or-mss behavor. By computng the ROC curve for each feature ndvdually, one can determne the ablty of that feature to separate samples nto the correct groups. Measurng the area under the ROC curve (Hanley et al., 1982) then gves an ndcaton of the feature s probablty of beng a successful bomarker. The AUC score for a gven feature s then obtaned by ntegratng over the ROC curve for that feature. As wth the Fsher score, hgher AUC scores sgnfy better feature canddates. Unvarate t-test: The t-test (Bald et al., 21) can be used to determne f the case versus control dstrbutons of a feature dffer substantally wthn the tranng set populaton. The t statstc, representng a normalzed dstance measurement between populatons, s gven as 2 2 σ σ + t = ( μ μ+ ) + n n+ where μ, σ are the emprcal mean and standard devaton for the th feature n the n control samples, and μ +, σ + are lkewse the emprcal mean and standard devaton for the th feature n the case samples. The t statstc follows a Student dstrbuton wth f [( σ / n ) + ( σ / n )] 2 2 = + + ( σ 2 ( 2 / n ) + / ) 1 + + σ n + n n 1 degrees of freedom. For each feature, one can then calculate the t statstc and assocated f, and determne the assocated p-value wth a predetermned confdence level from a standard table of sgnfcance. Smaller p-values ndcate t s unlkely the observed case and control populatons of the th feature are smlar by chance. Thus, t s lkely that the th feature s represented n a way that s statstcally sgnfcant between case and control examples, makng t a good canddate for feature selecton. 2 We also evaluated feature selecton usng smple separablty, weghted separablty, and the J5 test (Patel and Lyons-Weler, 24). De-correlaton enhancement: After dfferental feature selecton, we can perform further feature evaluaton to avod hghly correlated features. These may be of nterest for nterpretng the bologcal sources of varaton among peptdes (such as carrer protens; Mehta et al., 23). For the purpose of constructng ndependent classfers, however, t may be better to avod usng non-ndependent features - f only to ncrease the number of features ncluded after feature selecton - but also to avod overtranng on a large number of hghly correlated features. One way to avod these correlated features s de-correlaton (removal of features whch are nter-correlated beyond some pre-determned threshold). All of the methods descrbed so far can be evaluated wth and wthout de-correlaton. Prncpal component analyss: Prncpal component analyss, a type of feature constructon, ncorporates aspects of de-correlaton by groupng correlated features nto aggregate features (components), whch are presumed to be orthogonal (.e., uncorrelated). Classfcaton models Naïve Bayes: The Naïve Bayes classfer makes the assumpton that the state of a feature (ndcatng membershp n the case or control group) s ndependent of the states of other features when the sample s class (case or control) s known. Let X = { x, x 2,..., 1 n be a sample consstng of n features, and C = { c, c 2,... c be a set of m target classes to whch X mght belong. One can compute the probablty of a sample belongng to a partcular class usng Bayes rule: P( c X ) = m x 1 m j= 1 j } } P( X c ) P( c ) P( X c ) P( c ) j Cancer Informatcs 25:1(1) 61

Lyons-Weler, Pelkan, and Zeh, et al The lkelhood of sample X belongng to a partcular target class c j s gven as the product of each probablty densty functon for each feature n the populaton of c j. For our purposes, we assume each feature x k follows a Gaussan dstrbuton, although other dstrbutons are possble. Thus, the probablty densty functon for feature x k s where P( x k c j P( X c ) = ) = 1 j 2πσ k = 1 are the mean and standard devaton of the k th feature wthn the populaton of samples belongng to class c j. These two values, and ther correspondng par for the control populaton, must be estmated usng the emprcal nformaton seen n the tranng set for each feature. The estmates are then used n the computaton above durng the predctve process on the testng set. Support Vector Machne (SVM): One mght magne a sample wth n features as a pont n an n- dmensonal space. Ideally, we would lke to separate the n-dmensonal space nto parttons that contan all samples from ether case or control populatons exclusvely. The lnear support vector machne or SVM (Vapnk 1995, Burges 1995) accomplshes ths goal by separatng the n-dmensonal space nto 2 parttons wth a hyperplane wth the equaton w T X + w = where w s the normal to the hyperplane, and kj w s the dstance between the support vectors. These support vectors are the representatve samples from each class whch are most helpful for defnng n P( x c ) exp μ kj, σ kj k 1 2 j xk μkj σ kj 2 the decson boundary. The parameters of the model, w and w can be learned from data n the tranng set through quadratc optmzaton usng a set of Lagrange parameters αˆ (Scholkopf 22). These parameters allow us to redefne the decson boundary as wˆ T x + T w = α y ( x x) + w SV where only samples n the support vector contrbute to the computaton of the decson boundary. Fnally, the support vector machne determnes a classfcaton for the th sample as seen here: ŷ T yˆ = sgn ˆ α y ( x x) + w SV where negatve ŷ s wll occur below the hyperplane, and postve ŷ s wll occur above t. Ideally, all samples from one group wll have negatve ŷ whle all others wll have postve ŷ PACE Results All four cancer datasets were analyzed usng classfers defned by dfferng confguratons of feature selecton crtera, peak selecton, de-correlaton, and classfcaton models. De-correlaton MAC thresholds range from 1 (no de-correlaton) to.4 (strct de-correlaton) n ncrements of.2. To assess the statstcal sgnfcance of the classfers generated through these confguratons, PACE analyss was performed usng 1 random permutatons of the ˆ 62 Cancer Informatcs 25:1(1)

Serum peptde profle classfers and hgh-throughput random samplng repeated studes data over 4 splts nto tranng and testng sets. Classfers were evaluated over the range of 5 to 25 features, n ncrements of 5 features. For llustratve purposes, examples of PACE graphs are presented n the appendces of ths work. These graphs represent only a porton of the classfers evaluated for ths work. In partcular, the appendces present PACE graphs for SVM classfers enforcng a.6 MAC threshold, both before and after peak selecton, for each of the unvarate feature selecton methods. UPCI Pancreatc Cancer Data Each possble confguraton of classfcaton models produced a statstcally sgnfcant classfer at the 99% level. Ths trend was observed for all feature szes n each classfer. See fgures A.1 through A.6 for examples of PACE analyss on ths dataset usng dfferent feature selecton crtera. Ovaran Cancer Data (D1; Petrcon et al., 22) Each possble confguraton of classfcaton models produced a statstcally sgnfcant classfer at the 99% level. Ths trend was observed for all feature szes n each classfer. See fgures B.1 through B.6 for examples of PACE analyss on ths dataset usng dfferent feature selecton crtera. Ovaran Cancer Data (D2; Petrcon et al., 22) Each possble confguraton of classfcaton models produced a statstcally sgnfcant classfer at the 99% level. Ths trend was observed for all feature szes n each classfer. See fgures C.1 through C.6 for examples of PACE analyss on ths dataset usng dfferent feature selecton crtera. Prostate Cancer Data (Qu et al., 22) Under random feature selecton, several classfers were produced whch were not statstcally sgnfcant at the 99% or 95% level. Usng the Naïve Bayes classfcaton model, the generated classfers were not sgnfcant at the 95% level for small amounts of features (5-15). As de-correlaton becomes strcter, the classfers lost statstcal sgnfcance at hgh amounts of features where they had been sgnfcant wth a more lenent MAC. When couplng ths technque wth peak selecton, no statstcally sgnfcant classfers were produced. Wth an SVM-based classfer usng random feature selecton, the produced classfers were sgnfcant at the 99% level except when usng the ntal 5 features. Changes n MAC and peak selecton dd not change ths behavor. In general, Naïve Bayesan classfers usng unvarate feature selecton crtera are sgnfcant at the 99% level as long as peak selecton s not performed beforehand. The one excepton was the J5 test, whch was unable to produce a sgnfcant classfer at the 95% level wthout the ad of de-correlaton. Applyng de-correlaton allowed these classfers to acheve sgnfcance at the 99% level. When performng peak selecton, only the classfers produced usng the strctest MAC thresholds (.6,.4) were able to acheve some form of sgnfcance, and even then, only at hgh amounts of features (15-25). The weghted separablty score was unable to produce a sgnfcant naïve Bayes classfer usng peak selecton. SVM classfers usng unvarate feature selecton crtera were nearly always sgnfcant at the 99% level, ether wth or wthout peak selecton. The few nstances where there was no sgnfcance at the 95% level occurred usng the J5 and smple separablty scores wthout de-correlaton. In the case of the J5 score, lowerng the MAC to.8 remeded the stuaton, whle the smple separablty score mproved smply through ncorporatng addtonal features. See fgures D.1 through D.6 for examples of PACE analyss on ths dataset usng dfferent feature selecton crtera. Dscusson We have before us a dauntng challenge of creatng conduts of clear and meanngful communcaton and understandng between consumers (statstcans, computatonal machne learnng experts, bonformatcans) and the producers of hgh Cancer Informatcs 25:1(1) 63

Lyons-Weler, Pelkan, and Zeh, et al throughput data sets. The objectve s to maxmze the rate at whch clncally sgnfcant patterns can be dscovered and valdated. Dscplnes can be brdged n part by a straghtforward reference pont on performance provded by decson-theoretc performance measures. Nevertheless, performance characterstcs that are typcally reported (SN, SP, PPV, NPV) only provde partal nformaton on performance (the method s performance n the alternatve case). Researchers may be reluctant to publsh results that have relatvely low SN and SP (e.g.,.75,.8), and yet ths level of performance may n fact be hghly surprsng gven the sample numbers and degree of varablty (due to nose varance). Stellar results such as hgh 9 s senstvty and specfcty predomnate n the publshed cancer lterature (Table 1), posng the queston of whether the early reports of hgh performance may have set the standard too hgh. Some bologcal sgnal and powers of prognoss can be expected to be lower. Our work focuses on the queston: what represents a remarkable SN? SP? AUC? ACE? We study ths from the perspectve that proteomc proflng represents only one of many dfferent sources of potental clncally sgnfcant nformaton, and that combned use of panels of bomarkers and other molecular and classcal dagnostc nformaton s lkely to be requred f proteomc proflng becomes wdely adopted. Mnmze ACE: Conjecture or Tautology? In mcroarray analyss, most papers descrbe a new algorthm or test for fndng dfferentally expressed genes. Ths makes s dffcult to assess the valdty of a gven analytcal strategy (method of analyss). We recommend that a standard be consdered for the assessment of the mpact of partcular decsons n the constructon of an analytcal strategy, ncludng decsons made durng pre-processng (Fgs. 2 and 3): Specfcally, Any method that results n a sgnfcant ACE s to be preferred over methods that do not acheve sgnfcance. All sgnfcant methods (at a specfed degree of sgnfcance) are equally justfed for the tme beng. It s possble that dfferent methods that acheve sgnfcant ACE wll dentfy dstnct feature sets, n whch case each feature set s potentally nterestng. Note that we are not suggestng that reproducblty s not mportant;.e., deally, the same methods on smlarly-szed dfferent data sets should acheve smlar levels of sgnfcance. Indeed, reproducblty s key; therefore, the methods that yeld smlar levels of sgnfcance n repeated experences are also valdated. Note also that we are also not recommendng that one should adopt the somewhat opposng poston that The method that mnmzes ACE wll tend to be most sgnfcant, and therefore wll lkely be best justfed. In contrast, we consder t lkely that clncally sgnfcant nformaton may exst at a varety of scales wthn these large data sets. The search for a method-any method- wth the most sgnfcant ACE from a sngle data set seems lkely to lead to overestmates of the expected clncal utlty of a set of bomarkers. Comparsons of ACE across cancer types and wth ndependent data set would be nformatve. Nonsgnfcant Results Reasons for negatve results mght nclude no bologcal sgnal, poor study desgn or laboratory SOPs, poor technology, or low bologcal sgnal (requrng larger numbers of samples). It s our poston that researchers are better nformed whether the result s sgnfcant or not. For example, a non-sgnfcant ACE may nform the researcher that they should refne or redrect ther research queston; an example mght nclude early detecton of a gven dsease provdng a negatve result n the pre-dsease state, suggestng that one mght move the focus to early stage dsease nstead of pre-dsease. Whle the clncal predcton of a potental outcome durng the course of dsease may not be possble from the precondtoned state, the research mght shft focus toward how early can ths condton be predcted? Whle we report few non-sgnfcant results, we have seen non-sgnfcant results from unpublshed, propretary studes of whch we cannot report the detals. The results are unpublshed n part due to the negatve results, and n part due to the changes n the expermental desgn that has resulted due to achevng a negatve result. 64 Cancer Informatcs 25:1(1)

Serum peptde profle classfers and hgh-throughput random samplng repeated studes Relaton of PACE to Smlar Methods PACE creates a dstrbuton of the expected ACE under the null condton. The fxed measure ACE s the average classfcaton error over all random subsamplngs. Ths generates a dstrbuton around ACE, and the determnaton of sgnfcance could nvolve a comparson of the degree of overlap between the ACE and PACE dstrbutons. As we have seen, PACE s smlar n focus to a number of alternatve methods wth slghtly dstnct mplementatons and foc. These nclude the ROC bootstrap confdence nterval on AUC, confdence nterval estmaton around SN at a fxed SP, and bootstrap bands around the ROC curve tself. The bootstrap ROC s used to determne a confdence nterval around an estmated area under the ROC curve (AUC); we are most nterested n the specfc part of feature space where a classfer works best, not n the overall performance of a classfer over a range of strngency, and thus PACE focuses on comparng a pont estmate of statstc theta to ts null dstrbuton. A tradtonal lmtaton of permutaton tests s an assumpton of symmetry; n our case, we are only nterested n the lower tal of the PACE dstrbuton. In the case of ndvdual performance measures (SN, SP) or the composte AUC, one would be nterested only n the upper tal of ACE. Symmetry s also known to be an especally mportant assumpton when estmatng the confdence nterval around the AUC (Efron and Tbshran, 1993). The queston of relatve sutablty of these alternatves should be determned emprcally to determne f any practcal dfferences exst n ths partcular applcaton. So the queston s posed: whch statstcal assessment of confdence s of most practcal (appled) nterest: the specfc measurement of classfcaton error acheved by x n the learnng stage of the actual study, or the dstrbuton of the classfcaton error n magned alternatve cases? We prefer to make our nferences on the data set at hand, for the tme beng, usng magned alternatves that nvolve a (hopefully) well-posed null condton. The bas-corrected accelerated bootstrap confdence nterval (Efron and Tbshran, 1993), whch s rangerespectng and range-preservng (and unbased, as the name suggests) corrects for dfferences between the medan AUC of some of the pseudosamples and that of the orgnal sample, makng the magned alternatve samples more lke the actual sample. Ths method should also be explored n ths context. Some of these dsparate methods could also potentally be combned (e.g., PACE as the null dstrbuton and ROC bootstrappng to assess confdence ntervals around ACE). Ths would use the degree of overlap of dstrbutons nstead of specfc nstance outsde of a generated dstrbuton. A more formal exploraton of these possbltes seems warranted. Robustness of PACE and Permutaton Approaches to the Stark Realtes of Hgh-Throughput Scence PACE provde a reference pont that s robust to many of the vagares n study desgn common to peptde proflng studes, such as dfferent numbers of techncal replcates per sample that result from the applcaton of QA/QC. Compared to dstrbuton-dependent crtera that would otherwse requre adjustments to degrees of freedom, both PACE and the bootstrap are relevant for the data set at hand. Caveats PACE and the other methods cted here do not protect ncdental partal or complete confoundng. True valdaton of the results of any hgh-throughput analyss should nvolve more than one ste, deally wth the applcaton of a specfc classfer rule learned at ste A to data generated at ste B. Further, to protect aganst amplfcaton of local bases by data preprocessng steps, the preprocessng must be wrapped wthn the permutaton loop. A Word on Coverage It s mportant to consder n the development and evaluaton of bomarker-based classfcaton rule whether a sample s classfable;.e., do the rules developed and data at hand provde suffcently precse nformaton on a gven sample. The proporton of samples that are predctable n a data s defned as coverage. If a strategy s adopted whereby a number of samples are not classfed, the evaluaton scheme (whether t be a bootstrap, random subsamplngderved confdence boundares, or permutaton sg- Cancer Informatcs 25:1(1) 65

Lyons-Weler, Pelkan, and Zeh, et al nfcance test) should also be forced to not classfy the same number of samples. These enforced passes on a sample must be checked and enforced after the predcton stage to conserve the numercal and statstcal aspects of the study desgn and data set (e.g.s, number of samples; varablty wthn m/z class). Research s needed to determne the mportance of asymptotc propertes, dependences of the bootstrap ROC on the monotonc or jaggedness of the ROC curve, and the use of combned dstrbutons (.e., measure of degree of overlap between the PACE dstrbuton, as the null dstrbuton, and the bootstrap ROC curve as varablty n the estmated classfer performance measure of nterest n separate nstance of the study). Towards a More Complete Characterzaton of the Problem In the consderaton of further development and mprovements n analytc methods for the analyss of peptde profles, we assume that detaled descrptons of fundamental characterstcs of lowresoluton peptde profles can be used to help set prortes n the constructon of partcular strateges. These descrptons/observatons nclude an acknowledgement of somewhat hgh mass accuracy (.2-.4%); a comprehenson that ndvdual m/z values are not specfc (.e., they are not unque to ndvdual peptdes), and therefore ntensty measures at a gven m/z value reflect sum ntensty of peptde m/z classes, whch may or may not be functonally assocated; an understandng that peptdes do not map to sngle ndvdual peptdes;.e., they exst two or more tmes n the profle at dfferent m/z values as varously protonated forms. Each peptde may have a roughly unque sgnature, and pattern matchng forms the bass of peptde fngerprnt data mnng, but a peptde need not occur as a sngle peak; an understandng that m/z varance wll contan bologcal sources (mass shfts due to amno acd sequence varaton and varyng degrees of ubqunaton and cleavage, bndng of peptdes wth others), chemcal, and physcal components (mass drft), and thus models that allow the statstcal accountng of each of these varance components are needed; an understandng that hgh ntensty measurement n SELDI-TOF-MS profles tend to exhbt hgher varance, whch suggest that relance of peaks for any nference (analyzng peaks only, algnng peaks, or normalzng profles to peaks) may add large, unwanted components of varance or restrct fndng to peptdes wth ntenstes that are most naccurately measured; the acknowledgment that the m/z vector s an arbtrary vector along whch ntensty values of smlarly massed and charged peptdes are arranged, and, as an arbtrary ndex n and of tself may requre (or deserve) no profound bologcal explanaton and may or may not offer a profound bologcal nsght related to the clncal questons at hand beyond a gude to dentty of peptde by pattern matchng; observatons that features determned to be sgnfcant tend to be locally correlated and that long-range correlatons also exst, and that both artfactual and bologcally mportant correlatons and ant-correlatons may exst at both dstances; an expectaton that correlatons may exst that reflect protonated forms of peptdes and that some correlaton/antcorrelaton pars may reflect real peptde bology, such as enzymatc cleavage cascades; smlarly, the observaton that at least part of the local autocorrelaton observed n the profle s lkely due to poor resoluton (mass drft), and reflects a physcal property of the profles (nstrument measurement error and resolvng power). It may also reflect smoothng due to natural bologcal varaton n the populaton from whch the samples were drawn, the effects of summng ntenstes of dstnct peptdes that share smlar but not dentcal m/z values. One mght consder 66 Cancer Informatcs 25:1(1)

Serum peptde profle classfers and hgh-throughput random samplng repeated studes whether the local correlatons all reflect real bologcal propertes of sngle peptdes at partcular m/z postons, and, f not, they may offer no bologcal nsght and may requre no bologcal explanaton (.e. local autocorrelaton may be smple artfact of degree of resoluton of the nstrument and the lack of specfcty of m/z values). These descrptons may help motvate research on varance correctons, de-correlaton, the use of PCA, profle algnment strateges, and attempts at transformaton. Other Open Questons As hgh-throughput genomc and proteomc data become less expensve, and the laboratory equpment spreads nto an ncreasng number of facltes, t seems lkely that dfferent laboratores wll study the sample problem wth completely ndependent effort. Publshed data sets, therefore, represent profoundly useful potental source of corroboraton, or valdaton, of bomarker sets that mght be expected to exhbt reproducble dfferences n large portons of the patent populaton. A careful characterzaton and valdaton of those dfferences, as a step that s ndependent of the queston of potental clncal utlty, s essental n these studes. True valdaton by planned repeated experments may seem dauntng, or unwarranted at ths early stage, and the tendency wll be to attempt to valdate markers deemed to be sgnfcant n a small study usng other technology (mmunohstochemstry, for example). In ths case, absence of valdaton of specfc protens wth other technology s not complete refutaton due to the potental for dosyncrases n ths new applcaton of mass spec technology. Computatonal valdaton appled at the step of feature selecton alone could prove nvaluable (.e., whch features are reproducbly dfferent between cases and controls, responders and nonresponders, n ndependently analyzed subsets or splts of the data samples?) Large mult-year and mult-ste studes As unlkely as large-scale repeated studes may seem, t seems mmnent that studes of peptde profles from thousands of patents and normal donors wll be forthcomng. What are the practcal problems n such a settng? We would advocate avodng the temptaton to vew one large data set (say, 5, patent, 5, normal) as a sngle study, and would recommend analyss of multple, random ndependent (non-overlappng) subsets, whch would provde true valdaton of feature selecton methods and classfcaton nferences. Such large studes wll occur over long tmer perods. Laboratory condtons change, and manufacturers change kts and protocols; thus, to maxmze the generalzablty of the performance characterstcs of a traned classfer, tranng and test sets should be randomly selected and blnded. We must remember that learnng s asymptotc. Therefore, researchers should avod evaluatng a classfer bult on tranng data set 1 produced at tme 1 wth testng set produced at tme 2; nstead, they should randomze the data over the entre tme perod, even f ths means re-learnng a classfer after publshng an ntally nternally vald classfer usng data set 1. Ths approach stll nvolves tranng, but protects aganst a based (overly pessmstc) result due to shfts n laboratory condtons. Future Drectons n Peptde Proflng Gven that the dstrbuton of pure nose varance over the m/z range s not unform under the null condton, unvarate feature selecton methods such as t-tests, Fsher s score, area under the curve (AUC) and ther nonparametrc alternatves are perhaps best appled as permutaton tests to attempt to equalze the Type 1 error rate over the m/z range ncluded n an analyss. When combned wth PACE, ths greatly ncreases the computatonal burden of analyzng even a small set of profles, but the pay-off should be mmense. Features that are not sgnfcant under the parametrc, dstrbuton-dependent tests can become sgnfcant under the permutaton test for sgnfcance, and the reverse shfts are also possble. Ths becomes especally mportant when usng sgnfcance levels to select n-ranked features. When permutaton feature selecton methods are then combned wth classfcaton algorthms such as PCA, SVM, or nearest neghbor algorthms, and then are evaluated by PACE or bootstrap methods, ths clearly wll requre a large network dedcated to Cancer Informatcs 25:1(1) 67