IMPROVING THE EFFICIENCY OF BIOMARKER IDENTIFICATION USING BIOLOGICAL KNOWLEDGE

Similar documents
Reconstruction of gene regulatory network of colon cancer using information theoretic approach

Study and Comparison of Various Techniques of Image Edge Detection

INTEGRATIVE NETWORK ANALYSIS TO IDENTIFY ABERRANT PATHWAY NETWORKS IN OVARIAN CANCER

Statistically Weighted Voting Analysis of Microarrays for Molecular Pattern Selection and Discovery Cancer Genotypes

Introduction ORIGINAL RESEARCH

AN ENHANCED GAGS BASED MTSVSL LEARNING TECHNIQUE FOR CANCER MOLECULAR PATTERN PREDICTION OF CANCER CLASSIFICATION

International Journal of Emerging Technologies in Computational and Applied Sciences (IJETCAS)

Gene Selection Based on Mutual Information for the Classification of Multi-class Cancer

Joint Modelling Approaches in diabetes research. Francisco Gude Clinical Epidemiology Unit, Hospital Clínico Universitario de Santiago

Copy Number Variation Methods and Data

Optimal Planning of Charging Station for Phased Electric Vehicle *

Feature Selection for Predicting Tumor Metastases in Microarray Experiments using Paired Design

Using Past Queries for Resource Selection in Distributed Information Retrieval

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

AUTOMATED DETECTION OF HARD EXUDATES IN FUNDUS IMAGES USING IMPROVED OTSU THRESHOLDING AND SVM

Prediction of Total Pressure Drop in Stenotic Coronary Arteries with Their Geometric Parameters

Parameter Estimates of a Random Regression Test Day Model for First Three Lactation Somatic Cell Scores

310 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA'16

Lymphoma Cancer Classification Using Genetic Programming with SNR Features

Insights in Genetics and Genomics

The Limits of Individual Identification from Sample Allele Frequencies: Theory and Statistical Analysis

The Influence of the Isomerization Reactions on the Soybean Oil Hydrogenation Process

Journal of Engineering Science and Technology Review 11 (2) (2018) Research Article

Using the Perpendicular Distance to the Nearest Fracture as a Proxy for Conventional Fracture Spacing Measures

Incorporating prior biological knowledge for network-based differential gene expression analysis using differentially weighted graphical LASSO

Modeling Multi Layer Feed-forward Neural. Network Model on the Influence of Hypertension. and Diabetes Mellitus on Family History of

A MIXTURE OF EXPERTS FOR CATARACT DIAGNOSIS IN HOSPITAL SCREENING DATA

Project title: Mathematical Models of Fish Populations in Marine Reserves

THE NATURAL HISTORY AND THE EFFECT OF PIVMECILLINAM IN LOWER URINARY TRACT INFECTION.

Modeling the Survival of Retrospective Clinical Data from Prostate Cancer Patients in Komfo Anokye Teaching Hospital, Ghana

Physical Model for the Evolution of the Genetic Code

Biomarker Selection from Gene Expression Data for Tumour Categorization Using Bat Algorithm

Survival Rate of Patients of Ovarian Cancer: Rough Set Approach

A Support Vector Machine Classifier based on Recursive Feature Elimination for Microarray Data in Breast Cancer Characterization. Abstract.

THIS IS AN OFFICIAL NH DHHS HEALTH ALERT

An Introduction to Modern Measurement Theory

A Computer-aided System for Discriminating Normal from Cancerous Regions in IHC Liver Cancer Tissue Images Using K-means Clustering*

ARTICLE IN PRESS Neuropsychologia xxx (2010) xxx xxx

Price linkages in value chains: methodology

N-back Training Task Performance: Analysis and Model

A Support Vector Machine Classifier based on Recursive Feature Elimination for Microarray Data in Breast Cancer Characterization. Abstract.

BINNING SOMATIC MUTATIONS BASED ON BIOLOGICAL KNOWLEDGE FOR PREDICTING SURVIVAL: AN APPLICATION IN RENAL CELL CARCINOMA

Cancer Classification Based on Support Vector Machine Optimized by Particle Swarm Optimization and Artificial Bee Colony

Journal of Engineering Science and Technology Review 11 (2) (2018) Research Article

NUMERICAL COMPARISONS OF BIOASSAY METHODS IN ESTIMATING LC50 TIANHONG ZHOU

Economic crisis and follow-up of the conditions that define metabolic syndrome in a cohort of Catalonia,

Evaluation of Literature-based Discovery Systems

Drug Prescription Behavior and Decision Support Systems

Boosting for tumor classification with gene expression data. Seminar für Statistik, ETH Zürich, CH-8092, Switzerland

Estimation for Pavement Performance Curve based on Kyoto Model : A Case Study for Highway in the State of Sao Paulo

(From the Gastroenterology Division, Cornell University Medical College, New York 10021)

A comparison of statistical methods in interrupted time series analysis to estimate an intervention effect

RENAL FUNCTION AND ACE INHIBITORS IN RENAL ARTERY STENOSISA/adbon et al. 651

A Linear Regression Model to Detect User Emotion for Touch Input Interactive Systems

Sparse Representation of HCP Grayordinate Data Reveals. Novel Functional Architecture of Cerebral Cortex

Subject-Adaptive Real-Time Sleep Stage Classification Based on Conditional Random Field

Saeed Ghanbari, Seyyed Mohammad Taghi Ayatollahi*, Najaf Zare

ALMALAUREA WORKING PAPERS no. 9

Integration of sensory information within touch and across modalities

Monte Carlo Analysis of a Subcutaneous Absorption Insulin Glargine Model: Variability in Plasma Insulin Concentrations

Evaluation of the generalized gamma as a tool for treatment planning optimization

Prediction of Human Disease-Related Gene Clusters by Clustering Analysis

Towards Prediction of Radiation Pneumonitis Arising from Lung Cancer Patients Using Machine Learning Approaches

Fast Algorithm for Vectorcardiogram and Interbeat Intervals Analysis: Application for Premature Ventricular Contractions Classification

CONSTRUCTION OF STOCHASTIC MODEL FOR TIME TO DENGUE VIRUS TRANSMISSION WITH EXPONENTIAL DISTRIBUTION

What Determines Attitude Improvements? Does Religiosity Help?

FAST DETECTION OF MASSES IN MAMMOGRAMS WITH DIFFICULT CASE EXCLUSION

The Effect of Fish Farmers Association on Technical Efficiency: An Application of Propensity Score Matching Analysis

Appendix F: The Grant Impact for SBIR Mills

Unobserved Heterogeneity and the Statistical Analysis of Highway Accident Data

Active Affective State Detection and User Assistance with Dynamic Bayesian Networks. Xiangyang Li, Qiang Ji

Incorrect Beliefs. Overconfidence. Types of Overconfidence. Outline. Overprecision 4/22/2015. Econ 1820: Behavioral Economics Mark Dean Spring 2015

A New Machine Learning Algorithm for Breast and Pectoral Muscle Segmentation

Estimating the distribution of the window period for recent HIV infections: A comparison of statistical methods

Resampling Methods for the Area Under the ROC Curve

WHO S ASSESSMENT OF HEALTH CARE INDUSTRY PERFORMANCE: RATING THE RANKINGS

An Approach to Discover Dependencies between Service Operations*

Statistical Analysis on Infectious Diseases in Dubai, UAE

Computing and Using Reputations for Internet Ratings

National Polyp Study data: evidence for regression of adenomas

We analyze the effect of tumor repopulation on optimal dose delivery in radiation therapy. We are primarily

Statistical models for predicting number of involved nodes in breast cancer patients

Encoding processes, in memory scanning tasks

*VALLIAPPAN Raman 1, PUTRA Sumari 2 and MANDAVA Rajeswari 3. George town, Penang 11800, Malaysia. George town, Penang 11800, Malaysia

Effects of Estrogen Contamination on Human Cells: Modeling and Prediction Based on Michaelis-Menten Kinetics 1

Submitted for Presentation 94th Annual Meeting of the Transportation Research Board January 11-15, 2015, Washington, D.C.

Balanced Query Methods for Improving OCR-Based Retrieval

A Geometric Approach To Fully Automatic Chromosome Segmentation

Strategies for the Early Diagnosis of Acute Myocardial Infarction Using Biochemical Markers

HIV/AIDS-related Expectations and Risky Sexual Behavior in Malawi

NeuroImage. Multimodal classification of Alzheimer's disease and mild cognitive impairment

Nonstandard Machine Learning Algorithms for Microarray Data Mining. Byoung-Tak Zhang

Using a Wavelet Representation for Classification of Movement in Bed

Research Article Computational Analysis of Specific MicroRNA Biomarkers for Noninvasive Early Cancer Detection

Investigation of zinc oxide thin film by spectroscopic ellipsometry

A New Diagnosis Loseless Compression Method for Digital Mammography Based on Multiple Arbitrary Shape ROIs Coding Framework

TOPICS IN HEALTH ECONOMETRICS

Lateral Transfer Data Report. Principal Investigator: Andrea Baptiste, MA, OT, CIE Co-Investigator: Kay Steadman, MA, OTR, CHSP. Executive Summary:

Importance of Atrial Compliance in Cardiac Performance

The effect of salvage therapy on survival in a longitudinal study with treatment by indication

Transcription:

IMPROVING THE EFFICIENCY OF BIOMARKER IDENTIFICATION USING BIOLOGICAL KNOWLEDGE JOHN H. PHAN The Wallace H. Coulter Department of Bomedcal Engneerng, Georga Insttute of Technology, 313 Ferst Drve Atlanta, GA 30332, USA QIQIN YIN-GOEN ANDREW N. YOUNG Department of Pathology and Laboratory Medcne, Emory Unversty Atlanta, GA 30322, USA MAY D. WANG The Wallace H. Coulter Department of Bomedcal Engneerng, Georga Insttute of Technology, 313 Ferst Drve Atlanta, GA 30332, USA Identfyng and valdatng bomarkers from hgh-throughput gene expresson data s mportant for understandng and treatng cancer. Typcally, we dentfy canddate bomarkers as features that are dfferentally expressed between two or more classes of samples. Many feature selecton metrcs rely on rankng by some measure of dfferental expresson. However, nterpretng these results s dffcult due to the large varety of exstng algorthms and metrcs, each of whch may produce dfferent results. Consequently, a feature rankng metrc may work well on some datasets but perform consderably worse on others. We propose a method to choose an optmal feature rankng metrc on an ndvdual dataset bass. A metrc s optmal f, for a partcular dataset, t favorably ranks features that are known to be relevant bomarkers. Extensve knowledge of bomarker canddates s avalable n publc databases and lterature. Usng ths knowledge, we can choose a rankng metrc that produces the most bologcally meanngful results. In ths paper, we frst descrbe a framework for assessng the ablty of a rankng metrc to detect known relevant bomarkers. We then apply ths method to clncal renal cancer mcroarray data to choose an optmal metrc and dentfy several canddate bomarkers. 1. Introducton The subjectve nature of tradtonal medcal technques lmts the accuracy of cancer subtype classfcaton and, subsequently, the effectveness of therapy. Clncans vsually examne cancer specmens to determne ther subtypes before proposng treatment regmens. However, cancers wth smlar characterstcs may behave very dfferently despte smlar treatment condtons [1]. Because cancer s the result of genetc anomales, emergng dagnostc research has

prmarly focused on genetc and proteomc expresson. Ths research generally nvolves the use of hgh throughput technology (e.g. mcroarrays and mass spectrometry) to generate large amounts of genetc and proteomc expresson data. We typcally reduce ths data usng one of many analyss algorthms wth the goal of dentfyng a subset of features (correspondng to genes or protens) wth hgh predctve accuracy [2-4]. We hope that these feature subsets wll both enhance our understandng of the bologcal mechansms as well as provde us wth an accurate dagnostc system. When valdated, we call these dfferentally expressed features bomarkers. Unfortunately, even the selecton of a rankng metrc s subjectve, as dfferent metrcs may dentfy dfferent subsets of features [5]. Feature rankng affects both the effcency of dentfyng relevant genes and the accuracy of subsequent predctve models. We address ths ssue by presentng a method that uses exstng bologcal knowledge to dentfy the best feature rankng metrc for a partcular gene expresson dataset. The optmal metrc maxmzes the probablty of correctly rankng dfferentally expressed and prevously valdated genes. Despte numerous feature selecton studes, there s stll a lack of clncally valdated and proven bomarkers for most cancers. Thus, the use of correct genes as knowledge for algorthm selecton s subjectve and we should choose these genes carefully. Sources of bologcal knowledge are abundant, but vary n terms of relablty. We consder a knowledge source to be relable f genes (or the correspondng expressed protens) from that source have been clncally valdated as dfferentally expressed. The majorty of knowledge s contaned n the lterature and roughly falls nto four levels of relablty, adapted from a revew of post-analyss valdaton methods by Chuaqu et al. [6]: 1. No bologcal valdaton. As the lowest level of relablty, ths ncludes studes that develop feature selecton algorthms and present the selected lst of genes wthout a strngent nterpretaton of the bologcal results. 2. In slco valdaton. Also known as computatonal valdaton, these studes compare ther feature selecton results to the results of other studes. They may also dentfy Gene Ontology (GO) categores that are statstcally overrepresented as a result of feature selecton. 3. Same-sample valdaton. These studes valdate ther mcroarray experments by performng addtonal assays on the same samples from whch ther mcroarrays were derved. These assays typcally nclude quanttatve real-tme PCR (qrt-pcr) or northern analyss and serve to valdate the techncal relablty of the mcroarrays. 4. Independent or clncal valdaton. As the hghest level of relablty, these studes valdate the results of ther mcroarray experments usng ndependent bologcal samples, usually from a clncal source. Independent

valdaton ensures that the selected features are not a result of over-fttng. These valdatons often take the form of qrt-pcr and n stu hybrdzaton (ISH) for RNA products, or mmunohstochemstry (IHC) and western analyss for proten products. Despte frequent dsagreement between qrt-pcr and mcroarray results, qrt- PCR s the most common method for valdaton of dfferentally expressed genes. Genes wth large fold-change n mcroarray data are consstently correlated wth qrt-pcr whle those wth smaller fold change are more susceptble to techncal varablty [7]. The detecton of dfferentally expressed genes s generally reproducble across several mcroarray platforms [8]. However, n lght of a recent study llustratng the pervasveness of techncal artfacts n mcroarray data [9], we only consder a knowledge source relable f t falls nto category three or four. Investgators have attempted to mprove feature selecton by usng bologcal knowledge. Ther knowledge sources often fall nto category two of relablty, n slco valdaton, and nclude Gene Ontology and pathway databases, publshed lterature, mcroarray repostores, and sequence nformaton. Generally, these studes dentfy genes that cluster or correlate wth genes from the knowledge sources [10-12]. Another study developed a theoretcal framework to compare feature rankng metrcs n the presence of control features [13]. However, ths study also neglected to focus on the relablty of the control features. Indeed, the wealth of avalable nformaton n the form of gene and proten nteractons, functonal annotaton, and genetc and pathways can mprove the results of data analyss [14]. Furthermore, mcroarray data analyss has shfted from purely data drven methods to methods that use addtonal knowledge, even n the feature selecton process [14]. We develop a method to quantfy the effcency of detectng bomarkers by feature rankng. Ths method maxmzes the bologcal relevance of feature rankng by choosng the best metrc from a populaton of metrcs. The chosen rankng metrc s optmal wth respect to knowledge obtaned from relable sources. We test the effectveness of our method usng clncal gene expresson data. Results ndcate that the choce of rankng metrc sgnfcantly affects feature rankng, whch, n turn, affects the effcency of dscoverng and valdatng novel bomarkers.

2. Methods 2.1. Modelng Knowledge n Feature Selecton Throughout ths paper, the term feature set denotes a group of one or more features or genes that act n concert. A sample refers to measurements of a feature set from a sngle mcroarray or molecular profle. The entre mcroarray sample contans l features whle a feature set may contan p features (where p << l ). We r represent samples for feature set as jontly dstrbuted p random vectors, X R, and labels, Y {0,1 }. The class label, Y, ndcates the clncal source of the mcroarray sample. In most cancer problems, Y = 1 ndcates, for example, samples measured from patents wth cancer and Y = 0 ndcates samples from patents wth no cancer. For a mcroarray dataset wth N samples, feature set for a partcular dataset s the vector d r r r r = (( y1, x1 ),( y2, x2), K,( yn, xn )) r from the random varable D, whch represents all feature sets n a dataset. Each feature set s assocated wth a relevance varable, r, from the random varable R {0,1 }. r represents the bologcal relevance of the feature set and the relablty of the knowledge source. D r and R are jontly dstrbuted. For each feature set, we assgn a score that represents the predctve ablty of that feature set: r A = h( D, θ ) (1) where A R s a random varable and θ s a meta-parameter that characterzes the scorng functon, or rankng metrc. Although θ may represent the space of all rankng methods, we use a reduced set of wrapperbased methods n our smulatons. Specfcally, we use a support vector machne (SVM) classfer wth the lnear and radal bass kernels and estmate the classfcaton accuracy of bomarkers usng the 0.632 bootstrap [5, 15]. The SVM classfer depends on a cost parameter, C, whch determnes the penalty of msclassfcaton. The radal bass kernel depends on γ, whch s proportonal to the complexty of the classfer. For the radal bass kernel, the par of parameters, ( C, γ ), represents θ. We dscretely vary C and γ over the log scale range of 0.1 to 10 3 and 0.01 to 10 5, respectvely. For the lnear kernel, only the sngle parameter,c, represents θ. We vary ths parameter over the log scale range of 0.01 to 10 2.

In practce, a gene expresson dataset wll have N samples, each wth l features. We separately examne m (m can be dfferent from l and nclude, for example, all pars, trplets, or a subset of feature combnatons) feature sets, r r r correspondng to { d1, d2, K, dm} and { r1, r2, K, rm }. From the mappng defned n eq. 1, we compute the set of values { α1, α2, K, αm} where each α s an observaton from A. Usng a smple selecton method, we can then conclude that the best feature sets and potental bomarkers are n the set G = { : α < τ} (2) where τ s a threshold. We want to choose a θ that produces the most bologcally relevant r r r rankng of the m feature sets, { d1, d2, K, dm}, wth respect to a gven set of knowledge. Assumng that lower scores are better, the best θ assgns scores such that α < α j for r = 1 and r j = 0,.e., feature set s known to be more relevant than feature set j for ths partcular dataset. Although we may never know the relevance of all features n a dataset, we may nfer from lterature that the k feature sets, Gk = { g1, g2, K, gk}, are relevant, where k << m. Ths mples that the elements of the set { α : Gk} should generally be smaller than those of { α j : j Gk}. If the knowledge s relable, we want to choose a θ that maxmzes the probablty that the score of a feature set from G k s less than that of a feature set that s not fromg k. Explctly, ths probablty s P ( α < α j θ ) (3) for Gk and j Gk. The estmated optmal rankng method s ˆ = arg max P ( α < α θ ), (4) θ θ j keepng n mnd that θˆ s only optmal, or maxmzes the probablty, wth respect to the gven knowledge set. For m feature sets, k of whch are n our knowledge set, G k, we can emprcally approxmate the probablty of eq. 3 wth P 1 ( < α j θ ) = I( α < α j k( m k) ) α (5) G k j G

where I (x) evaluates to one when x s true and zero when x s false. Eq. 5 s equvalent to computng the area under an ROC curve (AUC) for classfyng feature sets as ether relevant or rrelevant [13]. 2.2. Iteratvely Updatng Knowledge It may be dffcult to comple a comprehensve lst of knowledge from lterature and ndependent valdaton. Consequently, we can expect that some feature sets that are not n our knowledge set, j Gk, are, n fact, relevant bomarkers. If V s the set of all relevant bomarkers, regardless of whether ther relevance s known, we defne the knowledge update functon, S, as θˆ Gk + 1 = S ˆ ( Gk ) = {{ Gk,arg mn α }: V, Gk }. (6) θ Ths functon adds to G k a relevant bomarker wth the best rank accordng to the estmated optmal metrc,θˆ. Of course, a feature set s known to be n the set V only after performng a valdaton procedure such as qrt-pcr. If we know all feature sets n V, we can quantfy any mprovement n effcency due to optmzaton of the rankng metrc. Usng bootstrap resamplng, we randomly and repeatedly partton the feature sets n V nto a group of known relevant feature sets (tranng) and a group of unknown relevant feature sets (testng). If there are K elements n V, we randomly select * * K elements wth replacement, resultng n K ( K < K) unque elements for * the testng set. We use the group of K K known relevant feature sets to optmze the rankng metrc, then teratvely detect feature sets from the * unknown set of K features and update our knowledge usng eq. 6. Every valdaton test requres a fnte amount of tme and resources. Plottng the fracton of correctly valdated bomarkers (y-axs) vs. total valdaton tme (xaxs), reveals that hgher detecton effcency corresponds to a larger area under ths curve. Ths curve s smlar to a ROC curve, so we also call the area under ths curve the AUC. We repeat ths bootstrap samplng of feature sets 100 tmes n order to compute the sgnfcance of the dfferences among three condtons: optmal metrc selecton, sub-optmal metrc selecton, and sub-optmal ntal knowledge. For the sub-optmal metrc selecton condton, we use correct ntal knowledge selected from V va bootstrap, but use a modfed equaton to choose θˆ wth medan AUC: ˆ = arg medan P( α < α θ ). (7) θ θ j

Selecton of a rankng metrc wth medan AUC represents the common practce of arbtrarly selectng a metrc wth no regard for bologcal relevance and effcency. Ths medan AUC algorthm also serves as a reference pont for assessng the potental mprovement of effcency when usng the optmal algorthm. For the sub-optmal ntal knowledge condton, we begn the smulaton wth ncorrect knowledge selected va bootstrap and use eq. 4 to optmze the rankng algorthm before updatng the current knowledge set. We expect the average AUC of the optmal selecton condton to be hgher than that of both of the sub-optmal condtons. Fgure 1 llustrates ths process. To determne whether the optmzaton procedure s over-fttng to the knowledge set, we conduct addtonal tests usng randomly selected knowledge sets. If over-fttng s occurrng, results of the optmal, suboptmal, and suboptmal knowledge tests for randomly selected knowledge should be smlar to those of the true knowledge set. Fgure 1. Quantfyng the effcency of detectng relevant feature sets. For clncal data, we defne V as the set of K known dfferentally expressed feature sets. Usng bootstrap cross valdaton, we partton V nto K * and K-K * samples. K * s the number of unque samples after samplng from V K tmes wth replacement. We optmze the rankng algorthm usng K-K * feature sets and assess the algorthm s effcency n detectng the remanng K * feature sets. For each of the three condtons optmal metrc selecton, sub-optmal metrc selecton, and sub-optmal ntal knowledge we perform ths bootstrap samplng 100 tmes n order to compute the sgnfcance of any dfferences between mean AUC values.

2.3. Mcroarray Data Analyss and qrt-pcr Valdaton We examne two clncal case studes usng renal tumor mcroarray datasets. The frst dataset, from a study by Schuetz et al., uses Affymetrx mcroarrays (HG-Focus, 8793 probesets) to profle samples from three subtypes of renal tumors: 13 clear cell (CC) renal cell carcnoma (RCC), 4 chromophobe (CHR) RCC, and 3 oncocytoma (ONC, bengn) [2]. The second dataset, from a study by Jones et al., uses a dfferent model of Affymetrx mcroarrays (HG-U133A, 22283 probesets reduced to 8793 that are common to HG-Focus) to examne smlar renal tumor subtypes wth 32 CC, 6 CHR, and 12 ONC samples [16]. We are nterested n bomarkers that dfferentate the CC class from the combned group of ONC and CHR. Usng lterature, we dentfy genes that have been valdated (va qrt-pcr or IHC) as dfferentally expressed between the CC and ONC/CHR subtypes. We then valdate an addtonal 94 genes usng qrt-pcr (usng RNA from 34 CC and 18 CHR tssue samples). These 94 genes were selected by a renal cancer pathologst based on hs knowledge and prevous research. Only some of the 94 genes assayed wth qrt-pcr are dfferentally expressed as assessed by a lnear SVM wth classfcaton error estmated usng 0.632 bootstrap. Genes measured wth qrt-pcr are categorzed as dfferentally expressed f the estmated classfcaton error s less than 10%. Usng the set of knowledge from both lterature and qrt-pcr valdaton, we examne the effcency of detectng these bomarkers by optmzng the rankng metrc under varous condtons, as llustrated n fgure 1. 3. Results and Dscusson As descrbed n the methods, we dentfy fve genes from lterature that are dfferentally expressed between the CC and ONC/CHR renal tumor subtypes (table 1). Each of these genes had been valdated usng ether qrt-pcr or IHC. Addtonally, we valdate several other potental bomarkers usng qrt-pcr and select genes wth estmated classfcaton errors of less than 10% (table 2). Combnng all knowledge from both lterature and qrt-pcr valdaton, we examne the effect of optmzng the feature rankng metrc usng the method llustrated n fgure 1. Box plots of the 100 teratons for each of the three tests ndcate that optmal selecton outperforms sub-optmal selecton (fgure 2, left column). The comparson of optmal to suboptmal metrcs may seem to always favor the optmal metrc. However, the optmal metrc s not always a smple lnear classfer. In fact, durng the teratve gene detecton process, θ changes frequently as V s updated. Moreover, suboptmal selecton may represent the common practce of arbtrarly selectng rankng metrcs wth no regard to ther

potental dsadvantages for partcular datasets. The box plots represent the medan and quartles of the AUC values for each of the 100 teratons. Correspondngly, the ROC curves also ndcate that the optmal selecton method mproves the effcency of bomarker detecton (fgure 2, rght column). For the Schuetz data (fgure 2, top row), the performance dfference between the optmal and suboptmal rankng metrcs seems small accordng to the box plots. However, the ROC curve of the optmal metrc ntally rses much more quckly compared to that of the suboptmal. The regon of low specfcty boosts the performance of the suboptmal metrc. However, ths regon should be neglected when assessng performance snce the number of false postves at ths pont s very hgh. Valdaton procedures would lkely consder only the bomarkers detected n the hgh specfcty regon. Results are smlar for the Jones data (fgure 2, bottom row). The hgh varance of the suboptmal ntal knowledge condton ndcates that optmzaton of the rankng metrc s senstve to the ntal condtons. Some of the randomly selected ntal knowledge may, n fact, be dfferentally expressed, resultng n good performance. However, these random ntal knowledge sets are more lkely to be rrelevant. Thus, box plots for ths condton llustrate ths mxture of knowledge qualty. These results stress the mportance of the qualty of bomarker knowledge. The control tests usng random knowledge sets for V show that our method does not over-ft to the knowledge (fgure 2, box plots CO, CSO, and CSK). None of the algorthms consdered n our space of θ are able to favorably rank these randomly selected genes. AUCs of these control tests are close to 0.5 as expected for random classfcaton. Usng all knowledge from lterature and the frst round of qrt-pcr, we optmze the rankng metrc and select the top genes that have not been prevously valdated and that have estmated classfcaton errors of less than 5% (table 3). We can lnk a few of these genes drectly to prevous lterature pertanng to renal cancer. For example, CXCR4 has been lnked to kdney cancer. Usng qrt-pcr, Schrader et al. shows that ths gene s over-expressed n kdney cancer tssue compared to normal kdney tssue [17]. IGFBP3 and KLF10 has also been lnked to renal cell carcnoma [18, 19]. Valdaton of these genes usng qrt-pcr may yeld addtonal knowledge to teratvely refne the bomarker selecton process. However, snce we want to prmarly focus on the methodology here, we reserve the actual valdaton of these results for a future study.

Table 1. Genes valdated as dfferentally expressed between CC and ONC/CHR renal tumor subtypes from varous knowledge sources. Gene Symbol Knowledge Source Valdaton Method CA9 Chen et al., Cln Cancer Res, 2005 qrt-pcr CLCNKB Chen et al., Cln Cancer Res, 2005 qrt-pcr DEFB1 Schuetz et al., J Mol Dagn, 2004 qrt-pcr, IHC LRP2 Schuetz et al., J Mol Dagn, 2004 qrt-pcr, IHC PVALB Chen et al., Cln Cancer Res, 2005 qrt-pcr Table 2. Genes that we valdated wth qrt-pcr. These genes have estmated classfcaton errors of less than 10% as assessed by a lnear SVM classfer usng 0.632 bootstrap estmaton. Gene Symbol Error Gene Symbol Error STC1 2.43E-05 COX5A 0.0394058 SLC25A4 0.00186696 BAG1 0.0548365 CFTR 0.00279081 LY6E 0.0596081 PDHA1 0.0133316 CD99 0.0600892 PFKM 0.0279739 AKAP12 0.0624445 NNMT 0.0289622 ACAT1 0.0687972 CP 0.0300157 SPTBN2 0.077287 CFB 0.0387219 GOT1 0.0784855 Fgure 2. Box plots of AUC areas over 100 teratons for each test (left). AUCs for the optmal test (O) are hgher than both the sub-optmal (SO) and sub-optmal knowledge (SK) tests (dfferences are statstcally sgnfcant wth p-values very close to 0). The control tests, usng randomly selected knowledge ndcate that optmzng the rankng metrc does not over-ft (CO=control optmal, CSO=control suboptmal, CSK=control suboptmal knowledge). Average ROC curves for each test, llustrate the dfferences n bomarker detecton effcency (rght). The ROC for the optmal metrc test (sold lne) ndcates more accurate bomarker detecton for both the Schuetz (top row) and Jones (bottom row) renal cancer datasets.

4. Concluson Table 3. Proposed lst of genes for further qrt-pcr valdaton. Gene Symbol Error Gene Symbol Error ACLY 0 PCCB 0.03274 CXCR4 0.013907 TMSB10 0.034201 C4A /// C4B 0.0187 HCLS1 0.034415 FLNA 0.019903 ACTA2 0.039398 PMP22 0.023798 IGFBP3 0.040989 PFKFB3 0.026506 NFKBIA 0.042332 KLF10 0.027801 CD44 0.049095 PRG1 0.03003 IER3 0.049571 LGALS1 0.030617 We have shown that bomarker dentfcaton by feature rankng benefts from knowledge ntegraton at key ponts. Usng ths knowledge whether from clncal observatons, laboratory experments, or exstng lterature we can ntellgently choose an optmal rankng metrc for a specfc gene expresson dataset. The use of an optmal metrc for rankng and dentfyng novel bomarkers reduces the number of false dscoveres, ncreases the number of true dscoveres, reduces the requred tme for valdaton, and ncreases the overall effcency of the process. The results of our smulatons ndcate that knowledge ntegraton mproves bomarker selecton for clncal mcroarray data. Although ths study assumes ndependent gene expresson, the method s general and we can use t to rank combnatoral gene expresson data as well. Furthermore, we test ths method usng only a lmted set of wrapper-based feature rankng metrcs. However, t s easly expandable to encompass a varety of metrcs, ncludng the commonly used flter methods such as t-tests and fold change. We hope that the proposed method wll mpact bomarker dentfcaton practces and mprove the effectveness of resultng clncal applcatons. Acknowledgments Ths research has been supported by grants from Natonal Insttutes of Health (R01CA108468, P20GM072069, U54CA119338), Mcrosoft Research Fundng, and Georga Cancer Coalton (Dstngushed Cancer Scholar Award to MDW). References 1. Golub, T., et al., Molecular Classfcaton of Cancer: Class Dscovery and Class Predcton by Gene Expresson Montorng. Scence, 1999. 286: p. 531-537. 2. Schuetz, A., et al., Molecular classfcaton of renal tumors by gene expresson proflng. J Mol Dagn, 2004.

3. Sngh, D., et al., Gene expresson correlates of clncal prostate cancer behavor. Cancer Cell, 2002. 1: p. 203-209. 4. van't Veer, L., et al., Gene expresson proflng predcts clncal outcome of breast cancer. Nature, 2002. 415: p. 530-536. 5. Braga-Neto, U. and E. Dougherty, Is cross-valdaton vald for smallsample mcroarray classfcaton? Bonformatcs, 2004. 20: p. 374-380. 6. Chuaqu, R., et al., Post-analyss follow-up and valdaton of mcroarray experments. Nature Genetcs, 2002. 32: p. 509-514. 7. Morey, J., J. Ryna, and F. Van Dolah, Mcroarray valdaton: factors nfluencng correlaton between olgonucleotde mcroarrays and real-tme PCR. Bol. Proced. Onlne, 2006. 8(1): p. 175-193. 8. Sh, L., et al., The McroArray Qualty Control (MAQC) project shows nter- and ntraplatform reproducblty of gene expresson measurements. Nat Botechnol, 2006. 24(9): p. 1151-61. 9. Stokes, T., et al., chp artfact CORRECTon (cacorrect): A Bonformatcs System for Qualty Assurance of Genomcs and Proteomcs Array Data. Annals of Bomedcal Engneerng, 2007. 35: p. 1068-1080. 10. Aerts, S., et al., Gene prortzaton through genomc data fuson. Nature Botechnology, 2006. 24(5): p. 537-544. 11. Kuffner, R., K. Fundel, and R. Zmmer, Expert knowledge wthout the expert: ntegrated analyss of gene expresson and lterature to derve actve functonal contexts. Bonformatcs, 2005. 21: p. 259-267. 12. Kong, S., W. Pu, and P. Park, A multvarate approach for ntegratng genome-wde expresson data and bologcal knowledge. Bonformatcs, 2006. 22(19): p. 2373-2380. 13. Mukherjee, S. and S. Roberts, A theoretcal analyss of the selecton of dfferentally expressed genes. J Bonformatcs Comput Bol, 2005. 3: p. 627-643. 14. Bellazz, R. and B. Zupan, Towards knowledge-based gene expresson data mnng. Journal of Bomedcal Informatcs, 2007. 40: p. 787-802. 15. Efron, B. and R. Tbshran, Improvements on Cross-Valdaton: The.632+ Bootstrap Method. Journal of the Amercan Statstcal Assocaton, 1997. 92(438): p. 548-560. 16. Jones, J., et al., Gene sgnatures of progresson and metastass n renal cell cancer. Cln Cancer Res, 2005. 11(16): p. 5730-9. 17. Schrader, A., et al., CXCR4/CXCL12 expresson and sgnallng n kdney cancer. Brtsh Journal of Cancer, 2002. 86: p. 1250-1256. 18. Rosendahl, A. and G. Forseberg, IGF-I and IGFBP-3 augment transformng growth factor-beta actons n human renal carcnoma cells. Kdney Internatonal, 2006. 70: p. 1584-1590. 19. Ivanov, S., et al., Two novel VHL targets, TGFBI (BIGH3) and ts transactvator KLF10, are up-regulated n renal clear cell carcnoma and other tumors. Bochem Bophys Res Commun, 2008.