Sequence Analysis using Logic Regression

Similar documents
Detection and Classification of Brain Tumor in MRI Images

describing DNA reassociation* (renaturation/nucleation inhibition/single strand ends)

Data Retrieval Methods by Using Data Discovery and Query Builder and Life Sciences System

Are piglet prices rational hog price forecasts?

Overview. On the computational aspects of sign language recognition. What is ASL recognition? What makes it hard? Christian Vogler

The effects of bilingualism on stuttering during late childhood

The effects of question order and response-choice on self-rated health status in the English Longitudinal Study of Ageing (ELSA)

Monday 16 May 2016 Afternoon time allowed: 1 hour 30 minutes

Quantification of population benefit in evaluation of biomarkers: practical implications for disease detection and prevention

Computer mouse use predicts acute pain but not prolonged or chronic pain in the neck and shoulder

Reading a Textbook Chapter

Measurement of Dose Rate Dependence of Radiation Induced Damage to the Current Gain in Bipolar Transistors 1

Systematic Review of Trends in Fish Tissue Mercury Concentrations

Addiction versus stages of change models in predicting smoking cessation

Supplementary Information Computational Methods

The burden of smoking-related ill health in the United Kingdom

Utilizing Bio-Mechanical Characteristics For User-Independent Gesture Recognition

The impact of smoking and quitting on household expenditure patterns and medical care costs in China

RATING SCALES FOR NEUROLOGISTS

Lung function studies before and after a work shift

The comparison of psychological evaluation between military aircraft noise and civil aircraft noise

Sexual and marital trajectories and HIV infection among ever-married women in rural Malawi

Road Map to a Delirium Detection, Prevention and Management Program

Opening and Closing Transitions for BK Channels Often Occur in Two

PARKINSON S DISEASE: MODELING THE TREMOR AND OPTIMIZING THE TREATMENT. Keywords: Medical, Optimization, Modelling, Oscillation, Noise characteristics.

What causes the spacing effect? Some effects ofrepetition, duration, and spacing on memory for pictures

An Intelligent Decision Support System for the Treatment of Patients Receiving Ventricular Assist Device Support

RADIATION DOSIMETRY INTRODUCTION NEW MODALITIES

One objective of quality family-planning services is to. Onsite Provision of Specialized Contraceptive Services: Does Title X Funding Enhance Access?

Abnormality Detection for Gas Insulated Switchgear using Self-Organizing Neural Networks

MR Imaging of the Optic Nerve and Sheath: Correcting

American Orthodontics Exhibit 1001 Page 1 of 6. US 6,276,930 Bl Aug. 21,2001 /IIIII

Incremental Diagnosis of DES with a Non-Exhaustive Diagnosis Engine

Urbanization and childhood leukaemia in Taiwan

Superspreading and the impact of individual variation on disease emergence

Job insecurity, chances on the labour market and decline in self-rated health in a representative sample of the Danish workforce

A HEART CELL GROUP MODEL FOR THE IDENTIFICATION OF MYOCARDIAL ISCHEMIA

4th. generally by. Since most. to the. Borjkhani 1,* Mehdi. execution [2]. cortex, computerized. tomography. of Technology.

Large Virchow-Robin Spaces:

Reading and communication skills after universal newborn screening for permanent childhood hearing impairment

Rate of processing and judgment of response speed: Comparing the effects of alcohol and practice

PDF hosted at the Radboud Repository of the Radboud University Nijmegen

Effects of training to implement new working methods to reduce knee strain in floor layers. A twoyear

Histometry of lymphoid infiltrate in the thyroid of primary thyrotoxicosis patients

clinical conditions using a tape recorder system

Assessment of neuropsychological trajectories in longitudinal population-based studies of children

Spatial Responsiveness of Monkey Hippocampal Neurons to Various Visual and Auditory Stimuli

Monte Carlo dynamics study of motions in &s-unsaturated hydrocarbon chains

Community-Based Bayesian Aggregation Models for Crowdsourcing

previously (Leff & Harper, 1989) this provides an experimental test for the operation of conditions under which erroneous

Measurement strategies for hazard control will have to be efficient and effective to protect a

Circumstances and Consequences of Falls in Community-Living Elderly in North Bangalore Karnataka 1* 2 2 2

Effects of Temporal and Causal Schemas on Probability Problem Solving

Southwest Fisheries Science Center National Marine Fisheries Service 8604 La Jolla Shores Dr. La Jolla, California 92037

Computer simulation of hippocampal place cells

The University of Mississippi NSSE 2011 Means Comparison Report

Defective neutrophil function in low-birth-weight,

Costly Price Discrimination

Channel Modeling Based on Interference Temperature in Underlay Cognitive Wireless Networks

An Empirical Investigation on Fine-Grained Syndrome Segmentation in TCM by Learning a CRF from a Noisy Labeled Data

Shift work is a risk factor for increased total cholesterol level: a 14-year prospective cohort study in 6886 male workers

Autosomal dominant polycystic kidney disease (ADPKD) is

Onset, timing, and exposure therapy of stress disorders: mechanistic insight from a mathematical model of oscillating neuroendocrine dynamics

Historically, occupational epidemiology studies have often been initiated in response to concerns

Coherent oscillations as a neural code in a model of the olfactory system

Keywords: congested heart failure,cardiomyopathy-targeted areas, Beck Depression Inventory, psychological distress. INTRODUCTION:

Regulation of spike timing in visual cortical circuits

METHODS JULIO A. PANZA, MD, ARSHED A. QUYYUMI, MD, JEAN G. DIODATI, MD, TIMOTHY S. CALLAHAN, MS, STEPHEN E. EPSTEIN, MD, FACC

Age-dependent penetrance of different germline mutations in the BRCA1 gene

Mark J Monaghan. Imaging techniques ROLE OF REAL TIME 3D ECHOCARDIOGRAPHY IN EVALUATING THE LEFT VENTRICLE TIME 3D ECHO TECHNOLOGY

Evaluation of a prototype for a reference platelet

Effect of Dietary Astaxanthin and Background Color on Pigmentation and Growth of Red Cher r y Shr imp, Neocaridina heteropoda

Regional Primary Care Team to Deliver Best-Practice Diabetes Care

Is cancer risk of radiation workers larger than expected?

i Y I I Analysis of Breakdown Voltage and On Resistance of Super-junction Power MOSFET CoolMOSTM Using Theory of Novel Voltage Sustaining Layer

Daily Illness Characteristics and

Department of Virology, Wellcome Research Laboratories, Langley Court, Beckenham, Kent BR3 3BS, U.K. and heterologous virus challenge.

A Diffusion Model Account of Masked Versus Unmasked Priming: Are They Qualitatively Different?

Reversal of ammonia coma in rats by L-dopa: a peripheral effect

The Development and Validation of a Finite Element Model of a Canine Rib For Use With a Bone Remodeling Algorithm.

Determinants of disability in osteoarthritis of the

HIV testing trends among gay men in Scotland, UK ( ): implications for HIV testing policies and prevention

abstract SUPPLEMENT ARTICLE

Determination of Parallelism and Nonparallelism in

NEPHROCHECK Calibration Verification Kit Package Insert

Ayed Ahmad Khawaldeh, PhD. Assistant Professor, Jerash University. Jamal Fawaz Al-Omari, PhD. Assistant Professor, Balqa University

ACOG COMMITTEE OPINION

Menopausal Hormone Therapy Use and Risk of Invasive Colon Cancer

Incentive Downshifts Evoke Search Repertoires in Rats

The Assessment of Competence

Functional GI disorders: from animal models to drug development

Effect of Curing Conditions on Hydration Reaction and Compressive Strength Development of Fly Ash-Cement Pastes

Clinical Case of the Month. Neurological issues. Introduction

Allocation of attention across saccades

Contact mechanics and wear simulations of hip resurfacing devices using computational methods

Factors contributing to the time taken to consult with symptoms of lung cancer: a cross-sectional study

Pathology of sentinel lymph nodes for melanoma

Simple Bacterial Preservation Medium and Its Application to

Transcription:

Geneti Epidemiology (Suppl ): S66 S6 (00) Sequene Analysis using Logi Regression Charles Kooperberg Ingo Ruzinski, Mihael L. LeBlan, and Li Hsu Division of Publi Health Sienes, Fred Huthinson Caner Researh Center, Seattle, Washington Logi Regression is a new adaptive regression methodology that attempts to onstrut preditors as Boolean ombinations of (binary) ovariates. In this paper we modify this algorithm to deal with single-nuleotide polymorphism (SNP) data. The preditors that are found are interpretable as risk fators of the disease. Signifiane of these risk fators is assessed using tehniques like ross-validation, permutation tests, and independent test sets. These model seletion tehniques remain valid when data is dependent, as is the ase for the family data used here. In our analysis of the Geneti Analysis Workshop data we identify the exat loations of mutations on gene and gene 6 and a number of mutations on gene that are assoiated with the affeted status, without seleting any false positives. Key words: adaptive estimation, Boolean ombinations, simulated annealing, SNP LOGIC REGRESSION Finding assoiations between many genes/environmental fators and disease outomes leads to statistial problems with a high-dimensional preditor spae. In this paper we first disuss a new adaptive regression methodology, Logi Regression, whih we apply to sequene data for the general population of the Geneti Analysis Workshop (GAW) data. Logi Regression [Ruzinski 000; Ruzinski et al., 00] is intended for situations where most preditors are binary (0/), and the goal is to find Boolean ombinations of these preditors that are assoiated with an outome variable. First assume that all preditors X i, i =,..., p are binary and write X i instead of Ind(X i = ) and Xi instead of Ind(X i = 0), where Ind( ) is the usual indiator funtion. The type of regression problem is irrelevant, all we need is a sore funtion suh as RSS in linear regression, log-likelihood in generalized regression, partial log-likelihood in Cox regression, or mislassifiation, that relates fitted values with the response. For simpliity, we assume in here that Y is Address reprint requests to Dr. Charles Kooperberg, Division of Publi Health Sienes, Fred Huthinson Caner Researh Center, 00 Fairview Avenue N, MP-00, Seattle, WA 9809-04

Sequene Analysis using Logi Regression S67 a binomial random variable. The simplest Logi Regression model is now Ŷ = Ind(L = ), () where L is any logi (Boolean) expression that involves the preditors X i, suh as L = X or L = X (X (X X4)). Mislassifiation, (Y Ŷ ), would be the sore for equation. If we want a regression equation of this form, the main problem is to find good andidates for L, as the olletion of all possible logi terms is enormous. It turns out to be very onvenient to write logi expressions in tree form. For example, we an draw X (X (X X4)) as the tree in the first panel of Figure. Using this logi tree representation it is possible to obtain any other logi tree by a finite number of operations suh as growing of branhes, pruning of branhes and hanging of leaves (borrowing from CART [Breiman et al. 984] terminology). In the remaining panels of Figure we show three of the logi trees that an be obtained by applying one operation to the original tree. Using this representation and these operations on logi trees we an adaptively selet L using a (stohasti) simulated annealing algorithm. We start with L =. Then, at eah stage a new tree is seleted at random among those that an be obtained by simple operations on the urrent tree. This new tree always replaes the urrent tree if it has a better sore than the old tree, and it is aepted with a probability that depends on the differene between the sores of the old and the new tree and the stage of the algorithm, otherwise. This simulated annealing algorithm has similarities with the Bayesian CART algorithm [Chipman et al., 998], in whih a CART tree is optimized stohastially. Both of these algorithms are distint from the greedy algorithm employed by CART, in that at any stage they not neessrily pik the move that improves the fit most. Diagnostis, and a sheme that adjust the above-mentioned probabilities slowly enough during this algorithm, guarantee that we will find (lose to) the optimal model. An advantage of simulated annealing is that we are muh less likely to end up in a loal maximum of the soring funtion. Properties of the simulated annealing algorithm depend on Markov hain theory, and thus on the set of operations that an be applied to logi trees [Aarts and Korst, 989]. We should point out that the rules obtained by the Logi Regression algorithm are distintly different from those found by tree based algorithms, suh as CART [Breiman et al., 984]. For those type of methods, the eventual deision rules are like X X X X4 Original expression X X 4 X X X4 Growing branh X X X 5 X4 Changing preditor X X X4 Pruning branh Figure : Logi tree representation of X (X (X X 4)) (first panel) and three logi trees that an be obtained by simple operations on this tree.

S68 Kooperberg et al. if (X X X) (X X X 4 ) predit Ŷ =. () In general, rules are of the form i E i, where E i = j S ij and the S ij are simple relations, suh as X k X. Rules of this form are said to be in Disjuntive Normal Form (DNF) [Fleisher et al., 98]. Though any logi expression an be written in DNF, the omplexity of suh an expression an be redued onsiderably if logi expressions of other forms are allowed. Note, for example, that the ondition in equation an be redued to (X X (X X 4 )). This latter expression is not in DNF, however. For ompliated problems, we may want to onsider more than one logi tree at the same time. Thus, we an extend the lassifiation model (equation ) (using a binomial likelihood) as m logit(y = X) = β 0 + β j L j, () where eah of the L j is a separate logi tree. In pratie not all preditors may be binary. Continuous preditors an still be inluded in Logi Regression models by allowing terms like X i a to enter the model [Ruzinski, 000]. Alternatively, we an inlude ontinuous preditors in a regression model, in addition to logi terms, as we did for the GAW data (see Appliation to the GAW Data). Using model seletion, in addition to a stohasti model building strategy, is of ritial importane, as the logi tree with the best sore typially overfits the data. A variety of methods of model seletion using ross-validation and randomization tests exist [Ruzinski, 000]. For the GAW data we have repliate data; thus we deided to fit our models on one repliate (training set), and validate them on another repliate (test set). LOGIC REGRESSION AND GENETIC DATA In this paper we use the Logi Regression methodology to explore the relationship between single-nuleiotide polymprphisms (SNPs) in sequene data and response variables related to a disease outome. For sequene data we reate two preditors for eah site with zero, one or two variant alleles. Let X = if the site has two variant alleles, and X = 0 if it has zero or one variant allele, and let X = if the site has one or two variant alleles, and X = 0 otherwise. As the seletion of a logi tree takes plae in an adaptive manner, whether X or X ends up in the logi tree, implies whether a dominant model or reessive model, respetively, for this site best fits the data. As the adaptive methodology removes unneessary details from the tree, X and X will not end up in the same logi tree, sine X X X and X X X, so that the searh algorithm automatially redues suh a branh to X or X. When more than one logi tree is fit X an appear in one logi tree, and X an appear in another logi tree, effetively fitting an additive or multipliative model. In the appliation to the GAW data we ignore the family struture of the data, and opt for a diret appliation of the Logi Regression algorithm to a binary and a ontinuous response. Appliation of the Logi Regression algorithm to a model that inorporates family data requires identifiation of a sore-funtion (likelihood) j=

S60 Kooperberg et al. for a logi model suh struture was primarily a matter of onveniene. Sine we arried out model seletion using a test set that satisfies the same family struture as the training set, ignoring the family struture only affets the effiieny of our approah. APPLICATION TO THE GAW DATA In analyzing the GAW data we deided up-front that we would use (part of) the first 5 repliates as training data and the seond 5 repliates as an independent test set. Using an independent test-set simplifies the model seletion. We applied the Logi Regression algorithm to the 5 th repliate data set for the general population. We used the 4 nd repliate as the test data set and ignored the family struture. We have sequene data for,000 persons. We repeated part of our experiments on a few other repliates and found very similar results. We proessed the sequene data for all of the first 5 repliates, keeping only those sites for whih among the people that have sequene information fewer than 98% of the persons had zero variant alleles and fewer than 98% had two variant alleles. This left us with 694 sites on the 7 genes ombined, that were reoded in 694 = 88 preditors using the sheme detailed in the previous setion. In the remainder we identify sites and oding of variables as follows: Gi.D.Sj refers to site j on gene i, using dominant oding, i.e. Gi.D.Sj = if at least one variant allele exist. Similarly, Gi.R.Sj refers to site j on gene i, using reessive oding, i.e. Gi.R.Sj = if two variant alleles exist. We identify omplements by the supersript, e.g. Gi.D.Sj. Affeted status. As our primary response variables we used the affeted status. We fitted a logisti regression model of the form K logit(affeted) = β 0 +β environ +β environ +β gender+ β i+ L i. (4) Here gender was oded as for female and 0 for male, environ j, j =,, are the two environmental fators that were provided, and the L i, i =,..., K are logi expressions based on the,88 preditors that were reated from the sequene data. Initially we fit models with K =,,, allowing logi expressions of at most size 8 on the training data. In Figure we show the deviane of the various fitted Logi i= sore 750 800 850 900 4 5 6 7 8 number of terms Fugure. Training (solid) and test (open) set devianes for Logi Regression models for the affeted state. The number in the boxes indiate the number of logi trees.

S60 Kooperberg et al. Regression models. As very likely the larger models overfit the data, we validated the models by omputing the fitted deviane for an independent test set keeping the models fixed at those seleted. These results are also shown in Figure. From this figure we see that the models with three logi trees with a total of three and six leaves have the lowest test-set deviane. As the goal of the urrent investigation is to identify sites that are possibly linked to the outome, we prefer the larger of these two models. In addition, when we repeated the experiment on a training set of five repliates and a test set of 5 repliates the model with six leaves had a slightly lower test set deviane than the model with three leaves. We arried out a randomization test, onditioning on the model with three leaves, to determine how muh better a model with six leaves fits the data if a model with three leaves is the true model. As the improvement that we observed is muh larger than what would be expeted by hane, this onfirmed that the model with six leaves fits the data better than a model with three leaves. The model with six leaves that was fitted on the single repliate is presented in Figure. The logisti regression model orresponding to this Logi Regression model is logit(affeted) = 0.44 + 0.005 environ 0.7 environ +.98 gender.09 L +.00 L.8 L. All but the seond environment variable in this model are statistially signifiant. Note that for all 000 persons with sequene data in repliate 5 site 76 on gene is exatly the opposite of site 557, whih was indiated as the orret site on the solutions (for example, a person with v variant alleles on site 76 always has v variant alleles on site 557). Similarly, the Logi Regression algorithm identified site 5007 on gene 6, whih is idential for all,000 persons to site 578, the site whih was indiated on the solutions. We note that the orret site on gene appears twie in the logi tree model. One, as a reessive oding (G.R.S557) and one effetively as a dominant oding (G.R.S76 G.D.S557 on this repliate) for site 557, suggesting that the true model may have been additive. When two sites are in (almost) perfet disequilibrium, as is the ase for these sites, the Logi Regression algorithm may identify one of these sites as the algorithm annot distinguish between them. The three remaining leaves in the model are all part of gene : two site lose to the ends of the gene and one site in the enter. Quantitative trait 5. In the solutions to the GAW data it was desribed that Quantitative trait 5 (Q5) depended on the sequene data of gene, but the exat pattern of the mutations that were influening this trait were not given. To investigate this further we deided to fit another Logi Regression model of the form equation 4, with Q5 as the response variable using linear regression. We would L = and L = G6.D S5007 L = or G.R S557 or G.R S76 G.D S86 G.D S47 G.D S049 Figure. Fitted Logi Regression model for the affeted state data with three trees and six leaves. Variables that are printed white on a blak bakground are the omplement of those indiated.

Sequene Analysis using Logi Regression S6 expet that this way we would be better able to find a more preise dependene on gene than for the logisti model using affeted status as the response. We arried out model seletion idential to that for the logisti regression desribed above. While the solutions indiated that the (sequene) dependene of Q5 was only on gene, we allowed all genes in the model. The model that was seleted had three logi trees and a total of seven leaves. The three trees were L = (G.D.S85 G.D.S89 ) G.R.S8657, L = G.D.S400 G.D.S4977, and L = G.D.S4 G.D.S009. Thus, a model that depends on a large number of sites on gene is fit. The solutions indiate that Quantitative trait 5 indeed depends on gene and not on the other genes. (Models with different numbers of trees or leaves for Q5 always exlusively depended on sites on gene.) Exept for site 8657, all sites ourred in the fitted model as dominant genes. While this ould be related to the way the data was generated, it is also possible that the data was generated using an additive model, as for six of the seven sites that were seleted few people had two variants, and the power of seleting reessive odings of a site is thus smaller than that of seleting the dominant odings of the same site. DISCUSSION Our analysis of the GAW data shows the potential usefulness of Logi Regression. While our algorithm was provided with data on hundreds of preditors (sites), it orretly piked out those few that were the responsible sites in the underlying model. No tweaking of the algorithm was needed to ahieve these results. In applying Logi Regression, it is advantageous to use raw sequene data, rather than data that has been aggregated as haplotypes, as suh preditors would yield few ategorial variables with many levels, while sequene data yields many variables with few levels. In ombining these variables the Logi Regression algorithm, effetively, determines whih levels of the haplotype are assoiated with disease. ACKNOWLEDGEMENTS We thank Sue Li for helpful disussions. Researh supported in part by NIH grant CA 7484. REFERENCES Aarts EHL, Korst JHM. 989. Simulated annealing and Boltzmann mahines. New York: Wiley. Breiman L, Friedman JH, Olshen RA, et al. 984. Classifiation and Regression Trees. Belmont, California: Wadsworth. Chipman H, George E, MCulloh R. 998. Bayesian CART model searh (with disussion). J. Am. Statist. Asso. 9:95-60. Fleisher H, Tavel M, and Yeager Y. 98. Exlusive-or representation of boolean funtions. IBM J. Res. Develop. 7:4-6. Ruzinzki I. 000. Logi Regression and statistial issues related to the protein folding problem. Ph. D. thesis. Seattle: University of Washington, Dept. of Statistis. Ruzinzki I, Kooperberg C, LeBlan ML. 00. Logi Regression tehnial report. Seattle: Fred Huthinson Caner Researh Center.