Sequence Analysis using Logic Regression

Geneti Epidemiology (Suppl ): S66 S6 (00) Sequene Analysis using Logi Regression Charles Kooperberg Ingo Ruzinski, Mihael L. LeBlan, and Li Hsu Division of Publi Health Sienes, Fred Huthinson Caner Researh Center, Seattle, Washington Logi Regression is a new adaptive regression methodology that attempts to onstrut preditors as Boolean ombinations of (binary) ovariates. In this paper we modify this algorithm to deal with single-nuleotide polymorphism (SNP) data. The preditors that are found are interpretable as risk fators of the disease. Signifiane of these risk fators is assessed using tehniques like ross-validation, permutation tests, and independent test sets. These model seletion tehniques remain valid when data is dependent, as is the ase for the family data used here. In our analysis of the Geneti Analysis Workshop data we identify the exat loations of mutations on gene and gene 6 and a number of mutations on gene that are assoiated with the affeted status, without seleting any false positives. Key words: adaptive estimation, Boolean ombinations, simulated annealing, SNP LOGIC REGRESSION Finding assoiations between many genes/environmental fators and disease outomes leads to statistial problems with a high-dimensional preditor spae. In this paper we first disuss a new adaptive regression methodology, Logi Regression, whih we apply to sequene data for the general population of the Geneti Analysis Workshop (GAW) data. Logi Regression [Ruzinski 000; Ruzinski et al., 00] is intended for situations where most preditors are binary (0/), and the goal is to find Boolean ombinations of these preditors that are assoiated with an outome variable. First assume that all preditors X i, i =,..., p are binary and write X i instead of Ind(X i = ) and Xi instead of Ind(X i = 0), where Ind( ) is the usual indiator funtion. The type of regression problem is irrelevant, all we need is a sore funtion suh as RSS in linear regression, log-likelihood in generalized regression, partial log-likelihood in Cox regression, or mislassifiation, that relates fitted values with the response. For simpliity, we assume in here that Y is Address reprint requests to Dr. Charles Kooperberg, Division of Publi Health Sienes, Fred Huthinson Caner Researh Center, 00 Fairview Avenue N, MP-00, Seattle, WA 9809-04

Sequene Analysis using Logi Regression S67 a binomial random variable. The simplest Logi Regression model is now Ŷ = Ind(L = ), () where L is any logi (Boolean) expression that involves the preditors X i, suh as L = X or L = X (X (X X4)). Mislassifiation, (Y Ŷ ), would be the sore for equation. If we want a regression equation of this form, the main problem is to find good andidates for L, as the olletion of all possible logi terms is enormous. It turns out to be very onvenient to write logi expressions in tree form. For example, we an draw X (X (X X4)) as the tree in the first panel of Figure. Using this logi tree representation it is possible to obtain any other logi tree by a finite number of operations suh as growing of branhes, pruning of branhes and hanging of leaves (borrowing from CART [Breiman et al. 984] terminology). In the remaining panels of Figure we show three of the logi trees that an be obtained by applying one operation to the original tree. Using this representation and these operations on logi trees we an adaptively selet L using a (stohasti) simulated annealing algorithm. We start with L =. Then, at eah stage a new tree is seleted at random among those that an be obtained by simple operations on the urrent tree. This new tree always replaes the urrent tree if it has a better sore than the old tree, and it is aepted with a probability that depends on the differene between the sores of the old and the new tree and the stage of the algorithm, otherwise. This simulated annealing algorithm has similarities with the Bayesian CART algorithm [Chipman et al., 998], in whih a CART tree is optimized stohastially. Both of these algorithms are distint from the greedy algorithm employed by CART, in that at any stage they not neessrily pik the move that improves the fit most. Diagnostis, and a sheme that adjust the above-mentioned probabilities slowly enough during this algorithm, guarantee that we will find (lose to) the optimal model. An advantage of simulated annealing is that we are muh less likely to end up in a loal maximum of the soring funtion. Properties of the simulated annealing algorithm depend on Markov hain theory, and thus on the set of operations that an be applied to logi trees [Aarts and Korst, 989]. We should point out that the rules obtained by the Logi Regression algorithm are distintly different from those found by tree based algorithms, suh as CART [Breiman et al., 984]. For those type of methods, the eventual deision rules are like X X X X4 Original expression X X 4 X X X4 Growing branh X X X 5 X4 Changing preditor X X X4 Pruning branh Figure : Logi tree representation of X (X (X X 4)) (first panel) and three logi trees that an be obtained by simple operations on this tree.

S68 Kooperberg et al. if (X X X) (X X X 4 ) predit Ŷ =. () In general, rules are of the form i E i, where E i = j S ij and the S ij are simple relations, suh as X k X. Rules of this form are said to be in Disjuntive Normal Form (DNF) [Fleisher et al., 98]. Though any logi expression an be written in DNF, the omplexity of suh an expression an be redued onsiderably if logi expressions of other forms are allowed. Note, for example, that the ondition in equation an be redued to (X X (X X 4 )). This latter expression is not in DNF, however. For ompliated problems, we may want to onsider more than one logi tree at the same time. Thus, we an extend the lassifiation model (equation ) (using a binomial likelihood) as m logit(y = X) = β 0 + β j L j, () where eah of the L j is a separate logi tree. In pratie not all preditors may be binary. Continuous preditors an still be inluded in Logi Regression models by allowing terms like X i a to enter the model [Ruzinski, 000]. Alternatively, we an inlude ontinuous preditors in a regression model, in addition to logi terms, as we did for the GAW data (see Appliation to the GAW Data). Using model seletion, in addition to a stohasti model building strategy, is of ritial importane, as the logi tree with the best sore typially overfits the data. A variety of methods of model seletion using ross-validation and randomization tests exist [Ruzinski, 000]. For the GAW data we have repliate data; thus we deided to fit our models on one repliate (training set), and validate them on another repliate (test set). LOGIC REGRESSION AND GENETIC DATA In this paper we use the Logi Regression methodology to explore the relationship between single-nuleiotide polymprphisms (SNPs) in sequene data and response variables related to a disease outome. For sequene data we reate two preditors for eah site with zero, one or two variant alleles. Let X = if the site has two variant alleles, and X = 0 if it has zero or one variant allele, and let X = if the site has one or two variant alleles, and X = 0 otherwise. As the seletion of a logi tree takes plae in an adaptive manner, whether X or X ends up in the logi tree, implies whether a dominant model or reessive model, respetively, for this site best fits the data. As the adaptive methodology removes unneessary details from the tree, X and X will not end up in the same logi tree, sine X X X and X X X, so that the searh algorithm automatially redues suh a branh to X or X. When more than one logi tree is fit X an appear in one logi tree, and X an appear in another logi tree, effetively fitting an additive or multipliative model. In the appliation to the GAW data we ignore the family struture of the data, and opt for a diret appliation of the Logi Regression algorithm to a binary and a ontinuous response. Appliation of the Logi Regression algorithm to a model that inorporates family data requires identifiation of a sore-funtion (likelihood) j=

S60 Kooperberg et al. for a logi model suh struture was primarily a matter of onveniene. Sine we arried out model seletion using a test set that satisfies the same family struture as the training set, ignoring the family struture only affets the effiieny of our approah. APPLICATION TO THE GAW DATA In analyzing the GAW data we deided up-front that we would use (part of) the first 5 repliates as training data and the seond 5 repliates as an independent test set. Using an independent test-set simplifies the model seletion. We applied the Logi Regression algorithm to the 5 th repliate data set for the general population. We used the 4 nd repliate as the test data set and ignored the family struture. We have sequene data for,000 persons. We repeated part of our experiments on a few other repliates and found very similar results. We proessed the sequene data for all of the first 5 repliates, keeping only those sites for whih among the people that have sequene information fewer than 98% of the persons had zero variant alleles and fewer than 98% had two variant alleles. This left us with 694 sites on the 7 genes ombined, that were reoded in 694 = 88 preditors using the sheme detailed in the previous setion. In the remainder we identify sites and oding of variables as follows: Gi.D.Sj refers to site j on gene i, using dominant oding, i.e. Gi.D.Sj = if at least one variant allele exist. Similarly, Gi.R.Sj refers to site j on gene i, using reessive oding, i.e. Gi.R.Sj = if two variant alleles exist. We identify omplements by the supersript, e.g. Gi.D.Sj. Affeted status. As our primary response variables we used the affeted status. We fitted a logisti regression model of the form K logit(affeted) = β 0 +β environ +β environ +β gender+ β i+ L i. (4) Here gender was oded as for female and 0 for male, environ j, j =,, are the two environmental fators that were provided, and the L i, i =,..., K are logi expressions based on the,88 preditors that were reated from the sequene data. Initially we fit models with K =,,, allowing logi expressions of at most size 8 on the training data. In Figure we show the deviane of the various fitted Logi i= sore 750 800 850 900 4 5 6 7 8 number of terms Fugure. Training (solid) and test (open) set devianes for Logi Regression models for the affeted state. The number in the boxes indiate the number of logi trees.

S60 Kooperberg et al. Regression models. As very likely the larger models overfit the data, we validated the models by omputing the fitted deviane for an independent test set keeping the models fixed at those seleted. These results are also shown in Figure. From this figure we see that the models with three logi trees with a total of three and six leaves have the lowest test-set deviane. As the goal of the urrent investigation is to identify sites that are possibly linked to the outome, we prefer the larger of these two models. In addition, when we repeated the experiment on a training set of five repliates and a test set of 5 repliates the model with six leaves had a slightly lower test set deviane than the model with three leaves. We arried out a randomization test, onditioning on the model with three leaves, to determine how muh better a model with six leaves fits the data if a model with three leaves is the true model. As the improvement that we observed is muh larger than what would be expeted by hane, this onfirmed that the model with six leaves fits the data better than a model with three leaves. The model with six leaves that was fitted on the single repliate is presented in Figure. The logisti regression model orresponding to this Logi Regression model is logit(affeted) = 0.44 + 0.005 environ 0.7 environ +.98 gender.09 L +.00 L.8 L. All but the seond environment variable in this model are statistially signifiant. Note that for all 000 persons with sequene data in repliate 5 site 76 on gene is exatly the opposite of site 557, whih was indiated as the orret site on the solutions (for example, a person with v variant alleles on site 76 always has v variant alleles on site 557). Similarly, the Logi Regression algorithm identified site 5007 on gene 6, whih is idential for all,000 persons to site 578, the site whih was indiated on the solutions. We note that the orret site on gene appears twie in the logi tree model. One, as a reessive oding (G.R.S557) and one effetively as a dominant oding (G.R.S76 G.D.S557 on this repliate) for site 557, suggesting that the true model may have been additive. When two sites are in (almost) perfet disequilibrium, as is the ase for these sites, the Logi Regression algorithm may identify one of these sites as the algorithm annot distinguish between them. The three remaining leaves in the model are all part of gene : two site lose to the ends of the gene and one site in the enter. Quantitative trait 5. In the solutions to the GAW data it was desribed that Quantitative trait 5 (Q5) depended on the sequene data of gene, but the exat pattern of the mutations that were influening this trait were not given. To investigate this further we deided to fit another Logi Regression model of the form equation 4, with Q5 as the response variable using linear regression. We would L = and L = G6.D S5007 L = or G.R S557 or G.R S76 G.D S86 G.D S47 G.D S049 Figure. Fitted Logi Regression model for the affeted state data with three trees and six leaves. Variables that are printed white on a blak bakground are the omplement of those indiated.

Sequene Analysis using Logi Regression S6 expet that this way we would be better able to find a more preise dependene on gene than for the logisti model using affeted status as the response. We arried out model seletion idential to that for the logisti regression desribed above. While the solutions indiated that the (sequene) dependene of Q5 was only on gene, we allowed all genes in the model. The model that was seleted had three logi trees and a total of seven leaves. The three trees were L = (G.D.S85 G.D.S89 ) G.R.S8657, L = G.D.S400 G.D.S4977, and L = G.D.S4 G.D.S009. Thus, a model that depends on a large number of sites on gene is fit. The solutions indiate that Quantitative trait 5 indeed depends on gene and not on the other genes. (Models with different numbers of trees or leaves for Q5 always exlusively depended on sites on gene.) Exept for site 8657, all sites ourred in the fitted model as dominant genes. While this ould be related to the way the data was generated, it is also possible that the data was generated using an additive model, as for six of the seven sites that were seleted few people had two variants, and the power of seleting reessive odings of a site is thus smaller than that of seleting the dominant odings of the same site. DISCUSSION Our analysis of the GAW data shows the potential usefulness of Logi Regression. While our algorithm was provided with data on hundreds of preditors (sites), it orretly piked out those few that were the responsible sites in the underlying model. No tweaking of the algorithm was needed to ahieve these results. In applying Logi Regression, it is advantageous to use raw sequene data, rather than data that has been aggregated as haplotypes, as suh preditors would yield few ategorial variables with many levels, while sequene data yields many variables with few levels. In ombining these variables the Logi Regression algorithm, effetively, determines whih levels of the haplotype are assoiated with disease. ACKNOWLEDGEMENTS We thank Sue Li for helpful disussions. Researh supported in part by NIH grant CA 7484. REFERENCES Aarts EHL, Korst JHM. 989. Simulated annealing and Boltzmann mahines. New York: Wiley. Breiman L, Friedman JH, Olshen RA, et al. 984. Classifiation and Regression Trees. Belmont, California: Wadsworth. Chipman H, George E, MCulloh R. 998. Bayesian CART model searh (with disussion). J. Am. Statist. Asso. 9:95-60. Fleisher H, Tavel M, and Yeager Y. 98. Exlusive-or representation of boolean funtions. IBM J. Res. Develop. 7:4-6. Ruzinzki I. 000. Logi Regression and statistial issues related to the protein folding problem. Ph. D. thesis. Seattle: University of Washington, Dept. of Statistis. Ruzinzki I, Kooperberg C, LeBlan ML. 00. Logi Regression tehnial report. Seattle: Fred Huthinson Caner Researh Center.