BOSTON UNIVERSITY GRADUATE SCHOOL OF ARTS AND SCIENCES. Dissertation NEURAL NETWORK AND BIOINFORMATIC DESIGNS

Size: px

Start display at page:

Download "BOSTON UNIVERSITY GRADUATE SCHOOL OF ARTS AND SCIENCES. Dissertation NEURAL NETWORK AND BIOINFORMATIC DESIGNS"

Beatrix Hodge
5 years ago
Views:

1 BOSTON UNIVERSITY GRADUATE SCHOOL OF ARTS AND SCIENCES Dissertation NEURAL NETWORK AND BIOINFORMATIC DESIGNS FOR PREDICTING HIV PROTEASE INHIBITOR RESISTANCE by MATTHEW WOODS B.S., University of Michigan, 1996 Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2007

2 Acknowledgements I would like to thank Dr. Gail Carpenter for her guidance over the years, and for providing me with the opportunity to pursue the research that led to this document. My thanks also go out to Dr. Stephen Grossberg for creating the environment in which this kind of interdisciplinary work is possible. I thank Dr. Alexander Macalalad for the enlightening discussions, and for coming to us with the data, and adding yet another twist to my career path. I give my thanks to the many influential teachers I have had in the past, including Linda Goodwin, Jim Power, Jens Zorn, and Dawei Dong. Finally, I thank my parents, Bon, Vin, and Elaine, for their love and support. iii

3 NEURAL NETWORK AND BIOINFORMATIC DESIGNS FOR PREDICTING HIV PROTEASE INHIBITOR RESISTANCE (Order No. ) MATTHEW WOODS Boston University Graduate School of Arts and Sciences, 2007 Major Professor: Gail A. Carpenter, Professor of Cognitive and Neural Systems and Mathematics ABSTRACT A variety of treatment options is now available for patients infected with Human Immunodeficiency Virus (HIV). Often, antiviral treatments do not lead to complete suppression of the virus, due to the rapid development of drug-resistant mutations in the viral genome. For some patterns of mutations, the degree of resistance to some or all of the available antiviral drugs, including protease inhibitors, has been measured in vitro. These measurements can aid in the choice of antiviral treatment options against a viral subtype containing one of these patterns of mutations. However, experimental testing to determine resistance values for all possible variants is combinatorially prohibitive. The primary goal of this thesis is to produce a computational system that learns from a collection of genetic sequences with known drug resistance values, and estimates resistance values for sequences that have not been tested. The neural network Analog ARTMAP, which learns nonlinear multi-dimensional continuous-valued maps, is introduced and applied to estimate protease inhibitor resistance from viral genotypes. A feature selection method is also introduced and iv

4 applied to these trained networks, producing insights into the mutation locations that are most predictive of resistance. The nonlinearity of the maps learned by the networks allows the feature selection method to detect genetic positions that contribute to resistance both alone and through interactions with other positions. This method has identified positions 35, 37, 62 and 77, for which traditional linear feature selection methods have not detected a contribution to resistance. At several positions in the protease gene, mutations confer differing degrees of resistance, depending on the specific amino acid to which the sequence has mutated. To test for these positions, an Amino Acid Space is introduced to represent genes in a vector space that faithfully captures the functional similarity between amino acid pairs. Feature selection identifies several new positions, including 36, 37, and 43, with amino acidspecific contributions to resistance. Accordingly, Analog ARTMAP networks applied to inputs that represent specific amino acids at these positions perform better than networks that use only mutation locations. The coefficients of correlation between predictions and ground truth increase by 1-9%. v

5 Table of Contents Chapter Summary 1 1. Background and Literature Review HIV Structure and Replication Cycle The HIV Protease Enzyme Protease Inhibitors Viral Resistance to Protease Inhibitors Testing for Viral Resistance to Protease Inhibitors Predicting HIV Resistance to Protease Inhibitors from Viral Genotype: Previous Computational Approaches Representing the Genetic Sequences in Machine Learning Applications A Computational System for Predicting HIV-1 Protease Inhibitor Resistance from Viral Genotype Preparation of the HIV Protease Inhibitor Resistance Data Encoding HIV Protease Genes as Vectors At Which Positions in the Protease Gene Do Mutations Affect Drug Resistance? Mutual Information Profiles Optimal Percent Correct Scores Analog ARTMAP Feature Selection Results At Which Positions is Resistance Differentially Influenced by the Specific Amino Acids 36 vi

6 2.4.1 Optimal Percent Correct Difference Scores Feature Selection Results Relative to Protease Gene Position Sets Described in the Literature Using the Results of Feature Selection to Enhance the Prediction of Protease Inhibitor Resistance Novel Mutations Associated with Protease Inhibitor Resistance and Hypersusceptibility Analog ARTMAP Analog ARTMAP: An ART Neural Network Architecture for Regression The Performance of Analog ARTMAP on Benchmark Problems Feature Selection Feature Selection for Analog ARTMAP The Performance of Analog ART Feature Selection on Sample Problems Amino Acid Spaces Amino Acid Spaces Permutation Distance Modified Permutation Distance An Example of the Modified Permutation Distance Using Two Protein Scoring Matrices Use of Modified Permutation Distance for the Evaluation of Amino Acid Spaces Constructing an Amino Acid Space that Preserves the Topology of a Protein Scoring Matrix 82 Appendix A: Protease Genes Containing Mutations K43N, M36L, N37Q, or N37K 86 vii

7 Appendix B: Analog ARTMAP Notation and Parameters 87 Appendix C: Proof of Equation Appendix D: The Analog ARTMAP Algorithm 91 References 96 Curriculum Vitae 102 viii

8 List of Tables 2.1 Protease inhibitor resistance data: Number of sequences by laboratory of origin and division into training, validation, and test sets Vigilance baseline parameter values used for feature selection Optimal Percent Correct Difference analysis of position 71 relative to Amprinavir The 10 positions with highest Optimal Percent Correct Difference scores for each drug Positions previously associated with resistance Minimal supersets of positions previously associated with resistance Positions previously associated with resistance and MSS positions ranked by Analog ARTMAP feature selection D values Position sets given an Amino Acid Space representation in trials Representation and vigilance baseline leading to the highest correlation coefficient on the validation sets for each of the seven protease inhibitors Test set results Saquinavir resistance to sequences containing mutations at position Indinavir resistance to sequences containing mutations at position Indinavir resistance to sequences containing mutations at position ix

9 2.14 Indinavir resistance to sequences containing mutations at position Amprenavir resistance to sequences containing mutations at position Lopinavir resistance to sequences containing mutations at position The BLOSUM62 protein scoring matrix Blomap five-dimensional Amino Acid Space Sample entries from PAM250 and BLOSUM62 protein scoring matrices An eight-dimensional Amino Acid Space that preserves the ordinal relations of the BLOSUM62 protein scoring matrix 84 A Protease genes containing mutations K43N, M36L, N37Q, or N37K 86 B1 Analog ARTMAP notation 87 B2 Analog ARTMAP parameters 88 x

10 List of Figures 1.1 Mature HIV-1 virion HIV infected CD4+T lymphocyte HIV replication cycle HIV-1 protease secondary structure and functional regions Protease inhibitor chemical structure and FDA approval dates HIV-1 protease in complex with the protease inhibitor IDV (Indinavir) shown with the locations of several resistanceconferring mutations Typical drug response curves: wild type vs. mutated Histograms of log 10 Resistance Factors shown with cutoff values Protease gene profiles Positions in the protease gene ranked according to Analog ARTMAP feature selection D values Comparison of network performance on validation sets for 15 representational schemes per drug Network predictions on the test sets Analog ARTMAP network architecture Target surface and training set for benchmarking Analog ARTMAP Network predictions and category boxes and scatter plot of network predictions vs. ground truth for benchmarking Analog ARTMAP 63 xi

11 3.4 Ground truth and test set results and scatter plot of network predictions vs. ground truth for benchmarking Analog ARTMAP Plane: Feature selection test set Cosine: Feature selection test set XOR: Feature selection test set Feature selection methods comparison Cardinal and ordinal reconstructive error curves for the production of Amino Acid Spaces 83 xii

12 List of Abbreviations A AA AA Sp. AAPAR AIDS APV ATV BLOSUM C C.R. CAM D DNA E F FR G H Alanine Amino Acid Amino Acid Space The set of positions in the protease gene at which the specific amino acids may contribute differentially to resistance according to the results summarized in the literature (ie., 33, 46, 47, 50, 54, 63, 71, 82, 88, and 93) Acquired Immunodeficiency Syndrome Amprenavir Atazanavir Blocks Substitution Matrix Cysteine Compression Ratio Content Addressable Memory Aspartic Acid Deoxyribonucleic Acid Glutamic Acid Phenylalanine Fold-Resistance Glycine Histidine xiii

13 HAART HIV I Highly Active Antiretroviral Therapy Human Immunodeficiency Virus Isoleucine IC 50 Inhibitory Concentration 50 IDV IG-CAM K L LPV M MI MSE MSS N NFV NNRTI NRTI O.P.C. O.P.C.D. O.P.C.D.>1% O.P.C.D.3 Indinavir Increase Gradient Content Addressable Memory Lysine Leucine Lopinavir Methionine Mutual Information Mean Squared Error Minimal Superset Asparagine Nelfinavir Non-nucleoside Reverse Transcriptase Inhibitors Nucleoside Reverse Transcriptase Inhibitors Optimal Percent Correct Optimal Percent Correct Difference Positions in the protease gene with O.P.C.D. scores greater than one percent relative to a given protease inhibitor. The three positions in the protease gene with greatest O.P.C.D. scores relative to a given protease inhibitor. xiv

14 P PAM PAR Q R RF RNA RT RTV S SOFM SQV T V Val. W Y Proline Percent Accepted Mutations Previously Associated with Resistance Glutamine Arginine Resistance Factor Ribonucleic Acid Reverse Transcriptase Ritonavir Serine Self-Organizing Feature Map Saquinavir Threonine Valine Validation Set Tryptophan Tyrosine xv

15 1 Chapter Summary Chapter One provides the background for the problem of predicting HIV-1 protease inhibitor resistance from viral genotype. A brief overview of HIV biology and the role of protease in the replication cycle of the virus is presented. Protease inhibitors and the development of protease inhibitor resistance are also discussed, and previous computational approaches to the prediction of resistance from viral genotype are reviewed. In Chapter Two the multivariate regression and feature selection capabilities of Analog ARTMAP (developed in Chapters Three and Four) are brought to bear on the problem of learned estimation of protease inhibitor resistance from a data set of HIV protease genes encoded with the Amino Acid Space representation (developed in Chapter Five). A system capable of estimating protease inhibitor resistance from viral genotype is constructed and analyzed. This analysis produces several new prediction of positions in the protease gene that may contribute to resistance, and specific mutations that may confer beneficial hypersusceptibility to some of the protease inhibitors. In Chapter Three the novel neural network architecture Analog ARTMAP is presented and developed. This network is a member of the Adaptive Resonance Theory (ART) neural network family, and is most closely related to default ARTMAP (Carpenter, 2003). New design features of Analog ARTMAP enable the network to learn mappings between input vectors and continuous-valued, multidimensional vector outputs.

16 2 As such, the Analog ARTMAP architecture extends the classification capabilities of default ARMAP into the domain of multivariate regression problems. In Chapter Four a novel method for feature selection and dimensionality reduction is presented and developed. This method provides a means for evaluating the predictive utility of individual features of a data set presented to an Analog ARTMAP neural network. In Chapter Five a novel means of encoding gene sequences as vectors is proposed and developed. This encoding represents gene sequences in such a way that the functional similarity between the amino acids that the sequences encode is faithfully captured by the metric of the vector space. An eight-dimensional vector space that preserves the topology of the pair-wise functional similarity between amino acids, as measured by the BLOSUM62 protein scoring matrix, is assigned to each position in a gene at which consideration of the specific amino acid is desirable. This Amino Acid Space representation of genetic sequences has potential applications to a variety of supervised and unsupervised learning problems within genetics, as well as to the prediction of HIV protease inhibitor resistance from viral genotype.

17 3 Figure 1.1. Mature HIV-1 virion The structure of the virus is depicted with several viral proteins and enzymes labeled. (Henderson et al., 2006) 1. Background and Literature Review This chapter provides a brief introduction to, and reviews the context of, the problem of predicting protease inhibitor resistance from HIV genotype. In the final two sections of the chapter, previous applications of machine learning systems to this problem and methods of representing genes are discussed. An understanding of the motivation for these applications requires some familiarity with the mechanism of protease inhibitor resistance, which in turn requires a basic familiarity with HIV biology, the role of protease in the HIV replication cycle, and the effects of protease inhibitors on the function and evolutionary trajectories of the virus HIV Structure and Replication Cycle

4 Figure 1.2. HIV infected CD4+T lymphocyte An electron microscopic image of an infected CD4+T lymphocyte is shown with attached viral particles in blue. (Huang et al.

18 4 Figure 1.2. HIV infected CD4+T lymphocyte An electron microscopic image of an infected CD4+T lymphocyte is shown with attached viral particles in blue. (Huang et al., 2002) The Human Immunodeficiency Virus (HIV) (Figure 1.1) is a retrovirus associated with the Acquired Immunodeficiency Syndrome (AIDS). The mature virus is approximately spherical with a diameter of 120 nm (Gentile et al., 1994). Like all retroviruses, HIV contains a genome in the form of two identical copies of singlestranded Ribonucleic Acid (RNA). The genome is enclosed within the conically shaped capsid, which is itself contained within the plasma membrane of the virus. Two transmembrane proteins on the surface of the virus, gp120 and gp41, allow the virus to fuse to host cells such as CD4+T lymphocytes (Figure 1.2) during the course of HIV infection (Wyatt et al., 1998). Once the virus has attached to the host cell, the viral RNA is introduced to the intracellular fluid where the viral enzyme reverse transcriptase (RT) helps copy it to Deoxyribonucleic Acid (DNA). The pro-viral DNA is then transported to the nucleus of the host cell and joined with the cell's genetic material through the action of the viral

5 Figure 1.3. HIV replication cycle The principal stages of the viral replication cycle are depicted. Binding of the virus with the host cell is shown on the left.

19 5 Figure 1.3. HIV replication cycle The principal stages of the viral replication cycle are depicted. Binding of the virus with the host cell is shown on the left. This is followed by the reversetranscription of viral RNA into DNA through the action of reverse transcriptase. The pro-viral DNA is then integrated into the host DNA through the action of integrase. The expressed products are cleaved into functional units by protease and transported to the site of the budding viral particle on the right. (Huang et al., 2002) enzyme integrase. Expression of the genetic material of the infected cell then produces the material essential for replication of the virus. The polyprotein gp160 is assembled in this fashion before being transported to the golgi apparatus where it is cleaved into the proteins gp120 and gp41 by the protease enzyme. As new viral particles bud off from the infected cell, the gp120 and gp41 proteins are transported to the cell membrane where gp41 acts to anchor gp120 to the membrane of the forming virion. As the virus matures, the polyproteins p55 and p160 are also cleaved into functional units by protease before their encapsulation in the viral membrane (Figure 1.3). In all, HIV-1 protease recognizes

20 6 Figure 1.4. HIV-1 protease secondary structure and functional regions The structure of the protease enzyme is shown with labels indicating the three functional domains: the flaps, the active site, and the dimerization domain. Yellow arrows indicate β-strands and red cylinders indicate α- helices. (Prasanna et al., 2005; Prasanna et al., 2006) and cleaves nine different peptide sequences (Shafer, 2002) The HIV Protease Enzyme HIV-1 protease is a homodimeric enzyme, meaning that it is comprised of two identical subunits (monomers) consisting of 99 amino acids each. As depicted in Figure 1.4, there are three regions of primary importance to the function of the molecule: the active site, the flap, and the dimer interface. The active site, a hydrophobic region of the protein responsible for recognizing and cleaving polyproteins during the replication cycle

21 7 of the virus, is comprised of the amino acid residues Asp25, Thr26, and Gly27, and is highly conserved (ie., highly unlikely to contain mutations) (Wlodawer et al., 1998). Residues form a flexible flap which closes over the active site and holds the polyproteins and their products in place during binding (Shao et al., 1996). The two monomers making up the functional protease are bound together at the highly conserved dimer interface, consisting of residues 1-4 and (Pettit et al., 2002) Protease Inhibitors The critical role played by protease in the replication cycle of HIV has made it one of the major targets for antiretroviral drugs. Eight protease inhibitors have now gained FDA approval for the treatment of HIV infection (Figure 1.5). These inhibitors act by binding to the active site of protease, and preventing it from accomplishing its normal function. Crystallographic and nuclear magnetic resonance techniques that have revealed the structure of protease both in its native conformation and in complex with inhibitors have made unprecedented contributions to the development of new antiviral drugs (Mahalingam et al., 2004). Indeed, in the history of drug development, HIV protease inhibitors are seen as the forefront of "rational drug design" (Wlodawer et al., 1998).

8 NFV (Nelfinavir): SQV (Saquinavir): IDV (Indinavir): RTV (Ritonavir): APV

(Atazanavir): Date Saquinavir 12/7/95 Ritonavir 3/1/96 Indinavir 3/14/96

Protease inhibitor chemical structure and FDA approval dates The chemical

22 8 NFV (Nelfinavir): SQV (Saquinavir): IDV (Indinavir): RTV (Ritonavir): APV (Amprenavir): LPV (Lopinavir): Protease Inhibitor FDA Approval ATV (Atazanavir): Date Saquinavir 12/7/95 Ritonavir 3/1/96 Indinavir 3/14/96 Nelfinavir 3/14/97 Amprenavir 4/15/99 Lopinavir 9/15/00 Atazanavir 6/20/03 Tipranavir 6/22/05 Figure 1.5. Protease inhibitor chemical structure and FDA approval dates The chemical structures of the seven protease inhibitors included in this study are shown. The table indicates the FDA approval dates of these protease inhibitors and the most recently approved protease inhibitor Tipranavir.

23 9 Figure 1.6. HIV-1 protease in complex with the protease inhibitor IDV (Indinavir) shown with the locations of several resistanceconferring mutations The structure of the HIV-1 protease dimer is depicted as a ribbon with the active site shown as a purple ball-and-stick model. The IDV molecule is shown as a space-filling model in yellow. The locations of several mutations known to confer resistance to IDV treatment are labeled on the right monomer. (Shafer, 2002) 1.4. Viral Resistance to Protease Inhibitors The error-prone nature of the process by which the viral RNA is copied into the DNA of the host cell results from the inability of reverse transcriptase to proofread its genetic products, and leads to a much higher rate of mutation in the viral population (between 10-4 and 10-5 mutations per base pair per replication cycle) than in multicellular organisms (Mansky, 1998). Genetic variability also results from recombination in

24 10 instances when a single host cell is simultaneously infected by multiple genetically distinct viral particles (Zhuang et al., 2002). It is estimated that every possible point mutation (a mutation involving the insertion, deletion, or substitution of a single amino acid anywhere in the viral genome) occurs between 10 4 and 10 5 times per day in an untreated person infected with HIV (Coffin, 1995). Introduction of antiviral drugs to the blood plasma of an infected individual changes the selective pressures at work on the viral population. Incomplete suppression of the virus in the presence of antiviral drugs, whether stemming from patient nonadherence to therapy or from viral subpopulations protected from the action of the drugs in sanctuary sites such as the central nervous system and the testes (Pomerantz, 2002; Crommentuyn et al., 2005), can lead to a temporary reduction in the viral load at the onset of treatment followed by a rebound to pre-treatment levels as drug-resistant mutations develop and become prevalent in the viral population (Figure 1.6). The importance of sustained reduction of the viral load for preventing the development of drug-resistant mutations has contributed to the American Medical Association's adoption of Highly Active Antiretroviral Therapy (HAART) in which a cocktail of antiviral drugs from multiple categories (including protease inhibitors, nucleoside or nucleotide reverse transcriptase inhibitors (NRTIs), and/or non-nucleoside reverse transcriptase inhibitors (NNRTIs)) is prescribed (Yeni et al., 2002).

25 11 Figure 1.7. Typical drug response curves: wild-type vs. mutated The in vitro response of two strains of HIV-1 to antiviral drugs is shown. Viral replication decreases as a function of drug concentration for both the wild-type (reference, or consensus, sequence), and for a drug-resistant (mutated) strain. The Inhibitory Concentration 50 (IC 50 ) values (the amount of drug needed to reduce viral replication by 50%) are indicated for both strains. In this example, the IC 50 values for the wild-type and the resistant strain are 1.3 and 28.6, respectively. The Resistance Factor for the mutated strain, defined as the ratio of the IC 50 of the resistant strain to the IC 50 of the reference sequence, is thus ~21. Image adapted from (Beerenwinkel et al., 2001) Testing for Viral Resistance to Protease Inhibitors The efficacy of a HAART regimen can be increased when the choice to include an antiviral drug is informed by knowledge of the degree of resistance the patient's viral population has to the available drugs. Testing for resistance before administering antiviral drugs therefore increases the likelihood of treatment success. Computational

26 12 systems, such as the one presented in this research, that estimate resistance on the basis of viral genotype can increase the accuracy of these resistance tests. Two types of tests that measure the resistance of an HIV strain to an antiviral drug are now in use: genotypic and phenotypic. In genotypic testing, a blood sample is taken from a patient and the viral genome is sequenced. The genome is then examined directly for known resistance-conferring mutations, and appropriate treatment is decided accordingly. In phenotypic testing, the impact of antiviral drugs is tested in vitro on viral particles extracted from either peripheral blood mononuclear cells or blood plasma from the patient (Hanna et al., 2001). Resistance is measured with the Inhibitory Concentration 50 (IC 50 ), defined as the amount of drug necessary to decrease viral replication by 50% (Figure 1.7). The Resistance Factor (RF) or Fold-Resistance (FR) of a viral strain is the ratio of the IC 50 of the strain to the IC 50 of the wild-type (also known as the consensus sequence or reference sequence). The clinical utility of resistance testing has been examined in a number of studies (for discussions see Perrin et al., 1998; Saag, 2001; Haubrich et al., 2001). The VIRADAPT (Durant et al., 1999) and GART (Baxter et al., 2000) trials showed an improvement in the response to salvage therapies informed by genotypic testing for patients with a history of failed antiviral treatment. The HAVANA trials (Tural et al., 2002) compared genotyping and genotyping combined with expert opinion against standard of care treatment. They found that there was significant short term benefit from genotyping, but the benefit was greatest when genotyping was combined with expert advice. The ARGENTA trials (Cingolani et al., 2002) also found short term benefit for

27 13 patients with salvage therapies informed by genotyping, but the benefit did not persist beyond six months. An analysis carried out in Weinstein et al. (2001) demonstrated the cost effectiveness of resistance testing. Despite the qualified nature of these results, the benefits of resistance testing for patient prognosis are now widely recognized, and resistance testing for recently infected patients (to detect the transmission of drugresistant strains of the virus), pregnant women, or patients with a history of failed antiviral therapy has now become the recommended standard of care (Hirsch et al., 2003). Phenotypic resistance tests are more costly and time consuming than genotypic tests, requiring 8-10 days, while genotypic tests can be completed in as few as 2 days (Hanna et al., 2001). Genotypic tests, however, require interpretation, for which there exist a variety of algorithms, rule-based systems, and machine learning applications to augment expert opinion Predicting HIV Resistance to Protease Inhibitors from Viral Genotype: Previous Computational Approaches Several computational approaches to using the viral genotype to predict the resistance of an HIV strain to a given drug have been published. Mehods can be classified as rule-based and machine learning. Rule-based interpretation systems, such as the HIVDB algorithm implemented on the Stanford HIV resistance database web site (Shafer et al., 2002) and work at U.C. Irvine (Lathrop et al., 1999), have produced systems for predicting resistance and providing treatment recommendations on the basis

28 14 of genotypic information by incorporating rules extracted from the literature. In the approach of Lathrop et al., treatments are prescribed on the basis of the predicted resistance both of the dominant viral strains with which the patient is infected, and of mutants close to these (in Hamming distance) that are likely to emerge in response to treatment. While these approaches have led to considerable clinical success, a rulebased program has limited ability to generalize from known patterns of resistance to novel sequences in a way that could capture possible nonlinear interactions between mutations at different loci. Machine learning problems, and the techniques used to solve them, belong to three categories. In unsupervised learning, or clustering problems, the goal is to divide a data set into a small number of groups or clusters, where the data points assigned to a given cluster are, in some sense, similar to one another. In supervised learning problems, each data point is associated with a class label, and the object is to generalize from learned exemplars and assign class labels to data points not contained in the training set. In regression problems, data points are associated with a continuous-valued, possibly multi-dimensional, output. The object is to generalize from a training set, and estimate the values associated with data points not contained in the training set. Supervised learning systems, including support vector machines (Beerenwinkel et al., 2001, 2002, 2003), decision trees (Sevin et al., 2000; Beerenwinkel et al., 2001, 2002), and linear discriminant analysis (Sevin et al., 2000), have demonstrated considerable success at predicting resistance from genotype. Neural networks, including unsupervised Self-Organizing Feature Maps (SOFM) (Drăghici and Potter, 2003) and

29 15 supervised Backpropagation Multi-layer Perceptrons (Wang and Larder, 2003), have also been applied. Drăghici and Potter use structure-based and sequence-based data mining to predict resistance to the protease inhibitors Indinavir and Saquinavir. The structure-based approach predicts tertiary protein structure and binding sites, while the sequence-based method applies a SOFM to a vector space representation of the protease genes. The research presented in this dissertation contributes a new machine learning system for the prediction of protease inhibitor resistance, with overall accuracy comparable to other published systems. Because different machine learning systems will tend to make different predictive errors, the application of as many systems as possible to the prediction of protease inhibitor resistance will lead to a more accurate understanding of the problem, and put more effective computational tools at the disposal of clinicians. The use of Analog ARTMAP to generate and test hypotheses about the contribution to protease inhibitor resistance made by mutations at specific positions goes beyond machine learning into the realm of data mining. By addressing the question "What features do the networks rely on to make their predictions?", this research uses Analog ARTMAP not only as a computational means of estimating the resistance of novel HIV mutants to protease inhibitors, but as an exploratory tool in the broader context of research aimed at furthering the understanding of the relationship between viral mutation and its effect on antiviral treatment.

30 Representing Genetic Sequences in Machine Learning Applications The application of machine learning techniques (clustering, classification, or regression) to databases of genetic samples requires a decision as to how the genes will be represented as vectors. The most appropriate choice of representational scheme depends on the specifics of the problem. For example, genes can be represented as vectors with a set of entries corresponding to the physical or chemical properties of each of the amino acids encoded by the gene (Wu and McLarty, 2000). This may be useful in situations where there is prior knowledge indicating that the investigation of specific kinds of interactions is likely to produce results. On the other hand, genes can be given a very compact representation as binary strings with each digit indicating the presence of absence or deviation from a pre-specified reference sequence. The relatively low dimension of this representation may be useful in machine learning applications to sparse data sets due to the so-called "curse of dimensionality" (Hastie et al., 2001). In the applications to predicting protease inhibitor resistance described above, two representational schemes are used. Drăghici and Potter (2003) assign a single dimension to each amino acid in the protease enzyme. In this approach, a position not containing a mutation is represented with a zero. The N mutations found at a given position are ranked by frequency of occurrence, and assigned values 1/N, 2/N,,1, with the most common mutation assigned a value of 1/N, and the least common a value of 1. This representational scheme has the advantage of taking into consideration the specific amino

31 17 acids in the mutated sequences, but it does not allow for a natural generalization to novel strings that contain a residue at a given position not found in the training set. In all other machine learning applications described above, each position in the gene is represented with an indicator vector, in which each of the twenty amino acids is encoded as a 20-dimensional vector of 19 zeros and a single one. The location of the one in the vector indicates the identity of the amino acid found at that position in the gene. Some gene sequences have loci at which mixtures of amino acids have been reported, resulting either from ambiguous measurement or the presence of significant quantities of genetically distinct viral subtypes in the clinical sample. Representing genes with indicator vectors has the advantage that loci at which mixtures are reported have a natural representation as the sum of the indicator vectors that represent the amino acids in the mixture. However, indicator vectors have the disadvantage that by representing the set of amino acids with mutually orthogonal vectors, a machine learning system cannot meaningfully extrapolate to genes containing an amino acid not found at the corresponding position in any of the training sequences. Furthermore, the computational requirements are significantly increased by the use of 20 dimensions for each position included in the representation. In this dissertation a novel means of representing gene sequences is presented that combines low dimensionality (eight dimensions per position) with a meaningful geometry. This Amino Acid Space is constructed in such a way that the pair-wise functional similarity of the amino acids is represented by the pair-wise distances of the vectors corresponding to the amino acids. The Amino Acid Space representation is used

32 18 to test hypotheses about individual positions in the protease gene and the contribution to protease inhibitor resistance of specific mutations found at these positions.

33 19 2. A Computational System for Predicting HIV-1 Protease Inhibitor Resistance from Viral Genotype In this chapter, the neural network Analog ARTMAP (a novel network architecture described in detail in Chapter Three), the Analog ARTMAP feature selection method (introduced and developed in Chapter Four), and the eightdimensional Amino Acid Space (introduced and developed in Chapter Five) are applied to a database of HIV-1 protease gene sequences. The method estimates Resistance Factors of a given HIV strain to the protease inhibitors Nelfinavir (NFV), Saquinavir (SQV), Indinavir (IDV), Ritonavir (RTV), Amprenavir (APV), Lopinavir (LPV), and Atazanavir (ATV). Analog ARTMAP neural networks are trained on a subset of the data to produce a system capable of estimating the Resistance Factors of an arbitrary protease gene sequence to these seven protease inhibitors. Feature selection methods are used to address two related questions: At which positions do mutations in the protease gene contribute to resistance? At which positions do the specific amino acids produced by mutations contribute differentially to resistance? Each of these questions is addressed in two stages: hypothesis generation and hypothesis testing. With regard to the first question, hypotheses are generated via feature selection from Analog ARTMAP networks trained on a subset of the data in which all 99 positions in the protease gene are represented. For each of the seven protease inhibitors, three sets of positions that may contribute to resistance are considered. The first set

34 20 Drug Number of Sequences per Data set Division Total Laboratory of Origin Number of Virologic Virco Averaged Train/Val. Test Sequences (580 (145 Unique) (725 Unique) Unique) NFV SQV IDV RTV APV LPV ATV Table 2.1. Protease inhibitor resistance data: Number of sequences by laboratory of origin and division into training, validation, and test sets For each protease inhibitor, the table shows the number of gene sequences in the data set from each laboratory of origin, Virco or Virologic. Where both labs have reported Resistance Factors for the same sequence, the logarithmic average of the two is used. The number of sequences included in the reserved test set is also indicated. The sequences in the Train/Validation set were further subdivided into training and validation sets for model parameter and representation selection independent of the test set. includes all 99 positions in the gene. The second set of positions is extracted from the literature, and includes positions that have been previously identified as possibly or definitely contributing to protease inhibitor resistance. The third set is a superset of the positions extracted from the literature. In addition to the positions in the second set, the third set includes positions identified by Analog ARTMAP feature selection as being of equal or greater value to the prediction of resistance compared to any position included in the second set.

35 21 With respect to the second question, hypotheses are generated by a measure of the gain in predictive utility of each position when the specific amino acid is considered vs. when it is not. This measure, called the Optimal Percent Correct Difference (O.P.C.D.), is used to generate two sets of positions for each drug: the set of the three positions with the greatest O.P.C.D. scores, and the set of positions with O.P.C.D. scores above one percent. A third set is composed of positions that have previously been described in the literature as possibly or definitely contributing differentially to resistance in dependence on the specific amino acid present in the mutation. For comparison, fourth and fifth sets are created in which the specific amino acids are considered at every included position, and at no positions, respectively. Hypothesis testing for both questions is accomplished by comparing the performance of Analog ARTMAP networks trained and tested on 15 representations of the data for each drug. These 15 representations are created by choosing one of the three sets of positions to be included in the analysis, and one of the five sets of positions at which the specific amino acid is considered. The Amino Acid Space derived from the BLOSUM62 protein scoring matrix is used to represent positions at which the specific amino acids are considered. An analysis of the results of this comparison highlights several positions in the protease gene not previously identified in the literature at which different mutations contribute differential degrees of protease inhibitor resistance. In particular, several mutations that may produce or contribute to hypersusceptibility to the protease inhibitors Saquinavir and Indinavir are identified.

36 22 Figure 2.1. Histograms of log 10 Resistance Factors shown with cutoff values For each of the seven protease inhibitors, dashed lines indicate the RF cutoff values above which each drug is considered ineffective Preparation of the HIV Protease Inhibitor Resistance Data The data set consists of a collection of protease gene sequences, and the available data on the resistance to each of the seven protease inhibitors for viral particles carrying each gene. During preprocessing of the data, the Resistance Factors (ie., the ratio of the IC 50 values of the mutated sequences to the IC 50 value of the reference sequence) were replaced with their base 10 logarithms. In cases in which different RFs were reported for the same sequences by different labs, the multiple entries in the database were replaced by a single entry with the mean of the logarithms of the reported Resistance Factors. Gene sequences with missing or ambiguous amino acids were discarded. The resultant data set consists of 725 unique protease genes. Of these, 145 (20%) were chosen at

37 23 random and kept in reserve to be used as the final test set, leaving 580 sequences (80%) for training and validation. For many exemplars in the data set, the RF of a given protease gene has been reported relative to some but not all of the seven protease inhibitors. For each drug, Table 2.1 shows the number of sequences for which RFs are known in the training/validation set, the testing set, and the total. Figure 2.1 shows the logarithmic histogram of Resistance Factors for each drug. The vertical lines indicate the (log) cutoff values, defined as the clinically determined Resistance Factors above which treatment is not expected to succeed. The cutoff values used here are NFV: 3.25, SQV: 2.5, IDV: 2.75, RTV: 3, APV: 2.5, LPV: 3, and ATV: Encoding HIV Protease Genes as Vectors The generalization from known protease gene Resistance Factors to unknown ones is accomplished by training an ensemble of Analog ARTMAP networks for each of the protease inhibitors. Because each drug has its own learning system, the format in which the protease genes are encoded can be drug-specific. For each drug, the encoding rests on two questions: 1) At which positions do mutations influence drug resistance? 2) For which positions is resistance influenced by the specific amino acid vs. simply the presence or absence of a deviation from the reference sequence? Experimental evidence regarding these questions, as summarized by Shafer (2002), provides a reference point for the computational data analysis approach introduced here.

38 At Which Positions in the Protease Gene Do Mutations Affect Drug Resistance? The clinically important question: At which positions in the protease gene are mutations indicative of protease inhibitor resistance? is addressed through feature selection from trained Analog ARTMAP networks. The results of this feature selection method are compared with the mutual information profiles (Section 2.3.1) of the gene, the marginal utility or optimal percent correct (Section 2.3.2) for each position, and the experimental findings as summarized in the clinical literature. Position 10 is used as an example to illustrate these methods. Figure 2.2 shows resistance profiles for each of the seven protease inhibitors calculated with four methods (Rows a - d). The resistance profiles produced by these methods each convey different information about the contributions made by mutations at specific positions in the protease gene to protease inhibitor resistance. Mutual information profiles select features that contribute to protease inhibitor resistance directly, rather than through interactions with mutations at other positions. When dichotomized resistance factors are used (Figure 2.2, Row a) the profiles provide a measure of the degree to which mutations at each position contribute to the development of sufficient protease inhibitor resistance to cause treatment failure. When the resistance factors themselves are used to generate the mutual information profiles (Figure 2.2, Row b), the profiles indicate the overall

39 25 contribution to resistance of mutations at each position, irrespective of whether or not that contribution will be at or near the resistance factor cutoff values that indicate the threshold of expected treatment success. The Optimal Percent Correct and Optimal Percent Correct Difference score profiles (Figure 2.2, Row d) described below are useful primarily as a means of determining which positions contribute differentially to resistance in dependence on the specific amino acids to which they mutate. As an example, the calculation of these scores is illustrated on position 71. The Analog ARTMAP feature selection D values (Figure 2.2, Row c) select features that contribute to resistance both directly, and through interactions with mutations at other positions. The combined use of mutual information profiles and Analog ARTMAP feature selection D values not only identifies mutations that contribute to protease inhibitor resistance, but provides a means for distinguishing between primary mutations which contribute to resistance directly (e.g., at positions 48 and 90) and secondary mutations which contribute to resistance through interactions with other positions (e.g., at positions 35 and 77) Mutual Information Profiles Mutual information (MI) provides a means of evaluating the importance of mutations at individual positions in the protease gene as predictors of the degree of protease inhibitor resistance. In contrast with the method of selecting features from a trained Analog ARTMAP neural network (Section 2.3.3), mutual information as applied

40 26 here evaluates each position in the gene independently. As a result, this method will tend to give higher values to primary than to secondary mutations. Intuitively, the mutual information between two variables gives a measure of how much the uncertainty in one is reduced by knowledge of the other. Formally, the mutual information between the variables X and Y is defined as pxy (, ) I( X, Y) = Ep( X, Y) log2 p( X) p( Y) (2.3.1) where p(x) and p(y) denote the probability distributions of X and Y respectively, p(x,y) is the joint distribution of the variables, and E p ( X, Y ) is the expectation operator taken over all of the possible values of X and Y. The following example illustrates the calculation of the mutual information between the presence or absence of a mutation at position 10 in the protease gene, and Nelfinavir (NFV) resistance (RF > 3.25) or susceptibility (RF <= 3.25). Let X ={no mutation at position 10, mutation at position 10} Let Y = {susceptible to NFV, resistant to NFV} The joint probability distribution p(x,y) is estimated from the sample population: Estimate of P(X,Y) x = no mutation at position 10 x = mutation at position 10 y = susceptible to NFV y = resistant to NFV p(x = x,y = y) = p(x = x,y = y) = = p(x = x,y = y) = = = p(x = x,y = y) = = Similarly, the distributions p(x) and p(y) are estimated from the sample population:

41 27 Estimate of P(X) x = no mutation at position 10 x = mutation at position p(x = x) = = p(x = x) = = Estimate of P(Y) y = susceptible to NFV y = resistant to NFV p(y = y) = = p(y = y) = = The mutual information between the variables X and Y is then given by: pxy (, ) I( X, Y) = Ep( X, Y) log2 p( X) p( Y) p( x, y) = p( x, y) log 2 p( x) p( y) x X y Y = p( X = no mut., Y = suscep.)log 2 p( X = no p( X = no mut., Y = suscep.) + mut.) p( Y = suscep.) p( X = no mut., Y = resis.)log 2 p( X p( X = no = no mut., Y = resis..) + mut.) p( Y = resis.) p( X p( X = mut., = mut., Y Y = suscep.)log = resis.)log 2 2 p( X = mut., Y = suscep.) + p( X = mut.) p( Y = suscep.) p( X p( X = mut., Y = resis.) = mut.) p( Y = resis.) = ( 0.278) log ( 0.501)( 0. ) + 2 ( 0.222) log ( 0.501)( 0.637) ( 0.085) log ( 0.499)( 0.363) + ( 0.414) log 2 2 ( )( ) = = 0.119

42 28 Row a of Figure 2.2 shows the mutual information between the dichotomized Resistance Factors (binned according to the cutoff values listed above), and the individual positions in the protease gene represented as binary vectors with ones and zeros indicating the presence or absence of a mutation respectively (the binary representation). Row b of Figure 2.2 shows the mutual information between this binary representation of the gene positions and the raw Resistance Factors. The mutual information between the binary representation and the dichotomized Resistance Factors identifies the mutations that are likely to produce Resistance Factors greater than the cutoff value, and are therefore crucial for determining the likelihood of treatment success. The mutual information between the binary representation and the raw Resistance Factors, on the other hand, identifies the positions that are indicative of the greatest changes in the magnitude of resistance. These two kinds of mutual information profiles highlight the differences between selecting the features that are most relevant to the problem of classifying the genes as resistant or susceptible vs. finding the features that are most useful for predicting the Resistance Factors themselves. For many patients, finding a drug that will lead to successful management of the viral population is the essential issue, and so determining whether a particular viral strain is susceptible to a given drug motivates the investigation of the individual positions in the gene. For others, however, especially those with a history of failed antiviral treatment who have developed multi-drug-resistant strains of the virus, finding the drug to which a given strain is least resistant becomes more important, and so the emphasis of the investigation shifts to the relative magnitude of conferred resistance.

43 Optimal Percent Correct Scores A second means of evaluating the positions in the protease gene individually is the Optimal Percent Correct (O.P.C.) score. When the fold resistances are dichotomized into resistant and susceptible classes according to a suitable cutoff, the relative utility of an individual position can be evaluated by looking at the percent correct of the best prediction that could be made on the basis of that feature alone. This measure is an estimate of the (complement of the) Bayes Error Rate (Duda et al., 2001) from sample data. The explicit calculation of a Bayes Error Rate requires knowledge of the underlying probability distributions for each class. The O.P.C. score becomes equivalent to one minus the Bayes Error Rate if one assumes that the true distribution is equal to the sample distribution. This assumption leads to a reasonable approximation in the context of the contribution to protease inhibitor resistance made by mutations at specific positions in the protease gene because the set of possible values over which the distributions are defined is discrete (ie., either zero or one for the binary representation, or one of the twenty amino acids when the nature of the mutation is considered). For an example of the Optimal Percent Correct score consider a machine learning system designed to predicted resistance to Nelfinavir on the basis of position 10 alone. If more than 50% of sequences that differ from the reference sequence (Leucine at position 10) were resistant, then the optimal prediction would be to classify all sequences with mutations at position 10 as resistant (in fact, 83% of the sequences in the data set with mutations at position 10 are resistant to NFV). Similarly, if the majority of the sequences

44 30 Drug NFV SQV IDV RTV APV LPV ATV ρ a Table 2.2. Vigilance baseline ( ρ a ) parameter values used for feature selection Validation was used to select the values of ρ a that led to the best performance on the validation sets for a binary representation of all 99 positions in the protease gene for each drug. These values were used to train networks for feature selection. without a mutation at position 10 were susceptible, then the optimal prediction would have all the sequences without a mutation there labeled susceptible (only 44% of the sequences without a mutation at position 10 are resistant). With this strategy, position 10 can be used to correctly predict 70.1% of the resistance/ susceptibility of the data. For Nelfinavir, 63.7% of the sequences are resistant, so prediction on the basis of the presence or absence of mutations at position 10 leads to an improvement of 6.4% over the default strategy of predicting resistance for all sequences. Row d of Figure 2.2 shows the O.P.C. scores for each of the positions in the protease gene relative to each of the seven protease inhibitors in the data set in blue. This simple measure gives some indication of which positions are predictive of resistance, but fails to take interactions between mutations at multiple positions into account. Section details the use of O.P.C. scores to measure the predictive gain when the specific amino acid found at a given position is taken into account.

45 Analog ARTMAP Feature Selection Results Feature selection from trained Analog ARTMAP networks is applied to the problem of evaluating the relative utility of the positions in the protease gene for predicting protease inhibitor resistance. Row c of Figure 2.2 shows the feature selection results for networks trained on a binary representation of the protease inhibitor data with all 99 positions included. Validation was used to select the values of the parameter that lead to the best performance for each drug. For each drug, twenty five networks were trained with 2/3 rds of the training data chosen randomly, using the parameter values ρ b = 0.2, Q = 13, andα = determined from pilot studies. The ρa values used were determined through validation, and are shown in Table 2.2. Features were selected from the trained networks (Equation Analog ARTMAP feature selection is described in detail in Chapter Four.), and the averaged results are presented. The Analog ARTMAP feature selection profiles differ significantly from the mutual information and optimal percent correct profiles presented in Rows a, b, and d of Figure 2.2. This reflects the difference between the global nature of the Analog ARTMAP feature selection process which acts on the data set as a whole, taking possible feature interactions into account, and the local approach to feature selection employed by the mutual information and optimal percent correct scores methods which consider individual features in isolation (see Section 4.2). ρ a

46 32 Figure 2.2, Row a. Profiles of mutual information between binary representations of the data and Resistance Factors dichotomized at the cutoff values Resistance profiles are estimated by taking the mutual information between binary representations of the gene sequences and their corresponding RFs dichotomized into resistant and susceptible bins. Mutations at positions with high MI scores are likely to be primary mutations. Figure 2.2, Row b. Profiles of mutual information between binary representations of the data and analog Resistance Factors Resistance profiles are estimated by taking the mutual information between binary representations of the gene sequences and their corresponding RFs. Figure 2.2, Row c. Profiles of analog ARTMAP feature selection results Resistance profiles are estimated by applying feature selection to Analog ARTMAP networks trained on binary representations of the data. Mutations at positions with high values may be either primary or secondary mutations. Figure 2.2, Row d. Profiles of O.P.C. scores Resistance profiles are estimated by calculation of the Optimal Percent Correct (O.P.C.) scores for each position. O.P.C. scores are shown in blue for the binary representation of the data. Empty bars show the O.P.C. score for each position when the specific amino acids are taken into account. Optimal Percent Correct Difference (O.P.C.D.) scores are defined by subtracting the binary O.P.C. score from the O.P.C. score found by taking into account the specific amino acids for each position. The O.P.C.D. scores are strictly non-negative.

47 33

48 34

49 35

50 36 Amino Acid at Position 71 Number of Sequences Number of Sequences Resistant to APV Number of Sequences Susceptible to APV A (wild-type) % T % V % L % I % Presence/Absence of a Mutation at Position 71 Number of Sequences Number of Sequences Resistant to APV Number of Sequences Susceptible to APV Wild-Type % Mutation % % Resistant To APV % Resistant To APV Table 2.3. Optimal Percent Correct Difference (O.P.C.D.) analysis of position 71 relative to Amprenavir (APV) This table shows the optimal percent correct difference analysis of the relative merit of considering the specific amino acid found at position 71 (upper panels) vs. a binary representation (lower panels) for predicting resistance to Amprenavir. Predicting the resistance or susceptibility of sequences in the training set to APV on the basis of the amino acid found at position 71 leads to an improvement of nearly 5% over prediction based solely on the presence or absence of a mutation At Which Positions is Resistance Differentially Influenced by the Specific Amino Acids? In addition to the question of which positions in the protease gene contribute to resistance, it is also useful to know which positions contribute to resistance differentially in dependence on the specific amino acid present in the mutation. Feature selection methods are brought to bear on this problem by comparing feature selection values for each position when the specific amino acid present in the mutation is considered vs. when

51 37 NFV 63 (2.2%) 1 st 2 nd 3 rd 4 th 5 th 6 th 7 th 8 th 9 th 10 th (1.3%) (0.9%) (0.6%) (0.4%) (0.3%) (0.3%) (0.3%) (0.3%) (0.3%) SQV (1.8%) (1.5%) (1.2%) (1.1%) (0.8%) (0.8%) (0.6%) (0.6%) (0.6%) (0.5%) IDV (4.2%) (4.1%) (2.5%) (1.3%) (1.2%) (1.0%) (0.9%) (0.7%) (0.6%) (0.4%) RTV (3.9%) (2.8%) (1.2%) (1.2%) (0.9%) (0.7%) (0.7%) (0.6%) (0.6%) (0.4%) APV (5.0%) (1.9%) (1.0%) (0.8%) (0.8%) (0.8%) (0.6%) (0.6%) (0.6%) (0.6%) LPV (4.5%) (1.9%) (1.6%) (1.1%) (0.8%) (0.8%) (0.8%) (0.8%) (0.8%) (0.8%) ATV (3.5%) (2.7%) (1.8%) (0.9%) (0.9%) (0.9%) (0.9%) (0.9%) (0.9%) (0.9%) Table 2.4. The 10 positions with highest Optimal Percent Correct Difference (O.P.C.D.) scores for each drug Positions in the protease gene are ranked according to their O.P.C.D. scores for each drug. Each box lists the position and the corresponding O.P.C.D. score in brackets. Mutations at positions with high O.P.C.D. scores are likely to contribute differentially to resistance depending on the specific amino acids found there. it is not. The positions with the greatest increase in feature selection value are the most likely candidates for a differential effect on resistance in dependence on the specific amino acids Optimal Percent Correct Difference Scores The O.P.C. scores described in Section can be used as a measure of the relative utility of taking into account the specific amino acid to which a given position has mutated vs. considering only whether or not the position contains a mutation. The O.P.C. score for a position is calculated for each position looking only at the presence/absence of mutations, and plotted in blue in Row d of Figure 2.2. O.P.C scores are then calculated by looking at the percent correct of the optimal prediction that could

52 38 Protease Inhibitor NFV (18 positions) SQV (15 positions) IDV (20 positions) RTV (19 positions) APV (19 positions) LPV (20 positions) ATV (20 positions) Positions Previously Associated with Resistance 10,20,30,36,46,48,50,53,54,63,71,73,77,82,84,88,90,93 10,20,36,46,48,50,53,54,63,71,73,82,84,90,93 10,20,24,32,33,36,46,47,48,50,53,54,63,71,73,82,84,88, 90,93 10,20,32,33,36,46,47,48,50,53,54,63,71,73,82,84,88,90, 93 10,20,32,33,36,46,47,48,50,53,54,63,71,73,82,84,88,90, 93 10,20,24,32,33,36,46,47,48,50,53,54,63,73,82,84,88,90, 93 10,20,24,32,33,36,46,47,48,50,53,54,63,73,82,84,88,90, 93 Table 2.5. Positions previously associated with resistance This table lists positions Previously Associated with Resistance (PAR). Mutations at these locations in the protease gene are known either to confer resistance directly (primary mutations), or to act as secondary, fitness restoring mutations. Protease Inhibitor Minimal Supersets of Positions Previously Associated with Resistance NFV (42 positions) 10,12,13,14,15,19,20,24,30,32,33,35,36,37,41,45,46,47, 48,50,53,54,57,58,60,61,62,63,64,69,70,71,72,73,74,77, 82,84,85,88,90,93 SQV (48 positions) 10,12,13,14,15,16,19,20,23,24,30,32,33,35,36,37,41,43, 45,46,47,48,50,53,54,55,57,58,60,61,62,63,64,67,69,70, 71,72,73,74,77,82,84,85,88,89,90,93 IDV (46 positions) 10,12,13,14,15,19,20,24,30,32,33,35,36,37,41,43,45,46, 47,48,50,53,54,55,57,58,60,61,62,63,64,67,69,70,71,72, 73,74,77,82,84,85,88,89,90,93 RTV (44 positions) 10,12,13,14,15,19,20,24,30,32,33,35,36,37,41,43,45,46, 47,48,50,53,54,57,58,60,61,62,63,64,69,70,71,72,73,74, 77,82,84,85,88,89,90,93 APV (45 positions) 10,12,13,14,15,16,19,20,24,30,32,33,35,36,37,41,45,46, 47,48,50,53,54,55,57,58,60,61,62,63,64,69,70,71,72,73, 74,76,77,82,84,88,89,90,93 LPV (50 positions) 10,12,13,14,15,16,18,19,20,24,30,32,33,35,36,37,39,41, 43,45,46,47,48,50,53,54,55,57,58,60,61,62,63,64,67,69, 70,71,72,73,74,76,77,82,84,85,88,89,90,93 ATV (53 positions) 2,4,8,10,11,12,13,14,15,16,19,20,23,24,30,32,33,34,35, 36,37,41,43,45,46,47,48,50,53,54,57,58,60,61,62,63,64, 65,66,69,70,71,72,73,77,79,82,84,85,88,89,90,93 Table 2.6. Minimal supersets of positions previously associated with resistance This table presents the Minimal SuperSets (MSS) of positions previously associated with resistance. Included in each MSS are all positions with Analog ARTMAP feature selection D values greater than or equal to the lowest D value of a position previously associated with resistance. Positions not previously associated with resistance are shown in bold face.

53 39 Figure 2.3. Positions in the protease gene ranked according to Analog ARTMAP feature selection D values The 99 positions in the gene are ranked according to their average D values. D values are plotted above the abscissa. Bars beneath the abscissa indicate the positions previously associated with resistance (PAR), many of which are clustered near the high end of the rankings. The method corroborates existing knowledge and identifies several candidate positions not previously associated with resistance.

54 40 be made when distinctions between the particular amino acids found in the mutations are allowed. These values are plotted with empty bars in Row d of Figure 2.2. The Optimal Percent Correct Difference (O.P.C.D.) scores are the difference between the O.P.C. scores for these two prediction schemes. This provides a simple measure of the utility of distinguishing between amino acids at each location in the protease gene. The contribution of position 71 to Amprinavir resistance provides an edifying example of O.P.C.D. score. Of the 524 protease sequences with reported Resistance Factors to APV, 333 (63.5%), have Resistance Factors below the cutoff of 2.5. Classifying all of the sequences as susceptible would therefore yield a 63.5% correct prediction rate. If, on the other hand, a classifier distinguished between sequences with and without mutations at position 71, 25.9% of the population without mutations are resistant, while 49.8% of the population with mutations are resistant (Table 2.3). Thus, the highest classification rate would be achieved by labeling both populations susceptible, so that there would be no gain by considering the presence or absence of mutations at this position. However, if the specific amino acids comprising the mutations are taken into account, the subpopulations with mutations A71V and A71L would be labeled resistant, while all others would be labeled susceptible (Table 2.3) (this is not to say that in isolation, point mutations A71V and A71L produce APV resistance, while others do not, but simply that the majority of sequences in the database containing these mutations are resistant). This strategy would result in an overall predictive accuracy of 68.5%. The O.P.C.D. score for this position is therefore 4.9%, suggesting that classification will benefit when the machine learning system distinguishes between the amino acids found at

55 41 position 71. Ranking the positions by the amount of predictive accuracy gained by considering the nature of the mutation gives an indication of which positions should be given an Amino Acid Space representation in the classification of complete sequences. Table 2.4 lists the 10 positions with the greatest gain for each drug Feature Selection Results Relative to Protease Gene Position Sets Described in the Literature Analysis of the individual positions in the protease gene has been presented in the literature from both experimental (Johnson et al., 2003; Shafer, 2002) and numerical (Beerenwinkel, 2003) perspectives. These descriptions provide a basis for comparison and discussion of the feature selection results presented here. Table 2.5 lists a set of positions known to contribute to resistance for each of the protease inhibitors under consideration. For each drug, the results of feature selection from trained Analog ARTMAP networks are used to produce a Minimal Superset (MSS) of this list by including all of the positions given a D score greater than or equal to the lowest scoring position in this list. The resultant position sets are given in Table 2.6. Figure 2.3 shows the 99 positions in the protease gene ranked by the D values assigned to them in Analog ARTMAP feature selection. Bars beneath the abscissa indicate a position known to contribute to resistance. Table 2.7 lists the positions known to contribute to resistance for each of the protease inhibitors ranked according to the D values found by Analog ARTMAP feature selection in the second column, and the minimal superset of these

56 42 Protease Rank order of positions Rank order of Minimal Inhibitor previously associated with SuperSet (MSS) of resistance position previously associated with resistance NFV 10,71,90,46,93,36,77,82,54,20,88, 10,71,90,46,93,36,35,77,82,54,62, 84,63,73,30,53,48,50 37,20,64,41,13,88,84,63,15,57,73, 30,72,19,12,14,69,60,33,24,70,61, 74,45,32,58,53,48,50 SQV 10,90,71,46,54,93,82,36,84,20,73, 10,90,71,46,54,93,82,36,77,35,62, 63,48,53,50 84,37,20,64,41,13,73,88,57,72,15, 63,30,12,19,48,33,14,60,24,69,32, 53,61,47,45,74,70,43,89,67,55,85, 16,58,23,50 IDV 10,71,90,46,82,93,54,36,20,84,73, 10,71,90,46,82,93,54,77,36,62,35, 63,88,33,24,48,32,53,50,47 37,64,41,20,84,13,73,63,57,72,15, 88,30,14,12,60,19,33,24,69,70,48, 61,74,32,45,53,43,58,85,67,89,50, 55,47 RTV 10,71,82,90,54,46,93,36,84,20,63, 10,71,82,90,54,46,93,36,77,62,35, 88,73,33,32,48,53,47,50 37,84,64,20,41,13,57,63,15,88,73, 72,30,12,14,19,33,60,24,69,70,32, 61,48,45,53,74,47,43,58,85,89,50 APV 10,46,71,90,54,93,82,36,84,88,20, 10,46,71,90,54,93,82,77,36,62,84, 63,73,33,32,47,50,53,48 35,37,88,20,64,13,63,41,57,73,15, 72,30,33,19,12,14,60,24,69,32,74, 47,50,70,53,89,61,45,55,16,76,58, 48 LPV 10,71,54,82,46,90,93,36,84,20,73, 10,71,54,82,46,90,93,36,77,62,35, 63,88,24,33,53,32,48,47,50 37,84,64,20,41,13,15,72,73,63,57, 88,14,24,12,19,30,33,60,61,69,53, 70,74,32,48,76,47,43,16,55,85,18, 89,45,58,67,39,50 ATV 71,10,36,93,90,82,20,46,54,88,63, 71,10,36,35,13,93,90,77,62,82,37, 33,50,73,32,84,47,53,24,48 30,20,41,46,64,54,88,63,69,33,14, 12,70,50,15,73,57,89,32,84,45,19, 60,47,16,66,53,58,72,24,85,61,4, 43,23,79,65,34,11,2,8,48 Table 2.7. Positions previously associated with resistance and MSS positions ranked by analog ARTMAP feature selection D values This table shows the position rankings according to Analog ARTMAP feature selection D values. The central column shows the ranking attributed to the positions previously associated with resistance. The column on the right shows the ranking of the MSS positions, with positions not previously associated with resistance in bold face. Note that for all seven drugs, the majority of the highly ranked MSS positions are positions previously associated with resistance, and were correctly identified by the feature selection procedure without the incorporation of prior knowledge.

57 43 Protease Inhibitor O.P.C.D. Best Three O.P.C.D. Greater Than 1% NFV 12,37,63 12,63 SQV 43,71,82 33,43,71,82 IDV 37,63,88 12,36,37,63,82,88 RTV 33,37,63 33,36,37,63 APV 50,71,72 50,71 LPV 54,71,82 54,71,72,82 ATV 37,45,69 37,45,69 Table 2.8. Position sets given an Amino Acid Space representation in trials This table shows two of the position sets given an amino acid space representation in the trials. The central column lists the three positions with the highest Optimal Percent Correct Difference (O.P.C.D.) scores for each drug. The right column lists all the positions with O.P.C.D. scores greater than 1% positions similarly ranked in the third column Using the Results of Feature Selection to Enhance the Prediction of Protease Inhibitor Resistance Combining the results of the methods for selecting a subset of protease gene positions for inclusion and for selecting a subset to be given an Amino Acid Space representation produces a set of testable hypotheses. For each drug, fifteen representations are compared. Three sets of included positions are tested: A) the complete set of positions 1-99, B) the set of positions previously associated with resistance (PAR, Table 2.5), and C) the drug-specific MSS positions. For each of these sets of included positions, five sets of positions are given an amino acid representation: 1) all of the included positions, 2) no positions, 3) the set of positions at which the specific amino acids may contribute differentially to resistance (AAPAR) according to the results summarized in the literature (ie., 33, 46, 47, 50, 54, 63, 71, 82, 88, and 93) (Shafer,

58 44

59 45 Figure 2.4. (Previous page) Comparison of network performance on validation sets for 15 representational schemes per drug For each drug 15 representational schemes were created by first choosing one of three sets of included positions: All (1-99), PAR (positions previously associated with resistance, Table 2.5), or MSS (minimal superset, Table 2.6), then choosing one of five sets of positions to be given an Amino Acid Space representation (all other included positions were given a binary representation): All AA (all of the included positions are given an Amino Acid Space representation), All Binary (all of the included positions are given a binary representation), positions with O.P.C.D. (optimal percent correct difference) scores >1% (Table 2.8), the three positions with the highest O.P.C.D. scores (Table 2.8), or AAPAR (the set of positions known to contribute differentially to resistance in dependence on the specific amino acids (see text)). For each drug, and for each of the 15 representations, 5 voter ensembles of 5 networks each were trained with 2/3 of the Train/Validation data and tested on the remaining 1/3 for every value of ρa in the set {0.81,0.83,,0.99}. For each drug and representation, the value of ρ a that lead to the highest average correlation coefficient between prediction and ground truth on the validation set was chosen, and the results are presented. 2002), 4) the three positions with the highest Optimal Percent Correct Difference Scores (O.P.C.D.), and 5) the set of positions with O.P.C.D. scores greater than 1% (Table 2.8). Figure 2.4 shows the performance of networks trained on these fifteen representations for each of the protease inhibitors. For each drug and representation, the mean correlation coefficient between the predicted and the target values of the log Resistance Factors is calculated for five divisions of the data into training and validation sets. For each division, an ensemble of voters was constructed by training five networks with randomly chosen orderings of the training set, using the parameter values ρ = 0. 2, Q = 13, p = 2, ε = -10-3, and α = determined from pilot studies. These simulations were carried out for each value of the vigilance baseline parameter ρ a in the set {0.81, 0.83,, 0.99}, and the vigilance baseline that resulted in the best performance on the validation set was chosen for each drug and representation. In all, the simulations b

60 46 Protease Inhibitor NFV Positions Given a Binary Representation PAR (Table 2.5) Representation Positions Given an Amino Acid Space Representation AAPAR (listed in the text) Positions with O.P.C.D. Scores Vigilance Baseline ( ) ρ a Correlation Coefficient between Predicted and Actual RFs (Validation Sets) SQV PAR (Table 2.5) Greater than 1% (Table 2.8) Positions with PAR O.P.C.D. Scores IDV (Table 2.5) Greater than 1% (Table 2.8) RTV PAR (Table 2.5) None The Three Positions with APV All Highest (1-99) O.P.C.D. Scores (Table 2.8) Positions with PAR O.P.C.D. Scores LPV (Table 2.5) Greater than 1% (Table 2.8) ATV PAR (Table 2.5) All Table 2.9. Representation and vigilance baseline ( a ρ ) leading to the highest correlation coefficient on the validation sets for each of the seven protease inhibitors For each drug, the representation and vigilance baseline value that lead to the best network performance on the validation sets (Figure 2.4) are shown. Each representation involves a choice of positions in the protease gene to be given a binary representation (Column Two), and a choice of positions to be given an Amino Acid Space representation (Column Three). Positions contained in the intersection of these sets are given an Amino Acid Space representation.

61 47 involved the training and testing of 7x15x5x5x10 = 26,250 Analog ARTMAP neural networks. The representations and corresponding vigilance baseline values that lead to the best correlation coefficients on the validation sets can be seen in Table 2.9. Figure 2.5, and Table 2.10 show the results of applying these representation schemes and parameter values to the test set. By binning the analog predictions into resistant and susceptible according to the cutoff values above, the performance of the networks can be evaluated as a two-class classifier yielding the classification error rates shown in Table Novel Mutations Associated with Protease Inhibitor Resistance and Hypersusceptibility In addition to the goal of predicting resistance for novel protease sequences, the application of Analog ARTMAP to protease inhibitor resistance data is used to gain some insight into locations in the protease gene where mutations may influence resistance. The feature selection results shown in Figure 2.3 and Table 2.7 demonstrate the ability of these techniques to identify the importance of positions in the gene that are known to contribute to resistance in a fully automated way, without the incorporation of previous knowledge into the system. At the same time, these methods also identify several positions not previously associated with resistance. Position 35, for example, is strongly implicated for all seven protease inhibitors. Positions 62, 37 and 77 (77 has been associated with resistance only to Nelfinavir) are also implicated for all of the protease

62 48 Figure 2.5. Network predictions on the test sets Scatter plots of the predicted Resistance Factors against actual Resistance Factors. For each drug, the representation and ρ a value that led to the best performance on the validation sets were chosen, and used to train an ensemble of networks for prediction on the test sets.

63 49 Protease Inhibitor Two-Class Classification Error on the Test Set Correlation Coefficient on the Test Set NFV 14.4% 0.81 SQV 16.5% 0.83 IDV 14.3% 0.82 RTV 10.4% 0.85 APV 27.5% 0.76 LPV 8.3% 0.87 ATV 36.4% 0.75 Table Test set results The representations and vigilance baseline values found through validation (Table 2.9) are used to train networks on the full training/validation sets. This table presents the results of testing these networks on the test sets. Two-class classification error rates are found by dichotomizing the predicted and actual Resistance Factors according to the cutoff values listed in the text. Column Three shows the correlation coefficients between the predicted and actual Resistance Factors of the protease genes in the test sets. Scatter plots of these network predictions are shown in Figure 2.5. inhibitors. These four positions have the highest Analog ARTMAP feature selection D values out of all the positions not previously associated with resistance for all of the protease inhibitors except ATV. These methods also produce several insights into positions whose contribution to resistance depends strongly on the specific amino acid to which the sequence mutates. Table 2.10 lists the representations that lead to the best performance of the networks on the validation sets. For Nelfinavir (NFV) the best network performance occurs when an Amino Acid Space representation is given to the positions 33, 46, 47, 50, 54, 63, 71, 82, 88, and 93. These are the positions that have previously been observed to contribute differentially to resistance in dependence on the specific amino acids to which they

64 50 SQV Resistance to Sequences Containing Mutations at Position 43 Mutation K43T K43N K43E K43R # Sequences Mean RF Table Saquinavir (SQV) resistance to sequences containing mutations at position 43. For each of the four mutations occurring at position 43 among sequences in the data set, the number of sequences containing the mutation and the mean Resistance Factor of these sequences to SQV are shown in this table. The mutation K43N is of particular interest because the mean RF to SQV of sequences containing this mutation is below one, suggesting that K43N may confer some degree of hypersusceptibility to SQV. IDV Resistance to Sequences Containing Mutations at Position 12 Mutation T12A T12S T12P T12D T12N # Sequences Mean RF Mutation T12E T12I T12M T12K T12R # Sequences Mean RF Table Indinavir (IDV) resistance to sequences containing mutations at position 12. For each of the 10 mutations occurring at position 12 among sequences in the data set, the number of sequences containing the mutation and the mean Resistance Factor of these sequences to IDV are shown in this table. IDV Resistance to Sequences Containing Mutations at Position 36 Mutation M36V M36L M36I # Sequences Mean RF Table Indinavir (IDV) resistance to sequences containing mutations at position 36. For each of the three mutations occurring at position 36 among sequences in the data set, the number of sequences containing the mutation and the mean Resistance Factor of these sequences to IDV are shown in this table. The mutation M36L is of particular interest because the mean RF to IDV of sequences containing this mutation is below one, suggesting that M36L may confer some degree of hypersusceptibility to IDV.

65 51 IDV Resistance to Sequences Containing Mutations at Position 37 Mutation N37A N37S N37C N37P N37D N37T # Sequences Mean RF Mutation N37E N37Q N37H N37K N37Y N37R # Sequences Mean RF Table Indinavir (IDV) resistance to sequences containing mutations at position 37. For each of the 12 mutations occurring at position 37 among sequences in the data set, the number of sequences containing the mutation and the mean Resistance Factor of these sequences to IDV are shown in this table. The mutations N37Q and N37K are of particular interest because the mean RFs to IDV of the sequences containing these mutations are below one, suggesting that these mutations may confer some degree of hypersusceptibility to IDV. APV Resistance to Sequences Containing Mutations at Position 72 Mutation I72T I72V I72E I72L I72M I72K I72F I72R # Sequences Mean RF Table Amprenavir (APV) resistance to sequences containing mutations at position 72. For each of the eight mutations occurring at position 72 among sequences in the data set, the number of sequences containing the mutation and the mean Resistance Factor of these sequences to APV are shown in this table. LPV Resistance to Sequences Containing Mutations at Position 72 Mutation I72T I72V I72E I72L I72M I72F I72R # Sequences Mean RF Table Lopinavir (LPV) resistance to sequences containing mutations at position 72. For each of the seven mutations occurring at position 72 among sequences in the data set, the number of sequences containing the mutation and the mean Resistance Factor of these sequences to LPV are shown in this table.

66 52 mutate (Shafer, 2002). Thus for NFV, the methods corroborate pre-existing observations. For Saquinavir (SQV), the best network performance occurs when the positions with O.P.C.D. scores greater than one percent, ie., 33, 43, 71, and 82 (Table 2.8), are given an Amino Acid Space representation. Of these, position 43 has not previously been observed to contribute differentially to resistance in dependence on the specific amino acids to which the sequences mutate. Table 2.11 catalogs the 23 sequences with mutations at position 43 for which Resistance Factors to SQV are present in the data set. For each of the four observed mutations K43T (a mutation at position 43 in which the amino acid K (Lysine) present in the wild-type is replaced with amino acid T (Threonine)), K43N, K43E and K43R the number of sequences containing the mutation is listed in the second row, and the mean Resistance Factor to SQV of those sequences is listed in the third row. Of particular interest is the fact that the two sequences containing K43N have a mean Resistance Factor below one. The fact that these sequences have Resistance Factors less than or equal to the wild-type despite the presence of mutations known to confer resistance to SQV (eg., L90M, etc.) may indicate that K43N contributes some degree of hypersusceptibility (the condition of being less resistant to a drug than the wild-type) to SQV. Appendix A lists the gene sequences containing the mutation K43N and their Resistance Factors to SQV. For Indinavir (IDV), the best network performance occurs when the positions with O.P.C.D. scores greater than one percent, ie., 12, 36, 37, 63, 82, and 88 (Table 2.8), are given an Amino Acid Space representation. Of these, positions 12, 36, and 37 have not previously been observed to contribute differentially to resistance in dependence on

67 53 the specific amino acids to which the sequences mutate. Tables 2.12, 2.13, and 2.14 catalog the sequences with mutations at positions 12, 36, and 37, respectively, for which Resistance Factors to IDV are present in the data set. Here there are three mutations that may be linked to hypersusceptibility to IDV: M36L, N37Q, and N37K. Appendix A lists the gene sequences containing these mutations and their Resistance Factors to IDV. For Ritonavir (RTV), the best network performance occurs when none of the positions is given an Amino Acid Space representation. For Amprenavir (APV), the best network performance occurs when the three positions with the highest O.P.C.D. scores, ie., 50, 71, and 72 (Table 2.8), are given an Amino Acid Space representation. Of these, position 72 has not previously been observed to contribute differentially to resistance in dependence on the specific amino acids to which the sequences mutate. Table 2.15 catalogs the 69 sequences with mutations at position 72 for which Resistance Factors to APV are present in the data set. For Lopinavir (LPV), the best network performance occurs when the positions with O.P.C.D. scores greater than one percent, ie., 54, 71, 72, and 82 (Table 2.8), are given an Amino Acid Space representation. Of these, position 72 has not previously been observed to contribute differentially to resistance in dependence on the specific amino acids to which the sequences mutate. Table 2.16 catalogs the 61 sequences with mutations at position 72 for which Resistance Factors to LPV are present in the data set. For Atazanavir (ATV), the best network performance occurs when all of the positions are given an Amino Acid Space representation.

68 54 The hypersusceptibility of a viral strain to a specific drug is a particularly important situation because it implies that for a patient infected with that viral strain, there exists a treatment option that could reasonably be expected to reduce viral load more effectively for that patient than it would for a patient infected with the wild-type virus. Ultimately, a numerical analysis of protease inhibitor resistance data cannot supplant experimental tests to determine the causative relationships between protease mutations and protease inhibitor resistance. However, this analysis does suggest that further experimental tests to determine the relationship between the mutation K43N and Saquinavir hypersusceptibility, and between the mutations M36L, N37Q, and N37K and Indinavir hypersusceptibility may be of value.

69 55 3. Analog ARTMAP In this chapter the neural network Analog ARTMAP, the learning system used to predict protease inhibitor resistance from HIV genotypes, is introduced. Previous applications of ART neural networks to genetic databases (LeBlanc et al., 1996; Tomida et al., 2002) have demonstrated the efficacy of this family of neural networks on unsupervised learning (or clustering) problems in genetics. However, the prediction of protease inhibitor resistance is a supervised learning problem, so the specific ART network architectures used in these applications cannot be applied. Prediction of resistance can be approached as a classification problem where the two class labels, drug-resistant and drug-susceptible, are assigned to each HIV protease gene by dichotomizing the Resistance Factors according to clinically determined cutoff values. An accurate prediction of resistance or susceptibility to a particular protease inhibitor would indicate whether or not treatment with that protease inhibitor would be likely to produce an effective suppression of viral load. However, this prediction scheme would be insufficient to determine which protease inhibitor is likely to be the most effective for a given patient. To accomplish this it is necessary to know the degree of resistance a particular viral subtype would have against each of the protease inhibitors, so that the drug to which the viral subtype is least resistant can be identified. A computational system capable of learning associations between inputs (i.e., gene sequences) and continuous-valued, analog outputs (i.e., Resistance Factors) is therefore needed. This is the motivation for the design of Analog ARTMAP, a neural network that

70 56 extends the classification capabilities of default ARTMAP into the realm of continuousvalued machine learning problems Analog ARTMAP: An ART Neural Network Architecture for Regression A novel ART network architecture is proposed for fast, on-line supervised learning of multidimensional associative maps between input patterns and continuousvalued, analog outputs. The system builds on the default ARTMAP architecture which uses winner-take-all training and distributed prediction during testing. Learning in the context of default ARTMAP uses class labels, ie., elements drawn from a finite discrete set, as the supervising signal. Analog ARTMAP networks allow the supervising signal to take on any real value, or even a multidimensional vector value. In the event that the supervising signals presented to the network are drawn from a finite set of class labels, the network performance will be identical to default ARTMAP. The network architectures therefore form the nested sequence: Fuzzy ARTMAP Default ARTMAP Analog ARTMAP The Analog ARTMAP network is able to learn associations with continuous output values by using a modified version of the distributed outstar equation (Carpenter, 1994) to model learning between nodes in the F and F 2 3 layers (Figure 3.1). The equation includes a category-specific learning rate β j that decays as the synapses projecting from the j th F 2 node encode more associations (A complete list of Analog ARTMAP notation and parameters is given in Appendix B). This is a form of category-

71 57 Analog output Vector O ( O,... ) 1,... O k O K Analog output layer, z k F 3 k Net signal σ k σ k W jk Code reset if maxσ O > ρ k F 3 k k b Coding layer, y j F 2 j Net signal T j T j w ij Code reset if A w j < ρ a M Complement coded input layer, x i F 1 i a Feature vector A ( a 1,... a i,... a M ) a c Figure 3.1. Analog ARTMAP network architecture A schematic representation of the Analog ARTMAP architecture for learning maps from multidimensional vector valued inputs to analog, multidimensional outputs.

72 58 specific fast-to-slow learning. The instance counting weight c j of the j th F2 node equals the number of input-output associations that have been learned. The learning rate for the synapses between that F node and the F nodes to which it is connected is: 2 3 β j = 1 c j (3.1.1) the form: With this varying learning rate, the equilibrium solution of the learning law has W new jk old = W jk + β j k old ( O W jk ) y j (3.1.2) where O k is the k th component of the analog output vector, y j is the activation of the j th node in the F 2 layer, and W jk is the weight assigned to the synapse from F 2 node j to F 3 node k. Because learning takes place in winner-take-all mode, only one F 2 node (node J) will be active, so (3.1.3) 1 if j = J y j = 0 if j J This implies that during training, only one F 2 to F 3 weight will be updated for each training point. Furthermore, Equations and guarantee that the weight from F 2 node j to F node k will be exactly equal to the mean of the (k th 3 component of the) output values corresponding to the input patterns that have been associated with committed F 2 node j. In other words, if F 2 node j encodes the input set {a 1, a 2,,a N }, and the k th components of the corresponding output vectors are {O 1,k, O 2,k,,O N,k }, then at the end of learning the weight W will be equal to the mean of the output set {O1,k, O 2,k,,O }, ie., after training N,k jk

73 59 W = 1 N jk O n, k N n= 1 (3.1.4) The proof of Equation is contained in Appendix C. The fact that the weights from the F to F 2 3 layers track the mean of the analog output values associated with each committed F 2 node gives the system several desirable properties. Since the mean of any finite set is unique, the value of the F 2 to F3 weights will be independent of the order in which the inputs assigned to a particular F 2 node are presented during training. In addition, association of the mean output value with each cluster guarantees that the within-cluster Mean-Squared Error (MSE) will be minimized, and that the individual maps from clusters to analog output values are unbiased estimators. Figure 3.1 depicts the two ways in which a reset at the F 2 layer can be triggered during network training. The first, common to Analog, Default and Fuzzy ARTMAP architectures, is triggered by a mismatch between the bottom up input and top down expectations from the F and F layers respectively. In this case, reset will be triggered if 1 2 A w j < ρ a M (3.1.5) where A is the current training point, is the vector of weights from the F nodes to F w j 1 2 node j, ρ a is the current value of vigilance, and M is the dimension of the input space.

74 60 The second cause of code reset in the Analog ARTMAP architecture is a mismatch between the bottom up and analog output signals from F 2 and F3 respectively. Here reset will be triggered if max σ k Ok > ρ b k F 3 (3.1.6) where σ k is the activation of F 3 node k (Equation ), O k is the k th component of the analog output vector associated with the current training point, and ρ b is a free parameter of the system. In both cases, the reaction to a code reset is the same. Vigilance ( ρ a ) will be raised and the network will continue to search through F 2 nodes until a suitable node can be found or a previously uncommitted node is recruited. The parameter ρ a allows a user to adjust the degree of acceptable heterogeneity among the input values of the set of training samples assigned to a cluster, while the parameter ρ b sets the permitted level of within-cluster heterogeneity of the output values. While ρ is incrementally increased in response to a code reset, a ρb holds a constant value throughout training (Appendix B, Table B.2). Like default ARTMAP, Analog ARTMAP uses distributed coding at the F 2 layer during testing. The effect of competition among F 2 nodes is modeled by a fusion of the Increased Gradient Content Addressable Memory (IG-CAM) rule (Carpenter et al., 1998) and the Q-Max CAM rule (Carpenter and Markuzon, 1998). The signal to the active coding nodes (j=1 C) in the F 2 layer is given by:

75 61 T j = A w + α j ( 1 )( M w ) j (3.1.7) These signals induce activation of the F 2 nodes as given by one of the following two expressions: A. Point Box Case: ' If Λ { = 1...C : T = M } λ λ φ B. Otherwise: y j 1 if j Λ ' ' = Λ ' 0 if j Λ (3.1.8) y p 1 M T j = λ Λ M Tλ 0 j 1 p if if j Λ j Λ (3.1.9) where C is the number of committed F 2 nodes, Λ is the set of the Q most active, suprathreshold F 2 nodes, ie., the Q most active nodes from the set { λ =... C : Tλ > αm } 1. The two free parameters that control the form of the distributed coding at the F 2 layer are p and Q. The predictions of the network during testing are given by: σ = k C j= 1 W jk y j (3.1.10)

76 62 Figure 3.2.a. Target surface This plot shows the surface from which points were drawn to train and test an Analog ART network. The problem has a two-dimensional input space, shown on the x and y axes, and a single, analog valued output variable depicted along the z axis. Figure 3.2.b. Training set 300 training points were drawn at random from a flat distribution over the unit square, and plotted along with their target analog output values. Since Equations and imply that y j values sum to one, the value of σ k given by Equation can be seen as an average of the W jk 's of the active F2 nodes weighted by the values of their activations. The complete Analog ARTMAP algorithm is presented in Appendix C The Performance of Analog ARTMAP on Benchmark Problems The results of training and testing a network with the Analog ARTMAP architecture described above demonstrate the ability of the network to learn associative maps between vector spaces in a supervised learning setting. Two sample problems have

77 63 Figure 3.3.a. Network predictions and category boxes The network predictions for the test set are shown in grey, along with red rectangles corresponding to the weights learned by the F1-F2 and F 2 - F 3 connections. Figure 3.3.b Scatter plot of network predictions vs. ground truth Each point in the plot represents a single test point. Network predictions are along the vertical axis and ground truth, or desired predictions, are along the horizontal axis. The black diagonal line indicates the locations of perfect predictions. been constructed to demonstrate the network dynamics. In the first, a training set is constructed by randomly choosing 300 two-dimensional points in the unit square, and assigning to them analog output values via the equation: 1 1 z = ( x ) + ( x2 ) (3.2.1) Figure 3.2.a depicts the surface from which these points were drawn with the two input dimensions in the horizontal plane and the analog output values in the vertical dimension. Figure 3.2.b shows the actual training set used. A network was trained and tested with the parameter values α = 0.01, ρ a = 0.9, ρ b = 0.2, p = 2, and Q = 5. The test

CS612 - Algorithms in Bioinformatics

CS612 - Algorithms in Bioinformatics Spring 2016 Protein Structure February 7, 2016 Introduction to Protein Structure A protein is a linear chain of organic molecular building blocks called amino acids. Introduction to Protein Structure Amine