Predicting Human Immunodeficiency Virus Type 1 Drug Resistance From Genotype Using Machine Learning. Robert James Murray

Size: px

Start display at page:

Download "Predicting Human Immunodeficiency Virus Type 1 Drug Resistance From Genotype Using Machine Learning. Robert James Murray"

Ronald Cameron
6 years ago
Views:

1 Predicting Human Immunodeficiency Virus Type 1 Drug Resistance From Genotype Using Machine Learning. Robert James Murray Master of Science School of Informatics University Of Edinburgh 2004

2 ABSTRACT: Drug resistance testing has been increasingly incorporated into the clinical management of human immunodeficiency virus type 1 (HIV-1) infection. At present, there are two ways to assess the susceptibility of an HIV-1 isolate to antiretroviral drugs namely phenotyping and genotyping. Although, phenotyping is recognised as providing a quantified measurement of drug resistance, it involves a complex procedure and is time consuming. On the other hand, genotyping involves a relatively simple procedure and can be done quickly. However, the interpretation of drug resistance from genotype information alone is challenging. A number of machine-learning methods have now been used to automatically relate HIV-1 genotype with phenotype. However, the predictive quality of these models has been mixed. This study investigates the nature of these computational models and their predictive merit. Using the complete Stanford dataset of matched phenotype genotype pairs, two contrasting machine-learning approaches were implemented to analyse the significance of sequence mutations in the protease and reverse transcriptase genes of HIV-1 for 14 antiretroviral drugs. Both decision tree and nearest-neighbour classifiers were generated and compared with previously published classification models. I found prediction errors between % for decision tree models and prediction errors between % for nearest-neighbour classifiers. This was compared with prediction errors of between % for previously published decision tree models and a correlation coefficient of 0.88 for a neural network lopinavir classification model.

3 Declaration I declare that this thesis was composed by myself, that the work contained herein is my own except where explicitly stated otherwise in the text, and that this work has not been submitted for any other degree or professional qualification except as specified. (Robert James Murray)

4 Table of Contents 1 Introduction Overview HIV Infection Resistance Testing Phenotype & Genotype Resistance Tests Literature Review 8 2 Machine Learning Overview & The Phenotype Prediction Problem The Training Experience, E Learning Mechanisms Decision Trees Artificial Neural Networks Instance Based 25 3 Materials & Methods Data Set Decision Trees Nearest-Neighbour 35 4 Results Classification Models Reverse Transcriptase Inhibitors Protease Inhibitors Prediction Quality 48 5 Conclusion Concluding Remarks & Observations Decision Tree Models Neural Network Models Nearest-Neighbour Models Suggestions For Further Work Handling Ambiguity Codes Using A Different Attribute Selection Measure Using A Different Distance Metric Receiver Operating Characteristic Curve 64

5 5.2.6 Other Machine Learning Approaches 64 A Pre-Processing Software 65 B Cultivated Phenotype 67 C The Complete Datasets 73 D Original Decision Trees 96 Bibliography 97

6 Chapter 1 Introduction 1.1 Overview Drug resistance testing has been increasingly incorporated into the clinical management of human immunodeficiency virus type 1 (HIV-1) infection. Resistance tests can show whether a particular HIV-1 strain is likely to be suppressed by a drug. At present, there are two ways to assess the susceptibility of an HIV-1 isolate to antiretroviral drugs namely phenotyping and genotyping. Whereas, phenotypic assays provide a direct quantification of drug resistance, genotypic assays only provide clues towards drug resistance. In particular, genotyping attempts to establish the presence or absence of genetic mutations in the protease and reverse transcriptase genes of HIV-1 that have been previously associated with drug resistance. Although, phenotyping is recognised as providing a quantified measurement of drug resistance, it involves a complex procedure and is time consuming. On the other hand, genotyping involves a relatively simple procedure and can be done quickly. However, the interpretation of drug resistance from genotype information alone is still challenging, and often requires expert analysis. A number of machine-learning methods have now been used to automatically relate HIV-1 genotype with phenotype [1,2,10,9]. Using datasets of matched phenotype genotype pairs, machine-learning methods can be used to derive computational models that predict phenotypic resistance from genotypic information. However, the predictive quality of these models has been mixed. For some drugs, these models offer reasonable prediction rates but for others the results are less useful for managing HIV-1 infection. This study attempts to investigate the nature of these computational models and their predictive merit. Specifically, I had the following initial goals: related to previous work [1], 1

7 Chapter 1. Introduction generate decision tree classifiers to recognise genotypic patterns attributed to drugresistance; evaluate the predictive quality and nature of these models in retrospect; consider the application of other machine learning approaches. Using the complete Stanford dataset of matched phenotype genotype pairs, two contrasting machine-learning approaches were implemented, in Java, to analyse the significance of sequence mutations in the protease and reverse transcriptase genes of HIV-1 for 14 antiretroviral drugs. Specifically, decision tree classifiers were generated to predict drug-susceptibilities from genotypic information and nearest neighbour classifiers were built to identify genotypes with similar mutational patterns. Also evaluated in the study was the possible use of artificial neural networks as proposed in [2]. The predictive quality of the classification models was analysed in regards to an independent testing set. For decision trees, I found prediction errors between % for all drugs. This was compared with the performance of the decision trees presented in [1]. Specifically, these models achieved prediction errors in the range of % over the same testing set. Nearest-neighbour classifiers exhibited poorer performance with prediction errors between %. 1.2 HIV Infection Human immunodeficiency virus (HIV) is a superbly adapted human pathogen and as of the end of 2003, an estimated 37.8 million people worldwide 35.7 million adults and 2.1 million children younger than 15 years were living with HIV/AIDS. Furthermore, an estimated 4.8 million of these were new HIV infections. Once HIV enters the human bloodstream - through sexual reproduction, exchange of blood or breast milk - it seeks a host cell in order to reproduce. Although HIV can infect a number of cells in the body, the main target is an immune cell called lymphocyte - a type of T-cell. Once HIV comes in contact with a T-cell it attaches itself to it and hijacks it s cellular machinery to reproduce thousands of new copies of HIV, see figure 1.1. T-cells are an important part of the immune system because they help facilitate the body s response to many common but potentially fatal infections. Without enough T-cells the body s immune system is unable to defend itself against infections. 2

Chapter 1. Introduction image from http://www.aidsmeds.com/lessons/lifecycle1.htm Fig. 1.1 The HIV life-cycle.

8 Chapter 1. Introduction image from Fig. 1.1 The HIV life-cycle. In (1) HIV encounters a T-cell and gp120 (on the surface of HIV) binds to the T-cells cd-4 molecule. The membranes of the HIV particle and T-cell fuse and the contents of the HIV particle release into the T-cell. In (2) reverse transcription creates a DNA copy of the virus s RNA. In (3) the HIV DNA is transported to the T-cell's nucleus. Another viral enzyme called integrase hides the proviral DNA into the cell's DNA. In (4) HIV s genetic material directs the T-cell to produce new HIV. In (5) and (6) a new HIV particle is assembled. Being HIV-positive, or being infected with HIV is not that same as having acquired immune deficiency syndrome (AIDS). Someone with HIV can live for many years with few ill effects. Off course, this is provided that their bodies replace the T-cells destroyed by the virus. However, once the number of T-cells diminishes below a certain threshold, infected individuals will start to display symptoms of AIDS. Such individuals will have a lowered immune response and are highly susceptible to a wide range of infections that are harmless to healthy people but may inevitably prove fatal. Indeed, in 2003, HIV/AIDS associated illnesses caused the deaths of approximately 2.9 million people worldwide, including an estimated 490,000 children younger than 15 years and since the first AIDS cases were identified in 1981, over 20 million people with HIV/AIDS have died. 1.3 Resistance Testing There are now a number of antiretroviral drugs approved for treating HIV-1 infection. Treatment with combinations of these drugs can offer individuals prolonged virus suppression and a chance for immunologic reconstruction. However, unless therapy suppresses virus replication completely the selective pressure of antiretroviral treatment enhances the emergence of drug-resistant variants, see figure

Chapter 1. Introduction image from http://www.vircolab.com/bgdisplay.jhtml?itemname=understandinghiv Fig. 1.2 The development of drug resistance.

9 Chapter 1. Introduction image from Fig. 1.2 The development of drug resistance. The emergence of these variants depends on the development of genetic variations in the virus that allow it to escape from the inhibitory effects of a drug. We say that a drug-resistant variant is identifiable from a reference virus by the manifestation of mutations that contribute to reduce susceptibility. This occurs through natural mechanisms. In particular, HIV-1 genetic variability results from the inability of the HIV-1 reverse transcriptase to rewrite nucleotide sequences during replication [3]. This is compounded by the high rate of HIV-1 replication (approximately 10 billion particles/day), a spontaneous mutation rate (approximately 1 mutation/copy) and genetic recombination when viruses of different sequence infect the same cell. Once a drug-resistant variant has emerged greater levels of the same antiretroviral drug is required to suppress virus replication. However, with greater levels of drug we increase the risk of adverse side effects and harm to an individual. Therefore when resistance occurs, patients often need to change to a new drug regimen. To help, we can use resistance tests to show whether a particular HIV-1 strain is likely to be suppressed by a drug or not. Nevertheless, until recently, resistance testing was used solely as a research tool, to investigate the mechanisms of drug failure. In July 1998, the idea of extending the methodology of resistance testing to routine clinical management, although 4

10 Chapter 1. Introduction logical, could not be recommended due to lack of validation, standardisation and a concrete definition of the role of the testing [4]. Nevertheless, since then, a number of studies emerged indicating the worth of resistance testing for clinical management and the lack of standardisation was addressed with the development of commercial resistance tests. Both of these factors contributed to a second statement, published in early 2000, which explicitly recognised the value of HIV-1 drug resistance testing [5] and finally, in May 2000, the International AIDS Society recommended the incorporation of drug-resistance testing in the clinical management of patients with HIV-1 infection [6]. Furthermore, considerable data supporting the use of drug-resistance testing has now been published or presented at international conferences [7]. 1.4 Phenotype And Genotype Resistance Tests At present, there are two ways to assess the susceptibility of an HIV-1 isolate to antiretroviral drugs in vitro namely phenotyping and genotyping. Phenotyping directly measures the susceptibility of HIV-1 strains to particular drugs, whereas genotyping establishes the absence or presence of specific mutations in HIV-1 that have been previously associated with drug resistance. Phenotype resistance tests involve direct quantification of drug sensitivity. Viral replication is measured in cell cultures under the selective pressure of increasing concentrations of antiretroviral drugs. Specifically, a sample of blood is taken from a patient and the HIV is isolated. At this stage the reverse transcriptase and protease genes are recognised and amplified. The amplified genes are then inserted into a laboratory strain of HIV, which the scientists are able to grow (a recombinant virus). The ability of the virus to grow in the presence of each antiretroviral drug is evaluated. In particular, the virus is grown in the presence of varying concentrations of antiretroviral drugs and its ability to grow in the presence of these drugs is compared to that of a reference strain. The outcome of a phenotypic test may be expressed as a: IC 50, IC 90 or IC 95 value. Where the IC value expresses the concentration of a particular drug required to inhibit the growth of the virus by 50%, 90% or 95%, respectively, see figure 1.3. The level of drug resistance is reported by typically comparing the IC 50 value of the patients HIV with that of a reference strain. In particular, a degree of fold-change is reported where fold-change indicates the increase in the amount of drug needed to stop viral replication compared to the reference 5

11 Chapter 1. Introduction strain. In other words a phenotypic resistance test gives a clear indication of the capability of an HIV-1 strain to grow in the presence of a particular concentration of an antiretroviral drug. These results are easily interpreted but often prove time consuming, expensive and labour intensive. Antiviral effect (%) Reference Strain Resistant Strain 50% IC 50 IC 50 Drug Concentration Fig Comparing the IC 50 values of a reference and resistant strain. Genotypic resistance tests are based on the analysis of mutations associated with resistance. In a genotypic test the nucleotide sequence of the virus genome is determined by direct sequencing of PCR products. This sequence is then aligned with a reference strain and the differences are recorded. In particular, the results of a genotypic test are given as a list of point mutations with respect to a reference strain. Such mutations are expressed by the position they have in a certain gene, preceded by the letter corresponding to the amino-acid seen in the reference virus, and followed by the mutated amino acid. For example M184V would correspond to the substitution of Methyonine by Valine at the codon 184. In contrast to phenotypic tests, genotypic tests usually provide results in a few days and are less expensive. However, genotyping is a method that can be viewed as primarily measuring the likelihood of reduced susceptibility, with the major challenge being the correct interpretation of the results in order to associate a realistic level of drug-resistance. The specific type and placement of the mutations determines which drugs the virus may be resistant to. For example, if the M184V mutation is discovered in a patient s HIV, the 6

12 Chapter 1. Introduction virus is probably resistant to the reverse transcriptase inhibitor lamivudine. In this respect, clinicians often utilise tables of known mutations attributed to drug-resistance, see figure 1.4. However, this is not a simple task because it is simply not true that the clinician can consider each mutation independently of the others. Specifically, the influence of a mutation on drug resistance must be considered as a part of a global interaction [8]. In addition it is not viable to continue to use such tables of mutations because their complexity must grow as the number of drugs, and especially drug combinations, increases. image from Fig Mutations in the protease associated with reduced susceptibility to protease inhibitors. 7

13 Chapter 1. Introduction 1.5 Literature Review A number of statistical methods have been used to investigate the relationship between the results of HIV-1 genotypic and phenotypic assays. Cluster analysis, linear discriminant analysis, heuristic scoring metrics, nearest neighbour, neural networks and recursive partitioning have been used to correlate drug-susceptibility with genotype. However, the problem of relating the results of genotypic and phenotypic assays provides several statistical challenges and the success of these methods have been mixed. Firstly, phenotype must be considered as a consequence of a large number of possible mutations. As mentioned previously, this is compounded by the fact that the effect of mutations at any given position is influenced by the presence of mutations at other positions and it is therefore necessary to detect global interactions. The use of cluster analysis and linear discriminant analysis is described in [9]. Investigating drug-resistance mutations of two protease inhibitors saquinavir (SQV) and indinavir (IDV) the results of these analyses were comparable. In particular, both analyses were able to identify the association of mutations at amino acid positions 10, 63, 71 and 90 with in vitro resistance to SQV and IDV. In elaboration, cluster analysis requires a notion of distance between any two amino acid sequences. Typically a set of indicator variables is created for each amino acid position and a vector of indicator variables is then used to represent a sequence. The distance between two sequences can then be defined as the standard Euclidean distance between the two corresponding vectors. Such a measure can then be used to create groups of amino acid sequences with similar genotypes. Furthermore, the distance between any two groups can be defined as the average of the individual distances between all pairs of members of the two groups. Creating hierarchies of such groups or clusters then facilitates the investigation of the degree to which similar genotypes have similar phenotypes. On the other hand, linear discriminant analysis can be used to determine which mutations best predict sensitivity to drugs, as defined by the phenotypic IC 50 value. A dataset consisting of matched phenotype genotype pairs is split into two groups, labelled as either resistant or susceptible depending on the basis of an IC 50 cut-off. A linear discriminant function is then used to predict which group an unknown genotype belongs to. Where the linear discriminant 8

14 Chapter 1. Introduction function is simply a linear combination of predictors or indicator variables (the choice here varies). In contrast, a simple scoring metric is used in [10] to predict HIV-1 protease inhibitor resistance. Here a database of genotype phenotype pairs was analysed. It was found that samples with one or two mutations in the protease gene were phenotypically susceptible and samples with five or more mutations were resistant against all protease inhibitors. A list of all the mutations present in the database was compiled and split into two groups. One in which mutations were frequent in susceptible and resistant samples and one in which mutations were predominantly present in resistant samples. A scoring system using the presence or absence of any single mutation compiled by Schinazi et al [11] was then used as a criterion for predicting phenotypic resistance from genotype. This was enhanced by the incorporation of a secondary score that simply takes into account the total number of resistance associated mutations in the protease. This achieved high sensitivity ( %) but lower specificity (57.9% %) on unseen cases. Commercially, the Virco-Tibotec company, over time have accumulated a database of around 100,000 genotypes and phenotypes. Using this dataset a virtualphenotype TM is generated from genotype by means of a nearest neighbour strategy. In particular, their system begins by identifying all the mutations in the genotype that can affect resistance to each drug. It uses this profile to interrogate the dataset for previously seen genotypes that have similar profiles. When all the possible matches are identified, the phenotypes for these samples are retrieved and for each drug the data is averaged. This generates a report of virtual IC 50 values for each drug. In contrast, in [2] artificial neural networks were used to predict lopinavir resistance from genotype. In brief, a neural network is a two-stage regression or classification model that represents real-valued, discrete-valued and vector-valued functions. Using a set of examples, algorithms such as backpropagation tune the parameters of a fixed network of units and interconnections. For example, backpropagation employs a gradient descent to attempt to minimise the error between the network outputs and target values. In [2] two neural network models were developed. The first was based on changes at only 11 amino acid positions in the protease, as described in the literature, and the second was based on 28 amino acid positions resulting from category prevalence analysis. A set of 1322 clinical samples was utilised to train, validate and test the models. Results were expressed in terms of the 9

15 Chapter 1. Introduction correlation coefficient R 2. In simple terms the correlation coefficient indicates to which extend the predicted and true values lie on a straight line. It was found that the 28-mutation model proved to be more accurate than the 11-mutation model at predicting lopinavir drugresistance (R 2 = 0.88 against R 2 = 0.84). Alternatively, decision tree classifiers were generated by means of recursive partitioning in [1]. These models were then used to identify genotypic patterns characteristic of resistance to 14 antiretroviral drugs. In brief, recursive partitioning describes the iterative technique used to construct a decision tree. Recursive partitioning algorithms begin by splitting a population of data into subpopulations by determining an attribute that best splits the initial population. It continues by repeated identification of attributes that best splits each resulting subpopulation. Once a subpopulation contains individuals of the same type no more splits are made and a classification is assigned to that population. An unknown case is then given a classification by sorting it down the tree from root to a leaf using the attributes as specific tests. Here the initial population was a dataset of matched phenotype genotype pairs, consisting of 471 clinical samples. For each drug in the study a separate decision tree classifier was constructed to predict phenotypic resistance from genotype using an implementation of recursive partitioning. These models were then assessed using leave-oneout experiments and prediction errors were found in the range of %. 10

16 Chapter 2 Machine Learning 2.1 Overview & The Phenotype Prediction Problem In general, the field of machine learning is concerned with the question of how to construct computer programs that automatically improve with experience. Typically, we have a classification, either quantitative or categorical, that we wish to predict based on a set of input characteristics. We have a training set consisting of matched classifications with characteristics. Then using this dataset we strive to build a classification model that will enable us to predict the appropriate outcome for unseen cases. Consider the following definition of a well-posed learning problem, according to Mitchell [12]: Definition: A computer program is said to learn form experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. The first step in constructing such a program is to consider the type of training experience from which the system will learn. This is important because the type and availability of training material has a significant impact on the success or failure of the learner. An important consideration (especially applicable to clinical data) is how well the training material represents the entire population. In general, machine learning is most effective when the training material is distributed similarly to that of the remaining population. However, in a clinical setting, this is almost never entirely the case. When developing systems that learn 11

17 Chapter 2. Machine Learning from clinical data it is often necessary to learn from a distribution of examples that may be fairly different from those in the remaining population. Such situations are challenging because the success over one distribution will not necessarily lead to a strong performance over some other distribution, since most machine learning theory rests on the assumption that the distribution of training material is exactly representative of the entire population. However, although not ideal, this situation is often the case when developing practical machine learning systems. The next step is to determine the type of knowledge to be learned and how this will be used by the performance measure. In particular, we seek to define a target function F: C O, where F accepts as input a set of characteristics C and produces as output some classification O that is true for C. The problem of improving performance P at tasks T is then reducible to finding a function F 2 that performs better than a function F 1 at tasks T, as measured by P. The final step in building a learning system is to choose an appropriate learning mechanism to derive the target function F. This includes determining the representation of the function F, whether it should be a decision tree, neural network, instance-based, concept-based, linear function etc. Once a representation is decided upon an appropriate algorithm must then be employed to generate such representations from the training experience E. Many machine learning problems can be specified using this critique. In particular we can define a machine learning problem by specifying the task T, performance measure P, training experience E and target function F. In this way, the generalised phenotype prediction problem, can be defined as follows: Task T: The correct interpretation of genotype sequence information with respect to drug resistance. Performance Measure P: The percentage of sequences correctly classified as resistant or susceptible to a particular drug. Training Experience E: A dataset of matched phenotype genotype pairs. Target Function: Resistant: Genotype Drug_Susceptibility_Classification, where Resistant accepts as input the result of a genotype resistance test, Genotype, and outputs a drug susceptibility classification, Drug_Susceptibility_Classification, indicating the susceptibility a genotype to a particular antiretroviral drug. 12

18 Chapter 2. Machine Learning 2.2 The Training Experience, E. The first step towards addressing the phenotype prediction problem is to acquire a suitable dataset of matched phenotype genotype pairs. In [1] 471 clinical samples from 397 patients were analysed, first hand, to provide both phenotypic and genotypic information for six nucleoside inhibitors of the reverse transcriptase, zidovudine (ZDV), zalcitabine (ddc), didanosine (ddi), stavudine (d4t), lamivudine (3TC) and abacavir (ABC); three nonnucleoside inhibitors of the reverse transcriptase, nevirapine (NVP), delavirdine (DLV) and efavirenz (EFV); and five protease inhibitors, saquinavir (SQV), indinavir (IDV), ritonavir (RTV), nelfinavir (NFV) and amprenavir (APV). This resulted in phenotype-genotype pairs for each drug, except for APV. In detail, genotype results were obtained through direct sequencing of the patients HIV containing the complete protease and the first nucleotides of the reverse transcriptase. These sequences were then aligned to the reference strain HXB2 to identify differences. Each genotype was then modelled in a computationally useable manner using one attribute for each amino acid position, allowing as a value a single-letter amino acid code or unknown for positions in which ambiguous or no sequence information was available, see figure 2.1. Reference sequence Sequencing result Alignment Attribute Values Fig. 2.1 Modelling genotype. Each sequence position is represented by either the amino acid present at that position or unknown if no sequence information is available. 13

19 Chapter 2. Machine Learning Phenotyping was performed using a recombinant virus assay. Recombinant viruses were cultivated in the presence of increasing amounts of antiretroviral drugs and fold-change values were calculated by dividing the IC 50 value of the relevant recombinant virus by the IC 50 value of a reference strain (NL4-3). Fold-change values were distributed as in figure 2.2. image from Fig 2.2 Frequency distribution of fold-change values for a subset of 271 samples for which data was available for all 14 drugs. Using the fold-change values, the dataset of genotypes were grouped into two classes for each antiretroviral drug. In particular, a genotype was labelled as either resistant or susceptible, using a drug specific fold-change threshold, see figure 2.3. These thresholds were obtained from previously published work: 8.5 for ZDV, 3TC, NVP, DLV, EFV; 2.5 for ddc, ddi, d4t and ABC; and 3.5 for SQV, IDV, RTV, NFV, APV. Drug No. of pheno-geno pairs. Percentage of examples classified resistant. ZDV DdC DdI D4T TC ABC NVP DLV EFV SQV IDV RTV NFV APV Fig. 2.3 Characteristics of the dataset. 14

20 Chapter 2. Machine Learning In contrast, to the more direct approach presented above, one may obtain both phenotypic and genotypic information via online databases. In recent years, databases, consisting of thousands of matched phenotype genotype pairs, have emerged in response to a concern about the lack of publicly available HIV reverse transcriptase and protease sequences. One such database, the Stanford HIV-1 drug resistance database, available at strives towards linking HIV-1 reverse transcriptase and protease sequence data, drug treatment histories and drug susceptibilities to allow researchers to analyse the extent of clinical cross-resistance among current and experimental antiretroviral drugs. Other online databases that provide access to HIV reverse transcriptase and protease sequences with drug-susceptibility data, include: Learning Mechanisms. The final step in addressing the phenotype prediction problem is to choose an appropriate learning mechanism. This includes choosing an appropriate representation of the function Resistant: Genotype Drug_Susceptibility_Classification, and an algorithm to automatically derive such a representation from the training experience. There are many choices available; decision trees, artificial neural networks and instance-based learning are just a few Decision Trees. In [1] decision trees were used to represent the target function Resistant: Genotype Drug_Susceptibility_Classification for 14 antiretroviral drugs. Specifically, decision trees were generated from a set of phenotype genotype pairs (described previously in section 2.2) using the C4.5 software package and a statistical measure 15

21 Chapter 2. Machine Learning indicating the amount of information that a particular sequence position provides about differentiating resistant from susceptible samples. A decision tree is an acyclic graph with interior vertices symbolizing tests to be carried out on a characteristic or attribute and leaves indicating classifications. Decision trees then classify unseen cases by sorting them through the tree from the root to some leaf, which provides a classification. In other words, as each node in the graph symbolizes a test of some attribute, and each edge departing from that node corresponds to one of the possible values for the attribute. An unseen case is then classified by starting at the root node of the graph, testing the attribute specified by this node and traversing the edge corresponding to the value of the attribute in the given case. This process is repeated for the sub-graph rooted at the new node, until a leaf node is encountered. At which point a classification is assigned to the case. Specifically, in [1] the classification of a genotype was achieved by traversing a decision tree, associated with an antiretroviral drug, from the root to a leaf according to the values of amino acids at specific sequence positions, see figure 2.4. Root V 90 F,M S L R Internal node S R I G V V L R Leaf R I V R,T,V I S R S R Fig. 2.4 Decision tree for the protease inhibitor SQV as presented in [1]. Nodes represent tests at specific amino acid positions, edges represent amino acid values and leafs represent drugsusceptibility classifications. 16

22 Chapter 2. Machine Learning In more general terms, a decision tree represents a disjunction of conjunctions of constraints on the attribute values. Each path from the root of the graph to a leaf translates to a conjunction of attribute tests, and the graph itself translates to a disjunction of these conjunctions. For example the decision tree in figure 2.4 can be translated into the expression ( 90 = F ) \/ (90 = M) \/ ( 90 = L /\ 48 = V ) \/ ( 90 = L /\ 48 = G /\ 54 = L ) \/ ( 90 = L /\ 48 = G /\ 54 = I /\ 84 = V ) \/ ( 90 = L /\ 48 = G /\ 54 = V /\ 72 = I ) representing the knowledge that the decision tree uses to determine resistance. Most algorithms that have been developed for generating decision trees offer slight variation on a core methodology that uses a greedy search through a space of possible decision trees. In this way we can think of decision tree learning as involving the search of a very large space of possible decision tree classifiers to determine the one that best fits the training data. Both the ID3 and C4.5 algorithms typify this approach. The basic decision tree-learning algorithm, ID3, performs a greedy search for a decision tree that fits the training data. In summary, it begins by asking, Which attribute should be tested at the root of the tree? Each possible attribute is evaluated using a statistical test to determine how well it alone classifies the complete set of training data. A descendent of the root is created for each possible value of the attribute, that is chosen, and the training data is sorted to the appropriate descendent nodes. The procedure is then repeated for each of the descendent nodes until all the training data is correctly classified. The ID3 algorithm is given in table 2.1. The central choice in the ID3 algorithm is which attribute to test at each node. In order to select an appropriate attribute we can employ a statistical measure to quantify how well an attribute separates a set of data according to a finite alphabet of possible outcomes. A popular measure is called information gain. The information gain of an attribute A, relative to a dataset S, is defined as ig(s,a) entropy(s) S v S entropy(s v ) v є values(a) 17

23 Chapter 2. Machine Learning ID3(Examples, Target_attribute, Attributes) Examples are the training examples. Target_attribute whose value is to be predicted by the tree. Attributes is a list of other attributes that may be tested by the learned decision tree. Returns a decision tree that correctly classifies the given Examples. - Create a Root node for the tree. - If all Examples are positive, Return the single-node tree Root, with label = + - If all Examples are negative, Return the single-node tree Root, with label = - - If Attributes is empty, Return the single-node tree Root, with label = most common value of Target_attribute in Examples. - Otherwise Begin o A the attribute from Attributes that best classifies Examples o The decision attribute for Root A o For each possible value, v i, of A, Add a new tree branch below Root, corresponding to the test A = v i Let Examples vi be the subset of Examples that have value v i for A If Examples vi is empty Then below this new branch add a leaf node with label = the most common value of Target_attribute in Examples Else below this new branch add the subtree ID3(Examples vi, Target_attribute, Attributes {A}) - Return Root Algorithm taken from Mitchell s Book [12] Table 2.1 The ID3 algorithm where values(a) is the set of all possible values for attribute A, S v is the subset of S for which attribute A has value v and entropy(s) measures the impurity of the set S, such that if all the members of S belong to the same class then entropy is 0 and conversely the entropy is 1 if members of S are equally distributed, see figure 2.5. Specifically, if a set S, contains a number of examples belonging to either a positive or negative class, the entropy of S relative to these two classes is defined as: entropy(s) p (+) log 2 p (+) p (-) log 2 p (-) where p (+) is the proportion of examples in S belonging to the positive class and p (-) is the proportion of examples in S belonging to the negative class. In this respect, ig(s,a), measures the expected reduction in the impurity of a set of examples S according to the attribute A. In other words, we wish to select attributes with high values for information gain. 18

24 Chapter 2. Machine Learning 1.0 entropy(s) p (+) Fig. 2.5 The entropy function relative to a boolean classification A major drawback of the ID3 algorithm is that it continues to grow a decision tree until all the training examples are perfectly classified. This approach, although reasonable can lead to difficulties when either there is erroneous data in the training set or the distribution of the training examples is not representative of the entire population. In these cases, ID3 produces decision trees that overfit the training examples. A decision tree is said to overfit the training examples if there exists some other decision tree that classifies the training examples less accurately but nevertheless performs better over the entire population. Figure 2.6 illustrates the impact of overfitting. Accuracy On training data On validation data Size of tree (no. of nodes) Fig. 2.6 Overfitting in decision tree learning. When ID3 creates a new node the accuracy of the tree measured using the training examples increases. However, when measured using the testing examples the accuracy of the tree decreases as its size increases. 19

25 Chapter 2. Machine Learning Overfitting is a significant problem for decision tree learning and indeed machine learning as a whole. Specifically, overfitting has been found to decrease the accuracy of learned decision trees by 10-25% [13]. However, there are a number of techniques available that minimise the effects of overfitting in decision tree learning. Typically, we begin by randomly partitioning the training data into two subsets, one for training and one for validation. Using the training set we grow an overfitted decision tree and then post-prune using the validation set. Postpruning, in this respect, has the effect that any nodes added due to coincidental regularities in the training set are likely to be removed because the same regularities are unlikely to occur in the validation set. Reduced error pruning is one post-pruning strategy that uses a validation set to minimise the effects of overfitting. Here each node in the decision tree is considered as a candidate for pruning. The removal of a node is determined by how well the reduced tree classifies the examples in the validation set. In particular, if the reduced tree performs no worse than the original over the validation set then the node is removed, making it a leaf and assigning it the most common classification of the training examples associated with that node. Figure 2.7 illustrates the impact of reduced error pruning. Accuracy On training data On validation data On validation data (during pruning) Size of tree (no. of nodes) Fig. 2.7 The impact of reduced-error pruning. The C4.5 algorithm is a successor of the basic ID3 algorithm. C4.5 behaves in the same way as ID3 but offers consideration to a number of issues. In particular, ID3, was criticised for not offering support to handle continuous attributes and for not handling training data with missing attribute values. 20

26 Chapter 2. Machine Learning Where the basic ID3 algorithm is restricted to attributes that take on a discrete set of values, C4.5 converts continuous values into a set of discrete values by separating the data into a number of bins and attaching a label to each bin. Considering the dataset presented in section 2.2, the ability to handle continuous values isn t required. In particular, each genotype is modelled using a number of attributes (amino acid positions) that have discrete values (single letter amino acid codes). C4.5 handles attributes with missing values by assigning a probability to each of the possible values of an attribute, based on the observed frequencies of the various values. These fractional proportions are then used to both grow the tree and classify unseen cases with missing attribute values. Again, considering the dataset presented in sections 2.2, the ability to handle attributes with missing values isn t a problem. In particular, the dataset presented in section 2.2 explicitly models missing attribute values using a value of unknown. Once continuous values and missing attribute values are handled, the C4.5 algorithm uses a greedy search (similar to ID3) to find a decision tree that exactly conforms to the training data and uses a post-pruning strategy, called rule-post pruning, to minimise the effects of overfitting. In particular rule-post pruning involves: inferring an overfitted decision tree from the training data, converting the decision tree into a set of classification rules (conjunctions of constraints on attribute values) and removing any constraints on attribute values that result in improving a rules estimated accuracy Artificial Neural Networks In [2] two artificial neural networks were independently used to represent the target function Resistant: Genotype Drug_Susceptibility_Classification for the protease inhibitor lopinavir. Specifically, two artificial neural networks were generated from a set of 1322 phenotype genotype pairs using the backpropagation algorithm to learn the parameters of a single hidden layer network predicting the susceptibility of lopinavir from genotype. The first network was based on mutations at 11 amino acid positions that were previously recognised as being attributed to drug-resistance and the second was based on mutations at 28 amino acid positions as identified through statistical analysis. The study of artificial neural networks was initially inspired by the observation that biological learning systems are built from very complex webs of simple computational units 21

27 Chapter 2. Machine Learning called neurons. Analogous, artificial neural networks are built from densely interconnected sets of units called sigmoid perceptrons, see figure v 1 w 1 w e -inputs v 2... w 2 inputs = w 0 + w 1 v 1 + w 2 v w n v n v n w n Fig. 2.8 A sigmoid perceptron. A sigmoid perceptron is a simple computational unit that takes as input a vector of numerical values and outputs a continuous function of its inputs, called the sigmoid function. Specifically, given a vector of inputs {v 1, v 2,, v n } a linear combination of these inputs is calculated as w 0 + w 1 v 1 + w 2 v w n v n. Each w i is a numerical constant that weights the contribution of an input v i. The output of the sigmoid perceptron is then obtained using the sigmoid function: e -inputs where inputs is the result of w 0 + w 1 v 1 + w 2 v w n v n. The backpropagation algorithm, given in table 2.2, learns appropriate values for each weight w i in a multilayer network with a fixed number of sigmoid perceptrons and interconnections in order to give a correct output (classification) for a set of inputs (characteristics). Figure 2.9 illustrates a single-hidden layer network. Backpropagation uses a training set of matched inputs and outputs and employs gradient descent to minimise the error between the network outputs and the actual outputs of the training set. 22

28 Chapter 2. Machine Learning v i w i h i k i Fig. 2.9 A single hidden layer network To learn the appropriate weights of a network consisting of a single sigmoid perceptron gradient decent uses a training set of matched input output pairs of the form ({v 1, v 2,, v n }, t), where {v 1, v 2,, v n } is a vector of input values and t is the target output. Gradient decent begins by choosing small arbitrary values for the weights. Weights are then updated for each training example that is misclassified until all the training examples are correctly classified. A learning rate η is used to determine the extent to which a weight is updated. Specifically, each training example is classified by the perceptron to obtain an output o and each weight is updated by the rule w i w i + η(t o)v i. This process is then repeated until the perceptron makes no classification errors in the training data. In considering networks with multiple sigmoid units and multiple outputs we again begin by choosing small arbitrary values for the weights in the network (typically between 0.05 and 0.05) and update them according to the weight update rule w ij w ij + η δ j x ij, where w ij denotes the weight from unit i to j, x ij denotes the input from unit i into unit j and δ is a term representing the misclassification error for each unit in the network. For an output unit k the term δ k is computed as o kv (1 o kv )(t kv o kv ), where o kv is the output value associated with the kth output unit and training example v. For a hidden unit h the term δ h is computed as o hv (1 - o hv ) k є outs w kh δ k where o kv is the output value associated with the hidden unit h and training example v. 23

29 Chapter 2. Machine Learning BACKPROPAGATION(training_examples, η, n in, n out, n hidden ) - Create a feed-forward network with n in inputs, n hidden hidden units and n out output units. - Initialise all network weights to small random numbers. - Until the termination condition is met, Do o For each <x, t> in training_examples, Do Propagate the input forward through the network: Input the instance x to the network and compute the output o u of every unit u in the network. Propagate the errors backward through the network: For each network output unit k, calculate its error term δ k o k (1 o k )(t k o k ) For each hidden unit h, calculate its error term δ h o h (1 - o h ) k є outpu ts w kh δ k Update each network weight w ij w ij + η δ j x ij Algorithm taken from Mitchell s Book [12] Table 2.2 The backpropagation algorithm Backpropagation continues to update the weights of a multilayer network, in this fashion, until all the training examples are correctly classified or once the error on the training examples falls below some threshold. In the context of the phenotype prediction problem, the target function Resistant: Genotype Drug_Susceptibility_Classification can be represented using an artificial neural network derived using a set of training patterns ({v 1, v 2,, v n }, t) and the backpropagation algorithm. Here {v 1, v 2,, v n } represents the result of a genotype resistance test and t is the corresponding target phenotype. An unseen genotype {v 1, v 2,, v n } q is then classified using the network by propagating the values v 1, v 2,, v n through it. In other words, given the input values v 1, v 2,, v n compute the output of every unit in the network. The final output of the network is then an estimate of phenotype. 24

30 Chapter 2. Machine Learning Instance Based Nearest neighbour learning contrasts with learning decision trees and artificial neural networks in that nearest neighbour learning simply stores the training examples and doesn t attempt to extract an explicit description of the target function. In this way generalisation beyond the training examples is postponed until a new case must be classified. Specifically, when a new case is presented to a nearest neighbour classifier, a set of similar cases is retrieved from the training set (using a similarity measure) and used to classify the new case. In this way, a nearest neighbour classifier constructs a different approximation to the target function for each query, based on a collection of local approximations. In regards to the phenotype prediction problem, this is a significant advantage, as the target function Resistant: Genotype Drug_Susceptibility_Classification may be very complex, due to the fact that each mutation may be part of a global interaction exhibiting large interdependence. However, nearest neighbour classifiers typically consider all the attributes of each case when making predictions and if the target function actually depends only on a small subset of these attributes, then cases that are truly most similar may well be deemed to be unrelated. In the context of the phenotype prediction problem this is problematic and extra care should be given to the design of a similarity measure. Another limitation of nearest neighbour methods is that nearly all computation is left until a new classification is required and the total computation cost of classifying a new case can be high. The k-nearest neighbour methodology is the most basic nearest neighbour algorithm available. It assumes that all n-attribute cases correspond to locations in an n-dimensional space. This is called the feature space. The nearest neighbours of an unseen case are then retrieved using the standard Euclidean distance. Specifically, each case is described using a vector of features and the distance between two cases or the similarity of two cases c i and c j is defined to be n (a r (c i ) a r (c j ))2 r = 1 where, a r (c), denotes the value of the rth feature of case c. Using such a measure, k-nearest neighbour approximates the classification of an unseen case, c q, by retrieving the k cases c 1, c 2,, c k that are closest to c q and assigning it the most common classification associated with 25

31 Chapter 2. Machine Learning the cases c 1, c 2,, c k. For example, if k = 1 then 1-nearest neighbour assigns to c q the classification associated c i, where c i is a training case closest to c q. Figure 2.10 illustrates the operation of the k-nearest neighbour algorithm c q Fig K-nearest neighbour. Here each case is represented using a 2 dimensional feature vector and cases are either classified as positive or negative. 1-nearest neighbour classifies c q as positive, whereas 5-nearest neighbour classifies c q as negative. In terms of the phenotype prediction problem, k-nearest neighbour would approximate the phenotype of an unseen genotype g q, by retrieving the k genotypes g 1, g 2,, g k that are closest to g q and assigning it a classification of drug-susceptibility using the phenotypes associated with the genotypes g 1, g 2,, g k. With the similarity or distance between two genotypes being determined using an appropriate distance measure. One possibility for this distance measure is to simply apply the Euclidean distance as described above. Here we model each genotype as a vector of single-letter amino acid codes. In this way, every sequence position is considered when determining the distance between two genotypes. As mentioned this may be problematic because it treats each sequence position as being equally important. In this way, we neglect to highlight the importance of amino-acid changes, at specific sequence positions. Commercially, the Virco-Tibotec company employs a k-nearest neighbour approach to predict drug-susceptibility from genotype, see figure Here the distance between two genotypes is based on the comparison of their profiles. Where a profile can be thought of as a feature vector containing all the mutations present in a genotype previously associated with drug-resistance. Here we do not consider every sequence position in the distance measure, as above, but rather only a subset of sequence positions. However, this method relies on 26

Chapter 2. Machine Learning previous knowledge of mutations associated with drug-resistance and fails to consider amino acid changes beyond this set. image from http://www.vircolab.com/bgdisplay.

32 Chapter 2. Machine Learning previous knowledge of mutations associated with drug-resistance and fails to consider amino acid changes beyond this set. image from Fig How Virco generates a virtualphenotype An alternative measure that doesn t presume any previous knowledge of drug-resistant mutations constructs a feature vector using a reference strain. Specifically, a feature vector for a genotype is constructed by comparing its complete sequence with a reference sequence. For positions in which there is no change in amino acid the feature vector is augmented to contain a dummy value, β, and for other positions the feature vector is augmented to contain the amino acid present in the genotype sequence. In other words, the feature vector of a genotype represents a pattern of deviations from a reference sequence. We then compute a similarity score based on how well two feature vectors conform. For positions in which both vectors contain non-dummy values we compute the percentage of these that are different. Also, an additional factor is included that represents the percentage of the remaining positions that are different. Figure 2.12 illustrates this approach. Another possible similarity measure that again doesn t presume any previous knowledge of drug-resistance mutations is derived through the comparison of two dot-plots. A dot plot is a visual representation of the similarities between two sequences. Each axis of a rectangular array represents one of the two sequences to be compared and at every point in the array where the two sequences are identical a dot is placed (i.e. at the intersection of every row and 27

33 Chapter 2. Machine Learning column that have the same amino acid in both sequences). A diagonal stretch of dots indicates regions where the two sequences are similar. Using such a representation of sequence similarity, we can construct a dot plot for each genotype in the training set in relation to a reference sequence. Similarly a dot-plot can be constructed for a query genotype using the same reference sequence. The distance between two genotypes could then be estimated by comparing the similarity of two dot-plots. Figure 2.13 illustrates this approach. Reference Strain Genotype sequences {β,.., β,, β,, β,, β,, β,, β} {β,.., β,, β,, β,, β,, β,, β,, β,.., β,, β,, β} g1 g2 distance(g1, g2) = mutagreement(g1, g2) + diffs(g1, g2). mutagreement(g1, g2) = no. of shared mutations that are different / total no of shared mutations. diffs(g1, g2) = no. of unshared mutations / length of sequence. Fig Comparing two feature vectors (i) (ii) (iii) Fig Comparing dot-plots. (i) A dot plot obtained by comparing the reference sequence, A, with itself. (ii) A dot plot obtained by comparing the reference sequence with a sequence, B, with a small subset of mutations. (iii) A dot plot obtained by comparing the reference sequence with a sequence, C, with a large number of mutations. Similarity of A and B determined by how well (ii) and (iii) conform. 28

34 Chapter 3 Materials & Methods 3.1 Data Set The complete HIV-1 reverse transcriptase and protease drug susceptibility data sets from the Stanford HIV-1 drug resistance database were utilised to determine viral genotype and drug susceptibility to five nucleoside inhibitors of the reverse transcriptase, zalcitabine (ddc), didanosine (ddi), stavudine (d4t), lamivudine (3TC) and abacavir (ABC); three nonnucleoside reverse transcriptase inhibitors, nevirapine (NVP), delavirdine (DLV) and efavirenz (EFV); and six protease inhibitors, saquinavir (SQV), lopinavir (LPV), indinavir (IDV), ritonavir (RTV), nelfinavir (NFV) and amprenavir (APV). Using a simple text parser, implementation described in Appendix A, I obtained phenotype-genotype pairs for each of these drugs, see table 3.2. In addition, I would have liked to reuse the dataset of phenotype-genotype pairs used in [1] but although the results of genotyping were deposited in GenBank (accession numbers AF to AF347605) no drug susceptibility information was attached to them. Furthermore, the Stanford dataset contained no drug-susceptibility information for the reverse transcriptase inhibitor zidovudine, which was also included in [1]. The Stanford HIV-1 reverse transcriptase and protease drug susceptibility data sets are available at and can be downloaded in a plain text format, see figure

35 Chapter 3. Materials & Methods SeqID SubType Method SQVFold SQVFoldMatch P1 P2 P3 MutList 7439 B Virologic 47.4 = I, 24FL, 37D, 46I, 53L, 60E, 63P, 71IV, 73S, 77I, 90M, 93L 7443 B Virologic = I, 17E, 37D, 43T, 48V, 54V, 63P, 67S, 71V, 77I, 82A 7459 B Virco 15.0 = I, 19Q, 35D, 48V, 63P, 69Y, 71T, 90M, 93L B Virologic na na I, 13V, 41K B Virologic na na L B Virologic = I, 15V, 20M, 35D, 36I, 54V, 57K, 62V, 63P, 71V, 73S Fig. 3.1 The HIV-1 protease drug-susceptibility dataset obtained from the Stanford HIV resistance database. For each sample presented in the Stanford dataset we are given a fold-change value (phenotype), for each drug, and a list of mutations identified in comparison to a reference strain (genotype). See Appendix C for a list of the sequences included in the protease and reverse transcriptase datasets. Using this information, I was able to model each sample in a similar way to that presented in section 2.2. Specifically, I modelled each protease sample using one attribute for each of its 99 amino acids, allowing as a value one of the 20 naturally occurring amino acids. Similarly, each reverse transcriptase sample was modelled using 440 attributes representing the first 440 amino acids of the reverse transcriptase. To acquire the values for each attribute, the list of mutations was used in conjunction with the reference strain HXB2. In particular, for each sample, the original sequence was reconstructed by substituting each mutation into the protease or reverse transcriptase genes of HXB2 (GenBank accession number K03455). In this way, each sequence that was constructed contained no gaps or areas with loss of sequence information and so an attribute value of unknown was not required. Figure 3.2 illustrates this process. 30

36 Chapter 3. Materials & Methods SeqID SubType Method SQVFold SQVFoldMatch P1 P2 P3 MutList 7443 B Virologic = I, 17E, 37D, 43T, 48V, 54V, 63P, 67S, 71V, 77I, 82A 7459 B Virco 15.0 = I, 19Q, 35D, 48V, 63P, 69Y, 71T, 90M, 93L Fold-change value for saquinavir (SQV) 10I, 17E, 37D, 43T, 48V, 54V, 63P, 67S, 71V, 77I, 82A Reference sequence Reconstructed genotype sequence Position 17 = E Attribute values Fig. 3.2 Modelling the data. A list of mutations in the protease gene is used in conjunction with the reference sequence HXB2 to obtain the values of each attribute. Using the drug-specific fold-change values associated with each sample, the dataset of genotypes were grouped into two classes. Here, fold-change values were distributed as in table 3.1. In particular, a genotype was labelled as either resistant or susceptible, in accordance with a drug specific fold-change threshold. An important consideration, here, is how the choice of thresholds affects the distribution of resistant and susceptible samples. By varying the choice of thresholds we include or exclude certain samples from either the resistant or susceptible classes. However, at this time, the same thresholds were used as described in section 2.2. By grouping the dataset of genotypes using these thresholds I obtained the datasets as described in Table

37 Chapter 3. Materials & Methods > ddc 22% 57% 16% 2% 2% 0.8% 0.2% 0% 0% 0% 0% 0% ddi 22% 61% 10% 4% 3% 0% 0% 0% 0% 0% 0% 0% D4T 32% 49% 13% 3% 1% 0.2% 0% 0% 0% 0% 0% 0% 3TC 11% 18% 8% 5% 6% 16% 18% 16% 0% 0.1% 0.1% 0% ABC 21% 30% 31% 13% 4% 1% 0% 0% 0% 0% 0% 0% NVP 32% 27% 11% 3% 2% 6% 6% 4% 5% 4% 0% 0% DLV 33% 26% 12% 5% 3% 5% 5% 4% 3% 0.5% 0% 0% EFV 41% 23% 7% 4% 4% 5% 5% 4% 4% 2% 0.2% 0.4% SQV 32% 26% 11% 7% 7% 8% 4% 2% 2% 1% 0% 0% LPV 24% 21% 13% 13% 10% 10% 8% 1% 0% 0% 0% 0% IDV 20% 28% 14% 14% 13% 7% 2% 0.7% 0.5% 0% 0% 0.1% RTV 24% 24% 12% 9% 9% 9% 9% 4% 0% 0% 0% 0% NFV 11% 22% 11% 12% 15% 15% 7% 3% 0.8% 2% 0.1% 0% APV 34% 29% 19% 9% 5% 2% 0.7% 0.3% 0.3% 0% 0% 0% Table 3.1 Frequency distribution of fold-change values. Resistance factors are grouped into equidistant bins. Drug No. of phenotype genotype pairs. Percentage of examples classified resistant. ZDV - - DdC % DdI % d4t % 3TC % ABC % NVP % DLV % EFV % SQV % LPV % IDV % RTV % NFV % APV % Table 3.2 Percentage of resistant examples. 32

38 Chapter 3. Materials & Methods 3.2 Decision Trees For each drug I derived a decision tree classifier to identify genotypic patterns characteristic of resistance, see Appendix B for a description of implementation. Each dataset of phenotype-genotype pairs was partitioned into a training, validation and test set. 56% of the complete dataset was randomly selected for training, 14% of the complete dataset was randomly selected for validation and what remained formed the testing set. Genotypes were labelled as either resistant or susceptible according to a drug-specific threshold, as defined in [1]. Decision trees were generated using the ID3 algorithm, as described in section A training set was recursively split, according to attribute values (positions amino acids), until all the examples in the training set were perfectly classified. Attributes were selected based on maximal information gain representing the amount of information that an amino acid position provides about differentiating resistant from susceptible genotypes. Trees were pruned using reduced error pruning to minimise the effects of overfitting. This gave rise to confidence factors associated with certain leafs, estimated by the fraction of training examples incorrectly classified by the pruned tree. Classification of an unseen genotype (list of mutations) was achieved by sorting it through a drug-specific decision tree from the root to a leaf according to the values (single-letter amino acid codes) of attributes (amino acid positions). If a genotype was unable to be completely sorted through the tree in this manner (i.e. it contains an attribute value that is not recognised by the tree) then it was classified as unknown. In addition, the classification of a nucleotide sequence was achieved through a two-step process. In order to classify a nucleotide sequence each codon in the sequence was translated into a single-letter amino acid code using the standard genetic code, see figure 3.3. Thus, constructing a set of attribute values to be classified. For each classification an explanation was generated. Specifically, recording the path followed through a decision tree during classification generated an explanation. This was translated into natural language for easy readability, see figure 3.4 for an example. 33

39 Chapter 3. Materials & Methods Fig. 3.3 The genetic code List of mutations Found I at position 10, Found A at position 71, Found V at position 48, ** Reached decision: example is resistant. (8[0%]) ** Fig. 3.4 Generating an explanation. 34

40 Chapter 3. Materials & Methods 3.3 Nearest-Neighbour By means of the same training, validation and test sets used to create the decision trees, I created a k-nearest neighbour classifier to approximate drug-susceptibility from k similar genotypes in the training + validation datasets of each antiretroviral drug. A feature vector for each genotype in the training + validation datasets was constructed by comparing each attribute value with a reference sequence. For positions in which there was no change in amino acid the feature vector was augmented to include a dummy value, β, and for other positions the feature vector was augmented to include the amino acid present in the genotype sequence. To classify the drug-susceptibility of an unseen genotype, 3 similar genotypes (guaranteed a majority) were retrieved from the training + validation datasets using a similarity measure. The drug-susceptibility of an unseen genotype was then approximated as the majority drugsusceptibility classification of the retrieved genotypes. For each classification an explanation that consisted of the three closest genotypes was generated. The similarity of two genotypes was defined by how well two feature vectors conform. In particular, for positions in which two feature vectors contained non-dummy values I used the percentage of these that are different as an indicator of similarity. I also added to this score, the percentage of the remaining positions that were different. 35

41 Chapter 4 Results 4.1 Classification Models I obtained both decision tree and nearest-neighbour classifiers that predict drug-susceptibility from genotype, for 8 reverse transcriptase and 6 protease inhibitors. Decision tree learning generated classification models with varying complexity, ranging from only 5-9 interior attribute tests to 10, 11, 12, 16 and 19 interior attribute tests for zalcitabine, indinavir, abacavir, amprenavir and saquinavir, respectively. In contrast, each nearest-neighbour classifier generated no explicit models but rather simply stored the training and validation datasets used to generate the decision trees Reverse Transcriptase Inhibitors I obtained decision tree classifiers for each of the reverse transcriptase inhibitors: zalcitabine, didanosine, stavudine, lamivudine, nevirapine, abacavir, delavirdine and delavirdine, see figure , respectively. The decision trees for these drugs varied in complexity. In particular, I found rather simple models for didanosine, stavudine, nevirapine and efavirenz with these trees having only 5-6 interior attribute tests. On the other hand, I found more complex models for zalcitabine, abacavir and delavirdine with these trees having 9-12 interior attribute tests. 36

42 Chapter 4. Results Training and validation datasets were randomly created from the entire dataset of applicable phenotype genotype pairs and were used to derive each decision tree and nearest-neighbour classifier, see Appendix C for details. Within these datasets, fold-change values were distributed as in figures % >2047 Training Validation Fig. 4.1 Frequency distribution of fold-change values in the training and validation datasets for zalcitabine. % >2047 Training Validation Fig. 4.2 Frequency distribution of fold-change values in the training and validation datasets for didanosine. % >2047 Training Validation Fig. 4.3 Frequency distribution of fold-change values in the training and validation datasets for stavudine. 37

43 Chapter 4. Results % >2047 Training Validation Fig. 4.4 Frequency distribution of fold-change values in the training and validation datasets for lamivudine % >2047 Training Validation Fig. 4.5 abacavir. Frequency distribution of fold-change values in the training and validation datasets for % >2047 Training Validation Fig. 4.6 Frequency distribution of fold-change values in the training and validation datasets for nevirapine. 40 % >2047 Training Validation Fig. 4.7 Frequency distribution of fold-change values in the training and validation datasets for delavirdine. 38

44 Chapter 4. Results % >2047 Training Validation Fig. 4.8 efavirenz. Frequency distribution of fold-change values in the training and validation datasets for [P75] = <M>, <T> then: resistant (15[8%]) [P75] = <A> then: susceptible (3[0%]) [P75] = <V> and: [P184] = <M> then: susceptible (182[17%]) [P184] = <V> and: [P210] = <L> then: susceptible (121[33%]) [P210] = <W> and: [P177] = <E>, <G> then: resistant (9[0%]) [P177] = <D> and: [P35] = then: resistant (3[0%]) [P35] = <T> then: susceptible (2[0%]) [P35] = <V> and: [P74] = <V>, then: resistant (4[0%]) [P74] = <L> and: [P118] = <V> then: susceptible (15[0%]) [P118] = then: resistant (3[0%]) [P35] = <M> and: [P41] = <M> then: resistant (1[0%]) [P41] = <L> then: susceptible (3[0%]) [P75] = and: [P41] = <M> then: resistant (10[0%]) [P41] = <L> then: susceptible (1[0%]) [P75] = <L> and: [P32] = <K> then: susceptible (1[0%]) [P32] = <H> then: resistant (1[0%]) Fig. 4.9 Decision tree classifier for zalcitabine. (n[e]) at the leaves denotes the number of examples n and the estimated error e. [P151] = <M> then: resistant (15[0%]) [P151] = <Q> and: [P74] = <L>, <S> then: susceptible (333[12%]) [P74] = <V> and: [P211] = <R>, <A> then: resistant (18[0%]) [P211] = <T> then: susceptible (1[0%]) [P211] = <K> and: [P297] = <E>, <K> then: susceptible (7[0%]) [P297] = <A>, <R> then: resistant (2[0%]) [P74] = and: [P35] = <V> then: resistant (3[0%]) [P35] = <L> then: susceptible (1[0%]) Fig Decision tree classifier for didanosine. (n[e]) at the leaves denotes the number of examples n and the estimated error e. 39

45 Chapter 4. Results [P210] = <L> and: [P75] = <V> then: susceptible (257[8%]) [P75] = <A>, <T>, , <S> then: resistant (17[0%]) [P75] = <M> and: [P3] = <S> then: resistant (1[0%]) [P3] = <C> then: susceptible (1[0%]) [P210] = <W> and: [P69] = <N>, <D>, <S> then resistant (15[0%]) [P69] = <T> and: [P67] = <D>, <G>, <E> then: susceptible (22[0%]) [P67] = <N> then: resistant (38[50%]) Fig Decision tree classifier for stavudine. (n[e]) at the leaves denotes the number of examples n and the estimated error e. Leafs with an error of 50% represent an inability to make a definite classification. [P184] = <V>, then: resistant (236[0%]) [P184] = <M> and: [P69] = <T>, <N>, <A> then: susceptible (165[6%]) [P69] = <D> and: [P245] = <T> then: resistant (1[0%]) [P245] = <E>, <K>, <A> then: susceptible (3[0%]) [P245] = <V> and: [P118] = <V> then: susceptible (5[0%]) [P118] = then: resistant (3[0%]) [P69] = <S> and: [P41] = <L> then: resistant (2[0%]) [P41] = <M> then: susceptible (1[0%]) [P69] = and: [P21] = <V> then: resistant (1[0%]) [P21] = then: susceptible (1[0%]) Fig Decision tree classifier for lamivudine. (n[e]) at the leaves denotes the number of examples n and the estimated error e. [P103] = <R> then: susceptible (5[0%]) [P103] = <N>, <T> then: resistant (60[0%]) [P103] = <K> and: [P181] = <C>, then: resistant (28[15%]) [P181] = <Y> and: [P190] = <A>, <S>, <T>, <Q>, <C>, <E>, <V> then: resistant (20[0%]) [P190] = <G> and: [P106] = , <L> then: susceptible (6[0%]) [P106] = <A>, <M> then: resistant (9[0%]) [P106] = <V> and: [P188] = <Y>, <H> then: susceptible (274[3%]) [P188] = <L>, <C> then: resistant (5[0%]) [P103] = <S> and: [P35] = <V>, <M> then: resistant (2[0%]) [P35] = then susceptible (1[0%]) Fig Decision tree classifier for nevirapine. (n[e]) at the leaves denotes the number of examples n and the estimated error e. 40

46 Chapter 4. Results [P184] = <V>, then: resistant (149[24%]) [P184] = <M> and: [P41] = <M> and: [P151] = <M> then: resistant (11[0%]) [P151] = <Q> and: [P65] = <K> and: [P211] = <K>, <A>, <S>, <T>, <G> then: susceptible (54[0%]) [P211] = <R> and: [P178] = <M>, <V> then: susceptible (7[0%]) [P178] = <L> then: resistant (2[0%]) [P178] = and: [P69] = <T>, <N> then: susceptible (50[0%]) [P69] = <S> then: resistant (1[0%]) [P65] = <R> and: [P35] = then: susceptible (1[0%]) [P35] = <M> then: resistant (1[0%]) [P35] = <V> and: [P162] = <C> then: resistant (1[0%]) [P162] = <A> then: susceptible (1[0%]) [P162] = <S> and: [P62] = <A> then: resistant (4[0%]) [P62] = <V> then: susceptible (1[0%]) [P41] = <L> and: [P69] = <N>, <D>, <S>, <E> then: resistant (11[0%]) [P69] = <A> then: susceptible (1[0%]) [P69] = <T> and: [P44] = <E> then: susceptible (26[22%]) [P44] = <D>, <E> then: resistant (7[0%]) Fig Decision tree classifier for abacavir. (n[e]) at the leaves denotes the number of examples n and the estimated error e. [P103] = <N>, <S>, <T> then: resistant (66[8%]) [P103] = <R>, <Q> then: susceptible (6[0%]) [P103] = <K> and: [P181] = <C>, then: resistant (17[0%]) [P181] = <Y> and: [P245] = <Q>, <T>, <E>, <M>, <K>, , <S>, <L>, <A> then: susceptible (81[0%]) [P245] = <R> then: resistant (1[0%]) [P245] = <V> and: [P102] = <K>, <R> then: susceptible (155[3%]) [P102] = <E> then: resistant (1[0%]) [P102] = <Q> and: [P100] = then: resistant (2[0%]) [P100] = <L> and: [P20] = <R> then: resistant (1[0%]) [P20] = <K> and: [P230] = <L> then: resistant (1[0%]) [P230] = <M> and: [P188] = <Y>, <C> then: susceptible (17[0%]) [P188] = <L> and: [P35] = <V> then: resistant (1[0%]) [P35] = <T> then: susceptible (1[0%]) Fig Decision tree classifier for delavirdine. (n[e]) at the leaves denotes the number of examples n and the estimated error e. 41

47 Chapter 4. Results [P103] = <R>, <T>, <Q> then: susceptible (5[0%]) [P103] = <N> then: resistant (66[17%]) [P103] = <K> and: [P190] = <S>, <T>, <Q>, <C>, <E> then: resistant (10[0%]) [P190] = <G> and: [P188] = <Y>, <C>, <H> then: susceptible (211[2%]) [P188] = <L> then: resistant (7[0%]) [P190] = <A> and: [P101] = <E>, <D> then: resistant (6[0%]) [P101] = <K> and: [P135] = <T>, <M> then: resistant (3[0%]) [P135] = then: susceptible (5[0%]) [P103] = <S> and: [P35] = <V> then: susceptible (2[0%]) [P35] = <M> then: resistant (1[0%]) Fig Decision tree classifier for efavirenz. (n[e]) at the leaves denotes the number of examples n and the estimated error e Protease Inhibitors I obtained decision tree classifiers for each of the protease inhibitors: lopinavir, saquinavir indinavir, ritonavir, amprenavir and nelfinavir, see figure , respectively. Here, I found rather more complex models than was found for the reverse transcriptase inhibitors. In particular, I found decision trees with 6-9 interior attribute tests for the drugs ritonavir, lopinavir and nelfinavir. On the other hand, the decision trees for indinavir, amprenavir and saquinavir were even more complex with 11, 16 and 19 interior attribute tests, respectively. This complexity indicates that the genetic basis of drug resistance is more complicated for protease inhibitors than for reverse transcriptase inhibitors. However, this conjecture is by no means conclusive as complex decision trees can stem from noisy training data. It may be simply the case that the training data for these drugs contains errors and that these errors are distributed evenly between the training and validation datasets. Training and validation datasets were randomly created from the entire dataset of applicable phenotype genotype pairs and were used to derive each decision tree and nearest-neighbour classifier, see Appendix C for details. Within these datasets, fold-change values were distributed as in figures

48 Chapter 4. Results % >2047 Training Validation Fig Frequency distribution of fold-change values in the training and validation datasets for saquinavir % >2047 Training Validation Fig Frequency distribution of fold-change values in the training and validation datasets for lopinavir % >2047 Training Validation Fig Frequency distribution of fold-change values in the training and validation datasets for indinavir % >2047 Training Validation Fig Frequency distribution of fold-change values in the training and validation datasets for ritonavir. 43

49 Chapter 4. Results % >2047 Training Validation Fig Frequency distribution of fold-change values in the training and validation datasets for nelfinavir. % >2047 Training Validation Fig Frequency distribution of fold-change values in the training and validation datasets for amprenavir. [P10] = <L>, <H>, <R>, <M> then: susceptible (92[10%]) [P10] = <V>, <F>, <Z> then: resistant (55[12%]) [P10] = and: [P82] = <A>, <T>, , <F>, <S> then: resistant (37[0%]) [P82] = <V> and: [P71] = <V>, <L> then: resistant (16[19%]) [P71] = then: susceptible (2[0%]) [P72] = , <V> then: resistant (10[0%]) [P72] = <M>, <E> then: susceptible (2[0%]) [P71] = <A> and: [P46] = <M>, <L> then: susceptible (9[0%]) [P46] = and: [P72] = <T> then: resistant (2[0%]) [P72] = <L> then: susceptible (1[0%]) [P72] = and: [P93] = <L> then: susceptible (3[0%]) [P93] = then: resistant (3[0%]) Fig Decision tree classifier for lopinavir. (n[e]) at the leaves denotes the number of examples n and the estimated error e. 44

50 Chapter 4. Results [P10] = <R>, <Z> then: resistant (26[0%]) [P10] = and: [P71] = , <T>, <V>, <Z> then: resistant (105[18%]) [P71] = <A> and: [P48] = <V>, <S> then: resistant (8[0%]) [P48] = <G> and: [P37] = <D>, <Z> then: resistant (3[0%]) [P37] = <E>, <T>, <N> then: susceptible (4[0%]) [P37] = <S> and: [P84] = <V>, <C> then: resistant (6[0%]) [P84] = and: [P73] = <S>, <G>, <C> then: susceptible (32[0%]) [P73] = <T> then: resistant (2[0%]) [P71] = <L> and: [P13] = then: resistant (2[0%]) [P13] = <V> then: susceptible (2[0%]) [P10] = <L> and: [P90] = <L> then: susceptible (173[0%]) [P90] = <M> and: Occasionally, a mutation [P88] = <S>, may <D> be then: expressed resistant by (14[0%]) the position they have in a certain gene, [P88] = <N> and: preceded by the letter corresponding [P48] = <V> then: to the resistant amino-acid (4[0%]) seen in the wild-type virus, and [P48] = <G> and: [P84] = <V> then: resistant (4[0%]) [P84] = and: [P60] = <E> and: [P64] = then: resistant (4[0%]) [P64] = <V> then: susceptible (4[0%]) [P60] = <D> and: [P14] = <K> then: susceptible (14[0%]) [P14] = <R> then: resistant (1[0%]) [P10] = <V> and: [P71] = <V> then: resistant (10[0%]) [P71] = <A> then: susceptible (8[0%]) [P71] = <T> and: [P12] = <Z> then: resistant (1[0%]) [P12] = <T> and: [P30] = <D> then: susceptible (1[0%]) [P30] = <N> then: resistant (1[0%]) [P10] = <F> and: [P84] = then: susceptible (20[33%]) [P84] = <L>, <C>, <A> then: resistant (3[0%]) [P84] = <V> and: [P63] = , <A> then: resistant (7[0%]) Fig Decision tree classifier for saquinavir. (n[e]) at the leaves denotes the number of examples n and the estimated error e. 45

51 Chapter 4. Results [P54] = <V>, <T>, <A>, <Z> then: resistant (125[0%]) [P54] = <M> then: susceptible (2[0%]) [P54] = and: [P46] = <Z> then: resistant (7[0%]) [P46] = and: [P63] = , <A> then: resistant (46[7%]) [P63] = <V>, <R>, <Q> then: susceptible (5[0%]) [P63] = <L> and: [P77] = then: susceptible (3[0%] [P77] = <V> and: [P69] = <Y>, <Q> then: resistant (2[0%]) [P69] = <K> then: susceptible (1[0%]) [P69] = <H> and: [P10] = , <L>, <V> then: resistant (4[0%]) [P10] = <F> then: susceptible (1[0%]) [P46] = <M> and: [P90] = <M> then: resistant (40[27%]) [P90] = <L> then: susceptible (200[6%]) [P46] = <L> and: [P10] = , <R> then: resistant (5[0%]) [P10] = <L> then: susceptible (10[0%]) [P10] = <F> and: [P62] = then: resistant (2[0%]) [P62] = <V> then: susceptible (1[0%]) [P54] = <L> and: [P10] = then: resistant (2[0%]) [P10] = <L> then: susceptible (2[0%]) [P20] = <K> then: susceptible (2[0%]) [P20] = <T> then: resistant (1[0%]) Fig Decision tree classifier for indinavir. (n[e]) at the leaves denotes the number of examples n and the estimated error e. [P82] = <A>, <T>, <F>, <L>, <S> then: resistant (109[4%]) [P82] = <V> and: [P90] = <M> then: resistant (79[28%]) [P90] = <L> and: [P84] = then: susceptible (205[6%]) [P84] = <V>, <L>, <A> then: resistant (26[0%]) [P84] = <C> and: [P10] = , <F> then: resistant (3[0%]) [P10] = <L> then: susceptible (1[0%]) [P82] = and: [P46] = then: resistant (3[0%]) [P46] = <L> then: susceptible (1[0%]) [P46] = <M> and: [P84] = then: susceptible (4[0%]) [P84] = <V>, <C> then: resistant (2[0%]) Fig Decision tree classifier for ritonavir. (n[e]) at the leaves denotes the number of examples n and the estimated error e. 46

52 Chapter 4. Results [P10] = <R> then: susceptible (3[0%]) [P10] = <Z> then: resistant (28[0%]) [P10] = and: [P84] = then: susceptible (90[22%]) [P84] = <V>, <C>, <A> then: resistant (41[37%]) [P10] = <L> and: [P84] = and: [P50] = <V> then: resistant (3[0%]) [P50] = <L> then: susceptible (5[0%]) [P50] = and: [P47] = then: susceptible (156[2%]) [P47] = <A> then: resistant (2[0%]) [P47] = <V> and: [P12] = <T> then: susceptible (1[0%]) [P12] = <S> then: resistant (1[0%]) [P84] = <V> and: [P73] = <S> then: susceptible (1[0%]) [P73] = <T> then: resistant (2[0%]) [P73] = <G> and: [P46] = then: susceptible (1[0%]) [P46] = <L> then: resistant (2[0%]) [P46] = <M> and: [P12] = then: resistant (1[0%]) [P12] = <T> and: [P36] = <M> then: susceptible (1[0%]) [P36] = <Z> then: resistant (1[0%]) [P84] = <C> and: [P20] = <K> then: susceptible (1[0%]) [P20] = then: resistant (1[0%]) [P10] = <V> and: [P46] = then: resistant (6[0%]) [P46] = <M> then: susceptible (1[0%]) [P46] = <L> and: [P61] = <Q> then: resistant (3[0%]) [P61] = <E> then: susceptible (1[0%]) [P10] = <F> and: [P84] = <V>, <A> then: resistant (8[0%]) [P84] = and: [P46] = then: resistant (3[0%]) [P46] = <M> then: susceptible (13[50%]) [P46] = <L> and: [P15] = then: susceptible (3[0%]) [P15] = <V> then: resistant (3[0%]) Fig Decision tree classifier for amprenavir. (n[e]) at the leaves denotes the number of examples n and the estimated error e. 47

53 Chapter 4. Results [P54] = <V>, <L>, <T>, <Z> then: resistant (128[0%]) [P54] = <M> then: susceptible (1[0%]) [P54] = and: [P90] = <M> then: resistant (94[8%]) [P90] = <L> and: [P30] = <N> then: resistant (37[0%]) [P30] = <Y> then: susceptible (1[0%]) [P30] = <D> and: [P46] = , <Z> then: resistant (31[28%]) [P46] = <M> and: [P88] = <S> then: resistant (8[0%]) [P88] = <D> then: susceptible (1[0%]) [P88] = <N> and: [P82] = <V>, <S> then: susceptible (155[4%]) [P82] = <F> then: resistant (3[0%]) [P82] = and: [P20] = <K> then: susceptible (1[0%]) [P20] = then: resistant (1[0%]) [P46] = <L> and: [P10] = then: resistant (5[0%]) [P10] = <F> then: susceptible (1[0%]) [P10] = <L> and: [P82] = <A>, <L> then: susceptible (2[0%]) [P82] = <V> then: resistant (5[0%]) Fig Decision tree classifier for nelfinavir. (n[e]) at the leaves denotes the number of examples n and the estimated error e. 4.2 Prediction Quality To assess the predictive quality of the classifiers on unseen genotypes 30% of the entire dataset (applicable to each drug) was randomly selected for testing. These cases were then queried using either the appropriate decision tree or nearest-neighbour classifier to obtain a predicted drug-susceptibility. For each genotype in the testing set the predicted classification was compared with its true classification to obtain both an indication of predictive error and an estimate of how well a classifier is able to generalise beyond the training population. In particular, I determined for each classifier its prediction error across a testing set (percentage of misclassified cases), sensitivity and specificity. The sensitivity of a classifier is the probability of the classifier to predict drug resistance given that a case is truly resistant. On the other hand, the specificity of a classifier is the probability of the classifier to predict drug susceptibility given that a case is truly susceptible. To assess how these newly generated classifiers faired against the decision tree classifiers originally presented in [1], I hand implemented each classifier and tested it against the same testing set as used above, see appendix D for details of these trees. 48

54 Chapter 4. Results An important consideration when testing such classification models is how well the distribution of cases in the testing set concurs with the distribution of cases in the training/validation sets. This is important because when distributions differ greatly we can expect a loss of predictive quality. This stems from the fact that learning is most reliable when the cases in the training set follow a distribution similar to that of the entire population. Table 4.1 gives the distribution of fold-change values in each testing set > ddc 18% 55% 21% 2% 3% ddi 16% 66% 12% 3% 3% d4t 33% 47% 15% 4% 1% TC 12% 18% 9% 7% 7% 15% 16% 14% ABC 21% 28% 34% 11% 4% 2% NVP 33% 27% 11% 3% 3% 6% 5% 2% 4% 4% 1% 0 DLV 31% 22% 12% 6% 2% 6% 5% 8% 5% EFV 43% 21% 7% 5% 2% 9% 4% 3% 5% SQV 33% 29% 12% 5% 4% 8% 4% 2% 0 2% 0 0 LPV 28% 21% 4% 13% 12% 16% 5% 1% IDV 20% 26% 13% 16% 12% 9% 3% 1% RTV 24% 29% 11% 7% 10% 10% 7% 3% NFV 9% 18% 13% 13% 18% 16% 8% 3% 1% APV 37% 27% 17% 9% 6% 2% Table 4.1 Frequency distribution of fold-change values within the testing datasets. Resistance factors are grouped into equidistant bins. Differences between the testing and training set greater than 5% are highlighted. Here the distribution of test cases is relatively similar to that of the training experience. Therefore any models with low predictive quality cannot be reasoned away in respect to differences in the training and test set distributions. Testing the newly constructed decision trees resulted in prediction errors in the range % for the protease inhibitors, except for amprenavir that had a prediction error of 24.2%, 6.0% - 7.8% for the nonnucleoside reverse transcriptase inhibitors, and 6.8%, 14.6%, 14.8% and 16.6% for 3TC, d4t, ddi and ABC, respectively. The error rate for ddc was the poorest at 24.7%. Using the same test cases, the previously published decision trees resulted in prediction errors in the range 8.2% % for the protease inhibitors, except for amprenavir that again had a prediction error of 24.2%, 4.8% - 9.5% for the nonnucleoside reverse transcriptase inhibitors, and 8.1%, 20.0%, 51.0% and 19.0% for 3TC, d4t, ddi and ABC, 49

55 Chapter 4. Results respectively. The error rate for ddc was 29.7%. Table 4.2 gives the details of the prediction errors for each drug. Using the nearest neighbour classifier I found relatively poor predictive quality with prediction errors in the range % for the protease inhibitors, except for nelfinavir that had a prediction error of 26.4%, % for the nonnucleoside reverse transcriptase inhibitors, and 32.7%, 18.9%, 21.9% and 28.7% for 3TC, d4t, ddi and ABC, respectively. The error rate for ddc was 46.2%. From these results it is clear that the newly constructed decision trees outperform the previously published decision trees over this dataset. The only drug for which the original decision tree outperforms the newly constructed decision tree is for efavirenz. However, this score is not a true indication of the predictive quality of the original tree because it fails to return a classification for 72% of the cases in the testing set. In addition, although the nearest-neighbour classifiers fair poorly, compared to the newly constructed decision trees, they outperform the original decision trees for some drugs. Drug No. of test cases. New D-Tree. Prediction Error. Original D-Tree. Prediction Error Nearest Neighbour. Prediction Error ddc % [0%] 29.7% [0%] 46.2% [0%] ddi % [0%] 51.0% [0%] 21.9% [0%] d4t % [0%] 20.0% [0%] 18.9% [0%] 3TC % [0%] 8.1% [0%] 32.7% [0%] ABC % [0%] 19.0% [1%] 28.7% [0%] NVP % [0%] 7.3% [0.9%] 22.4% [0%] DLV % [0%] 9.5% [0%] 36.3% [0%] EFV % [0%] 4.8% [72%] 24.4% [0%] SQV % [0%] 20.0% [0%] 18.3% [0%] LPV % [0%] No tree available. 18.9% [0%] IDV % [0%] 15.0% [3%] 19.4% [0%] RTV % [0%] 11.0% [0.8%] 18.0% [0%] NFV % [0%] 8.2% [4%] 26.4% [0%] APV % [0%] 24.2% [8.7%] 19.1% [0%] Table 4.2 Prediction errors. 50

56 Chapter 4. Results Considering the sensitivity and specificity of the classification models (Table 4.3), the newly constructed decision trees achieved sensitivities in the range of for the protease inhibitors, except for amprenavir that had a sensitivity of 0.67, for the nonnucleoside reverse transcriptase inhibitors, for 3TC and ABC. The sensitivities for ddc, ddi and d4t were poorest with values 0.39, 0.32 and 0.67, respectively. Specificities were in the range of for the protease inhibitors, for the nonnucleoside reverse transcriptase inhibitors, for ddc, ddi, d4t, 3TC. The specificity for ABC was poorest with a value of For the previously published decision trees sensitivities were in the range of for the protease inhibitors, except for amprenavir that had a sensitivity of 0.76, for the nonnucleoside reverse transcriptase inhibitors, 0.93 and 0.88 for 3TC and ABC. The sensitivities for ddc, ddi and d4t were poorest with values 0.59, 0.64 and 0.72, respectively. Specificities were in the range of for the protease inhibitors, for the nonnucleoside reverse transcriptase inhibitors, for ABC, ddc, ddi, d4t, 3TC. The specificity for EFV was poorest with a value of It is clear from these results that the newly constructed decision trees fair better at predicting susceptibility than resistance and vice versa for the previously published decision trees. For the nearest neighbour classifiers sensitivities were in the range of for the protease inhibitors, except for amprenavir that had a sensitivity of 0.48, for the nonnucleoside reverse transcriptase inhibitors, 0.69 for 3TC and ABC. The sensitivities for ddc, ddi and d4t were 0.51, 0.36 and 0.47, respectively. Specificities were in the range of for the protease inhibitors, for the nonnucleoside reverse transcriptase inhibitors, for ABC, ddc, ddi, d4t, 3TC. 51

57 Chapter 4. Results New D-Tree. Original D-Tree Nearest Neighour Drug sensitivity specificity sensitivity Specificity Sensitivity specificity ddc ddi d4t TC ABC NVP DLV EFV SQV LPV IDV RTV NFV APV Table 4.3 Sensitivities and specificities. 52

58 Chapter 5 Conclusion 5.1 Concluding Remarks and Observations Using a dataset of HIV-1 reverse transcriptase and protease genotypes with matched drugsusceptibilities I was able to construct decision tree classifiers to recognise genotypic patterns characteristic of drug-resistance for 14 antiretroviral drugs. No prior knowledge about drug-resistance associated mutations was used and mutations at every sequence position were treated equally. Using an independent testing set I was able to judge each decision trees predictive quality compared with a number of similarly derived decision trees, presented in the literature. I also constructed a novel nearest-neighbour classifier to predict drug-susceptibility from genotype, for each drug. Each nearest-neighbour classifier used a database of matched phenotype genotype pairs but, in contrast to decision tree learning, nearest-neighbour learning did not attempt to extract a generalised classification function. In order to investigate the possible advent of neural network learning compared to decision tree learning I derived a decision tree classifier for the protease inhibitor lopinavir. This is compared to the neural network classifier for lopinavir presented in [2]. The predictive quality of the decision tree classifiers were mixed. For the decision trees, I found prediction errors between % for all drugs. These results offered an improvement over the performance of previously published decision trees that had prediction errors between % over the same testing set. Nearest-neighbour classifiers exhibited 53

59 Chapter 5. Conclusion poorer performance with prediction errors between %, but still outperformed the previously published decision trees on some drugs Decision Tree Models The decision trees generated in the scope of this study varied in complexity. In particular, I found rather novel classification models (5-7 interior attribute tests) for the drugs didanosine, stavudine, lamivudine, nevirapine, efavirnez, ritonavir and lopinavir and I found more complex models (9-19 interior attribute tests) for the drugs delavirdine, nelfinavir, zalcitabine, indinavir, abacavir, amprenavir and saquinavir. This is in contrast to the decision tree classifiers presented in [1] that had between 4 and 12 interior attribute tests for all drugs. This increase in complexity may stem from the fact that the training data of genotypes exhibits large mutational interdependence, I used a larger training set, the training data is distributed differently, pruning is ineffective or the training data contains noise and subsequently causes overly specific classification functions. In the most extreme case I can compare the complexity of the decision trees for the protease inhibitor saquinavir. In [1] the decision tree for saquinavir had only 5 interior attribute tests and achieved a prediction error of 12.5% in leave-one-out experiments. A leave-one-out experiment predicts a classification for a case in the training data by constructing a decision tree on the remaining data and then uses the respective tree to give a classification. This rather novel tree is in contrast to the decision tree for saquinavir, presented in this study, which has 19 interior attribute tests and achieved a prediction error of 14.3% over an independent testing set consisting of 251 cases. In this case the tree appears to be overgrown and the effects of reduced error pruning are minimal. In particular only two leaf nodes contain a classification error (indication of where subtrees have been removed) and 54% of leafs have only 1-4 training cases associated with them. In this respect the tree appears to be overly specific and pruning wasn t executed hard enough to force generalisation. This is true for a number of the other decision tree classifiers that I generated. In particular, the decision trees for amprenavir, abacavir and indinavir appear to be overly specific and the effects of reduced error pruning, again, appear to be minimal. The remaining classification models are more similar in structure and complexity to the ones previously published. For models in which overfitting appears to have occurred, a basic reduced error pruning strategy appears not to introduce enough bias to create shorter more general trees. This is problematic 54

60 Chapter 5. Conclusion because shorter trees are more desirable than larger ones. The reason why shorter trees are preferred to larger ones stems from Occam s razor that states that shorter trees form a sounder basis for generalization beyond a set of training data than larger ones. This is particularly important if we wish to use decision tree classifiers as a future tool to help select drug regimens for treating HIV-1 infection. We seek a classification model that generalises well to the entire population of HIV-1 cases. Looking again at the performance of the novel saquinavir classification model using an independent testing set (used to test the complex model), we see a dramatic decrease in performance than was previously published (prediction error of 20.0%). Similarly, the performance of the other models also decreased, except for d4t, 3TC, NVP and NFV. Here I may argue that leave-one-out experiments are not sufficient to judge the performance these models because single cases taken from the training data are less likely to include information that is omitted. This is true in the fact that some of the models fail to return a classification because they fail to recognise the presence of amino acids at certain sequence positions. For example, the previously published decision tree for efavirenz fails to return a classification for 72% of the testing examples. The relative similarity of performance also makes a strong case for the ID3 algorithm. In particular, the ID3 algorithm is able to generate decision trees with a similar performance to the decision trees generated using the C4.5 algorithm. In this setting, many of the C4.5 extensions and features are not required. Furthermore, considering the performance of the novel saquinavir classification model over this testing set implies that the novel model is overly general. In other words the novel classification model is less likely to differentiate between resistant and susceptible cases than the more complex model. In this respect the complex saquinavir classification model may indeed satisfy Occam s razor, for this particular dataset, and indeed it may be the smallest possible decision tree that can be generated from the training data. Therefore both the novel and complex saquinavir classification models satisfy Occam s razor in a contradictory way. This presents us with a difficult choice of which tree would best generalise to the entire population of HIV-1 cases. Statistically, we can compare the sensitivities and specificities of the models as an indication to how well each decision tree is able to generalise beyond the training population. 55

61 Chapter 5. Conclusion Considering the sensitivities (the ability of a model to predict drug-resistance when a case is truly resistant) of the newly constructed decision tree models with the previously published models, the results were comparable for most drugs except for zalcitabine, didanosine, amprenavir and saquinavir. Therefore, by using a larger dataset for learning I have found classification models with relatively similar sensitivities. Therefore the ability of a decision tree to predict resistance seems not to depend greatly on the size of the training dataset. Considering the specificities (the ability of a model to predict drug-susceptibility when a case is truly susceptible) of the newly constructed decision tree models with the previously published models, the new models offer an improvement across the board. Therefore, by using a larger training dataset I have found classification models that offer an improved ability to predict susceptibility. Returning to the decision tree for saquinavir, by inspecting the attribute tests present in the novel and complex model we can get a picture of the genetic basis for saquinavir drugresistance. The novel model represents saquinavir-resistance as being determined by mutational changes at the sequence positions 90, 48, 54, 84 and 72. These sequence positions have all been previously described as being associated with either high-level or intermediatelevel resistance [14], except position 72. Here, positions associated with high-level resistance are placed closer to the root of the tree. In contrast, the complex model represents saquinavir drug-resistance as being determined by mutational changes at the sequence positions 10, 71, 48, 37, 84, 73, 13, 90, 88, 60, 64, 14, 12, 30 and 63. Only the positions 10, 71, 48, 84, 73, 90 and 63 have been previously described as being associated with saquinavir resistance. Here, the positions 10, 71, 90, 48, 84 are placed closer to the root of the tree and are associated with high-level resistance, except for positions 10 and 71 which are regarded as only accessory resistance mutations. The other positions are not listed in [14] as being associated with saquinavir resistance. This may imply that, either, the decision tree has been able to identify as yet unknown resistance-associated mutations from the training data or that the decision tree-learning algorithm has been fooled by noise in the training data. This situation is similar for the reverse transcriptase inhibitor classification models. In other words these models contain a mixture of known high-level, intermediate-level and accessory resistance associated mutations. They also contain a number of mutations not listed in [15]. However, this situation does not extend to the other protease inhibitor classification models. For ritonavir, amprenavir and nelfinavir their models tended to identify only previously known high-level, intermediate-level and accessory resistance associated mutations. 56

62 Chapter 5. Conclusion Furthermore, higher-level resistance associated mutations tended to be tested closer to the roots of their respective trees. Only the classification models for lopinavir and indinavir followed a similar pattern to that of saquinavir and tended to identify intermediate-level, accessory-level and previously unknown mutations. These observations raise three important considerations when applying decision tree learning to the phenotype-prediction problem: choosing an appropriate attribute measure, the importance of the training data and how deeply to grow a tree. In this study, amino acid positions were selected based on maximal information gain. However, judging from the types of mutations that are placed closer to the roots of the trees (in some cases accessory rather than high-level resistance associated mutations) I may have obtained better classification models if I had used a different attribute selection measure. Specifically, information gain has a natural bias to prefer attributes with many values to those with fewer values. In the context of the phenotype prediction problem this could prove problematic because a certain amino acid position could exhibit a variety of mutations but nevertheless play no role in drug-resistance. This amino acid position may then be selected, accidentally, as a good indicator of resistance due to coincidences in the training data. The size and quality of the training data heavily dictates the quality of the eventual decision tree classifier. Off course, to obtain a decision tree classifier that generalises well to the entire population of HIV-1 cases the training data should be distributed in such a way that it is representative of the entire population. This is extremely difficult in practice and is unrealistic. However, with larger and larger datasets of matched phenotype-genotype pairs becoming available, it may become possible to probabilistically model the distribution of cases within the entire population. Training sets that are constructed to follow such distributions will therefore generate decision trees with a stronger predictive merit. In this way the size of the training set is less important once we have reached a minimal number of examples. Here what is important is the examples themselves. A better decision tree will be grown if the data that it is derived from is of good quality and varied. By good quality I mean that the data is reliable (minimises error) and by varied I mean that for each phenotype we have a wide selection of genotypes. In this study I simply used the complete HIV-1 protease and reverse transcriptase drugsusceptibility datasets to generate decision tree classifiers with relatively low prediction 57

63 Chapter 5. Conclusion errors. However, since no attention was spared to determine the quality and variety of examples in the training, validation and test sets these same decision trees may falter when applied to genotypes drawn randomly from the entire population of HIV-1 cases. Similar to how the performance of the previously published decision trees faltered when applied to a different testing set. The C4.5 and ID3 algorithms grow decision trees just deeply enough in order to perfectly classify the training examples, in accordance with Occam s razor. Whilst this is sometimes a reasonable strategy it can lead to difficulties. Particularly, for the phenotype prediction problem we can only ever obtain a relatively small subset of training examples compared with the entire population and this may lead to a classification model that does not generalise well to the true target function. There reaches a point during decision tree learning when a tree starts to become specific to the target function represented by the training examples and fails to generalise to the remaining population. In other words, the tree has become overfitted. As previously mentioned, the effects of overfitting can be minimised by using a number of post-pruning strategies. However, as has been exhibited, the effects of basic reduced error pruning were minimal for some trees. We should therefore introduce a suitable weighting to force pruning to consider even smaller decision trees that may not perform well against the training set but nevertheless fair better over the remaining population Neural Network Models The use of neural networks may be particularly suited to the phenotype prediction problem for drugs such a lopinavir where resistance is dictated by complex combinations of multiple mutations [2]. In particular, by representing the classification function Resistant: Genotype Drug_Susceptibility_Classification using a neural network we are able to represent nonlinear functions of many variables such a multiple mutations exhibiting large interdependence. Indeed, by using neural network learning we do not have to make any prior assumptions about the form of the target function. For example, feedforward networks containing three layers of sigmoid perceptions are able to approximate any function to arbitrary accuracy, given a sufficient (potentially very large) number of units in each layer [16]. This is in contrast to decision tree learning where we must make some judgement as to what size of trees should be preferred to others i.e. what bias should we introduce. 58

64 Chapter 5. Conclusion Furthermore, representing the target function using a neural network gives the advantage over decision tree learning that the predicted drug-susceptibility is quantative. In other words, such a neural network can return a predicted fold-change value rather than simply a discrete classification such as resistant or susceptible. In [2] a neural network classification model was constructed from a dataset of phenotypegenotype pairs to predict resistance of the protease inhibitor lopinavir. The performance of the model was determined using a testing set of 177 examples and the results were expressed using the linear correlation coefficient R 2. The linear correlation coefficient is a number between 1 and 1that measures how close a set of points lie on a straight line. A value of 1 indicates that all the points fall on a straight line. In other words, it was used in this case to determine how well the predicted fold-change and actual fold-change values agree. For their best neural network, a correlation coefficient of 0.88 was obtained. In order to compare the performance of this network against decision tree learning I obtained an equivalent decision tree classifier to predict lopinavir resistance using a dataset of phenotype genotype pairs obtained from the Stanford database. I obtained a relatively simple decision tree classification model that determines lopinavir resistance according to mutations at the positions 10, 82, 71, 72, 46 and 93. With the exception of position 72 all these positions have been previously recognised as being significant for lopinavir resistance [2]. Using an independent testing set, consisting of 95 cases, the decision tree had a prediction error of 10.5% and a sensitivity and specificity of 0.92 and 0.86, respectively. In other words, for 89.5% of the testing cases the predicted classification agreed with the true classification. This result is comparable to the correlation coefficient 0.88 obtained for the neural network. However, some allowance should be made for differences in the size of the testing sets. In particular, the neural network model was tested against 177 examples, which is nearly double that for the decision tree. However, the decision tree model has the advantage over the neural network model in that it is easily interpreted. In other words, experts can easily understand and examine the knowledge portrayed by the decision tree. Using decision trees it is easy to derive a set of corresponding rules. Tracing out a path from the root of the tree to a leaf yields a single rule. Where the internal nodes, of the tree, induce the premises of the rule and the leaf determines its conclusion. Such rules can be presented as evidence for a classification. 59

65 Chapter 5. Conclusion This is in contrast to neural network models that act as a black box. A genotype is given as input and a fold-change value is given as output but no clue as to how such a prediction was made is available. One may wish to look inside the black box but all that will be found is a number of connected units with no real meaningful interpretation. An analogy can be made with looking inside at the workings of the human brain that is in itself made up from millions of similar interconnected units. It is not enough to simply understand how each unit processes signals rather we wish to know the knowledge that they portray. In addition, like decision tree learning, neural network learning is prone to overfitting. This occurs as training proceeds because some weights will continue to be updated in order to reduce the error over the training data. However, there are a number of methods available to help minimise the effects of overfitting. Most methods introduce a bias to prefer neural networks with small weight values, i.e. to avoid learning complex decision surfaces. But compared to decision trees these solutions are not as aesthetically pleasing. Also, in practical terms, neural network training algorithms typically require longer training times than decision tree learning. Training times range from a few seconds to many hours, depending on the number of weights to learn in the network and the amount of training examples Nearest - Neighbour Models Nearest-neighbour models have an advantage over decision tree and neural network models in that no explicit representation of the target function needs to be made. In particular, a nearest-neighbour method simply stores the set of training examples and postpones generalising beyond these examples until a new case must be classified. Here we avoid the problems of overfitting and estimating a one-time target function that embodies the entire population. In other words, a nearest-neighbour classifier represents the target function by a combination of many local approximations, whereas decision tree and neural network learning must commit at training time to a single global approximation. In this respect, a nearest-neighbour classifier effectively uses a richer hypothesis space than both decision tree and neural network learning. In addition because all the training data is stored and reused in its entirety the information that it contains is never lost. Where the main difficulty in 60

66 Chapter 5. Conclusion defining a nearest-neighbour classifier to address the phenotype-prediction problem lies is determining an appropriate distance metric for retrieving similar genotypes. In the scope of this study I derived a nearest-neighbour classifier to predict drugsusceptibility from genotype. This was based on a novel distance metric. The performance of the nearest-neighbour classifiers was assessed, in the same way as the decision tree classification models, using a randomly selected independent test set of genotype cases. In comparison to the decision tree models, created in this study, these classifiers faired poorly. In particular, I found prediction errors in the range of % compared to prediction errors in the range of % for the decision trees. However, given these results we cannot conjecture that all nearest-neighbour methods will fair poorly in the context of the phenotype prediction problem. This is clear when we consider the commercial success of Virco s virtualphenotype that employs a nearestneighbour classification scheme. Indeed some studies have even proved that Virco s prediction scheme is a useful as an independent predictor of the clinical outcome of antiretroviral therapy [17]. It is clear from these results that the distance metric used in this study is too naive and is not able to entirely capture the genetic basis of drug-resistance when comparing genotypes. In particular, it does not take into consideration any details of the mutations that are present rather it only considers the percentage of differences between two sequences. However, for the protease inhibitors this novel distance metric produced reasonable results. Specifically, I found prediction errors in the range of %, except for amprenavir that had a prediction error of 26.4%. This suggests that for these drugs, drug-resistance can be characterised in some way by the amount of shared genetic mutations. A practical disadvantage of nearest-neighbour classifiers is that they are inefficient when classifying a new case because all the processing is performed at query time rather than in advance like decision tree and neural network learners. 61

67 Chapter 5. Conclusion 5.2 Suggestions For Further Work Handling Ambiguity Codes Within this study, ambiguity codes are not handled in an effective manner. An ambiguity code may occur during genotyping when a sample containing a population of HIV-1 sequences is found to contain a number of possible amino acids for a specific sequence position. When this occurs, some mutations may be represented by multiple amino acid codes, representing the detection of more than one amino acid at this sequence position. In this case it is ambiguous to which amino acid should be used when modelling the genotype sequence. Within this study, the first amino acid code that is encountered is used for modelling. This is wholly inadequate and would have been improved on if more time permitted Using A Different Attribute Selection Measure As was mentioned earlier, decision attributes were selected based on maximal information gain. Here, I may have obtained better classification models if I had used a different attribute selection measure that doesn t favour attributes with large numbers of values. A way to avoid this bias is to select decision attributes based on a measure called gain ratio. The gain ratio penalises attributes with a large number of values by incorporating a term called split information, which is sensitive to how uniformly a decision attribute splits the training data: split_information(s, A) p (+) log 2 p (+) + p (-) log 2 p (-) where p (+) and p (-) are the proportion of positive and negative examples, respectively, resulting from partitioning S by the attribute A. Using this measure, the gain ratio, is then calculated as: gain_ratio(s, A) = ig(s, A) / split_information(s, A) If more time had been permitted I would have experimented with using this measure and other possible selection measures. Would this have made a big impact on the complexity and or structure of the decision tree models? 62

68 Chapter 5. Conclusion Handling Missing Information The ability to effectively handle missing information was treated poorly in this study. Here a training set may only exhibit a subset of the possible mutations that may occur at a particular sequence position. Therefore, when growing the decision tree using this dataset only this subset of amino acids will be considered as valid at that particular position. This is problematic because the decision tree will fail to recognise the presence of other amino acids at that position. Within this study, if this situation arises at a decision node n the classification process is halted and a classification of unknown is returned. This is not ideal; we would at least like to take into account the decision knowledge already portrayed by the decision tree before this point. Returning a classification of unknown is worthless. One possible strategy for dealing with this problem is to assign a classification based on the most common classification of the training examples associated with the decision node n. A second more complex procedure begins by assigning a probability to each of the values for an attribute. These probabilities are based on the observed frequencies of the various values for the attribute among the training examples at node n. Now when we encounter the node n, instead of stopping, we continue down the most probable path. We then proceed as normal until we reach a classification. If more time was permitted I would have implemented the second of these strategies in order to maximise the information that a decision tree uses when classifying examples taken from the entire population of HIV-1 cases Using A Different Distance Metric As has already been highlighted the distance metric used for nearest-neighbour classification was not strong enough to truly judge the similarity of two genotype sequences. If more time was permitted I would have experimented with different distance measures. In particular, I would have investigated statistical measures based on the comparison of two dot-plots. 63

69 Chapter 5. Conclusion Receiver Operating Characteristic Curve A good way to analyse the performance of a classification model is to plot a receiver operating characteristic curve (ROC). A ROC curve is commonly used within clinical research to investigate the tradeoff between the sensitivity and specificity of a classification model. The x-axis of a ROC graph corresponds to the specificity of a model, i.e. the ability of the model to identify true negatives. Conversely, the y-axis corresponds to the sensitivity of a model, i.e. how well the model is able to predict true positives. In this way we are able to look more closely at the ability of a classification model to discriminate between drugresistant and drug-susceptible genotypes. The greater the sensitivity at high specificity values the better the model. Also, by doing such an analysis we better facilitate the comparison of two or more classification models. In order to plot an ROC curve for each of the decision tree models that I created I would have to introduce a weighting parameter, α, to determine whether an example should be classified as resistant or not. In more detail, for each leaf node in which there is a prediction error (i.e. a leaf created as a result of pruning) we have a probability of an example being classified as resistant. We introduce α at these leafs and we say that an example is resistant if and only if the probability of the example being resistant is greater than α. Now we can create the ROC curve by varying the size of α from 1 down to 0 and computing the sensitivity and specificity of the model for each value of α. During the scope of this project I had hoped to be able to compare the performance of the classification models by comparing the ROC curves that they produce. However, I was unable to complete this work through lack of time Other Machine Learning Approaches There is many other machine learning strategies available. For example, by using genetic algorithms we can generate a set of classification rules similar to decision tree learning. Investigating the possible use of these methods for the phenotype prediction problem presents us with a wide scope for future research. In particular, can we get better classification models by using different machine learning approaches or even can we get better results through the hybridisation of different approaches? 64

70 Appendix A Pre-Processing Software I developed a simple text parser, using Java, that takes as input a data file from the Stanford HIV Resistance Database and outputs a new data file containing the modelled samples, as previously described. This can be run from the command line using the command > java DataParser i inputfile.txt o outputfile.txt g gene where inputfile.txt is a data file obtained from the Stanford Database and gene is one of reverse transcriptase (rt) or protease (p). The parser reads each line in the original data file, separately, and interprets the information present in specific columns. This is a reasonable strategy considering the format of the file (tab delimited). Furthermore, future updates on this information will only ever alter the amount of data but not the way it is presented. On interpretation of each line in the original file a new instance is created that has a unique identification, the fold-change values for each drug and a set of attributes as described previously. Individual instances are then written to the programs output, see below. 65

71 Appendix A. Pre-Processing Software Data File From The Stanford HIV Resistance Database. Simple Text Parser Instances SeqId APV_Fold ATV_Fold NFV_Fold P1 P2 P3 P P Q V F P Q V F 66

72 Appendix B Cultivated Phenotype I developed a generic machine-learning program, called cultivated phenotype, that (1) represents and manipulates a dataset of phenotype genotype pairs as obtained from the Stanford HIV-1 reverse transcriptase and protease drug-susceptibility datasets; (2) characterises a dataset; (3) constructs a decision tree according to a training dataset; (4) prunes a decision tree according to a validation dataset; (5) predicts drug-susceptibility from genotype using either a decision tree or 3-nearest neighbour classifier; (6) displays performance statistics of a single decision tree according to a testing dataset and (7) compares the performance of newly constructed decision trees, hand-coded decision trees and 3-nearest neighbour classifiers according to a testing dataset. The learning component was developed using Java because it was readably available and did not require any licences. Furthermore, by developing the program using Java it allows for easy conversion into a form that could be made accessible from the World Wide Web. This is advantageous because it promotes the distribution of the knowledge that it prescribes and inevitably this knowledge could be harnessed by clinicians to help manage HIV-1 infection. In detail, the learning component was developed using an Object-Oriented methodology and the following classes were defined: CultivatedPhenotype, Example, Attribute, Reference, DrugWindow, AlterExperience, DataCharacteristics, Tree, BeerenwinkelModels, NearestNeighbour, QueryWindow, PerformanceWindow, Data, and Graph. Summarised as follows: 67

73 Appendix B. Cultivated Phenotype CultivatedPhenotype: This is the main program thread and all subsequent functionality stems from it. It defines four tables: the first containing a complete dataset of phenotype-genotype examples; the second containing a subset of examples to be used for training; the third containing a subset of examples to be used for validation and the fourth containing a subset of examples to be used for testing. Functionality realised by the procedures: void loadfile(string filename). DrugWindow: Defines a number of antiretroviral drugs and associated fold-change thresholds. Provides functionality to filter a dataset according to a particular drug, allowing only information related to that drug. Functionality realised by the procedures: void filterdrug(string antiretroviral), void setthreshold(double foldvalue). Example: Defines a single phenotype-genotype pair. In particular, an Example contains a fold-change value, sequence identification and a set of Attributes. In addition, each Example contains a feature vector constructed from its set of Attributes and a Reference. Includes functionality to compute the similarity of this example (feature vector) from another. Functionality realised by the procedures: boolean isresistant(), string getattribute(int index), void computecomparisionarray() and int distancefromthissequence(char[] othercomparisionarray). AlterExperience: Defines the training, validation and test sets. Provides functionality to randomly set aside a percentage of the entire dataset for testing. Functionality realised by the procedures: void setanddisplaytrainingtestsets(int percentage). DataCharacteristics: Describes a number of properties of the training, validation and testing datasets. For example, determines the total number of examples, the number of examples classified as resistant and the distribution of fold-change values. Functionality realised by the procedures: string getdatacharacteristics(). 68

74 Appendix B. Cultivated Phenotype Tree: Defines the gross structure of a decision tree. Includes functionality to grow a new decision tree using the ID3 algorithm, a set of examples and a set of attributes. Includes functionality to self-prune using reduced error pruning and a set of validation examples. Includes functionality to return a drug-susceptibility classification given a genotype sequence. Functionality realised by the procedures: void addbranch(tree subtree, string label), Tree id3(object[] examples, vector attributes), Attribute getbestattributegain(), double getentropy(object[] examples), void prune(object[] examples), string querytree(char[] sequence). BeerenwinkelModels: Defines a number of Tree objects corresponding to the decision tree classifiers presented in [1]. NearestNeighbour: Defines a 3-nearest-neighbour classifier that imposes an ordering on the training and validation datasets according to a similarity measure defined on Example. Includes functionality to return a drug-susceptibility classification given a genotype sequence. Functionality realised by the procedures: string querynearestneighbour(example queryexample), string getclassification(). QueryWindow: Provides the ability to query a decision tree or nearest neighbour classifier using either a mutation list or nucleotide sequence. Includes functionality to obtain a classification, output a drug-susceptibility classification and an explanation of a classification. Functionality realised by the procedures: char[] getquerysequence(), char translatecodon(string codon), Data: Defines a number of properties regarding the performance of a classifier in regards to a testing dataset. For example, a Data object stores the total number of examples in a testing set; the number of examples correctly classified as resistant; the number of examples incorrectly classified as resistant; the number of examples correctly classified as susceptible and the number of examples incorrectly classified as susceptible. Includes functionality to compute the sensitivity, specificity, positive prediction value, negative prediction value, positive likelihood ratio and the negative likelihood ratio. 69

75 Appendix B. Cultivated Phenotype Functionality realised by the procedures: double getsensitivity(), double getspecificity(), double getpositivelikelihoodratio(), double getnegativelikelihoodratio(), double getpositivepredictionvalue(), double getnegativepredictionvalue(), double getpercentagecorrectlyclassified(). PerformanceWindow: Displays the information stored in a Data object. Graph: Plots the results of a number of experimental runs. Includes functionality to create an independent testing dataset. The remaining examples are then sampled to create a variety of learning experiences and for each learning experience, a new decision tree and nearest-neighbour classifier is constructed. For each new decision tree and nearest neighbour classifier their performance is recorded in regards to the testing dataset. In addition the performance of a BeerenwinkelModel is computed in regards to the same testing dataset. The performance of each model is plotted for each training experience. Functionality realised by the procedure: vector testdtree(). Above, each class encompasses a number of important procedures. The role of these procedures is described below. Procedure void loadfile(string filename) void filterdrug(string antiretroviral) void setthreshold(double foldvalue) boolean isresistant() string getattribute(int I) void computecomparisionarray() Description Reads a data file as constructed from the text parser described in appendix A. Creates a table of phenotype-genotype pairs. Removes from the entire dataset of phenotype-genotype pairs the fold-change values associated with each drug except for antiretroviral. Initialises the fold-change threshold to be used for discriminating resistant from susceptible samples. Compares the fold-value of the Example with the fold-change threshold. Returns true (resistant) if the fold-change value exceeds the threshold and false (susceptible) otherwise. Given an index i returns the value of the ith attribute. In other words, returns the amino acid present at position i. Creates a feature vector for the Example. In particular, retrieves a reference sequence and compares it to the attribute values of the Example. For positions in which there was no change in amino acid the feature vector 70

76 Appendix B. Cultivated Phenotype int distancefromthissequence(char[] ot) void setanddisplaytrainingtestsets(int i). string getdatacharacteristics(). void addbranch(tree subtree, string label) Tree id3(object[] examples, vector, ats) Attribute getbestattributegain(), double getentropy(object[] examples), void prune(object[] examples), string querytree(char[] sequence). was augmented to include a dummy value and for other positions the feature vector was augmented to include the attribute value. Given a feature vector, ot, returns a score representing the distance from the feature vector of this Example and ot. The distance is computed as the sum of two factors. In particular, for positions in which the two feature vectors contain non-dummy values the percentage of these that are different is computed. In addition, the percentage of the remaining positions that are different is added. Given a percentage of the entire dataset to allocate for testing, i, randomly samples the entire dataset to construct a training and testing dataset. Furthermore 20% of the training set is then randomly sampled to create a validation dataset. Examples are selected using a random number generator (without replacement) in that each Example is equally probable to be picked. Given training, validation and testing datasets, outputs a description of their characteristics. In particular it computes the total number of Examples in each dataset; the number of Examples that are currently classified as resistant; the sequences in each dataset with their phenotypes and the distribution of phenotype values within the dataset. Creates a new branch from a tree node (amino-acid position) with a label (aminoacid) and a subtree. Given a set of phenotype-genotype pairs, examples and a set of attributes, ats, constructs a decision tree according to the ID3 algorithm. Using a set of phenotype-genotype pairs and a set of attributes, returns the attribute with maximal information gain. Given a set of phenotype-genotype pairs, examples, computes the entropy of the dataset. Measures the (im)purity of the dataset. Given a dataset of phenotype-genotype pairs, examples, considers each node in the tree for pruning. In particular, implements the reduced-error pruning strategy. Given an unseen genotype sequence, sequence, obtains a drug-susceptibility classification by sorting the example down 71

77 Appendix B. Cultivated Phenotype string querynearestneighbour(example q) string getclassification() char[] getquerysequence() char translatecodon(string codon) double getsensitivity() double getspecificity() double getpositivelikelihoodratio() double getnegativelikelihoodratio() double getpositivepredictionvalue() double getnegativepredictionvalue() double getpercentagecorrectlyclassified() vector testdtree(). through the tree. In particular, each node in the tree tests a specific amino acid position for certain amino acids. Given an unseen genotype Example, imposes an ordering on the Examples in the training + validation datasets as dictated by the distance measure, computed by the function int distancefromthissequence(char[] ot). Provided that the Examples in the training and datasets are ordered, selects the three closest Examples to an unseen genotype. Returns the majority drug-susceptibility classification. Constructs a genotype sequence from a set of mutations. Given a sequence of three nucleotides, codon, uses the genetic code to translate this code into an amino acid. Given the number of true positives, tp, and false negatives, fn, returns tp / (tp + fp). Given the number of false positives, fp, and true negatives, tn, returns tn / (fp + tn ). Returns sensitivity / (1 specificity). Returns (1 sensitivity) / specificity Given the number of true positives, tp, and false positives, fp, returns tp / (tp + fp). Given the number of true negatives, tn, and false negatives, fn, returns tn / (tn + fn). Given the total number of examples tested, tot, the number of true positives, tp, and the number of true negatives, tn, returns ((tp + tn) / tot) * 100. Creates a testing set of phenotype-genotype pairs using 20% of the entire dataset. Constructs 20 different training experiences by randomly sampling the remaining data. Creates a new decision tree classifier, for each different learning experience. For each training experience the performance over the entire testing set of each newly constructed decision tree, a Beerenwinkel model and the k-nearest neighbour classifier are plotted. 72

78 Appendix C The Complete Datasets Given below is a list of the sequences (given as sequence id s used in the Stanford HIV resistance database) along with the fold-change values that were associated with each drug. I used the following sequences from the Stanford HIV-1 protease drug susceptibility data set: Instance_Id Seq_Id APV_Fold ATV_Fold NFV_Fold RTV_Fold SQV_Fold LPV_Fold IDV_Fold i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i

79 i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i Appendix C. The Complete Datasets 74

80 i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i Appendix C. The Complete Datasets 75

81 i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i Appendix C. The Complete Datasets 76

82 i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i Appendix C. The Complete Datasets 77

83 i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i Appendix C. The Complete Datasets 78

84 i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i Appendix C. The Complete Datasets 79

85 i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i Appendix C. The Complete Datasets 80

86 i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i Appendix C. The Complete Datasets 81

87 i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i Appendix C. The Complete Datasets 82

88 Appendix C. The Complete Datasets I used the following sequences from the Stanford HIV-1 reverse transcriptase drug susceptibility data set: Id Seq_Id 3TC_Fold ABC_Fold D4T_Fold DDC_Fold DDI_Fold DLV_Fold EFV_Fold NVP_Fold i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i

89 Appendix C. The Complete Datasets i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i

90 Appendix C. The Complete Datasets i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i

91 Appendix C. The Complete Datasets i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i

92 Appendix C. The Complete Datasets i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i

93 Appendix C. The Complete Datasets i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i

94 Appendix C. The Complete Datasets i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i

95 Appendix C. The Complete Datasets i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i

96 Appendix C. The Complete Datasets i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i

97 Appendix C. The Complete Datasets i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i

98 Appendix C. The Complete Datasets i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i

99 Appendix C. The Complete Datasets i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i

100 Appendix C. The Complete Datasets i i i i i i i i i i i i i i i i i i i i i i i i i i i i i i

Appendix D Original Decision Trees Given below are the decision trees for ZDV (a), ddc (b), ddi (c), d4t (d), 3TC (e), ABC (f), NVP (g), DLV (h), EFV (i), SQV (j), IDV (k), NFV (m) and APV (n), as

101 Appendix D Original Decision Trees Given below are the decision trees for ZDV (a), ddc (b), ddi (c), d4t (d), 3TC (e), ABC (f), NVP (g), DLV (h), EFV (i), SQV (j), IDV (k), NFV (m) and APV (n), as presented in [1]. image taken from Note that (a) is unable to offer a classification because no labels are attached to the leafs, apart from one. 96

Anumber of clinical trials have demonstrated

Anumber of clinical trials have demonstrated IMPROVING THE UTILITY OF PHENOTYPE RESISTANCE ASSAYS: NEW CUT-POINTS AND INTERPRETATION * Richard Haubrich, MD ABSTRACT The interpretation of a phenotype assay is determined by the cut-point, which defines