Predicting Human Immunodeficiency Virus Type 1 Drug Resistance From Genotype Using Machine Learning. Robert James Murray

Similar documents
Anumber of clinical trials have demonstrated

Predicting HIV-1 Protease and Reverse Transcriptase Drug Resistance Using Fuzzy Cognitive Maps

Continuing Education for Pharmacy Technicians

2 nd Line Treatment and Resistance. Dr Rohit Talwani & Dr Dave Riedel 12 th June 2012

Because accurate and reproducible phenotypic susceptibility

Supplementary Figure 1. Gating strategy and quantification of integrated HIV DNA in sorted CD4 + T-cell subsets.

THE HIV LIFE CYCLE. Understanding How Antiretroviral Medications Work

MedChem 401~ Retroviridae. Retroviridae

Predicting Breast Cancer Survival Using Treatment and Patient Factors

Predicting Breast Cancer Survivability Rates

CLAUDINE HENNESSEY & THEUNIS HURTER

INTRODUCTION TO MACHINE LEARNING. Decision tree learning

Selected Issues in HIV Clinical Trials

MEDICAL COVERAGE GUIDELINES ORIGINAL EFFECTIVE DATE: 03/07/18 SECTION: DRUGS LAST REVIEW DATE: 02/19/19 LAST CRITERIA REVISION DATE: ARCHIVE DATE:

Selected Issues in HIV Clinical Trials

Perspective Resistance and Replication Capacity Assays: Clinical Utility and Interpretation

ACQUIRED IMMUNODEFICIENCY SYNDROME AND ITS OCULAR COMPLICATIONS

Reverse transcriptase and protease inhibitor resistant mutations in art treatment naïve and treated hiv-1 infected children in India A Short Review

Recognition of HIV-1 subtypes and antiretroviral drug resistance using weightless neural networks

Milan, Italy. Received 15 March 2002; returned 22 July 2002; revised 12 September 2002; accepted 27 September 2002

HIV THERAPY STRATEGIES FOR THIRD LINE. issues to consider when faced with few drug options

White Paper Estimating Complex Phenotype Prevalence Using Predictive Models

Supplementary information

LESSON 4.6 WORKBOOK. Designing an antiviral drug The challenge of HIV

Quick Reference Guide to Antiretrovirals. Guide to Antiretroviral Agents

Antiviral Chemotherapy

I m B m. 1 f ub I B. D m B. f u. 1 f ua 1 D I A. I A m. D m A. f a. 1 f u. D m B ) D m A )(I m B. 1 f ua. 1 (I m A. log (I A. log f.

Cognitive Neuroscience History of Neural Networks in Artificial Intelligence The concept of neural network in artificial intelligence

What is HIV? Shoba s story. What is HIV?

Evaluating Classifiers for Disease Gene Discovery

Somnuek Sungkanuparph, M.D.

Management of NRTI Resistance

Application of Artificial Neural Networks in Classification of Autism Diagnosis Based on Gene Expression Signatures

Case Study. Dr Sarah Sasson Immunopathology Registrar. HIV, Immunology and Infectious Diseases Department and SydPath, St Vincent's Hospital.

Identification of Tissue Independent Cancer Driver Genes

Chapter 1. Introduction

Using Bayesian Networks to Analyze Expression Data. Xu Siwei, s Muhammad Ali Faisal, s Tejal Joshi, s

Learning and Adaptive Behavior, Part II

Original article. Antiviral Therapy 14:

10CS664: PATTERN RECOGNITION QUESTION BANK

Lesson 6 Learning II Anders Lyhne Christensen, D6.05, INTRODUCTION TO AUTONOMOUS MOBILE ROBOTS

The use of antiretroviral agents during pregnancy in Canada and compliance with North-American guidelines

Micropathology Ltd. University of Warwick Science Park, Venture Centre, Sir William Lyons Road, Coventry CV4 7EZ

Geno2pheno: estimating phenotypic drug resistance from HIV-1 genotypes

HIV medications HIV medication and schedule plan

NOTICE TO PHYSICIANS. Division of AIDS (DAIDS), National Institute of Allergy and Infectious Diseases, National Institutes of Health

Lecture 2: Virology. I. Background

Artificial Neural Networks (Ref: Negnevitsky, M. Artificial Intelligence, Chapter 6)

Evolutionary Programming

Plan Recognition through Goal Graph Analysis

Nobel /03/28. HIV virus and infected CD4+ T cells

Supplemental Materials and Methods Plasmids and viruses Quantitative Reverse Transcription PCR Generation of molecular standard for quantitative PCR

HIV Drugs and the HIV Lifecycle

Bacteria and Viruses

0.14 ( 0.053%) UNAIDS 10% (94) ( ) (73-94/6 ) 8,920

Terapia antirretroviral inicial y de rescate: Utilidad actual y futura de nuevos medicamentos

Prediction of Malignant and Benign Tumor using Machine Learning

DIVISION OF ANTIVIRAL DRUG PRODUCTS (HFD-530) MICROBIOLOGY REVIEW NDA:

HIV and drug resistance Simon Collins UK-CAB 1 May 2009

HIV Drug Resistance. Together, we can change the course of the HIV epidemic one woman at a time.

Lecture 2: Foundations of Concept Learning

Data mining for Obstructive Sleep Apnea Detection. 18 October 2017 Konstantinos Nikolaidis

Interactive selective pressures of HLA-restricted immune responses and antiretroviral drugs on HIV-1

HIV Drug Resistance: An Overview

PAEDIATRIC HIV INFECTION. Dr Ashendri Pillay Paediatric Infectious Diseases Specialist

HIVDB Users Guide. Interactive Programs. Database Query & Reference Pages. Educational Resources STANFORD UNIVERSITY HIV DRUG RESISTANCE DATABASE

Comparison of discrimination methods for the classification of tumors using gene expression data

Sub-Topic Classification of HIV related Opportunistic Infections. Miguel Anderson and Joseph Fonseca

DATA SHEET. Provided: 500 µl of 5.6 mm Tris HCl, 4.4 mm Tris base, 0.05% sodium azide 0.1 mm EDTA, 5 mg/liter calf thymus DNA.

Identifying Parkinson s Patients: A Functional Gradient Boosting Approach

Evaluation and Management of Virologic Failure

INDUCTIVE LEARNING OF TREE-BASED REGRESSION MODELS. Luís Fernando Raínho Alves Torgo

Obstacles to successful antiretroviral treatment of HIV-1 infection: problems & perspectives

Predicting Kidney Cancer Survival from Genomic Data

Decision Analysis Example

C h a p t e r 5 5 HIV Therapy Where are We Now?

Clinton Foundation HIV/AIDS Initiative ARV Procurement Forecast Tool Version 1.4 March, User s Manual

CONCISE COMMUNICATION

Learning Convolutional Neural Networks for Graphs

T. R. Golub, D. K. Slonim & Others 1999

Stepwise method Modern Model Selection Methods Quantile-Quantile plot and tests for normality

Distribution and Effectiveness of Antiretrovirals in the Central Nervous System

Learning Classifier Systems (LCS/XCSF)

Machine Learning to Inform Breast Cancer Post-Recovery Surveillance

Section 6. Junaid Malek, M.D.

Julianne Edwards. Retroviruses. Spring 2010

Predicting Breast Cancer Recurrence Using Machine Learning Techniques

Antiretroviral Therapy

HS-LS4-4 Construct an explanation based on evidence for how natural selection leads to adaptation of populations.

HIV-1 Protease and Reverse Transcriptase Mutation Patterns Responsible for Discordances Between Genotypic Drug Resistance Interpretation Algorithms

Comprehensive Guideline Summary

Distinguishing epidemiological dependent from treatment (resistance) dependent HIV mutations: Problem Statement

Application of Artificial Neural Network-Based Survival Analysis on Two Breast Cancer Datasets

Introduction to HIV Drug Resistance. Kevin L. Ard, MD, MPH Massachusetts General Hospital Harvard Medical School

Didactic Series. Archive Genotype Resistance Testing in the Setting of Regimen Switching

Liver Toxicity in Epidemiological Cohorts

Detection of Lung Cancer Using Backpropagation Neural Networks and Genetic Algorithm

Exploring HIV Evolution: An Opportunity for Research Sam Donovan and Anton E. Weisstein

1592U89 (Abacavir) Resistance and Phenotypic Resistance Testing

Empirical function attribute construction in classification learning

Transcription:

Predicting Human Immunodeficiency Virus Type 1 Drug Resistance From Genotype Using Machine Learning. Robert James Murray Master of Science School of Informatics University Of Edinburgh 2004

ABSTRACT: Drug resistance testing has been increasingly incorporated into the clinical management of human immunodeficiency virus type 1 (HIV-1) infection. At present, there are two ways to assess the susceptibility of an HIV-1 isolate to antiretroviral drugs namely phenotyping and genotyping. Although, phenotyping is recognised as providing a quantified measurement of drug resistance, it involves a complex procedure and is time consuming. On the other hand, genotyping involves a relatively simple procedure and can be done quickly. However, the interpretation of drug resistance from genotype information alone is challenging. A number of machine-learning methods have now been used to automatically relate HIV-1 genotype with phenotype. However, the predictive quality of these models has been mixed. This study investigates the nature of these computational models and their predictive merit. Using the complete Stanford dataset of matched phenotype genotype pairs, two contrasting machine-learning approaches were implemented to analyse the significance of sequence mutations in the protease and reverse transcriptase genes of HIV-1 for 14 antiretroviral drugs. Both decision tree and nearest-neighbour classifiers were generated and compared with previously published classification models. I found prediction errors between 6.3-24.7% for decision tree models and prediction errors between 18.0 46.2% for nearest-neighbour classifiers. This was compared with prediction errors of between 8.1 51.0% for previously published decision tree models and a correlation coefficient of 0.88 for a neural network lopinavir classification model.

Declaration I declare that this thesis was composed by myself, that the work contained herein is my own except where explicitly stated otherwise in the text, and that this work has not been submitted for any other degree or professional qualification except as specified. (Robert James Murray)

Table of Contents 1 Introduction 1 1.1 Overview 1 1.2 HIV Infection 2 1.3 Resistance Testing 3 1.4 Phenotype & Genotype Resistance Tests 5 1.5 Literature Review 8 2 Machine Learning 11 2.1 Overview & The Phenotype Prediction Problem 11 2.2 The Training Experience, E. 13 2.3 Learning Mechanisms 15 2.3.1 Decision Trees 15 2.3.2 Artificial Neural Networks 21 2.3.3 Instance Based 25 3 Materials & Methods 29 3.1 Data Set 29 3.2 Decision Trees 33 3.3 Nearest-Neighbour 35 4 Results 36 4.1 Classification Models 36 4.1.1 Reverse Transcriptase Inhibitors 36 4.1.2 Protease Inhibitors 42 4.2 Prediction Quality 48 5 Conclusion 53 5.1 Concluding Remarks & Observations 53 5.1.1 Decision Tree Models 54 5.1.2 Neural Network Models 58 5.1.3 Nearest-Neighbour Models 60 5.2 Suggestions For Further Work 62 5.2.1 Handling Ambiguity Codes 62 5.2.2 Using A Different Attribute Selection Measure 62 5.2.4 Using A Different Distance Metric 63 5.2.5 Receiver Operating Characteristic Curve 64

5.2.6 Other Machine Learning Approaches 64 A Pre-Processing Software 65 B Cultivated Phenotype 67 C The Complete Datasets 73 D Original Decision Trees 96 Bibliography 97

Chapter 1 Introduction 1.1 Overview Drug resistance testing has been increasingly incorporated into the clinical management of human immunodeficiency virus type 1 (HIV-1) infection. Resistance tests can show whether a particular HIV-1 strain is likely to be suppressed by a drug. At present, there are two ways to assess the susceptibility of an HIV-1 isolate to antiretroviral drugs namely phenotyping and genotyping. Whereas, phenotypic assays provide a direct quantification of drug resistance, genotypic assays only provide clues towards drug resistance. In particular, genotyping attempts to establish the presence or absence of genetic mutations in the protease and reverse transcriptase genes of HIV-1 that have been previously associated with drug resistance. Although, phenotyping is recognised as providing a quantified measurement of drug resistance, it involves a complex procedure and is time consuming. On the other hand, genotyping involves a relatively simple procedure and can be done quickly. However, the interpretation of drug resistance from genotype information alone is still challenging, and often requires expert analysis. A number of machine-learning methods have now been used to automatically relate HIV-1 genotype with phenotype [1,2,10,9]. Using datasets of matched phenotype genotype pairs, machine-learning methods can be used to derive computational models that predict phenotypic resistance from genotypic information. However, the predictive quality of these models has been mixed. For some drugs, these models offer reasonable prediction rates but for others the results are less useful for managing HIV-1 infection. This study attempts to investigate the nature of these computational models and their predictive merit. Specifically, I had the following initial goals: related to previous work [1], 1

Chapter 1. Introduction generate decision tree classifiers to recognise genotypic patterns attributed to drugresistance; evaluate the predictive quality and nature of these models in retrospect; consider the application of other machine learning approaches. Using the complete Stanford dataset of matched phenotype genotype pairs, two contrasting machine-learning approaches were implemented, in Java, to analyse the significance of sequence mutations in the protease and reverse transcriptase genes of HIV-1 for 14 antiretroviral drugs. Specifically, decision tree classifiers were generated to predict drug-susceptibilities from genotypic information and nearest neighbour classifiers were built to identify genotypes with similar mutational patterns. Also evaluated in the study was the possible use of artificial neural networks as proposed in [2]. The predictive quality of the classification models was analysed in regards to an independent testing set. For decision trees, I found prediction errors between 6.3 24.7% for all drugs. This was compared with the performance of the decision trees presented in [1]. Specifically, these models achieved prediction errors in the range of 8.1 51.0% over the same testing set. Nearest-neighbour classifiers exhibited poorer performance with prediction errors between 18.0 46.2%. 1.2 HIV Infection Human immunodeficiency virus (HIV) is a superbly adapted human pathogen and as of the end of 2003, an estimated 37.8 million people worldwide 35.7 million adults and 2.1 million children younger than 15 years were living with HIV/AIDS. Furthermore, an estimated 4.8 million of these were new HIV infections. Once HIV enters the human bloodstream - through sexual reproduction, exchange of blood or breast milk - it seeks a host cell in order to reproduce. Although HIV can infect a number of cells in the body, the main target is an immune cell called lymphocyte - a type of T-cell. Once HIV comes in contact with a T-cell it attaches itself to it and hijacks it s cellular machinery to reproduce thousands of new copies of HIV, see figure 1.1. T-cells are an important part of the immune system because they help facilitate the body s response to many common but potentially fatal infections. Without enough T-cells the body s immune system is unable to defend itself against infections. 2

Chapter 1. Introduction image from http://www.aidsmeds.com/lessons/lifecycle1.htm Fig. 1.1 The HIV life-cycle. In (1) HIV encounters a T-cell and gp120 (on the surface of HIV) binds to the T-cells cd-4 molecule. The membranes of the HIV particle and T-cell fuse and the contents of the HIV particle release into the T-cell. In (2) reverse transcription creates a DNA copy of the virus s RNA. In (3) the HIV DNA is transported to the T-cell's nucleus. Another viral enzyme called integrase hides the proviral DNA into the cell's DNA. In (4) HIV s genetic material directs the T-cell to produce new HIV. In (5) and (6) a new HIV particle is assembled. Being HIV-positive, or being infected with HIV is not that same as having acquired immune deficiency syndrome (AIDS). Someone with HIV can live for many years with few ill effects. Off course, this is provided that their bodies replace the T-cells destroyed by the virus. However, once the number of T-cells diminishes below a certain threshold, infected individuals will start to display symptoms of AIDS. Such individuals will have a lowered immune response and are highly susceptible to a wide range of infections that are harmless to healthy people but may inevitably prove fatal. Indeed, in 2003, HIV/AIDS associated illnesses caused the deaths of approximately 2.9 million people worldwide, including an estimated 490,000 children younger than 15 years and since the first AIDS cases were identified in 1981, over 20 million people with HIV/AIDS have died. 1.3 Resistance Testing There are now a number of antiretroviral drugs approved for treating HIV-1 infection. Treatment with combinations of these drugs can offer individuals prolonged virus suppression and a chance for immunologic reconstruction. However, unless therapy suppresses virus replication completely the selective pressure of antiretroviral treatment enhances the emergence of drug-resistant variants, see figure 1.2. 3

Chapter 1. Introduction image from http://www.vircolab.com/bgdisplay.jhtml?itemname=understandinghiv Fig. 1.2 The development of drug resistance. The emergence of these variants depends on the development of genetic variations in the virus that allow it to escape from the inhibitory effects of a drug. We say that a drug-resistant variant is identifiable from a reference virus by the manifestation of mutations that contribute to reduce susceptibility. This occurs through natural mechanisms. In particular, HIV-1 genetic variability results from the inability of the HIV-1 reverse transcriptase to rewrite nucleotide sequences during replication [3]. This is compounded by the high rate of HIV-1 replication (approximately 10 billion particles/day), a spontaneous mutation rate (approximately 1 mutation/copy) and genetic recombination when viruses of different sequence infect the same cell. Once a drug-resistant variant has emerged greater levels of the same antiretroviral drug is required to suppress virus replication. However, with greater levels of drug we increase the risk of adverse side effects and harm to an individual. Therefore when resistance occurs, patients often need to change to a new drug regimen. To help, we can use resistance tests to show whether a particular HIV-1 strain is likely to be suppressed by a drug or not. Nevertheless, until recently, resistance testing was used solely as a research tool, to investigate the mechanisms of drug failure. In July 1998, the idea of extending the methodology of resistance testing to routine clinical management, although 4

Chapter 1. Introduction logical, could not be recommended due to lack of validation, standardisation and a concrete definition of the role of the testing [4]. Nevertheless, since then, a number of studies emerged indicating the worth of resistance testing for clinical management and the lack of standardisation was addressed with the development of commercial resistance tests. Both of these factors contributed to a second statement, published in early 2000, which explicitly recognised the value of HIV-1 drug resistance testing [5] and finally, in May 2000, the International AIDS Society recommended the incorporation of drug-resistance testing in the clinical management of patients with HIV-1 infection [6]. Furthermore, considerable data supporting the use of drug-resistance testing has now been published or presented at international conferences [7]. 1.4 Phenotype And Genotype Resistance Tests At present, there are two ways to assess the susceptibility of an HIV-1 isolate to antiretroviral drugs in vitro namely phenotyping and genotyping. Phenotyping directly measures the susceptibility of HIV-1 strains to particular drugs, whereas genotyping establishes the absence or presence of specific mutations in HIV-1 that have been previously associated with drug resistance. Phenotype resistance tests involve direct quantification of drug sensitivity. Viral replication is measured in cell cultures under the selective pressure of increasing concentrations of antiretroviral drugs. Specifically, a sample of blood is taken from a patient and the HIV is isolated. At this stage the reverse transcriptase and protease genes are recognised and amplified. The amplified genes are then inserted into a laboratory strain of HIV, which the scientists are able to grow (a recombinant virus). The ability of the virus to grow in the presence of each antiretroviral drug is evaluated. In particular, the virus is grown in the presence of varying concentrations of antiretroviral drugs and its ability to grow in the presence of these drugs is compared to that of a reference strain. The outcome of a phenotypic test may be expressed as a: IC 50, IC 90 or IC 95 value. Where the IC value expresses the concentration of a particular drug required to inhibit the growth of the virus by 50%, 90% or 95%, respectively, see figure 1.3. The level of drug resistance is reported by typically comparing the IC 50 value of the patients HIV with that of a reference strain. In particular, a degree of fold-change is reported where fold-change indicates the increase in the amount of drug needed to stop viral replication compared to the reference 5

Chapter 1. Introduction strain. In other words a phenotypic resistance test gives a clear indication of the capability of an HIV-1 strain to grow in the presence of a particular concentration of an antiretroviral drug. These results are easily interpreted but often prove time consuming, expensive and labour intensive. Antiviral effect (%) Reference Strain Resistant Strain 50% IC 50 IC 50 Drug Concentration Fig. 1.3. Comparing the IC 50 values of a reference and resistant strain. Genotypic resistance tests are based on the analysis of mutations associated with resistance. In a genotypic test the nucleotide sequence of the virus genome is determined by direct sequencing of PCR products. This sequence is then aligned with a reference strain and the differences are recorded. In particular, the results of a genotypic test are given as a list of point mutations with respect to a reference strain. Such mutations are expressed by the position they have in a certain gene, preceded by the letter corresponding to the amino-acid seen in the reference virus, and followed by the mutated amino acid. For example M184V would correspond to the substitution of Methyonine by Valine at the codon 184. In contrast to phenotypic tests, genotypic tests usually provide results in a few days and are less expensive. However, genotyping is a method that can be viewed as primarily measuring the likelihood of reduced susceptibility, with the major challenge being the correct interpretation of the results in order to associate a realistic level of drug-resistance. The specific type and placement of the mutations determines which drugs the virus may be resistant to. For example, if the M184V mutation is discovered in a patient s HIV, the 6

Chapter 1. Introduction virus is probably resistant to the reverse transcriptase inhibitor lamivudine. In this respect, clinicians often utilise tables of known mutations attributed to drug-resistance, see figure 1.4. However, this is not a simple task because it is simply not true that the clinician can consider each mutation independently of the others. Specifically, the influence of a mutation on drug resistance must be considered as a part of a global interaction [8]. In addition it is not viable to continue to use such tables of mutations because their complexity must grow as the number of drugs, and especially drug combinations, increases. image from http://hivinsite.ucsf.edu/insite?page=kb-authors&doc=kb-03-02-07 Fig. 1.4. Mutations in the protease associated with reduced susceptibility to protease inhibitors. 7

Chapter 1. Introduction 1.5 Literature Review A number of statistical methods have been used to investigate the relationship between the results of HIV-1 genotypic and phenotypic assays. Cluster analysis, linear discriminant analysis, heuristic scoring metrics, nearest neighbour, neural networks and recursive partitioning have been used to correlate drug-susceptibility with genotype. However, the problem of relating the results of genotypic and phenotypic assays provides several statistical challenges and the success of these methods have been mixed. Firstly, phenotype must be considered as a consequence of a large number of possible mutations. As mentioned previously, this is compounded by the fact that the effect of mutations at any given position is influenced by the presence of mutations at other positions and it is therefore necessary to detect global interactions. The use of cluster analysis and linear discriminant analysis is described in [9]. Investigating drug-resistance mutations of two protease inhibitors saquinavir (SQV) and indinavir (IDV) the results of these analyses were comparable. In particular, both analyses were able to identify the association of mutations at amino acid positions 10, 63, 71 and 90 with in vitro resistance to SQV and IDV. In elaboration, cluster analysis requires a notion of distance between any two amino acid sequences. Typically a set of indicator variables is created for each amino acid position and a vector of indicator variables is then used to represent a sequence. The distance between two sequences can then be defined as the standard Euclidean distance between the two corresponding vectors. Such a measure can then be used to create groups of amino acid sequences with similar genotypes. Furthermore, the distance between any two groups can be defined as the average of the individual distances between all pairs of members of the two groups. Creating hierarchies of such groups or clusters then facilitates the investigation of the degree to which similar genotypes have similar phenotypes. On the other hand, linear discriminant analysis can be used to determine which mutations best predict sensitivity to drugs, as defined by the phenotypic IC 50 value. A dataset consisting of matched phenotype genotype pairs is split into two groups, labelled as either resistant or susceptible depending on the basis of an IC 50 cut-off. A linear discriminant function is then used to predict which group an unknown genotype belongs to. Where the linear discriminant 8

Chapter 1. Introduction function is simply a linear combination of predictors or indicator variables (the choice here varies). In contrast, a simple scoring metric is used in [10] to predict HIV-1 protease inhibitor resistance. Here a database of genotype phenotype pairs was analysed. It was found that samples with one or two mutations in the protease gene were phenotypically susceptible and samples with five or more mutations were resistant against all protease inhibitors. A list of all the mutations present in the database was compiled and split into two groups. One in which mutations were frequent in susceptible and resistant samples and one in which mutations were predominantly present in resistant samples. A scoring system using the presence or absence of any single mutation compiled by Schinazi et al [11] was then used as a criterion for predicting phenotypic resistance from genotype. This was enhanced by the incorporation of a secondary score that simply takes into account the total number of resistance associated mutations in the protease. This achieved high sensitivity (92.5-98.5%) but lower specificity (57.9% - 77.3%) on unseen cases. Commercially, the Virco-Tibotec company, over time have accumulated a database of around 100,000 genotypes and phenotypes. Using this dataset a virtualphenotype TM is generated from genotype by means of a nearest neighbour strategy. In particular, their system begins by identifying all the mutations in the genotype that can affect resistance to each drug. It uses this profile to interrogate the dataset for previously seen genotypes that have similar profiles. When all the possible matches are identified, the phenotypes for these samples are retrieved and for each drug the data is averaged. This generates a report of virtual IC 50 values for each drug. In contrast, in [2] artificial neural networks were used to predict lopinavir resistance from genotype. In brief, a neural network is a two-stage regression or classification model that represents real-valued, discrete-valued and vector-valued functions. Using a set of examples, algorithms such as backpropagation tune the parameters of a fixed network of units and interconnections. For example, backpropagation employs a gradient descent to attempt to minimise the error between the network outputs and target values. In [2] two neural network models were developed. The first was based on changes at only 11 amino acid positions in the protease, as described in the literature, and the second was based on 28 amino acid positions resulting from category prevalence analysis. A set of 1322 clinical samples was utilised to train, validate and test the models. Results were expressed in terms of the 9

Chapter 1. Introduction correlation coefficient R 2. In simple terms the correlation coefficient indicates to which extend the predicted and true values lie on a straight line. It was found that the 28-mutation model proved to be more accurate than the 11-mutation model at predicting lopinavir drugresistance (R 2 = 0.88 against R 2 = 0.84). Alternatively, decision tree classifiers were generated by means of recursive partitioning in [1]. These models were then used to identify genotypic patterns characteristic of resistance to 14 antiretroviral drugs. In brief, recursive partitioning describes the iterative technique used to construct a decision tree. Recursive partitioning algorithms begin by splitting a population of data into subpopulations by determining an attribute that best splits the initial population. It continues by repeated identification of attributes that best splits each resulting subpopulation. Once a subpopulation contains individuals of the same type no more splits are made and a classification is assigned to that population. An unknown case is then given a classification by sorting it down the tree from root to a leaf using the attributes as specific tests. Here the initial population was a dataset of matched phenotype genotype pairs, consisting of 471 clinical samples. For each drug in the study a separate decision tree classifier was constructed to predict phenotypic resistance from genotype using an implementation of recursive partitioning. These models were then assessed using leave-oneout experiments and prediction errors were found in the range of 9.6 32%. 10

Chapter 2 Machine Learning 2.1 Overview & The Phenotype Prediction Problem In general, the field of machine learning is concerned with the question of how to construct computer programs that automatically improve with experience. Typically, we have a classification, either quantitative or categorical, that we wish to predict based on a set of input characteristics. We have a training set consisting of matched classifications with characteristics. Then using this dataset we strive to build a classification model that will enable us to predict the appropriate outcome for unseen cases. Consider the following definition of a well-posed learning problem, according to Mitchell [12]: Definition: A computer program is said to learn form experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. The first step in constructing such a program is to consider the type of training experience from which the system will learn. This is important because the type and availability of training material has a significant impact on the success or failure of the learner. An important consideration (especially applicable to clinical data) is how well the training material represents the entire population. In general, machine learning is most effective when the training material is distributed similarly to that of the remaining population. However, in a clinical setting, this is almost never entirely the case. When developing systems that learn 11

Chapter 2. Machine Learning from clinical data it is often necessary to learn from a distribution of examples that may be fairly different from those in the remaining population. Such situations are challenging because the success over one distribution will not necessarily lead to a strong performance over some other distribution, since most machine learning theory rests on the assumption that the distribution of training material is exactly representative of the entire population. However, although not ideal, this situation is often the case when developing practical machine learning systems. The next step is to determine the type of knowledge to be learned and how this will be used by the performance measure. In particular, we seek to define a target function F: C O, where F accepts as input a set of characteristics C and produces as output some classification O that is true for C. The problem of improving performance P at tasks T is then reducible to finding a function F 2 that performs better than a function F 1 at tasks T, as measured by P. The final step in building a learning system is to choose an appropriate learning mechanism to derive the target function F. This includes determining the representation of the function F, whether it should be a decision tree, neural network, instance-based, concept-based, linear function etc. Once a representation is decided upon an appropriate algorithm must then be employed to generate such representations from the training experience E. Many machine learning problems can be specified using this critique. In particular we can define a machine learning problem by specifying the task T, performance measure P, training experience E and target function F. In this way, the generalised phenotype prediction problem, can be defined as follows: Task T: The correct interpretation of genotype sequence information with respect to drug resistance. Performance Measure P: The percentage of sequences correctly classified as resistant or susceptible to a particular drug. Training Experience E: A dataset of matched phenotype genotype pairs. Target Function: Resistant: Genotype Drug_Susceptibility_Classification, where Resistant accepts as input the result of a genotype resistance test, Genotype, and outputs a drug susceptibility classification, Drug_Susceptibility_Classification, indicating the susceptibility a genotype to a particular antiretroviral drug. 12

Chapter 2. Machine Learning 2.2 The Training Experience, E. The first step towards addressing the phenotype prediction problem is to acquire a suitable dataset of matched phenotype genotype pairs. In [1] 471 clinical samples from 397 patients were analysed, first hand, to provide both phenotypic and genotypic information for six nucleoside inhibitors of the reverse transcriptase, zidovudine (ZDV), zalcitabine (ddc), didanosine (ddi), stavudine (d4t), lamivudine (3TC) and abacavir (ABC); three nonnucleoside inhibitors of the reverse transcriptase, nevirapine (NVP), delavirdine (DLV) and efavirenz (EFV); and five protease inhibitors, saquinavir (SQV), indinavir (IDV), ritonavir (RTV), nelfinavir (NFV) and amprenavir (APV). This resulted in 443-469 phenotype-genotype pairs for each drug, except for APV. In detail, genotype results were obtained through direct sequencing of the patients HIV containing the complete protease and the first 650-750 nucleotides of the reverse transcriptase. These sequences were then aligned to the reference strain HXB2 to identify differences. Each genotype was then modelled in a computationally useable manner using one attribute for each amino acid position, allowing as a value a single-letter amino acid code or unknown for positions in which ambiguous or no sequence information was available, see figure 2.1. Reference sequence Sequencing result Alignment Attribute Values Fig. 2.1 Modelling genotype. Each sequence position is represented by either the amino acid present at that position or unknown if no sequence information is available. 13

Chapter 2. Machine Learning Phenotyping was performed using a recombinant virus assay. Recombinant viruses were cultivated in the presence of increasing amounts of antiretroviral drugs and fold-change values were calculated by dividing the IC 50 value of the relevant recombinant virus by the IC 50 value of a reference strain (NL4-3). Fold-change values were distributed as in figure 2.2. image from http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=123057&rendertype=figure&id=f1 Fig 2.2 Frequency distribution of fold-change values for a subset of 271 samples for which data was available for all 14 drugs. Using the fold-change values, the dataset of genotypes were grouped into two classes for each antiretroviral drug. In particular, a genotype was labelled as either resistant or susceptible, using a drug specific fold-change threshold, see figure 2.3. These thresholds were obtained from previously published work: 8.5 for ZDV, 3TC, NVP, DLV, EFV; 2.5 for ddc, ddi, d4t and ABC; and 3.5 for SQV, IDV, RTV, NFV, APV. Drug No. of pheno-geno pairs. Percentage of examples classified resistant. ZDV 456 58.1 DdC 456 43.0 DdI 456 49.1 D4T 456 38.6 3TC 452 54.4 ABC 445 66.3 NVP 457 45.1 DLV 455 36.5 EFV 443 35.9 SQV 465 46.7 IDV 469 48.8 RTV 469 50.1 NFV 468 53.6 APV 277 32.9 Fig. 2.3 Characteristics of the dataset. 14

Chapter 2. Machine Learning In contrast, to the more direct approach presented above, one may obtain both phenotypic and genotypic information via online databases. In recent years, databases, consisting of thousands of matched phenotype genotype pairs, have emerged in response to a concern about the lack of publicly available HIV reverse transcriptase and protease sequences. One such database, the Stanford HIV-1 drug resistance database, available at http://hivdb.stanford.edu/, strives towards linking HIV-1 reverse transcriptase and protease sequence data, drug treatment histories and drug susceptibilities to allow researchers to analyse the extent of clinical cross-resistance among current and experimental antiretroviral drugs. Other online databases that provide access to HIV reverse transcriptase and protease sequences with drug-susceptibility data, include: http://hivinsite.ucsf.edu/ http://www.resistanceweb.com/ http://jama.ama-assn.org http://home.ncifcrf.gov/hivdrp 2.3 Learning Mechanisms. The final step in addressing the phenotype prediction problem is to choose an appropriate learning mechanism. This includes choosing an appropriate representation of the function Resistant: Genotype Drug_Susceptibility_Classification, and an algorithm to automatically derive such a representation from the training experience. There are many choices available; decision trees, artificial neural networks and instance-based learning are just a few. 2.3.1 Decision Trees. In [1] decision trees were used to represent the target function Resistant: Genotype Drug_Susceptibility_Classification for 14 antiretroviral drugs. Specifically, decision trees were generated from a set of phenotype genotype pairs (described previously in section 2.2) using the C4.5 software package and a statistical measure 15

Chapter 2. Machine Learning indicating the amount of information that a particular sequence position provides about differentiating resistant from susceptible samples. A decision tree is an acyclic graph with interior vertices symbolizing tests to be carried out on a characteristic or attribute and leaves indicating classifications. Decision trees then classify unseen cases by sorting them through the tree from the root to some leaf, which provides a classification. In other words, as each node in the graph symbolizes a test of some attribute, and each edge departing from that node corresponds to one of the possible values for the attribute. An unseen case is then classified by starting at the root node of the graph, testing the attribute specified by this node and traversing the edge corresponding to the value of the attribute in the given case. This process is repeated for the sub-graph rooted at the new node, until a leaf node is encountered. At which point a classification is assigned to the case. Specifically, in [1] the classification of a genotype was achieved by traversing a decision tree, associated with an antiretroviral drug, from the root to a leaf according to the values of amino acids at specific sequence positions, see figure 2.4. Root V 90 F,M S L R Internal node S R I 48 54 G V V L R Leaf R 84 72 I V R,T,V I S R S R Fig. 2.4 Decision tree for the protease inhibitor SQV as presented in [1]. Nodes represent tests at specific amino acid positions, edges represent amino acid values and leafs represent drugsusceptibility classifications. 16

Chapter 2. Machine Learning In more general terms, a decision tree represents a disjunction of conjunctions of constraints on the attribute values. Each path from the root of the graph to a leaf translates to a conjunction of attribute tests, and the graph itself translates to a disjunction of these conjunctions. For example the decision tree in figure 2.4 can be translated into the expression ( 90 = F ) \/ (90 = M) \/ ( 90 = L /\ 48 = V ) \/ ( 90 = L /\ 48 = G /\ 54 = L ) \/ ( 90 = L /\ 48 = G /\ 54 = I /\ 84 = V ) \/ ( 90 = L /\ 48 = G /\ 54 = V /\ 72 = I ) representing the knowledge that the decision tree uses to determine resistance. Most algorithms that have been developed for generating decision trees offer slight variation on a core methodology that uses a greedy search through a space of possible decision trees. In this way we can think of decision tree learning as involving the search of a very large space of possible decision tree classifiers to determine the one that best fits the training data. Both the ID3 and C4.5 algorithms typify this approach. The basic decision tree-learning algorithm, ID3, performs a greedy search for a decision tree that fits the training data. In summary, it begins by asking, Which attribute should be tested at the root of the tree? Each possible attribute is evaluated using a statistical test to determine how well it alone classifies the complete set of training data. A descendent of the root is created for each possible value of the attribute, that is chosen, and the training data is sorted to the appropriate descendent nodes. The procedure is then repeated for each of the descendent nodes until all the training data is correctly classified. The ID3 algorithm is given in table 2.1. The central choice in the ID3 algorithm is which attribute to test at each node. In order to select an appropriate attribute we can employ a statistical measure to quantify how well an attribute separates a set of data according to a finite alphabet of possible outcomes. A popular measure is called information gain. The information gain of an attribute A, relative to a dataset S, is defined as ig(s,a) entropy(s) S v S entropy(s v ) v є values(a) 17

Chapter 2. Machine Learning ID3(Examples, Target_attribute, Attributes) Examples are the training examples. Target_attribute whose value is to be predicted by the tree. Attributes is a list of other attributes that may be tested by the learned decision tree. Returns a decision tree that correctly classifies the given Examples. - Create a Root node for the tree. - If all Examples are positive, Return the single-node tree Root, with label = + - If all Examples are negative, Return the single-node tree Root, with label = - - If Attributes is empty, Return the single-node tree Root, with label = most common value of Target_attribute in Examples. - Otherwise Begin o A the attribute from Attributes that best classifies Examples o The decision attribute for Root A o For each possible value, v i, of A, Add a new tree branch below Root, corresponding to the test A = v i Let Examples vi be the subset of Examples that have value v i for A If Examples vi is empty Then below this new branch add a leaf node with label = the most common value of Target_attribute in Examples Else below this new branch add the subtree ID3(Examples vi, Target_attribute, Attributes {A}) - Return Root Algorithm taken from Mitchell s Book [12] Table 2.1 The ID3 algorithm where values(a) is the set of all possible values for attribute A, S v is the subset of S for which attribute A has value v and entropy(s) measures the impurity of the set S, such that if all the members of S belong to the same class then entropy is 0 and conversely the entropy is 1 if members of S are equally distributed, see figure 2.5. Specifically, if a set S, contains a number of examples belonging to either a positive or negative class, the entropy of S relative to these two classes is defined as: entropy(s) p (+) log 2 p (+) p (-) log 2 p (-) where p (+) is the proportion of examples in S belonging to the positive class and p (-) is the proportion of examples in S belonging to the negative class. In this respect, ig(s,a), measures the expected reduction in the impurity of a set of examples S according to the attribute A. In other words, we wish to select attributes with high values for information gain. 18

Chapter 2. Machine Learning 1.0 entropy(s) 0.0 0.5 1.0 p (+) Fig. 2.5 The entropy function relative to a boolean classification A major drawback of the ID3 algorithm is that it continues to grow a decision tree until all the training examples are perfectly classified. This approach, although reasonable can lead to difficulties when either there is erroneous data in the training set or the distribution of the training examples is not representative of the entire population. In these cases, ID3 produces decision trees that overfit the training examples. A decision tree is said to overfit the training examples if there exists some other decision tree that classifies the training examples less accurately but nevertheless performs better over the entire population. Figure 2.6 illustrates the impact of overfitting. Accuracy On training data On validation data Size of tree (no. of nodes) Fig. 2.6 Overfitting in decision tree learning. When ID3 creates a new node the accuracy of the tree measured using the training examples increases. However, when measured using the testing examples the accuracy of the tree decreases as its size increases. 19

Chapter 2. Machine Learning Overfitting is a significant problem for decision tree learning and indeed machine learning as a whole. Specifically, overfitting has been found to decrease the accuracy of learned decision trees by 10-25% [13]. However, there are a number of techniques available that minimise the effects of overfitting in decision tree learning. Typically, we begin by randomly partitioning the training data into two subsets, one for training and one for validation. Using the training set we grow an overfitted decision tree and then post-prune using the validation set. Postpruning, in this respect, has the effect that any nodes added due to coincidental regularities in the training set are likely to be removed because the same regularities are unlikely to occur in the validation set. Reduced error pruning is one post-pruning strategy that uses a validation set to minimise the effects of overfitting. Here each node in the decision tree is considered as a candidate for pruning. The removal of a node is determined by how well the reduced tree classifies the examples in the validation set. In particular, if the reduced tree performs no worse than the original over the validation set then the node is removed, making it a leaf and assigning it the most common classification of the training examples associated with that node. Figure 2.7 illustrates the impact of reduced error pruning. Accuracy On training data On validation data On validation data (during pruning) Size of tree (no. of nodes) Fig. 2.7 The impact of reduced-error pruning. The C4.5 algorithm is a successor of the basic ID3 algorithm. C4.5 behaves in the same way as ID3 but offers consideration to a number of issues. In particular, ID3, was criticised for not offering support to handle continuous attributes and for not handling training data with missing attribute values. 20

Chapter 2. Machine Learning Where the basic ID3 algorithm is restricted to attributes that take on a discrete set of values, C4.5 converts continuous values into a set of discrete values by separating the data into a number of bins and attaching a label to each bin. Considering the dataset presented in section 2.2, the ability to handle continuous values isn t required. In particular, each genotype is modelled using a number of attributes (amino acid positions) that have discrete values (single letter amino acid codes). C4.5 handles attributes with missing values by assigning a probability to each of the possible values of an attribute, based on the observed frequencies of the various values. These fractional proportions are then used to both grow the tree and classify unseen cases with missing attribute values. Again, considering the dataset presented in sections 2.2, the ability to handle attributes with missing values isn t a problem. In particular, the dataset presented in section 2.2 explicitly models missing attribute values using a value of unknown. Once continuous values and missing attribute values are handled, the C4.5 algorithm uses a greedy search (similar to ID3) to find a decision tree that exactly conforms to the training data and uses a post-pruning strategy, called rule-post pruning, to minimise the effects of overfitting. In particular rule-post pruning involves: inferring an overfitted decision tree from the training data, converting the decision tree into a set of classification rules (conjunctions of constraints on attribute values) and removing any constraints on attribute values that result in improving a rules estimated accuracy. 2.3.2 Artificial Neural Networks In [2] two artificial neural networks were independently used to represent the target function Resistant: Genotype Drug_Susceptibility_Classification for the protease inhibitor lopinavir. Specifically, two artificial neural networks were generated from a set of 1322 phenotype genotype pairs using the backpropagation algorithm to learn the parameters of a single hidden layer network predicting the susceptibility of lopinavir from genotype. The first network was based on mutations at 11 amino acid positions that were previously recognised as being attributed to drug-resistance and the second was based on mutations at 28 amino acid positions as identified through statistical analysis. The study of artificial neural networks was initially inspired by the observation that biological learning systems are built from very complex webs of simple computational units 21

Chapter 2. Machine Learning called neurons. Analogous, artificial neural networks are built from densely interconnected sets of units called sigmoid perceptrons, see figure 2.8. -1 1 v 1 w 1 w 0 1 + e -inputs v 2... w 2 inputs = w 0 + w 1 v 1 + w 2 v 2 + + w n v n v n w n Fig. 2.8 A sigmoid perceptron. A sigmoid perceptron is a simple computational unit that takes as input a vector of numerical values and outputs a continuous function of its inputs, called the sigmoid function. Specifically, given a vector of inputs {v 1, v 2,, v n } a linear combination of these inputs is calculated as w 0 + w 1 v 1 + w 2 v 2 + + w n v n. Each w i is a numerical constant that weights the contribution of an input v i. The output of the sigmoid perceptron is then obtained using the sigmoid function: 1 1 + e -inputs where inputs is the result of w 0 + w 1 v 1 + w 2 v 2 + + w n v n. The backpropagation algorithm, given in table 2.2, learns appropriate values for each weight w i in a multilayer network with a fixed number of sigmoid perceptrons and interconnections in order to give a correct output (classification) for a set of inputs (characteristics). Figure 2.9 illustrates a single-hidden layer network. Backpropagation uses a training set of matched inputs and outputs and employs gradient descent to minimise the error between the network outputs and the actual outputs of the training set. 22

Chapter 2. Machine Learning v i w i h i k i Fig. 2.9 A single hidden layer network To learn the appropriate weights of a network consisting of a single sigmoid perceptron gradient decent uses a training set of matched input output pairs of the form ({v 1, v 2,, v n }, t), where {v 1, v 2,, v n } is a vector of input values and t is the target output. Gradient decent begins by choosing small arbitrary values for the weights. Weights are then updated for each training example that is misclassified until all the training examples are correctly classified. A learning rate η is used to determine the extent to which a weight is updated. Specifically, each training example is classified by the perceptron to obtain an output o and each weight is updated by the rule w i w i + η(t o)v i. This process is then repeated until the perceptron makes no classification errors in the training data. In considering networks with multiple sigmoid units and multiple outputs we again begin by choosing small arbitrary values for the weights in the network (typically between 0.05 and 0.05) and update them according to the weight update rule w ij w ij + η δ j x ij, where w ij denotes the weight from unit i to j, x ij denotes the input from unit i into unit j and δ is a term representing the misclassification error for each unit in the network. For an output unit k the term δ k is computed as o kv (1 o kv )(t kv o kv ), where o kv is the output value associated with the kth output unit and training example v. For a hidden unit h the term δ h is computed as o hv (1 - o hv ) k є outs w kh δ k where o kv is the output value associated with the hidden unit h and training example v. 23

Chapter 2. Machine Learning BACKPROPAGATION(training_examples, η, n in, n out, n hidden ) - Create a feed-forward network with n in inputs, n hidden hidden units and n out output units. - Initialise all network weights to small random numbers. - Until the termination condition is met, Do o For each <x, t> in training_examples, Do Propagate the input forward through the network: Input the instance x to the network and compute the output o u of every unit u in the network. Propagate the errors backward through the network: For each network output unit k, calculate its error term δ k o k (1 o k )(t k o k ) For each hidden unit h, calculate its error term δ h o h (1 - o h ) k є outpu ts w kh δ k Update each network weight w ij w ij + η δ j x ij Algorithm taken from Mitchell s Book [12] Table 2.2 The backpropagation algorithm Backpropagation continues to update the weights of a multilayer network, in this fashion, until all the training examples are correctly classified or once the error on the training examples falls below some threshold. In the context of the phenotype prediction problem, the target function Resistant: Genotype Drug_Susceptibility_Classification can be represented using an artificial neural network derived using a set of training patterns ({v 1, v 2,, v n }, t) and the backpropagation algorithm. Here {v 1, v 2,, v n } represents the result of a genotype resistance test and t is the corresponding target phenotype. An unseen genotype {v 1, v 2,, v n } q is then classified using the network by propagating the values v 1, v 2,, v n through it. In other words, given the input values v 1, v 2,, v n compute the output of every unit in the network. The final output of the network is then an estimate of phenotype. 24

Chapter 2. Machine Learning 2.3.3 Instance Based Nearest neighbour learning contrasts with learning decision trees and artificial neural networks in that nearest neighbour learning simply stores the training examples and doesn t attempt to extract an explicit description of the target function. In this way generalisation beyond the training examples is postponed until a new case must be classified. Specifically, when a new case is presented to a nearest neighbour classifier, a set of similar cases is retrieved from the training set (using a similarity measure) and used to classify the new case. In this way, a nearest neighbour classifier constructs a different approximation to the target function for each query, based on a collection of local approximations. In regards to the phenotype prediction problem, this is a significant advantage, as the target function Resistant: Genotype Drug_Susceptibility_Classification may be very complex, due to the fact that each mutation may be part of a global interaction exhibiting large interdependence. However, nearest neighbour classifiers typically consider all the attributes of each case when making predictions and if the target function actually depends only on a small subset of these attributes, then cases that are truly most similar may well be deemed to be unrelated. In the context of the phenotype prediction problem this is problematic and extra care should be given to the design of a similarity measure. Another limitation of nearest neighbour methods is that nearly all computation is left until a new classification is required and the total computation cost of classifying a new case can be high. The k-nearest neighbour methodology is the most basic nearest neighbour algorithm available. It assumes that all n-attribute cases correspond to locations in an n-dimensional space. This is called the feature space. The nearest neighbours of an unseen case are then retrieved using the standard Euclidean distance. Specifically, each case is described using a vector of features and the distance between two cases or the similarity of two cases c i and c j is defined to be n (a r (c i ) a r (c j ))2 r = 1 where, a r (c), denotes the value of the rth feature of case c. Using such a measure, k-nearest neighbour approximates the classification of an unseen case, c q, by retrieving the k cases c 1, c 2,, c k that are closest to c q and assigning it the most common classification associated with 25

Chapter 2. Machine Learning the cases c 1, c 2,, c k. For example, if k = 1 then 1-nearest neighbour assigns to c q the classification associated c i, where c i is a training case closest to c q. Figure 2.10 illustrates the operation of the k-nearest neighbour algorithm. + - - +. c q - + - + Fig. 2.10 K-nearest neighbour. Here each case is represented using a 2 dimensional feature vector and cases are either classified as positive or negative. 1-nearest neighbour classifies c q as positive, whereas 5-nearest neighbour classifies c q as negative. In terms of the phenotype prediction problem, k-nearest neighbour would approximate the phenotype of an unseen genotype g q, by retrieving the k genotypes g 1, g 2,, g k that are closest to g q and assigning it a classification of drug-susceptibility using the phenotypes associated with the genotypes g 1, g 2,, g k. With the similarity or distance between two genotypes being determined using an appropriate distance measure. One possibility for this distance measure is to simply apply the Euclidean distance as described above. Here we model each genotype as a vector of single-letter amino acid codes. In this way, every sequence position is considered when determining the distance between two genotypes. As mentioned this may be problematic because it treats each sequence position as being equally important. In this way, we neglect to highlight the importance of amino-acid changes, at specific sequence positions. Commercially, the Virco-Tibotec company employs a k-nearest neighbour approach to predict drug-susceptibility from genotype, see figure 2.11. Here the distance between two genotypes is based on the comparison of their profiles. Where a profile can be thought of as a feature vector containing all the mutations present in a genotype previously associated with drug-resistance. Here we do not consider every sequence position in the distance measure, as above, but rather only a subset of sequence positions. However, this method relies on 26

Chapter 2. Machine Learning previous knowledge of mutations associated with drug-resistance and fails to consider amino acid changes beyond this set. image from http://www.vircolab.com/bgdisplay.jhtml?itemname=howitworks_virtualphenotype&product=virtualphenotype Fig. 2.11 How Virco generates a virtualphenotype An alternative measure that doesn t presume any previous knowledge of drug-resistant mutations constructs a feature vector using a reference strain. Specifically, a feature vector for a genotype is constructed by comparing its complete sequence with a reference sequence. For positions in which there is no change in amino acid the feature vector is augmented to contain a dummy value, β, and for other positions the feature vector is augmented to contain the amino acid present in the genotype sequence. In other words, the feature vector of a genotype represents a pattern of deviations from a reference sequence. We then compute a similarity score based on how well two feature vectors conform. For positions in which both vectors contain non-dummy values we compute the percentage of these that are different. Also, an additional factor is included that represents the percentage of the remaining positions that are different. Figure 2.12 illustrates this approach. Another possible similarity measure that again doesn t presume any previous knowledge of drug-resistance mutations is derived through the comparison of two dot-plots. A dot plot is a visual representation of the similarities between two sequences. Each axis of a rectangular array represents one of the two sequences to be compared and at every point in the array where the two sequences are identical a dot is placed (i.e. at the intersection of every row and 27

Chapter 2. Machine Learning column that have the same amino acid in both sequences). A diagonal stretch of dots indicates regions where the two sequences are similar. Using such a representation of sequence similarity, we can construct a dot plot for each genotype in the training set in relation to a reference sequence. Similarly a dot-plot can be constructed for a query genotype using the same reference sequence. The distance between two genotypes could then be estimated by comparing the similarity of two dot-plots. Figure 2.13 illustrates this approach. Reference Strain Genotype sequences {β,.., β,, β,, β,, β,, β,, β} {β,.., β,, β,, β,, β,, β,, β,, β,.., β,, β,, β} g1 g2 distance(g1, g2) = mutagreement(g1, g2) + diffs(g1, g2). mutagreement(g1, g2) = no. of shared mutations that are different / total no of shared mutations. diffs(g1, g2) = no. of unshared mutations / length of sequence. Fig. 2.12 Comparing two feature vectors (i) (ii) (iii) Fig. 2.13 Comparing dot-plots. (i) A dot plot obtained by comparing the reference sequence, A, with itself. (ii) A dot plot obtained by comparing the reference sequence with a sequence, B, with a small subset of mutations. (iii) A dot plot obtained by comparing the reference sequence with a sequence, C, with a large number of mutations. Similarity of A and B determined by how well (ii) and (iii) conform. 28

Chapter 3 Materials & Methods 3.1 Data Set The complete HIV-1 reverse transcriptase and protease drug susceptibility data sets from the Stanford HIV-1 drug resistance database were utilised to determine viral genotype and drug susceptibility to five nucleoside inhibitors of the reverse transcriptase, zalcitabine (ddc), didanosine (ddi), stavudine (d4t), lamivudine (3TC) and abacavir (ABC); three nonnucleoside reverse transcriptase inhibitors, nevirapine (NVP), delavirdine (DLV) and efavirenz (EFV); and six protease inhibitors, saquinavir (SQV), lopinavir (LPV), indinavir (IDV), ritonavir (RTV), nelfinavir (NFV) and amprenavir (APV). Using a simple text parser, implementation described in Appendix A, I obtained 381-855 phenotype-genotype pairs for each of these drugs, see table 3.2. In addition, I would have liked to reuse the dataset of phenotype-genotype pairs used in [1] but although the results of genotyping were deposited in GenBank (accession numbers AF347117 to AF347605) no drug susceptibility information was attached to them. Furthermore, the Stanford dataset contained no drug-susceptibility information for the reverse transcriptase inhibitor zidovudine, which was also included in [1]. The Stanford HIV-1 reverse transcriptase and protease drug susceptibility data sets are available at http://hivdb.stanford.edu/cgi-bin/genophenods.cgi and can be downloaded in a plain text format, see figure 3.1. 29

Chapter 3. Materials & Methods SeqID SubType Method SQVFold SQVFoldMatch P1 P2 P3 MutList 7439 B Virologic 47.4 = - - - 10I, 24FL, 37D, 46I, 53L, 60E, 63P, 71IV, 73S, 77I, 90M, 93L 7443 B Virologic 574.2 = - - - 10I, 17E, 37D, 43T, 48V, 54V, 63P, 67S, 71V, 77I, 82A 7459 B Virco 15.0 = - - - 10I, 19Q, 35D, 48V, 63P, 69Y, 71T, 90M, 93L... 45121 B Virologic na na - - - 10I, 13V, 41K 45122 B Virologic na na - - - 50L... 7430 B Virologic 121.7 = - - - 10I, 15V, 20M, 35D, 36I, 54V, 57K, 62V, 63P, 71V, 73S Fig. 3.1 The HIV-1 protease drug-susceptibility dataset obtained from the Stanford HIV resistance database. For each sample presented in the Stanford dataset we are given a fold-change value (phenotype), for each drug, and a list of mutations identified in comparison to a reference strain (genotype). See Appendix C for a list of the sequences included in the protease and reverse transcriptase datasets. Using this information, I was able to model each sample in a similar way to that presented in section 2.2. Specifically, I modelled each protease sample using one attribute for each of its 99 amino acids, allowing as a value one of the 20 naturally occurring amino acids. Similarly, each reverse transcriptase sample was modelled using 440 attributes representing the first 440 amino acids of the reverse transcriptase. To acquire the values for each attribute, the list of mutations was used in conjunction with the reference strain HXB2. In particular, for each sample, the original sequence was reconstructed by substituting each mutation into the protease or reverse transcriptase genes of HXB2 (GenBank accession number K03455). In this way, each sequence that was constructed contained no gaps or areas with loss of sequence information and so an attribute value of unknown was not required. Figure 3.2 illustrates this process. 30

Chapter 3. Materials & Methods SeqID SubType Method SQVFold SQVFoldMatch P1 P2 P3 MutList 7443 B Virologic 574.2 = - - - 10I, 17E, 37D, 43T, 48V, 54V, 63P, 67S, 71V, 77I, 82A 7459 B Virco 15.0 = - - - 10I, 19Q, 35D, 48V, 63P, 69Y, 71T, 90M, 93L Fold-change value for saquinavir (SQV) 10I, 17E, 37D, 43T, 48V, 54V, 63P, 67S, 71V, 77I, 82A Reference sequence Reconstructed genotype sequence Position 17 = E Attribute values Fig. 3.2 Modelling the data. A list of mutations in the protease gene is used in conjunction with the reference sequence HXB2 to obtain the values of each attribute. Using the drug-specific fold-change values associated with each sample, the dataset of genotypes were grouped into two classes. Here, fold-change values were distributed as in table 3.1. In particular, a genotype was labelled as either resistant or susceptible, in accordance with a drug specific fold-change threshold. An important consideration, here, is how the choice of thresholds affects the distribution of resistant and susceptible samples. By varying the choice of thresholds we include or exclude certain samples from either the resistant or susceptible classes. However, at this time, the same thresholds were used as described in section 2.2. By grouping the dataset of genotypes using these thresholds I obtained the datasets as described in Table 3.2. 31

Chapter 3. Materials & Methods 0-1 2 4-7 8-16 -31 32-63 64-128 - 256-512 - 1024 - > 2047 3 15 127 255 511 1023 2047 ddc 22% 57% 16% 2% 2% 0.8% 0.2% 0% 0% 0% 0% 0% ddi 22% 61% 10% 4% 3% 0% 0% 0% 0% 0% 0% 0% D4T 32% 49% 13% 3% 1% 0.2% 0% 0% 0% 0% 0% 0% 3TC 11% 18% 8% 5% 6% 16% 18% 16% 0% 0.1% 0.1% 0% ABC 21% 30% 31% 13% 4% 1% 0% 0% 0% 0% 0% 0% NVP 32% 27% 11% 3% 2% 6% 6% 4% 5% 4% 0% 0% DLV 33% 26% 12% 5% 3% 5% 5% 4% 3% 0.5% 0% 0% EFV 41% 23% 7% 4% 4% 5% 5% 4% 4% 2% 0.2% 0.4% SQV 32% 26% 11% 7% 7% 8% 4% 2% 2% 1% 0% 0% LPV 24% 21% 13% 13% 10% 10% 8% 1% 0% 0% 0% 0% IDV 20% 28% 14% 14% 13% 7% 2% 0.7% 0.5% 0% 0% 0.1% RTV 24% 24% 12% 9% 9% 9% 9% 4% 0% 0% 0% 0% NFV 11% 22% 11% 12% 15% 15% 7% 3% 0.8% 2% 0.1% 0% APV 34% 29% 19% 9% 5% 2% 0.7% 0.3% 0.3% 0% 0% 0% Table 3.1 Frequency distribution of fold-change values. Resistance factors are grouped into equidistant bins. Drug No. of phenotype genotype pairs. Percentage of examples classified resistant. ZDV - - DdC 646 29.2% DdI 673 20.9% d4t 631 22.5% 3TC 749 59.4% ABC 582 53.9% NVP 706 29.0% DLV 601 28.3% EFV 557 27.1% SQV 830 33.9% LPV 381 58.5% IDV 822 48.6% RTV 773 49.1% NFV 855 64.3% APV 701 33.3% Table 3.2 Percentage of resistant examples. 32

Chapter 3. Materials & Methods 3.2 Decision Trees For each drug I derived a decision tree classifier to identify genotypic patterns characteristic of resistance, see Appendix B for a description of implementation. Each dataset of phenotype-genotype pairs was partitioned into a training, validation and test set. 56% of the complete dataset was randomly selected for training, 14% of the complete dataset was randomly selected for validation and what remained formed the testing set. Genotypes were labelled as either resistant or susceptible according to a drug-specific threshold, as defined in [1]. Decision trees were generated using the ID3 algorithm, as described in section 2.3.1. A training set was recursively split, according to attribute values (positions amino acids), until all the examples in the training set were perfectly classified. Attributes were selected based on maximal information gain representing the amount of information that an amino acid position provides about differentiating resistant from susceptible genotypes. Trees were pruned using reduced error pruning to minimise the effects of overfitting. This gave rise to confidence factors associated with certain leafs, estimated by the fraction of training examples incorrectly classified by the pruned tree. Classification of an unseen genotype (list of mutations) was achieved by sorting it through a drug-specific decision tree from the root to a leaf according to the values (single-letter amino acid codes) of attributes (amino acid positions). If a genotype was unable to be completely sorted through the tree in this manner (i.e. it contains an attribute value that is not recognised by the tree) then it was classified as unknown. In addition, the classification of a nucleotide sequence was achieved through a two-step process. In order to classify a nucleotide sequence each codon in the sequence was translated into a single-letter amino acid code using the standard genetic code, see figure 3.3. Thus, constructing a set of attribute values to be classified. For each classification an explanation was generated. Specifically, recording the path followed through a decision tree during classification generated an explanation. This was translated into natural language for easy readability, see figure 3.4 for an example. 33

Chapter 3. Materials & Methods Fig. 3.3 The genetic code List of mutations Found I at position 10, Found A at position 71, Found V at position 48, ** Reached decision: example is resistant. (8[0%]) ** Fig. 3.4 Generating an explanation. 34

Chapter 3. Materials & Methods 3.3 Nearest-Neighbour By means of the same training, validation and test sets used to create the decision trees, I created a k-nearest neighbour classifier to approximate drug-susceptibility from k similar genotypes in the training + validation datasets of each antiretroviral drug. A feature vector for each genotype in the training + validation datasets was constructed by comparing each attribute value with a reference sequence. For positions in which there was no change in amino acid the feature vector was augmented to include a dummy value, β, and for other positions the feature vector was augmented to include the amino acid present in the genotype sequence. To classify the drug-susceptibility of an unseen genotype, 3 similar genotypes (guaranteed a majority) were retrieved from the training + validation datasets using a similarity measure. The drug-susceptibility of an unseen genotype was then approximated as the majority drugsusceptibility classification of the retrieved genotypes. For each classification an explanation that consisted of the three closest genotypes was generated. The similarity of two genotypes was defined by how well two feature vectors conform. In particular, for positions in which two feature vectors contained non-dummy values I used the percentage of these that are different as an indicator of similarity. I also added to this score, the percentage of the remaining positions that were different. 35

Chapter 4 Results 4.1 Classification Models I obtained both decision tree and nearest-neighbour classifiers that predict drug-susceptibility from genotype, for 8 reverse transcriptase and 6 protease inhibitors. Decision tree learning generated classification models with varying complexity, ranging from only 5-9 interior attribute tests to 10, 11, 12, 16 and 19 interior attribute tests for zalcitabine, indinavir, abacavir, amprenavir and saquinavir, respectively. In contrast, each nearest-neighbour classifier generated no explicit models but rather simply stored the training and validation datasets used to generate the decision trees. 4.1.1 Reverse Transcriptase Inhibitors I obtained decision tree classifiers for each of the reverse transcriptase inhibitors: zalcitabine, didanosine, stavudine, lamivudine, nevirapine, abacavir, delavirdine and delavirdine, see figure 4.9-4.16, respectively. The decision trees for these drugs varied in complexity. In particular, I found rather simple models for didanosine, stavudine, nevirapine and efavirenz with these trees having only 5-6 interior attribute tests. On the other hand, I found more complex models for zalcitabine, abacavir and delavirdine with these trees having 9-12 interior attribute tests. 36

Chapter 4. Results Training and validation datasets were randomly created from the entire dataset of applicable phenotype genotype pairs and were used to derive each decision tree and nearest-neighbour classifier, see Appendix C for details. Within these datasets, fold-change values were distributed as in figures 4.1 4.8. % 60 50 40 30 20 10 0 0-1 2-3 4-7 8-15 16-31 32-63 64-127 128-255 256-511 512-10231024-204 >2047 Training Validation Fig. 4.1 Frequency distribution of fold-change values in the training and validation datasets for zalcitabine. % 60 50 40 30 20 10 0 0-1 2-3 4-7 8-15 16-31 32-63 64-127 128-255 256-511 512-1023 1024-204 >2047 Training Validation Fig. 4.2 Frequency distribution of fold-change values in the training and validation datasets for didanosine. % 60 50 40 30 20 10 0 0-1 2-3 4-7 8-15 16-31 32-63 64-127 128-255 256-511 512-1023 1024-2047 >2047 Training Validation Fig. 4.3 Frequency distribution of fold-change values in the training and validation datasets for stavudine. 37

Chapter 4. Results % 25 20 15 10 5 0 0-1 2-3 4-7 8-15 16-31 32-63 64-127 128-255 256-511 512-1023 1024-204 >2047 Training Validation Fig. 4.4 Frequency distribution of fold-change values in the training and validation datasets for lamivudine. 40 30 % 20 10 0 0-1 2-3 4-7 8-15 16-31 32-63 64-127 128-255 256-511 512-10231024-204 >2047 Training Validation Fig. 4.5 abacavir. Frequency distribution of fold-change values in the training and validation datasets for % 35 30 25 20 15 10 5 0 0-1 2-3 4-7 8-15 16-31 32-63 64-127 128-255 256-511 512-10231024-204 >2047 Training Validation Fig. 4.6 Frequency distribution of fold-change values in the training and validation datasets for nevirapine. 40 % 30 20 10 0 0-1 2-3 4-7 8-15 16-31 32-63 64-127 128-255 256-511 512-1023 1024-204 >2047 Training Validation Fig. 4.7 Frequency distribution of fold-change values in the training and validation datasets for delavirdine. 38

Chapter 4. Results % 50 40 30 20 10 0 0-1 2-3 4-7 8-15 16-31 32-63 64-127 128-255 256-511 512-1023 1024-2047 >2047 Training Validation Fig. 4.8 efavirenz. Frequency distribution of fold-change values in the training and validation datasets for [P75] = <M>, <T> then: resistant (15[8%]) [P75] = <A> then: susceptible (3[0%]) [P75] = <V> and: [P184] = <M> then: susceptible (182[17%]) [P184] = <V> and: [P210] = <L> then: susceptible (121[33%]) [P210] = <W> and: [P177] = <E>, <G> then: resistant (9[0%]) [P177] = <D> and: [P35] = <I> then: resistant (3[0%]) [P35] = <T> then: susceptible (2[0%]) [P35] = <V> and: [P74] = <V>, <I> then: resistant (4[0%]) [P74] = <L> and: [P118] = <V> then: susceptible (15[0%]) [P118] = <I> then: resistant (3[0%]) [P35] = <M> and: [P41] = <M> then: resistant (1[0%]) [P41] = <L> then: susceptible (3[0%]) [P75] = <I> and: [P41] = <M> then: resistant (10[0%]) [P41] = <L> then: susceptible (1[0%]) [P75] = <L> and: [P32] = <K> then: susceptible (1[0%]) [P32] = <H> then: resistant (1[0%]) Fig. 4.9 Decision tree classifier for zalcitabine. (n[e]) at the leaves denotes the number of examples n and the estimated error e. [P151] = <M> then: resistant (15[0%]) [P151] = <Q> and: [P74] = <L>, <S> then: susceptible (333[12%]) [P74] = <V> and: [P211] = <R>, <A> then: resistant (18[0%]) [P211] = <T> then: susceptible (1[0%]) [P211] = <K> and: [P297] = <E>, <K> then: susceptible (7[0%]) [P297] = <A>, <R> then: resistant (2[0%]) [P74] = <I> and: [P35] = <V> then: resistant (3[0%]) [P35] = <L> then: susceptible (1[0%]) Fig. 4.10 Decision tree classifier for didanosine. (n[e]) at the leaves denotes the number of examples n and the estimated error e. 39

Chapter 4. Results [P210] = <L> and: [P75] = <V> then: susceptible (257[8%]) [P75] = <A>, <T>, <I>, <S> then: resistant (17[0%]) [P75] = <M> and: [P3] = <S> then: resistant (1[0%]) [P3] = <C> then: susceptible (1[0%]) [P210] = <W> and: [P69] = <N>, <D>, <S> then resistant (15[0%]) [P69] = <T> and: [P67] = <D>, <G>, <E> then: susceptible (22[0%]) [P67] = <N> then: resistant (38[50%]) Fig. 4.11 Decision tree classifier for stavudine. (n[e]) at the leaves denotes the number of examples n and the estimated error e. Leafs with an error of 50% represent an inability to make a definite classification. [P184] = <V>, <I> then: resistant (236[0%]) [P184] = <M> and: [P69] = <T>, <N>, <A> then: susceptible (165[6%]) [P69] = <D> and: [P245] = <T> then: resistant (1[0%]) [P245] = <E>, <K>, <A> then: susceptible (3[0%]) [P245] = <V> and: [P118] = <V> then: susceptible (5[0%]) [P118] = <I> then: resistant (3[0%]) [P69] = <S> and: [P41] = <L> then: resistant (2[0%]) [P41] = <M> then: susceptible (1[0%]) [P69] = <I> and: [P21] = <V> then: resistant (1[0%]) [P21] = <I> then: susceptible (1[0%]) Fig. 4.12 Decision tree classifier for lamivudine. (n[e]) at the leaves denotes the number of examples n and the estimated error e. [P103] = <R> then: susceptible (5[0%]) [P103] = <N>, <T> then: resistant (60[0%]) [P103] = <K> and: [P181] = <C>, <I> then: resistant (28[15%]) [P181] = <Y> and: [P190] = <A>, <S>, <T>, <Q>, <C>, <E>, <V> then: resistant (20[0%]) [P190] = <G> and: [P106] = <I>, <L> then: susceptible (6[0%]) [P106] = <A>, <M> then: resistant (9[0%]) [P106] = <V> and: [P188] = <Y>, <H> then: susceptible (274[3%]) [P188] = <L>, <C> then: resistant (5[0%]) [P103] = <S> and: [P35] = <V>, <M> then: resistant (2[0%]) [P35] = <I> then susceptible (1[0%]) Fig. 4.13 Decision tree classifier for nevirapine. (n[e]) at the leaves denotes the number of examples n and the estimated error e. 40

Chapter 4. Results [P184] = <V>, <I> then: resistant (149[24%]) [P184] = <M> and: [P41] = <M> and: [P151] = <M> then: resistant (11[0%]) [P151] = <Q> and: [P65] = <K> and: [P211] = <K>, <A>, <S>, <T>, <G> then: susceptible (54[0%]) [P211] = <R> and: [P178] = <M>, <V> then: susceptible (7[0%]) [P178] = <L> then: resistant (2[0%]) [P178] = <I> and: [P69] = <T>, <N> then: susceptible (50[0%]) [P69] = <S> then: resistant (1[0%]) [P65] = <R> and: [P35] = <I> then: susceptible (1[0%]) [P35] = <M> then: resistant (1[0%]) [P35] = <V> and: [P162] = <C> then: resistant (1[0%]) [P162] = <A> then: susceptible (1[0%]) [P162] = <S> and: [P62] = <A> then: resistant (4[0%]) [P62] = <V> then: susceptible (1[0%]) [P41] = <L> and: [P69] = <N>, <D>, <S>, <E> then: resistant (11[0%]) [P69] = <A> then: susceptible (1[0%]) [P69] = <T> and: [P44] = <E> then: susceptible (26[22%]) [P44] = <D>, <E> then: resistant (7[0%]) Fig. 4.14 Decision tree classifier for abacavir. (n[e]) at the leaves denotes the number of examples n and the estimated error e. [P103] = <N>, <S>, <T> then: resistant (66[8%]) [P103] = <R>, <Q> then: susceptible (6[0%]) [P103] = <K> and: [P181] = <C>, <I> then: resistant (17[0%]) [P181] = <Y> and: [P245] = <Q>, <T>, <E>, <M>, <K>, <I>, <S>, <L>, <A> then: susceptible (81[0%]) [P245] = <R> then: resistant (1[0%]) [P245] = <V> and: [P102] = <K>, <R> then: susceptible (155[3%]) [P102] = <E> then: resistant (1[0%]) [P102] = <Q> and: [P100] = <I> then: resistant (2[0%]) [P100] = <L> and: [P20] = <R> then: resistant (1[0%]) [P20] = <K> and: [P230] = <L> then: resistant (1[0%]) [P230] = <M> and: [P188] = <Y>, <C> then: susceptible (17[0%]) [P188] = <L> and: [P35] = <V> then: resistant (1[0%]) [P35] = <T> then: susceptible (1[0%]) Fig. 4.15 Decision tree classifier for delavirdine. (n[e]) at the leaves denotes the number of examples n and the estimated error e. 41

Chapter 4. Results [P103] = <R>, <T>, <Q> then: susceptible (5[0%]) [P103] = <N> then: resistant (66[17%]) [P103] = <K> and: [P190] = <S>, <T>, <Q>, <C>, <E> then: resistant (10[0%]) [P190] = <G> and: [P188] = <Y>, <C>, <H> then: susceptible (211[2%]) [P188] = <L> then: resistant (7[0%]) [P190] = <A> and: [P101] = <E>, <D> then: resistant (6[0%]) [P101] = <K> and: [P135] = <T>, <M> then: resistant (3[0%]) [P135] = <I> then: susceptible (5[0%]) [P103] = <S> and: [P35] = <V> then: susceptible (2[0%]) [P35] = <M> then: resistant (1[0%]) Fig. 4.16 Decision tree classifier for efavirenz. (n[e]) at the leaves denotes the number of examples n and the estimated error e. 4.1.2 Protease Inhibitors I obtained decision tree classifiers for each of the protease inhibitors: lopinavir, saquinavir indinavir, ritonavir, amprenavir and nelfinavir, see figure 4.23-4.28, respectively. Here, I found rather more complex models than was found for the reverse transcriptase inhibitors. In particular, I found decision trees with 6-9 interior attribute tests for the drugs ritonavir, lopinavir and nelfinavir. On the other hand, the decision trees for indinavir, amprenavir and saquinavir were even more complex with 11, 16 and 19 interior attribute tests, respectively. This complexity indicates that the genetic basis of drug resistance is more complicated for protease inhibitors than for reverse transcriptase inhibitors. However, this conjecture is by no means conclusive as complex decision trees can stem from noisy training data. It may be simply the case that the training data for these drugs contains errors and that these errors are distributed evenly between the training and validation datasets. Training and validation datasets were randomly created from the entire dataset of applicable phenotype genotype pairs and were used to derive each decision tree and nearest-neighbour classifier, see Appendix C for details. Within these datasets, fold-change values were distributed as in figures 4.17 4.22. 42

Chapter 4. Results % 35 30 25 20 15 10 5 0 0-1 2-3 4-7 8-15 16-31 32-63 64-127 128-255 256-511 512-10231024-204 >2047 Training Validation Fig. 4.17 Frequency distribution of fold-change values in the training and validation datasets for saquinavir. 25 20 % 15 10 5 0 0-1 2-3 4-7 8-15 16-31 32-63 64-127 128-255 256-511 512-10231024-204 >2047 Training Validation Fig. 4.18 Frequency distribution of fold-change values in the training and validation datasets for lopinavir. 30 25 20 % 15 10 5 0 0-1 2-3 4-7 8-15 16-31 32-63 64-127 128-255 256-511 512-10231024-204 >2047 Training Validation Fig. 4.19 Frequency distribution of fold-change values in the training and validation datasets for indinavir. 25 20 % 15 10 5 0 0-1 2-3 4-7 8-15 16-31 32-63 64-127 128-255 256-511 512-10231024-204 >2047 Training Validation Fig. 4.20 Frequency distribution of fold-change values in the training and validation datasets for ritonavir. 43

Chapter 4. Results 25 20 % 15 10 5 0 0-1 2-3 4-7 8-15 16-31 32-63 64-127 128-255 256-511 512-10231024-204 >2047 Training Validation Fig. 4.21 Frequency distribution of fold-change values in the training and validation datasets for nelfinavir. % 40 35 30 25 20 15 10 5 0 0-1 2-3 4-7 8-15 16-31 32-63 64-127 128-255 256-511 512-1023 1024-204 >2047 Training Validation Fig. 4.22 Frequency distribution of fold-change values in the training and validation datasets for amprenavir. [P10] = <L>, <H>, <R>, <M> then: susceptible (92[10%]) [P10] = <V>, <F>, <Z> then: resistant (55[12%]) [P10] = <I> and: [P82] = <A>, <T>, <I>, <F>, <S> then: resistant (37[0%]) [P82] = <V> and: [P71] = <V>, <L> then: resistant (16[19%]) [P71] = <I> then: susceptible (2[0%]) [P72] = <I>, <V> then: resistant (10[0%]) [P72] = <M>, <E> then: susceptible (2[0%]) [P71] = <A> and: [P46] = <M>, <L> then: susceptible (9[0%]) [P46] = <I> and: [P72] = <T> then: resistant (2[0%]) [P72] = <L> then: susceptible (1[0%]) [P72] = <I> and: [P93] = <L> then: susceptible (3[0%]) [P93] = <I> then: resistant (3[0%]) Fig. 4.23 Decision tree classifier for lopinavir. (n[e]) at the leaves denotes the number of examples n and the estimated error e. 44

Chapter 4. Results [P10] = <R>, <Z> then: resistant (26[0%]) [P10] = <I> and: [P71] = <I>, <T>, <V>, <Z> then: resistant (105[18%]) [P71] = <A> and: [P48] = <V>, <S> then: resistant (8[0%]) [P48] = <G> and: [P37] = <D>, <Z> then: resistant (3[0%]) [P37] = <E>, <T>, <N> then: susceptible (4[0%]) [P37] = <S> and: [P84] = <V>, <C> then: resistant (6[0%]) [P84] = <I> and: [P73] = <S>, <G>, <C> then: susceptible (32[0%]) [P73] = <T> then: resistant (2[0%]) [P71] = <L> and: [P13] = <I> then: resistant (2[0%]) [P13] = <V> then: susceptible (2[0%]) [P10] = <L> and: [P90] = <L> then: susceptible (173[0%]) [P90] = <M> and: Occasionally, a mutation [P88] = <S>, may <D> be then: expressed resistant by (14[0%]) the position they have in a certain gene, [P88] = <N> and: preceded by the letter corresponding [P48] = <V> then: to the resistant amino-acid (4[0%]) seen in the wild-type virus, and [P48] = <G> and: [P84] = <V> then: resistant (4[0%]) [P84] = <I> and: [P60] = <E> and: [P64] = <I> then: resistant (4[0%]) [P64] = <V> then: susceptible (4[0%]) [P60] = <D> and: [P14] = <K> then: susceptible (14[0%]) [P14] = <R> then: resistant (1[0%]) [P10] = <V> and: [P71] = <V> then: resistant (10[0%]) [P71] = <A> then: susceptible (8[0%]) [P71] = <T> and: [P12] = <Z> then: resistant (1[0%]) [P12] = <T> and: [P30] = <D> then: susceptible (1[0%]) [P30] = <N> then: resistant (1[0%]) [P10] = <F> and: [P84] = <I> then: susceptible (20[33%]) [P84] = <L>, <C>, <A> then: resistant (3[0%]) [P84] = <V> and: [P63] = <P>, <A> then: resistant (7[0%]) Fig. 4.24 Decision tree classifier for saquinavir. (n[e]) at the leaves denotes the number of examples n and the estimated error e. 45

Chapter 4. Results [P54] = <V>, <T>, <A>, <Z> then: resistant (125[0%]) [P54] = <M> then: susceptible (2[0%]) [P54] = <I> and: [P46] = <Z> then: resistant (7[0%]) [P46] = <I> and: [P63] = <P>, <A> then: resistant (46[7%]) [P63] = <V>, <R>, <Q> then: susceptible (5[0%]) [P63] = <L> and: [P77] = <I> then: susceptible (3[0%] [P77] = <V> and: [P69] = <Y>, <Q> then: resistant (2[0%]) [P69] = <K> then: susceptible (1[0%]) [P69] = <H> and: [P10] = <I>, <L>, <V> then: resistant (4[0%]) [P10] = <F> then: susceptible (1[0%]) [P46] = <M> and: [P90] = <M> then: resistant (40[27%]) [P90] = <L> then: susceptible (200[6%]) [P46] = <L> and: [P10] = <I>, <R> then: resistant (5[0%]) [P10] = <L> then: susceptible (10[0%]) [P10] = <F> and: [P62] = <I> then: resistant (2[0%]) [P62] = <V> then: susceptible (1[0%]) [P54] = <L> and: [P10] = <I> then: resistant (2[0%]) [P10] = <L> then: susceptible (2[0%]) [P20] = <K> then: susceptible (2[0%]) [P20] = <T> then: resistant (1[0%]) Fig. 4.25 Decision tree classifier for indinavir. (n[e]) at the leaves denotes the number of examples n and the estimated error e. [P82] = <A>, <T>, <F>, <L>, <S> then: resistant (109[4%]) [P82] = <V> and: [P90] = <M> then: resistant (79[28%]) [P90] = <L> and: [P84] = <I> then: susceptible (205[6%]) [P84] = <V>, <L>, <A> then: resistant (26[0%]) [P84] = <C> and: [P10] = <I>, <F> then: resistant (3[0%]) [P10] = <L> then: susceptible (1[0%]) [P82] = <I> and: [P46] = <I> then: resistant (3[0%]) [P46] = <L> then: susceptible (1[0%]) [P46] = <M> and: [P84] = <I> then: susceptible (4[0%]) [P84] = <V>, <C> then: resistant (2[0%]) Fig. 4.26 Decision tree classifier for ritonavir. (n[e]) at the leaves denotes the number of examples n and the estimated error e. 46

Chapter 4. Results [P10] = <R> then: susceptible (3[0%]) [P10] = <Z> then: resistant (28[0%]) [P10] = <I> and: [P84] = <I> then: susceptible (90[22%]) [P84] = <V>, <C>, <A> then: resistant (41[37%]) [P10] = <L> and: [P84] = <I> and: [P50] = <V> then: resistant (3[0%]) [P50] = <L> then: susceptible (5[0%]) [P50] = <I> and: [P47] = <I> then: susceptible (156[2%]) [P47] = <A> then: resistant (2[0%]) [P47] = <V> and: [P12] = <T> then: susceptible (1[0%]) [P12] = <S> then: resistant (1[0%]) [P84] = <V> and: [P73] = <S> then: susceptible (1[0%]) [P73] = <T> then: resistant (2[0%]) [P73] = <G> and: [P46] = <I> then: susceptible (1[0%]) [P46] = <L> then: resistant (2[0%]) [P46] = <M> and: [P12] = <P> then: resistant (1[0%]) [P12] = <T> and: [P36] = <M> then: susceptible (1[0%]) [P36] = <Z> then: resistant (1[0%]) [P84] = <C> and: [P20] = <K> then: susceptible (1[0%]) [P20] = <I> then: resistant (1[0%]) [P10] = <V> and: [P46] = <I> then: resistant (6[0%]) [P46] = <M> then: susceptible (1[0%]) [P46] = <L> and: [P61] = <Q> then: resistant (3[0%]) [P61] = <E> then: susceptible (1[0%]) [P10] = <F> and: [P84] = <V>, <A> then: resistant (8[0%]) [P84] = <I> and: [P46] = <I> then: resistant (3[0%]) [P46] = <M> then: susceptible (13[50%]) [P46] = <L> and: [P15] = <I> then: susceptible (3[0%]) [P15] = <V> then: resistant (3[0%]) Fig. 4.27 Decision tree classifier for amprenavir. (n[e]) at the leaves denotes the number of examples n and the estimated error e. 47

Chapter 4. Results [P54] = <V>, <L>, <T>, <Z> then: resistant (128[0%]) [P54] = <M> then: susceptible (1[0%]) [P54] = <I> and: [P90] = <M> then: resistant (94[8%]) [P90] = <L> and: [P30] = <N> then: resistant (37[0%]) [P30] = <Y> then: susceptible (1[0%]) [P30] = <D> and: [P46] = <I>, <Z> then: resistant (31[28%]) [P46] = <M> and: [P88] = <S> then: resistant (8[0%]) [P88] = <D> then: susceptible (1[0%]) [P88] = <N> and: [P82] = <V>, <S> then: susceptible (155[4%]) [P82] = <F> then: resistant (3[0%]) [P82] = <I> and: [P20] = <K> then: susceptible (1[0%]) [P20] = <I> then: resistant (1[0%]) [P46] = <L> and: [P10] = <I> then: resistant (5[0%]) [P10] = <F> then: susceptible (1[0%]) [P10] = <L> and: [P82] = <A>, <L> then: susceptible (2[0%]) [P82] = <V> then: resistant (5[0%]) Fig. 4.28 Decision tree classifier for nelfinavir. (n[e]) at the leaves denotes the number of examples n and the estimated error e. 4.2 Prediction Quality To assess the predictive quality of the classifiers on unseen genotypes 30% of the entire dataset (applicable to each drug) was randomly selected for testing. These cases were then queried using either the appropriate decision tree or nearest-neighbour classifier to obtain a predicted drug-susceptibility. For each genotype in the testing set the predicted classification was compared with its true classification to obtain both an indication of predictive error and an estimate of how well a classifier is able to generalise beyond the training population. In particular, I determined for each classifier its prediction error across a testing set (percentage of misclassified cases), sensitivity and specificity. The sensitivity of a classifier is the probability of the classifier to predict drug resistance given that a case is truly resistant. On the other hand, the specificity of a classifier is the probability of the classifier to predict drug susceptibility given that a case is truly susceptible. To assess how these newly generated classifiers faired against the decision tree classifiers originally presented in [1], I hand implemented each classifier and tested it against the same testing set as used above, see appendix D for details of these trees. 48

Chapter 4. Results An important consideration when testing such classification models is how well the distribution of cases in the testing set concurs with the distribution of cases in the training/validation sets. This is important because when distributions differ greatly we can expect a loss of predictive quality. This stems from the fact that learning is most reliable when the cases in the training set follow a distribution similar to that of the entire population. Table 4.1 gives the distribution of fold-change values in each testing set. 0-1 2 4-7 8-16 -31 32 64-128 256-512 1024 - > 2047 3 15 63 127 255 511 1023 2047 ddc 18% 55% 21% 2% 3% 0 0 0 0 0 0 0 ddi 16% 66% 12% 3% 3% 0 0 0 0 0 0 0 d4t 33% 47% 15% 4% 1% 0 0 0 0 0 0 0 3TC 12% 18% 9% 7% 7% 15% 16% 14% 0 0 0 0 ABC 21% 28% 34% 11% 4% 2% 0 0 0 0 0 0 NVP 33% 27% 11% 3% 3% 6% 5% 2% 4% 4% 1% 0 DLV 31% 22% 12% 6% 2% 6% 5% 8% 5% 0 0 0 EFV 43% 21% 7% 5% 2% 9% 4% 3% 5% 0 0 0 SQV 33% 29% 12% 5% 4% 8% 4% 2% 0 2% 0 0 LPV 28% 21% 4% 13% 12% 16% 5% 1% 0 0 0 0 IDV 20% 26% 13% 16% 12% 9% 3% 1% 0 0 0 0 RTV 24% 29% 11% 7% 10% 10% 7% 3% 0 0 0 0 NFV 9% 18% 13% 13% 18% 16% 8% 3% 1% 0 0 0 APV 37% 27% 17% 9% 6% 2% 0 0 0 0 0 0 Table 4.1 Frequency distribution of fold-change values within the testing datasets. Resistance factors are grouped into equidistant bins. Differences between the testing and training set greater than 5% are highlighted. Here the distribution of test cases is relatively similar to that of the training experience. Therefore any models with low predictive quality cannot be reasoned away in respect to differences in the training and test set distributions. Testing the newly constructed decision trees resulted in prediction errors in the range 6.3 14.3% for the protease inhibitors, except for amprenavir that had a prediction error of 24.2%, 6.0% - 7.8% for the nonnucleoside reverse transcriptase inhibitors, and 6.8%, 14.6%, 14.8% and 16.6% for 3TC, d4t, ddi and ABC, respectively. The error rate for ddc was the poorest at 24.7%. Using the same test cases, the previously published decision trees resulted in prediction errors in the range 8.2% - 20.0% for the protease inhibitors, except for amprenavir that again had a prediction error of 24.2%, 4.8% - 9.5% for the nonnucleoside reverse transcriptase inhibitors, and 8.1%, 20.0%, 51.0% and 19.0% for 3TC, d4t, ddi and ABC, 49

Chapter 4. Results respectively. The error rate for ddc was 29.7%. Table 4.2 gives the details of the prediction errors for each drug. Using the nearest neighbour classifier I found relatively poor predictive quality with prediction errors in the range 18.0 19.1% for the protease inhibitors, except for nelfinavir that had a prediction error of 26.4%, 22.4 36.3% for the nonnucleoside reverse transcriptase inhibitors, and 32.7%, 18.9%, 21.9% and 28.7% for 3TC, d4t, ddi and ABC, respectively. The error rate for ddc was 46.2%. From these results it is clear that the newly constructed decision trees outperform the previously published decision trees over this dataset. The only drug for which the original decision tree outperforms the newly constructed decision tree is for efavirenz. However, this score is not a true indication of the predictive quality of the original tree because it fails to return a classification for 72% of the cases in the testing set. In addition, although the nearest-neighbour classifiers fair poorly, compared to the newly constructed decision trees, they outperform the original decision trees for some drugs. Drug No. of test cases. New D-Tree. Prediction Error. Original D-Tree. Prediction Error Nearest Neighbour. Prediction Error ddc 182 24.7% [0%] 29.7% [0%] 46.2% [0%] ddi 196 14.8% [0%] 51.0% [0%] 21.9% [0%] d4t 185 14.6% [0%] 20.0% [0%] 18.9% [0%] 3TC 220 6.8% [0%] 8.1% [0%] 32.7% [0%] ABC 174 16.6% [0%] 19.0% [1%] 28.7% [0%] NVP 205 7.3% [0%] 7.3% [0.9%] 22.4% [0%] DLV 179 7.8% [0%] 9.5% [0%] 36.3% [0%] EFV 168 6.0% [0%] 4.8% [72%] 24.4% [0%] SQV 251 14.3% [0%] 20.0% [0%] 18.3% [0%] LPV 95 10.5% [0%] No tree available. 18.9% [0%] IDV 247 14.5% [0%] 15.0% [3%] 19.4% [0%] RTV 228 10.5% [0%] 11.0% [0.8%] 18.0% [0%] NFV 254 6.3% [0%] 8.2% [4%] 26.4% [0%] APV 219 24.2% [0%] 24.2% [8.7%] 19.1% [0%] Table 4.2 Prediction errors. 50

Chapter 4. Results Considering the sensitivity and specificity of the classification models (Table 4.3), the newly constructed decision trees achieved sensitivities in the range of 0.79 0.96 for the protease inhibitors, except for amprenavir that had a sensitivity of 0.67, 0.82 0.89 for the nonnucleoside reverse transcriptase inhibitors, 0.88 0.9 for 3TC and ABC. The sensitivities for ddc, ddi and d4t were poorest with values 0.39, 0.32 and 0.67, respectively. Specificities were in the range of 0.81 0.89 for the protease inhibitors, 0.94-0.97 for the nonnucleoside reverse transcriptase inhibitors, 0.91 0.98 for ddc, ddi, d4t, 3TC. The specificity for ABC was poorest with a value of 0.77. For the previously published decision trees sensitivities were in the range of 0.92 0.96 for the protease inhibitors, except for amprenavir that had a sensitivity of 0.76, 0.78 1 for the nonnucleoside reverse transcriptase inhibitors, 0.93 and 0.88 for 3TC and ABC. The sensitivities for ddc, ddi and d4t were poorest with values 0.59, 0.64 and 0.72, respectively. Specificities were in the range of 0.71 0.87 for the protease inhibitors, 0.94-0.97 for the nonnucleoside reverse transcriptase inhibitors, 0.64 0.97 for ABC, ddc, ddi, d4t, 3TC. The specificity for EFV was poorest with a value of 0.27. It is clear from these results that the newly constructed decision trees fair better at predicting susceptibility than resistance and vice versa for the previously published decision trees. For the nearest neighbour classifiers sensitivities were in the range of 0.64 0.69 for the protease inhibitors, except for amprenavir that had a sensitivity of 0.48, 0.42 0.7 for the nonnucleoside reverse transcriptase inhibitors, 0.69 for 3TC and ABC. The sensitivities for ddc, ddi and d4t were 0.51, 0.36 and 0.47, respectively. Specificities were in the range of 0.91 0.98 for the protease inhibitors, 0.61-0.89 for the nonnucleoside reverse transcriptase inhibitors, 0.55 0.92 for ABC, ddc, ddi, d4t, 3TC. 51

Chapter 4. Results New D-Tree. Original D-Tree Nearest Neighour Drug sensitivity specificity sensitivity Specificity Sensitivity specificity ddc 0.39 0.93 0.59 0.75 0.51 0.55 ddi 0.32 0.98 0.64 0.45 0.36 0.88 d4t 0.67 0.91 0.72 0.82 0.47 0.92 3TC 0.90 0.98 0.88 0.97 0.69 0.65 ABC 0.88 0.77 0.93 0.64 0.69 0.73 NVP 0.89 0.94 0.89 0.94 0.47 0.89 DLV 0.82 0.97 0.78 0.97 0.70 0.61 EFV 0.85 0.97 1 0.27 0.42 0.87 SQV 0.79 0.89 0.96 0.71 0.63 0.91 LPV 0.92 0.86 - - 0.73 0.91 IDV 0.90 0.81 0.94 0.75 0.64 0.98 RTV 0.93 0.87 0.92 0.86 0.69 0.93 NFV 0.96 0.88 0.93 0.87 0.65 0.96 APV 0.67 0.93 0.76 0.72 0.48 0.95 Table 4.3 Sensitivities and specificities. 52

Chapter 5 Conclusion 5.1 Concluding Remarks and Observations Using a dataset of HIV-1 reverse transcriptase and protease genotypes with matched drugsusceptibilities I was able to construct decision tree classifiers to recognise genotypic patterns characteristic of drug-resistance for 14 antiretroviral drugs. No prior knowledge about drug-resistance associated mutations was used and mutations at every sequence position were treated equally. Using an independent testing set I was able to judge each decision trees predictive quality compared with a number of similarly derived decision trees, presented in the literature. I also constructed a novel nearest-neighbour classifier to predict drug-susceptibility from genotype, for each drug. Each nearest-neighbour classifier used a database of matched phenotype genotype pairs but, in contrast to decision tree learning, nearest-neighbour learning did not attempt to extract a generalised classification function. In order to investigate the possible advent of neural network learning compared to decision tree learning I derived a decision tree classifier for the protease inhibitor lopinavir. This is compared to the neural network classifier for lopinavir presented in [2]. The predictive quality of the decision tree classifiers were mixed. For the decision trees, I found prediction errors between 6.0 24.7% for all drugs. These results offered an improvement over the performance of previously published decision trees that had prediction errors between 8.1 51.0% over the same testing set. Nearest-neighbour classifiers exhibited 53

Chapter 5. Conclusion poorer performance with prediction errors between 18.0 46.2%, but still outperformed the previously published decision trees on some drugs. 5.1.1 Decision Tree Models The decision trees generated in the scope of this study varied in complexity. In particular, I found rather novel classification models (5-7 interior attribute tests) for the drugs didanosine, stavudine, lamivudine, nevirapine, efavirnez, ritonavir and lopinavir and I found more complex models (9-19 interior attribute tests) for the drugs delavirdine, nelfinavir, zalcitabine, indinavir, abacavir, amprenavir and saquinavir. This is in contrast to the decision tree classifiers presented in [1] that had between 4 and 12 interior attribute tests for all drugs. This increase in complexity may stem from the fact that the training data of genotypes exhibits large mutational interdependence, I used a larger training set, the training data is distributed differently, pruning is ineffective or the training data contains noise and subsequently causes overly specific classification functions. In the most extreme case I can compare the complexity of the decision trees for the protease inhibitor saquinavir. In [1] the decision tree for saquinavir had only 5 interior attribute tests and achieved a prediction error of 12.5% in leave-one-out experiments. A leave-one-out experiment predicts a classification for a case in the training data by constructing a decision tree on the remaining data and then uses the respective tree to give a classification. This rather novel tree is in contrast to the decision tree for saquinavir, presented in this study, which has 19 interior attribute tests and achieved a prediction error of 14.3% over an independent testing set consisting of 251 cases. In this case the tree appears to be overgrown and the effects of reduced error pruning are minimal. In particular only two leaf nodes contain a classification error (indication of where subtrees have been removed) and 54% of leafs have only 1-4 training cases associated with them. In this respect the tree appears to be overly specific and pruning wasn t executed hard enough to force generalisation. This is true for a number of the other decision tree classifiers that I generated. In particular, the decision trees for amprenavir, abacavir and indinavir appear to be overly specific and the effects of reduced error pruning, again, appear to be minimal. The remaining classification models are more similar in structure and complexity to the ones previously published. For models in which overfitting appears to have occurred, a basic reduced error pruning strategy appears not to introduce enough bias to create shorter more general trees. This is problematic 54

Chapter 5. Conclusion because shorter trees are more desirable than larger ones. The reason why shorter trees are preferred to larger ones stems from Occam s razor that states that shorter trees form a sounder basis for generalization beyond a set of training data than larger ones. This is particularly important if we wish to use decision tree classifiers as a future tool to help select drug regimens for treating HIV-1 infection. We seek a classification model that generalises well to the entire population of HIV-1 cases. Looking again at the performance of the novel saquinavir classification model using an independent testing set (used to test the complex model), we see a dramatic decrease in performance than was previously published (prediction error of 20.0%). Similarly, the performance of the other models also decreased, except for d4t, 3TC, NVP and NFV. Here I may argue that leave-one-out experiments are not sufficient to judge the performance these models because single cases taken from the training data are less likely to include information that is omitted. This is true in the fact that some of the models fail to return a classification because they fail to recognise the presence of amino acids at certain sequence positions. For example, the previously published decision tree for efavirenz fails to return a classification for 72% of the testing examples. The relative similarity of performance also makes a strong case for the ID3 algorithm. In particular, the ID3 algorithm is able to generate decision trees with a similar performance to the decision trees generated using the C4.5 algorithm. In this setting, many of the C4.5 extensions and features are not required. Furthermore, considering the performance of the novel saquinavir classification model over this testing set implies that the novel model is overly general. In other words the novel classification model is less likely to differentiate between resistant and susceptible cases than the more complex model. In this respect the complex saquinavir classification model may indeed satisfy Occam s razor, for this particular dataset, and indeed it may be the smallest possible decision tree that can be generated from the training data. Therefore both the novel and complex saquinavir classification models satisfy Occam s razor in a contradictory way. This presents us with a difficult choice of which tree would best generalise to the entire population of HIV-1 cases. Statistically, we can compare the sensitivities and specificities of the models as an indication to how well each decision tree is able to generalise beyond the training population. 55

Chapter 5. Conclusion Considering the sensitivities (the ability of a model to predict drug-resistance when a case is truly resistant) of the newly constructed decision tree models with the previously published models, the results were comparable for most drugs except for zalcitabine, didanosine, amprenavir and saquinavir. Therefore, by using a larger dataset for learning I have found classification models with relatively similar sensitivities. Therefore the ability of a decision tree to predict resistance seems not to depend greatly on the size of the training dataset. Considering the specificities (the ability of a model to predict drug-susceptibility when a case is truly susceptible) of the newly constructed decision tree models with the previously published models, the new models offer an improvement across the board. Therefore, by using a larger training dataset I have found classification models that offer an improved ability to predict susceptibility. Returning to the decision tree for saquinavir, by inspecting the attribute tests present in the novel and complex model we can get a picture of the genetic basis for saquinavir drugresistance. The novel model represents saquinavir-resistance as being determined by mutational changes at the sequence positions 90, 48, 54, 84 and 72. These sequence positions have all been previously described as being associated with either high-level or intermediatelevel resistance [14], except position 72. Here, positions associated with high-level resistance are placed closer to the root of the tree. In contrast, the complex model represents saquinavir drug-resistance as being determined by mutational changes at the sequence positions 10, 71, 48, 37, 84, 73, 13, 90, 88, 60, 64, 14, 12, 30 and 63. Only the positions 10, 71, 48, 84, 73, 90 and 63 have been previously described as being associated with saquinavir resistance. Here, the positions 10, 71, 90, 48, 84 are placed closer to the root of the tree and are associated with high-level resistance, except for positions 10 and 71 which are regarded as only accessory resistance mutations. The other positions are not listed in [14] as being associated with saquinavir resistance. This may imply that, either, the decision tree has been able to identify as yet unknown resistance-associated mutations from the training data or that the decision tree-learning algorithm has been fooled by noise in the training data. This situation is similar for the reverse transcriptase inhibitor classification models. In other words these models contain a mixture of known high-level, intermediate-level and accessory resistance associated mutations. They also contain a number of mutations not listed in [15]. However, this situation does not extend to the other protease inhibitor classification models. For ritonavir, amprenavir and nelfinavir their models tended to identify only previously known high-level, intermediate-level and accessory resistance associated mutations. 56

Chapter 5. Conclusion Furthermore, higher-level resistance associated mutations tended to be tested closer to the roots of their respective trees. Only the classification models for lopinavir and indinavir followed a similar pattern to that of saquinavir and tended to identify intermediate-level, accessory-level and previously unknown mutations. These observations raise three important considerations when applying decision tree learning to the phenotype-prediction problem: choosing an appropriate attribute measure, the importance of the training data and how deeply to grow a tree. In this study, amino acid positions were selected based on maximal information gain. However, judging from the types of mutations that are placed closer to the roots of the trees (in some cases accessory rather than high-level resistance associated mutations) I may have obtained better classification models if I had used a different attribute selection measure. Specifically, information gain has a natural bias to prefer attributes with many values to those with fewer values. In the context of the phenotype prediction problem this could prove problematic because a certain amino acid position could exhibit a variety of mutations but nevertheless play no role in drug-resistance. This amino acid position may then be selected, accidentally, as a good indicator of resistance due to coincidences in the training data. The size and quality of the training data heavily dictates the quality of the eventual decision tree classifier. Off course, to obtain a decision tree classifier that generalises well to the entire population of HIV-1 cases the training data should be distributed in such a way that it is representative of the entire population. This is extremely difficult in practice and is unrealistic. However, with larger and larger datasets of matched phenotype-genotype pairs becoming available, it may become possible to probabilistically model the distribution of cases within the entire population. Training sets that are constructed to follow such distributions will therefore generate decision trees with a stronger predictive merit. In this way the size of the training set is less important once we have reached a minimal number of examples. Here what is important is the examples themselves. A better decision tree will be grown if the data that it is derived from is of good quality and varied. By good quality I mean that the data is reliable (minimises error) and by varied I mean that for each phenotype we have a wide selection of genotypes. In this study I simply used the complete HIV-1 protease and reverse transcriptase drugsusceptibility datasets to generate decision tree classifiers with relatively low prediction 57

Chapter 5. Conclusion errors. However, since no attention was spared to determine the quality and variety of examples in the training, validation and test sets these same decision trees may falter when applied to genotypes drawn randomly from the entire population of HIV-1 cases. Similar to how the performance of the previously published decision trees faltered when applied to a different testing set. The C4.5 and ID3 algorithms grow decision trees just deeply enough in order to perfectly classify the training examples, in accordance with Occam s razor. Whilst this is sometimes a reasonable strategy it can lead to difficulties. Particularly, for the phenotype prediction problem we can only ever obtain a relatively small subset of training examples compared with the entire population and this may lead to a classification model that does not generalise well to the true target function. There reaches a point during decision tree learning when a tree starts to become specific to the target function represented by the training examples and fails to generalise to the remaining population. In other words, the tree has become overfitted. As previously mentioned, the effects of overfitting can be minimised by using a number of post-pruning strategies. However, as has been exhibited, the effects of basic reduced error pruning were minimal for some trees. We should therefore introduce a suitable weighting to force pruning to consider even smaller decision trees that may not perform well against the training set but nevertheless fair better over the remaining population. 5.1.2 Neural Network Models The use of neural networks may be particularly suited to the phenotype prediction problem for drugs such a lopinavir where resistance is dictated by complex combinations of multiple mutations [2]. In particular, by representing the classification function Resistant: Genotype Drug_Susceptibility_Classification using a neural network we are able to represent nonlinear functions of many variables such a multiple mutations exhibiting large interdependence. Indeed, by using neural network learning we do not have to make any prior assumptions about the form of the target function. For example, feedforward networks containing three layers of sigmoid perceptions are able to approximate any function to arbitrary accuracy, given a sufficient (potentially very large) number of units in each layer [16]. This is in contrast to decision tree learning where we must make some judgement as to what size of trees should be preferred to others i.e. what bias should we introduce. 58

Chapter 5. Conclusion Furthermore, representing the target function using a neural network gives the advantage over decision tree learning that the predicted drug-susceptibility is quantative. In other words, such a neural network can return a predicted fold-change value rather than simply a discrete classification such as resistant or susceptible. In [2] a neural network classification model was constructed from a dataset of phenotypegenotype pairs to predict resistance of the protease inhibitor lopinavir. The performance of the model was determined using a testing set of 177 examples and the results were expressed using the linear correlation coefficient R 2. The linear correlation coefficient is a number between 1 and 1that measures how close a set of points lie on a straight line. A value of 1 indicates that all the points fall on a straight line. In other words, it was used in this case to determine how well the predicted fold-change and actual fold-change values agree. For their best neural network, a correlation coefficient of 0.88 was obtained. In order to compare the performance of this network against decision tree learning I obtained an equivalent decision tree classifier to predict lopinavir resistance using a dataset of phenotype genotype pairs obtained from the Stanford database. I obtained a relatively simple decision tree classification model that determines lopinavir resistance according to mutations at the positions 10, 82, 71, 72, 46 and 93. With the exception of position 72 all these positions have been previously recognised as being significant for lopinavir resistance [2]. Using an independent testing set, consisting of 95 cases, the decision tree had a prediction error of 10.5% and a sensitivity and specificity of 0.92 and 0.86, respectively. In other words, for 89.5% of the testing cases the predicted classification agreed with the true classification. This result is comparable to the correlation coefficient 0.88 obtained for the neural network. However, some allowance should be made for differences in the size of the testing sets. In particular, the neural network model was tested against 177 examples, which is nearly double that for the decision tree. However, the decision tree model has the advantage over the neural network model in that it is easily interpreted. In other words, experts can easily understand and examine the knowledge portrayed by the decision tree. Using decision trees it is easy to derive a set of corresponding rules. Tracing out a path from the root of the tree to a leaf yields a single rule. Where the internal nodes, of the tree, induce the premises of the rule and the leaf determines its conclusion. Such rules can be presented as evidence for a classification. 59

Chapter 5. Conclusion This is in contrast to neural network models that act as a black box. A genotype is given as input and a fold-change value is given as output but no clue as to how such a prediction was made is available. One may wish to look inside the black box but all that will be found is a number of connected units with no real meaningful interpretation. An analogy can be made with looking inside at the workings of the human brain that is in itself made up from millions of similar interconnected units. It is not enough to simply understand how each unit processes signals rather we wish to know the knowledge that they portray. In addition, like decision tree learning, neural network learning is prone to overfitting. This occurs as training proceeds because some weights will continue to be updated in order to reduce the error over the training data. However, there are a number of methods available to help minimise the effects of overfitting. Most methods introduce a bias to prefer neural networks with small weight values, i.e. to avoid learning complex decision surfaces. But compared to decision trees these solutions are not as aesthetically pleasing. Also, in practical terms, neural network training algorithms typically require longer training times than decision tree learning. Training times range from a few seconds to many hours, depending on the number of weights to learn in the network and the amount of training examples. 5.1.3 Nearest - Neighbour Models Nearest-neighbour models have an advantage over decision tree and neural network models in that no explicit representation of the target function needs to be made. In particular, a nearest-neighbour method simply stores the set of training examples and postpones generalising beyond these examples until a new case must be classified. Here we avoid the problems of overfitting and estimating a one-time target function that embodies the entire population. In other words, a nearest-neighbour classifier represents the target function by a combination of many local approximations, whereas decision tree and neural network learning must commit at training time to a single global approximation. In this respect, a nearest-neighbour classifier effectively uses a richer hypothesis space than both decision tree and neural network learning. In addition because all the training data is stored and reused in its entirety the information that it contains is never lost. Where the main difficulty in 60

Chapter 5. Conclusion defining a nearest-neighbour classifier to address the phenotype-prediction problem lies is determining an appropriate distance metric for retrieving similar genotypes. In the scope of this study I derived a nearest-neighbour classifier to predict drugsusceptibility from genotype. This was based on a novel distance metric. The performance of the nearest-neighbour classifiers was assessed, in the same way as the decision tree classification models, using a randomly selected independent test set of genotype cases. In comparison to the decision tree models, created in this study, these classifiers faired poorly. In particular, I found prediction errors in the range of 18.9 46.3% compared to prediction errors in the range of 6.3 24.7% for the decision trees. However, given these results we cannot conjecture that all nearest-neighbour methods will fair poorly in the context of the phenotype prediction problem. This is clear when we consider the commercial success of Virco s virtualphenotype that employs a nearestneighbour classification scheme. Indeed some studies have even proved that Virco s prediction scheme is a useful as an independent predictor of the clinical outcome of antiretroviral therapy [17]. It is clear from these results that the distance metric used in this study is too naive and is not able to entirely capture the genetic basis of drug-resistance when comparing genotypes. In particular, it does not take into consideration any details of the mutations that are present rather it only considers the percentage of differences between two sequences. However, for the protease inhibitors this novel distance metric produced reasonable results. Specifically, I found prediction errors in the range of 18.0 19.1%, except for amprenavir that had a prediction error of 26.4%. This suggests that for these drugs, drug-resistance can be characterised in some way by the amount of shared genetic mutations. A practical disadvantage of nearest-neighbour classifiers is that they are inefficient when classifying a new case because all the processing is performed at query time rather than in advance like decision tree and neural network learners. 61

Chapter 5. Conclusion 5.2 Suggestions For Further Work 5.2.1 Handling Ambiguity Codes Within this study, ambiguity codes are not handled in an effective manner. An ambiguity code may occur during genotyping when a sample containing a population of HIV-1 sequences is found to contain a number of possible amino acids for a specific sequence position. When this occurs, some mutations may be represented by multiple amino acid codes, representing the detection of more than one amino acid at this sequence position. In this case it is ambiguous to which amino acid should be used when modelling the genotype sequence. Within this study, the first amino acid code that is encountered is used for modelling. This is wholly inadequate and would have been improved on if more time permitted. 5.2.2 Using A Different Attribute Selection Measure As was mentioned earlier, decision attributes were selected based on maximal information gain. Here, I may have obtained better classification models if I had used a different attribute selection measure that doesn t favour attributes with large numbers of values. A way to avoid this bias is to select decision attributes based on a measure called gain ratio. The gain ratio penalises attributes with a large number of values by incorporating a term called split information, which is sensitive to how uniformly a decision attribute splits the training data: split_information(s, A) p (+) log 2 p (+) + p (-) log 2 p (-) where p (+) and p (-) are the proportion of positive and negative examples, respectively, resulting from partitioning S by the attribute A. Using this measure, the gain ratio, is then calculated as: gain_ratio(s, A) = ig(s, A) / split_information(s, A) If more time had been permitted I would have experimented with using this measure and other possible selection measures. Would this have made a big impact on the complexity and or structure of the decision tree models? 62

Chapter 5. Conclusion 5.2.3 Handling Missing Information The ability to effectively handle missing information was treated poorly in this study. Here a training set may only exhibit a subset of the possible mutations that may occur at a particular sequence position. Therefore, when growing the decision tree using this dataset only this subset of amino acids will be considered as valid at that particular position. This is problematic because the decision tree will fail to recognise the presence of other amino acids at that position. Within this study, if this situation arises at a decision node n the classification process is halted and a classification of unknown is returned. This is not ideal; we would at least like to take into account the decision knowledge already portrayed by the decision tree before this point. Returning a classification of unknown is worthless. One possible strategy for dealing with this problem is to assign a classification based on the most common classification of the training examples associated with the decision node n. A second more complex procedure begins by assigning a probability to each of the values for an attribute. These probabilities are based on the observed frequencies of the various values for the attribute among the training examples at node n. Now when we encounter the node n, instead of stopping, we continue down the most probable path. We then proceed as normal until we reach a classification. If more time was permitted I would have implemented the second of these strategies in order to maximise the information that a decision tree uses when classifying examples taken from the entire population of HIV-1 cases. 5.2.4 Using A Different Distance Metric As has already been highlighted the distance metric used for nearest-neighbour classification was not strong enough to truly judge the similarity of two genotype sequences. If more time was permitted I would have experimented with different distance measures. In particular, I would have investigated statistical measures based on the comparison of two dot-plots. 63

Chapter 5. Conclusion 5.2.5 Receiver Operating Characteristic Curve A good way to analyse the performance of a classification model is to plot a receiver operating characteristic curve (ROC). A ROC curve is commonly used within clinical research to investigate the tradeoff between the sensitivity and specificity of a classification model. The x-axis of a ROC graph corresponds to the specificity of a model, i.e. the ability of the model to identify true negatives. Conversely, the y-axis corresponds to the sensitivity of a model, i.e. how well the model is able to predict true positives. In this way we are able to look more closely at the ability of a classification model to discriminate between drugresistant and drug-susceptible genotypes. The greater the sensitivity at high specificity values the better the model. Also, by doing such an analysis we better facilitate the comparison of two or more classification models. In order to plot an ROC curve for each of the decision tree models that I created I would have to introduce a weighting parameter, α, to determine whether an example should be classified as resistant or not. In more detail, for each leaf node in which there is a prediction error (i.e. a leaf created as a result of pruning) we have a probability of an example being classified as resistant. We introduce α at these leafs and we say that an example is resistant if and only if the probability of the example being resistant is greater than α. Now we can create the ROC curve by varying the size of α from 1 down to 0 and computing the sensitivity and specificity of the model for each value of α. During the scope of this project I had hoped to be able to compare the performance of the classification models by comparing the ROC curves that they produce. However, I was unable to complete this work through lack of time. 5.2.6 Other Machine Learning Approaches There is many other machine learning strategies available. For example, by using genetic algorithms we can generate a set of classification rules similar to decision tree learning. Investigating the possible use of these methods for the phenotype prediction problem presents us with a wide scope for future research. In particular, can we get better classification models by using different machine learning approaches or even can we get better results through the hybridisation of different approaches? 64

Appendix A Pre-Processing Software I developed a simple text parser, using Java, that takes as input a data file from the Stanford HIV Resistance Database and outputs a new data file containing the modelled samples, as previously described. This can be run from the command line using the command > java DataParser i inputfile.txt o outputfile.txt g gene where inputfile.txt is a data file obtained from the Stanford Database and gene is one of reverse transcriptase (rt) or protease (p). The parser reads each line in the original data file, separately, and interprets the information present in specific columns. This is a reasonable strategy considering the format of the file (tab delimited). Furthermore, future updates on this information will only ever alter the amount of data but not the way it is presented. On interpretation of each line in the original file a new instance is created that has a unique identification, the fold-change values for each drug and a set of attributes as described previously. Individual instances are then written to the programs output, see below. 65

Appendix A. Pre-Processing Software Data File From The Stanford HIV Resistance Database. Simple Text Parser Instances SeqId APV_Fold ATV_Fold NFV_Fold P1 P2 P3 P99 7439 4.3-46.1 P Q V F 7443 2.3-11.0 P Q V F 66

Appendix B Cultivated Phenotype I developed a generic machine-learning program, called cultivated phenotype, that (1) represents and manipulates a dataset of phenotype genotype pairs as obtained from the Stanford HIV-1 reverse transcriptase and protease drug-susceptibility datasets; (2) characterises a dataset; (3) constructs a decision tree according to a training dataset; (4) prunes a decision tree according to a validation dataset; (5) predicts drug-susceptibility from genotype using either a decision tree or 3-nearest neighbour classifier; (6) displays performance statistics of a single decision tree according to a testing dataset and (7) compares the performance of newly constructed decision trees, hand-coded decision trees and 3-nearest neighbour classifiers according to a testing dataset. The learning component was developed using Java because it was readably available and did not require any licences. Furthermore, by developing the program using Java it allows for easy conversion into a form that could be made accessible from the World Wide Web. This is advantageous because it promotes the distribution of the knowledge that it prescribes and inevitably this knowledge could be harnessed by clinicians to help manage HIV-1 infection. In detail, the learning component was developed using an Object-Oriented methodology and the following classes were defined: CultivatedPhenotype, Example, Attribute, Reference, DrugWindow, AlterExperience, DataCharacteristics, Tree, BeerenwinkelModels, NearestNeighbour, QueryWindow, PerformanceWindow, Data, and Graph. Summarised as follows: 67

Appendix B. Cultivated Phenotype CultivatedPhenotype: This is the main program thread and all subsequent functionality stems from it. It defines four tables: the first containing a complete dataset of phenotype-genotype examples; the second containing a subset of examples to be used for training; the third containing a subset of examples to be used for validation and the fourth containing a subset of examples to be used for testing. Functionality realised by the procedures: void loadfile(string filename). DrugWindow: Defines a number of antiretroviral drugs and associated fold-change thresholds. Provides functionality to filter a dataset according to a particular drug, allowing only information related to that drug. Functionality realised by the procedures: void filterdrug(string antiretroviral), void setthreshold(double foldvalue). Example: Defines a single phenotype-genotype pair. In particular, an Example contains a fold-change value, sequence identification and a set of Attributes. In addition, each Example contains a feature vector constructed from its set of Attributes and a Reference. Includes functionality to compute the similarity of this example (feature vector) from another. Functionality realised by the procedures: boolean isresistant(), string getattribute(int index), void computecomparisionarray() and int distancefromthissequence(char[] othercomparisionarray). AlterExperience: Defines the training, validation and test sets. Provides functionality to randomly set aside a percentage of the entire dataset for testing. Functionality realised by the procedures: void setanddisplaytrainingtestsets(int percentage). DataCharacteristics: Describes a number of properties of the training, validation and testing datasets. For example, determines the total number of examples, the number of examples classified as resistant and the distribution of fold-change values. Functionality realised by the procedures: string getdatacharacteristics(). 68

Appendix B. Cultivated Phenotype Tree: Defines the gross structure of a decision tree. Includes functionality to grow a new decision tree using the ID3 algorithm, a set of examples and a set of attributes. Includes functionality to self-prune using reduced error pruning and a set of validation examples. Includes functionality to return a drug-susceptibility classification given a genotype sequence. Functionality realised by the procedures: void addbranch(tree subtree, string label), Tree id3(object[] examples, vector attributes), Attribute getbestattributegain(), double getentropy(object[] examples), void prune(object[] examples), string querytree(char[] sequence). BeerenwinkelModels: Defines a number of Tree objects corresponding to the decision tree classifiers presented in [1]. NearestNeighbour: Defines a 3-nearest-neighbour classifier that imposes an ordering on the training and validation datasets according to a similarity measure defined on Example. Includes functionality to return a drug-susceptibility classification given a genotype sequence. Functionality realised by the procedures: string querynearestneighbour(example queryexample), string getclassification(). QueryWindow: Provides the ability to query a decision tree or nearest neighbour classifier using either a mutation list or nucleotide sequence. Includes functionality to obtain a classification, output a drug-susceptibility classification and an explanation of a classification. Functionality realised by the procedures: char[] getquerysequence(), char translatecodon(string codon), Data: Defines a number of properties regarding the performance of a classifier in regards to a testing dataset. For example, a Data object stores the total number of examples in a testing set; the number of examples correctly classified as resistant; the number of examples incorrectly classified as resistant; the number of examples correctly classified as susceptible and the number of examples incorrectly classified as susceptible. Includes functionality to compute the sensitivity, specificity, positive prediction value, negative prediction value, positive likelihood ratio and the negative likelihood ratio. 69

Appendix B. Cultivated Phenotype Functionality realised by the procedures: double getsensitivity(), double getspecificity(), double getpositivelikelihoodratio(), double getnegativelikelihoodratio(), double getpositivepredictionvalue(), double getnegativepredictionvalue(), double getpercentagecorrectlyclassified(). PerformanceWindow: Displays the information stored in a Data object. Graph: Plots the results of a number of experimental runs. Includes functionality to create an independent testing dataset. The remaining examples are then sampled to create a variety of learning experiences and for each learning experience, a new decision tree and nearest-neighbour classifier is constructed. For each new decision tree and nearest neighbour classifier their performance is recorded in regards to the testing dataset. In addition the performance of a BeerenwinkelModel is computed in regards to the same testing dataset. The performance of each model is plotted for each training experience. Functionality realised by the procedure: vector testdtree(). Above, each class encompasses a number of important procedures. The role of these procedures is described below. Procedure void loadfile(string filename) void filterdrug(string antiretroviral) void setthreshold(double foldvalue) boolean isresistant() string getattribute(int I) void computecomparisionarray() Description Reads a data file as constructed from the text parser described in appendix A. Creates a table of phenotype-genotype pairs. Removes from the entire dataset of phenotype-genotype pairs the fold-change values associated with each drug except for antiretroviral. Initialises the fold-change threshold to be used for discriminating resistant from susceptible samples. Compares the fold-value of the Example with the fold-change threshold. Returns true (resistant) if the fold-change value exceeds the threshold and false (susceptible) otherwise. Given an index i returns the value of the ith attribute. In other words, returns the amino acid present at position i. Creates a feature vector for the Example. In particular, retrieves a reference sequence and compares it to the attribute values of the Example. For positions in which there was no change in amino acid the feature vector 70

Appendix B. Cultivated Phenotype int distancefromthissequence(char[] ot) void setanddisplaytrainingtestsets(int i). string getdatacharacteristics(). void addbranch(tree subtree, string label) Tree id3(object[] examples, vector, ats) Attribute getbestattributegain(), double getentropy(object[] examples), void prune(object[] examples), string querytree(char[] sequence). was augmented to include a dummy value and for other positions the feature vector was augmented to include the attribute value. Given a feature vector, ot, returns a score representing the distance from the feature vector of this Example and ot. The distance is computed as the sum of two factors. In particular, for positions in which the two feature vectors contain non-dummy values the percentage of these that are different is computed. In addition, the percentage of the remaining positions that are different is added. Given a percentage of the entire dataset to allocate for testing, i, randomly samples the entire dataset to construct a training and testing dataset. Furthermore 20% of the training set is then randomly sampled to create a validation dataset. Examples are selected using a random number generator (without replacement) in that each Example is equally probable to be picked. Given training, validation and testing datasets, outputs a description of their characteristics. In particular it computes the total number of Examples in each dataset; the number of Examples that are currently classified as resistant; the sequences in each dataset with their phenotypes and the distribution of phenotype values within the dataset. Creates a new branch from a tree node (amino-acid position) with a label (aminoacid) and a subtree. Given a set of phenotype-genotype pairs, examples and a set of attributes, ats, constructs a decision tree according to the ID3 algorithm. Using a set of phenotype-genotype pairs and a set of attributes, returns the attribute with maximal information gain. Given a set of phenotype-genotype pairs, examples, computes the entropy of the dataset. Measures the (im)purity of the dataset. Given a dataset of phenotype-genotype pairs, examples, considers each node in the tree for pruning. In particular, implements the reduced-error pruning strategy. Given an unseen genotype sequence, sequence, obtains a drug-susceptibility classification by sorting the example down 71

Appendix B. Cultivated Phenotype string querynearestneighbour(example q) string getclassification() char[] getquerysequence() char translatecodon(string codon) double getsensitivity() double getspecificity() double getpositivelikelihoodratio() double getnegativelikelihoodratio() double getpositivepredictionvalue() double getnegativepredictionvalue() double getpercentagecorrectlyclassified() vector testdtree(). through the tree. In particular, each node in the tree tests a specific amino acid position for certain amino acids. Given an unseen genotype Example, imposes an ordering on the Examples in the training + validation datasets as dictated by the distance measure, computed by the function int distancefromthissequence(char[] ot). Provided that the Examples in the training and datasets are ordered, selects the three closest Examples to an unseen genotype. Returns the majority drug-susceptibility classification. Constructs a genotype sequence from a set of mutations. Given a sequence of three nucleotides, codon, uses the genetic code to translate this code into an amino acid. Given the number of true positives, tp, and false negatives, fn, returns tp / (tp + fp). Given the number of false positives, fp, and true negatives, tn, returns tn / (fp + tn ). Returns sensitivity / (1 specificity). Returns (1 sensitivity) / specificity Given the number of true positives, tp, and false positives, fp, returns tp / (tp + fp). Given the number of true negatives, tn, and false negatives, fn, returns tn / (tn + fn). Given the total number of examples tested, tot, the number of true positives, tp, and the number of true negatives, tn, returns ((tp + tn) / tot) * 100. Creates a testing set of phenotype-genotype pairs using 20% of the entire dataset. Constructs 20 different training experiences by randomly sampling the remaining data. Creates a new decision tree classifier, for each different learning experience. For each training experience the performance over the entire testing set of each newly constructed decision tree, a Beerenwinkel model and the k-nearest neighbour classifier are plotted. 72

Appendix C The Complete Datasets Given below is a list of the sequences (given as sequence id s used in the Stanford HIV resistance database) along with the fold-change values that were associated with each drug. I used the following sequences from the Stanford HIV-1 protease drug susceptibility data set: Instance_Id Seq_Id APV_Fold ATV_Fold NFV_Fold RTV_Fold SQV_Fold LPV_Fold IDV_Fold i1 7439 4.3-46.1 34.1 47.4-20.0 i2 7443 2.3-11.0 62.2 574.2-12.0 i3 7459 - - - 4.0 15.0-3.0 i4 7460 - - - 23.0 1.0-7.0 i5 7461 - - - 120.0 8.0-21.0 i6 7462 - - - 20.0 37.0-8.0 i7 7463 - - - 175.0 580.0-100.0 i8 7464 - - - 8.0 1.0-18.0 i9 7465 - - - 8.0 1.0-15.0 i10 7466 - - - 3.0 21.0-4.0 i11 7467 - - - 23.0 121.0-45.0 i12 8084 8.3-161.1 170.2 100.0-100.0 i13 8085 1.6-3.1 7.4 0.5-3.3 i14 41692 14.9-42.2 104.8 41.0-22.3 i15 8090 2.3-23.4 51.6 37.8-32.7 i16 8093 2.3-37.7 22.2 56.4-22.8 i17 8113 - - 4.9 20.0 26.0 10.0 5.6 i18 40595 - - 1.3 0.8 0.7 0.7 0.9 i19 8253 5.3-16.4 24.5 23.8 5.3 8.8 i20 8493 0.1-2.9 0.6 0.6 0.5 1.0 i21 8637 - - 11.0 0.8 1.1 0.8 3.4 i22 8652 - - 54.0-7.0 - - i23 8654 - - 7.0-1.0 - - i24 8658 - - 4.0-1.0 - - i25 8660 - - 37.0-5.0 - - i26 8666 - - 2.0-1.0 - - i27 8670 - - 4.0-2.0 - - i28 8672 - - 1.0-1.0 - - i29 8674 - - 1.0-1.0 - - i30 9431 - - 12.0 1.3 0.9 0.8 2.8 i31 9490 - - 73.0 1.8 2.7 0.9 1.8 i32 9513 - - 34.0 0.7 0.6 0.5 1.0 i33 41885 - - 2.0 1.3 0.8 0.3 0.9 i34 9556 - - 3.0 1.6 1.0 1.2 1.2 i35 2613 - - 24.0 1.0 0.4-1.0 i36 2615 - - 19.0 115.0 4.0-14.0 i37 9706 - - 0.4 0.3 0.4 0.3 0.3 i38 9733 5.1-34.7 77.0 8.9 33.5 22.8 i39 9742 - - 14.0 0.3 0.3 0.4 0.4 i40 10191 - - 1.6 1.3 0.9 0.7 1.0 i41 10271 - - 59.0 2.3 2.5 1.4 1.4 i42 10286 - - 1.8 1.6 0.9 1.4 1.2 i43 10294 2.0-72.0 73.0 27.0 20.0 24.0 73

i44 10436 - - 100.0 90.0 58.0 24.0 48.0 i45 10539 0.4-0.7 0.8 0.8 0.6 0.6 i46 10558 - - 28.0 41.0 4.2 22.0 15.0 i47 10560 - - 2.6 9.0 1.2 1.2 1.4 i48 10574 - - 5.4 2.1 1.2 0.3 1.3 i49 10855 - - 8.0 4.4 1.7 0.9 3.5 i50 11224 - - 9.2 3.4 3.9 2.2 1.9 i51 11954 6.1-19.0 5.5 3.8-9.3 i52 11958 3.9-24.0 10.0 3.2-4.8 i53 2867 - - 33.0 119.0 9.0-24.0 i54 12377 - - 2.2 1.0 1.1 0.8 1.6 i55 12400 1.1-1.2 1.3 0.7-0.6 i56 12650 0.5-0.8 0.5 0.6-0.6 i57 12676 4.0-34.0 5.5 5.9-16.0 i58 12679 5.3-3.5 3.3 1.3-12.0 i59 40765 - - 47.0 1.3 2.3 0.9 1.5 i60 12861 0.4-7.1 0.4 0.5-0.5 i61 12862 0.8-24.7 1.0 0.9-1.2 i62 12863 3.0-2.2 16.1 1.0-2.8 i63 12864 4.4-3.6 22.4 1.7-3.9 i64 12865 3.6-6.2 12.0 9.0-3.6 i65 12866 4.5-11.5 11.3 4.9-4.2 i66 12867 2.3-3.0 4.4 12.9-3.7 i67 13255 - - 2.3 0.3 0.4-1.5 i68 13256 0.1-5.1 0.7 1.4-2.5 i69 13257 0.1-8.1 0.8 1.1-2.3 i70 13258 - - 12.8 1.3 1.2-3.0 i71 13259 0.1-12.2 1.3 1.4-2.9 i72 13260 0.3-0.4 0.4 0.3-0.4 i73 13261 1.0-1.4 1.2 1.0-1.1 i74 13262 0.9-2.4 1.7 1.2-1.7 i75 12928 3.2-4.0 12.1 3.1-6.4 i76 12930 5.2-34.7 100.0 91.0-20.3 i77 12932 2.2-3.4 5.9 0.8-4.4 i78 12936 0.4-11.1 0.9 0.4-0.9 i79 12938 2.3-2.0 2.1 1.2-1.2 i80 12940 0.4-2.3 1.8 0.9-0.9 i81 12942 10.8-121.7 203.2 29.1-32.2 i82 12944 13.9-97.2 203.8 37.2-33.2 i83 13235 - - 8.9 1.7 0.7-2.1 i84 13236 - - 21.0 1.5 0.2-6.1 i85 13237 0.1-11.0 1.2 1.5-3.1 i86 13238 0.1-15.8 3.3 1.2-1.4 i87 13239 0.1-29.9 1.4 1.8-6.3 i88 13240 0.1-16.7 1.8 1.4-5.4 i89 13241 0.1-24.6 3.3 1.2-7.6 i90 13242 0.2-18.6 1.9 1.8-7.5 i91 13244 0.2-15.1 1.5 2.0-5.5 i92 13247 0.3-1.6 8.1 0.3-1.0 i93 13248 0.3-63.0 6.2 8.9-13.5 i94 13250 0.3-3.5 4.3 0.6-1.1 i95 13252 0.3-27.6 0.6 0.4-1.2 i96 13254 0.3-29.1 1.1 1.1-1.3 i97 15492 12.0 0.5 1.0 8.0 2.0-1.0 i98 15493 14.0 0.5 1.0 3.0 2.0-1.0 i99 15494 16.0 0.6 1.0 2.0 0.1-2.0 i100 15495 0.4 0.6 4.0 16.0 1.0-3.0 i101 15496 0.6 0.6 2.0 5.3 0.6-1.0 i102 15497 8.0 0.6 2.0 7.0 1.0-1.0 i103 15500 0.3 1.0 32.0 0.7 1.0-0.5 i104 15501 1.0 1.0 1.0 2.0 17.0-1.0 i105 15502 0.3 1.0 27.0 0.4 1.0-0.3 i106 15504 2.0 2.0 43.0 1.0 0.6-2.0 i107 15505 20.0-8.0 7.0 4.0-27.0 i108 15506 2.0 2.0 7.0 5.0 3.0-3.0 i109 15507 6.0 2.0 8.0 15.0 6.0-3.0 i110 15508 31.0 2.0 4.0 8.0 4.0-2.0 i111 15511 0.2 2.0 4.0 4.0 30.0-1.0 i112 15512 1.0 3.0 50.0 2.0 2.0-2.0 i113 15513 4.0 3.0 8.0 40.0 2.0-7.0 i114 15514 1.0 3.0 27.0 4.0 3.0-10.0 i115 15515 0.4 3.0 3.0 2.0 14.0-1.0 i116 15516 11.0 3.0 4.0 22.0 6.0-2.0 i117 15517 1.0 4.0 11.0 58.0 3.0-6.0 i118 15518 2.0 4.0 13.0 15.0 4.0-2.0 i119 15519 5.0 4.0 14.0 10.0 4.0-10.0 i120 15520 3.0 4.0 23.0 47.0 3.0-14.0 i121 15521 0.5 4.0 17.0 23.0 1.0-34.0 i122 15523-4.0 2.0 62.0 2.0-10.0 i123 15524 8.0 5.0-134.0 1.0-9.0 i124 15525 1.0 5.0 65.0 3.0 3.0-2.0 i125 15526 3.0 5.0 18.0 93.0 4.0-8.0 i126 15527 3.0 5.0 77.0 2.0 3.0-4.0 i127 15528 4.0 5.0 7.0 12.0 1.0-17.0 i128 15529 4.0 5.0 7.0 105.0 - - 4.0 i129 15530 3.0 6.0 15.0 33.0 4.0-13.0 i130 15531 4.0 6.0 19.0 40.0 4.0-16.0 i131 15532 4.0 6.0 17.0 33.0 3.0-15.0 i132 15533 7.0 8.0 10.0 37.0 4.0-10.0 i133 15534 5.0 9.0 8.0 18.0 11.0-11.0 i134 15535 4.0 10.0 21.0 17.0 23.0-13.0 i135 15536 5.0 12.0 17.0 11.0 11.0-17.0 i136 15537 0.1 14.0 13.0 2.0 1.0-2.0 i137 15538 7.0 14.0 35.0 85.0 9.0-28.0 i138 15539 8.0 16.0 13.0 51.0 9.0-19.0 i139 15540 13.0 22.0 41.0 34.0 82.0-35.0 i140 15542 8.0 25.0 36.0 203.0 43.0-32.0 Appendix C. The Complete Datasets 74

i141 15543 12.0 27.0 52.0 63.0 50.0-50.0 i142 15544 2.0 29.0 37.0 77.0 14.0-41.0 i143 15545 9.0 30.0 106.0 17.0 33.0-62.0 i144 15546 10.0 35.0 20.0 105.0 31.0-23.0 i145 15548 30.0 39.0 36.0 203.0 43.0-26.0 i146 15549 26.0 39.0 36.0 164.0 43.0-56.0 i147 15550 29.0 39.0 36.0 33.0 14.0-38.0 i148 15551 3.0 41.0 479.0 54.0 118.0-17.0 i149 15552 7.0 41.0 36.0 135.0 4.0-21.0 i150 15553 3.0 82.0 101.0 90.0 40.0-78.0 i151 15509 1.0 2.0 5.0 14.0 1.0-4.0 i152 16034 - - 12.0 6.5 6.5-6.1 i153 16397 - - 1.8 0.9 0.8 0.8 0.9 i154 25471 - - 34.5 90.9 54.4-20.5 i155 25498 - - 132.0 236.0 123.0-52.3 i156 25472 - - 229.0 94.4 51.8-79.4 i157 25473 - - 180.0 179.0 9.2-43.9 i158 25475 - - 12.3 47.8 1.5-5.4 i159 25476 - - 35.3 110.0 1.8-21.0 i160 25503 - - 15.6 74.4 1.5-5.9 i161 25477 - - 1646.0 40.6 34.0-25.2 i162 25504 - - 1646.0 72.3 60.3-116.0 i163 25478 - - 31.2 513.0 52.5-16.8 i164 25479 - - 422.0 846.0 153.0-9.6 i165 25506 - - 359.0 26.6 10.0-21.0 i166 25480 - - 7.4 75.7 49.4-5.2 i167 25508 - - 735.0 846.0 23.1-52.2 i168 25484 - - 97.4 71.5 54.1-4.6 i169 26090 0.5-4.0 1.6 0.9-1.6 i170 26092 0.9-7.6 3.3 2.0-3.1 i171 26094 7.6-34.3 127.0 3.6-35.9 i172 26096 2.5-20.8 6.1 4.3-9.4 i173 26117 4.4-3.0 1.9 0.6-12.5 i174 26121 3.4-4.3 8.0 0.3-2.7 i175 26122 2.3-11.5 22.1 0.9-8.3 i176 26124 2.1-8.9 49.8 1.0-6.5 i177 26100 10.2-65.8 128.0 12.3-50.7 i178 26078 0.9-27.1 47.1 4.7-24.0 i179 26079 1.0-28.8 48.9 5.1-28.1 i180 26080 1.2-30.8 49.5 6.5-22.9 i181 26105 3.2-9.3 4.2 2.6-7.5 i182 26108 2.8-7.7 15.2 6.5-7.7 i183 26110 2.9-7.5 17.3 11.0-6.6 i184 26081 1.1-1.7 2.2 1.0-2.2 i185 26082 1.1-3.4 2.3 1.5-3.7 i186 26112 0.3-1.0 0.6 0.9-0.9 i187 26113 1.5-1.6 1.3 1.1-1.1 i188 26084 9.1-7.2 9.2 2.9-17.2 i189 40789 1.3-34.0 22.0 60.0-5.0 i190 26899 0.5-0.6 0.5 0.7 0.5 0.5 i191 26903 0.8-1.4 0.8 0.7-1.0 i192 27084 12.0-102.0 105.0 186.0-12.0 i193 27086 - - 3.9-1.5-9.6 i194 27090 - - 4.1-0.3-2.4 i195 27093 - - 16.8-1.4-20.0 i196 27095 - - 6.4-0.7-4.5 i197 27091 - - 3.7-0.7-2.5 i198 27094 - - - 1.4 1.5-0.5 i199 27092 - - 23.0-4.8-25.2 i200 27087 - - 27.4-4.8-34.6 i201 27096 - - 33.3-3.8-25.8 i202 27088 - - 37.0-4.4-15.8 i203 27089 - - 49.8-7.2-34.5 i204 27441 16.0-1.4 5.0 0.9 2.4 0.4 i205 27442 4.9-2.1 0.8 0.6 0.8 0.6 i206 27443 5.9-3.3 3.8 1.0 2.9 1.0 i207 27444 - - 1.1 0.5 0.1 0.4 1.4 i208 27445 4.5-0.8 1.6 0.9 4.5 1.0 i209 27447 9.6-1.4 8.3 0.8 2.0 0.5 i210 27448 10.3-1.3 4.6 0.6 3.2 1.1 i211 27449 5.0-3.7 3.6 0.5 5.7 2.0 i212 27450 3.4-2.3 2.6 0.8 1.4 0.8 i213 27453 2.0-6.2 2.0 0.9 1.8 1.7 i214 27454 7.6-5.9 1.8 0.7 1.8 2.1 i215 27455 0.7-0.5 0.7 1.2 0.7 0.6 i216 27456 5.3-1.0 2.5 1.3 4.0 1.3 i217 27457 4.0-2.4 0.9 0.7 1.2 0.5 i218 27459 2.0-1.5 3.9 0.3 0.9 1.7 i219 27460 7.0-2.0 2.0 0.4 4.5 1.2 i220 27461 2.6-1.1 0.7 0.6 2.9 1.0 i221 27462 5.4-2.3 1.8 0.5 2.8 1.5 i222 27463 5.9-2.1 1.3 0.3 3.4 1.0 i223 27465 1.4-7.4 12.1 1.5 5.2 10.2 i224 27468 3.4-33.5 22.6 32.4 2.5 12.6 i225 27471 0.6-54.7 4.5 3.4 0.7 15.7 i226 27474 - - 7.2 21.6 0.9 2.3 7.1 i227 27476 1.3-2.3 5.0 1.8 2.8 1.8 i228 27466 4.6-17.8 101.9 1.9 20.3 15.6 i229 27467 2.5-27.4 37.9 2.4 33.6 9.3 i230 27469 5.5-20.5 15.4 16.3 3.7 8.2 i231 27470 5.4-30.1 19.5 29.8 9.4 10.6 i232 27472 2.9-1.6 16.2 0.6 46.3 8.7 i233 27473 6.6-8.5 26.6 0.4 78.7 6.8 i234 27475 1.0-17.2 33.7 0.8 24.6 6.8 i235 27477 8.5-205.0 260.9 52.0 99.0 156.0 i236 28212 - - - - - - - i237 28218 3.6-45.0 5.9 2.6 4.3 15.0 Appendix C. The Complete Datasets 75

i238 28231 64.0-29.0 178.0 13.0-14.0 i239 28222 0.8-54.0 2.1 1.6 1.0 1.3 i240 28215 1.1-1.1 0.9 1.0-1.0 i241 28230 0.6-1.1 0.8 0.9-1.0 i242 28229 0.5-0.7 0.6 0.7-0.7 i243 28217 0.7-1.2 0.9 0.8 0.9 0.9 i244 28216 0.6-0.8 0.5 0.6-0.7 i245 28226 - - - - - 0.4 - i246 29039 1.9-160.4 8.2 9.6-2.8 i247 29040 1.3-74.0 4.0 7.1-3.2 i248 29041 1.1-124.0 3.6 4.4-2.6 i249 29042 2.0-57.0 3.4 9.3-5.3 i250 29043 11.4-109.0 4.7 6.1-1.1 i251 29044 3.7-171.4 5.7 38.1-3.9 i252 29045 0.4-32.8 2.1 3.7-1.3 i253 29046 2.7-140.1 10.2 21.0-5.2 i254 29047 1.5-219.0 17.0 24.0-5.8 i255 29048 2.3-550.0 35.0 72.0-8.4 i256 29049 1.2-47.0 2.5 5.0-1.7 i257 29050 1.0-67.0 3.1 3.9-2.3 i258 29051 28.0-550.0 31.0 45.0-6.8 i259 29052 1.3-74.0 3.3 5.2-3.5 i260 29053 2.2-140.0 27.0 46.0-12.0 i261 29054 130.0-550.0 275.0 290.0-25.0 i262 38708 17.0-96.0 120.0 2.4 67.0 51.0 i263 38710 - - 10.7 0.1 0.1 0.7 0.3 i264 38712 0.6-1.2 0.8 0.5 0.6 0.8 i265 38714 0.8-1.1 1.4 1.1 1.1 1.1 i266 38716 - - 1.2 0.5 0.8 0.7 0.6 i267 38718 - - 17.3 0.3 1.4 1.6 9.7 i268 38720 0.7-6.2 18.0 1.6 3.4 11.0 i269 38722 2.2-43.0 4.4 6.7 3.7 17.0 i270 38724 - - 31.4 0.6 1.6 0.8 1.6 i271 38726 - - 25.3 41.4 2.9 15.3 11.8 i272 38728 - - 4.2 17.3 0.4 4.6 5.8 i273 38730 4.0-54.0 148.0 27.0 34.0 22.0 i274 38732 - - 14.8 0.3 0.3 0.7 0.8 i275 38734 - - 6.9 3.1 0.7 14.0 18.6 i276 38736 - - 8.4 7.6 1.1 7.6 6.9 i277 38738 0.5-1.1 0.8 0.6 0.7 0.9 i278 38740 - - 1.2 1.3 1.4 0.8 1.5 i279 38742 7.0-31.0 32.0 4.0 14.0 25.0 i280 38744 - - 3.6 2.3 0.7 0.7 1.1 i281 38746 - - 3.5 2.1 0.6 1.2 1.7 i282 38748 0.6-21.0 0.5 0.4 0.7 1.3 i283 38750 - - 13.5 19.5 0.3 6.2 11.6 i284 38752 0.8-1.3 1.2 1.5 0.8 0.9 i285 38756 0.7-1.5 1.1 1.0 0.9 1.2 i286 38758 0.9-1.4 1.3 0.8 1.1 1.1 i287 38760 - - 1.3 1.4 8.9 0.7 2.1 i288 38762 0.6-54.2 4.5 3.4 0.7 15.7 i289 38766 0.7-6.7 30.0 0.4 4.2 3.4 i290 38768 2.0-53.0 84.0 7.6 22.0 20.0 i291 38770 - - 1.1 1.3 0.7 1.0 1.5 i292 38772 2.8-23.0 48.0 4.7 16.0 11.0 i293 38774 3.1-39.0 77.0 5.9 18.0 14.0 i294 38776 - - 2.0 0.8 0.8 0.7 2.0 i295 38778 - - 2.7 27.1 2.8 4.6 7.4 i296 38780 - - 54.2 0.5 1.9 0.7 1.7 i297 38782 12.0-31.0 108.0 5.9 41.0 40.0 i298 38784 - - 1.7 1.1 0.8 0.9 0.9 i299 38786 0.6-1.5 0.9 0.5 0.5 0.9 i300 38788 1.3-34.0 74.0 181.0 20.0 26.0 i301 38790 - - 7.3 16.3 24.4 2.3 6.0 i302 38792 2.7-30.0 112.0 7.4 12.0 11.0 i303 38794 3.5-7.0 78.0 14.0 13.0 7.7 i304 38796 - - 6.0 2.7 0.1 4.2 9.8 i305 38798 - - 2.3 0.8 1.2 0.7 0.8 i306 38800 8.2-53.0 92.0 9.1 64.0 41.0 i307 38802 - - 1.8 7.3 0.4 1.3 1.7 i308 38804 - - 1.9 5.7 1.0 1.6 3.0 i309 38806 1.6-19.0 45.0 1.8 17.0 19.0 i310 38808 1.6-45.0 5.4 10.0 1.9 11.0 i311 38810 - - 7.7 20.3 1.2 8.4 4.6 i312 38812 3.2-96.0 315.8 265.0 96.0 16.0 i313 38816 - - 2.2 1.0 0.9 0.9 0.4 i314 38818 2.7-26.0 20.0 18.0 3.2 8.4 i315 38820 3.3-33.5 22.6 32.4 2.5 12.6 i316 38822 1.4-4.4 3.1 0.9 1.7 2.8 i317 38824 - - 36.4 1.1 1.6 1.2 1.9 i318 38826 1.4-10.0 4.6 0.9 1.1 4.6 i319 38828 5.8-6.1 45.0 1.3 7.0 7.4 i320 38834 - - 42.9 17.7 38.5 2.7 10.9 i321 38836 3.5-35.0 44.0 31.0 6.4 11.0 i322 38838 - - 49.7 47.0 4.6 26.0 26.0 i323 38840 - - 3.6 2.7 1.8 1.2 1.7 i324 38842 - - 0.8 0.7 0.1 0.7 1.2 i325 38844 1.8-15.0 10.0 6.2 2.3 3.7 i326 38846 2.8-27.0 57.0 8.8 16.0 15.0 i327 38848 - - 11.6 1.2 1.7 0.7 4.1 i328 38850 1.0-7.4 12.1 1.5 5.2 10.2 i329 38856 3.6-27.0 28.0 23.0 6.1 8.1 i330 38858 - - 9.9 8.9 3.1 2.7 4.0 i331 38860 3.6-21.0 75.0 6.2 44.0 18.0 i332 38862 0.8-1.9 1.5 1.1 0.8 1.1 i333 38864 5.2-58.0 109.0 155.0 39.0 73.0 i334 38866 0.7-1.5 1.1 0.9 0.8 1.0 Appendix C. The Complete Datasets 76

i335 38870 - - 20.4 0.1 0.2 0.7 0.5 i336 38872 6.0-43.0 79.0 7.1 33.0 22.0 i337 38874 1.2-3.1 3.5 1.2 1.3 1.5 i338 38876 - - 7.2 21.6 0.9 2.3 7.1 i339 38880 - - 1.1 1.1 0.4 0.8 0.9 i340 38882 11.0-131.0 129.0 545.5 34.0 58.0 i341 38884 29.0-23.0 48.0 24.0 4.2 6.6 i342 38890 - - 8.9 10.2 13.8 2.5 7.6 i343 38892 1.9-10.0 1.9 2.5 1.7 6.5 i344 38894 - - 0.8 1.0 0.7 0.7 0.8 i345 38896 - - 54.2 0.9 2.1 0.8 2.5 i346 38900 6.8-121.0 100.0 461.5 63.0 171.4 i347 38902 - - 20.9 0.5 0.5 0.7 0.8 i348 38904 - - 7.7 26.8 0.9 2.4 4.6 i349 38906 - - 20.0 8.6 0.5 4.5 20.3 i350 38908 0.7-1.7 1.1 0.7 0.7 1.0 i351 38910 1.9-11.0 28.0 4.6 14.0 10.0 i352 38912 - - 42.8 1.8 1.5 1.1 8.2 i353 38914 - - 19.5 6.3 2.6 1.4 7.6 i354 38916 - - 1.2 0.8 0.8 1.0 1.0 i355 38922 - - 2.4 1.3 0.5 0.8 2.1 i356 38926 5.0-8.6 105.0 18.0 11.0 8.1 i357 38928 9.8-18.0 17.0 20.0 4.3 11.0 i358 39551 0.1-14.6 0.6 2.4-2.5 i359 42159 8.8-16.0 88.0 33.0 35.0 9.4 i360 3901 - - 50.0 2.0 0.3-1.0 i361 2947 - - 5.0 9.0 0.6-6.0 i362 2904 - - 2.0 1.0 1.0-1.0 i363 2996 2.5-38.6 147.8 16.1-16.3 i364 39449 - - 31.0 1.0 1.0-6.0 i365 2987 - - 18.0 0.2 0.2-0.5 i366 44038 - - 4.5 1.5 2.9-0.7 i367 44038 - - 2.3 1.4 0.6-1.1 i368 44040 - - 0.8 1.0 0.2-1.2 i369 44040 - - 0.5 0.6 0.5-0.4 i370 44042 1.1-1.5 0.4 2.0-0.8 i371 44042 0.4-0.8 0.8 0.7-0.5 i372 44044 1.6-0.6 0.7 1.2-1.5 i373 44044 2.0-1.5 2.2 1.7-1.4 i374 44046 - - 0.8 0.7 0.9-1.1 i375 44046 - - 0.7 1.1 0.7-0.8 i376 44048 0.9-1.8 0.7 1.6-1.1 i377 44048 1.5-1.4 1.5 1.1-1.4 i378 44050 - - 1.1 0.2 0.7-1.0 i379 44050 - - 0.9 1.1 0.5-0.9 i380 44052 1.2-0.4 0.4 0.6-1.3 i381 44052 1.0-1.2 1.1 0.8-0.8 i382 44054 1.8-4.6 2.3 3.6-1.4 i383 44054 0.8-2.5 3.0 1.3-1.2 i384 44056 - - 2.9 1.8 1.3-1.7 i385 44056 - - 1.5 1.3 0.6-0.8 i386 44058 0.2-0.2 0.2 0.4-0.6 i387 44058 0.3-0.2 0.3 0.5-0.4 i388 44060 - - 3.5 1.6 2.2-0.7 i389 44060 - - 1.8 0.9 0.6-0.9 i390 44062 0.6-1.0 0.2 0.6-1.4 i391 44062 0.6-0.9 0.5 0.7-0.5 i392 44064 - - 1.0 0.8 1.1-1.0 i393 44064 - - 0.9 0.8 0.5-0.8 i394 44066 - - 1.6 1.1 0.7-1.6 i395 44066 - - 0.6 0.8 0.5-0.6 i396 44068 - - 2.8 0.9 1.2-4.0 i397 44068 - - 1.1 0.5 0.3-0.8 i398 44070 - - 2.0 4.3 1.6-2.1 i399 44070 - - 0.5 0.6 0.4-0.5 i400 44072 - - 2.6 1.4 2.6-1.4 i401 44072 - - 1.2 0.9 0.8-0.8 i402 44074 - - 1.1 1.0 2.9-0.9 i403 44074 - - 0.4 0.5 0.1-0.3 i404 44078 - - 1.4 1.1 0.8-1.8 i405 44078 - - 1.3 1.3 0.7-0.8 i406 44080 - - 8.0 3.7 4.0-1.2 i407 44080 - - 2.6 2.3 1.1-0.9 i408 44082 0.9-1.3 1.2 1.5-1.7 i409 44082 0.3-1.4 0.5 0.3-0.5 i410 44084 0.9-0.5 0.2 0.4-0.9 i411 44084 0.8-0.8 1.0 1.2-0.8 i412 44086 1.3-0.2 0.4 0.9-1.3 i413 44086 0.6-0.6 0.8 0.9-0.7 i414 44088 1.0-0.7 0.3 0.3-1.0 i415 44088 0.5-1.0 0.9 0.7-0.7 i416 44090 0.6-0.7 0.3 1.3-0.8 i417 44090 0.6-0.9 1.0 0.9-0.8 i418 44092 0.7-0.7 0.3 0.4-0.4 i419 44092 1.3-0.8 1.4 0.9-0.8 i420 44094 0.9-0.8 0.5 1.0-0.7 i421 44094 2.1-2.5 2.8 1.3-1.3 i422 44096 1.3-1.0 0.9 0.9-2.3 i423 44096 2.0-2.7 2.3 1.2-1.4 i424 44098 1.9-3.2 2.2 1.5-2.5 i425 44098 1.3-2.0 1.8 1.2-1.1 i426 44100 1.5-5.3 5.1 2.4-1.7 i427 44100 1.3-3.2 3.0 2.6-1.9 i428 44102 - - 6.2 1.3 1.4-3.3 i429 44102 - - 1.4 0.9 0.5-0.9 i430 44104 0.9-2.6 1.4 1.3-2.8 i431 44104 0.7-1.4 1.1 0.8-1.0 Appendix C. The Complete Datasets 77

i432 44106 - - 1.4 1.0 1.0-1.7 i433 44106 - - 1.0 0.8 0.6-0.9 i434 44108 - - 1.3 1.0 0.5-0.8 i435 44108 - - 1.3 3.6 1.2-1.4 i436 44110 - - 1.0 0.4 0.5-0.6 i437 44110 - - 2.7 2.8 1.0-1.0 i438 44112 0.2-0.4 0.2 - - 0.3 i439 44112 1.1-1.9 1.9 1.1-1.1 i440 45038 0.4-0.7 0.3 0.4-0.4 i441 45040 1.0-60.0 1.5 2.0-1.9 i442 45042 0.9-2.7 1.5 1.1-1.7 i443 45044 1.1-1.7 0.9 0.8-0.7 i444 45046 0.4-0.5 0.4 0.5-0.3 i445 45057 0.9 4.1 83.0 1.3 2.4-2.4 i446 45058 3.9 49.0 232.0 13.0 33.0-8.2 i447 45059 0.7 1.1 1.2 1.2 0.7-1.0 i448 45060 0.6 4.9 0.7 0.2 0.3-0.3 i449 45061 0.8 0.9 1.1 1.1 0.9-0.9 i450 45062 0.5 1.9 25.0 0.6 0.5-1.1 i451 45063 0.5 0.5 0.5 0.6 0.3-0.4 i452 45064 0.8 2.9 5.8 2.4 0.8-1.3 i453 45065 0.2 6.0 7.7 0.8 1.3-4.2 i454 45066 1.6 141.0 86.0 9.8 111.0-27.0 i455 45067 0.8 0.8 1.2 1.1 0.7-1.0 i456 45069 0.4 7.9 0.8 0.2 0.5-0.4 i457 45071 1.6 2.7 7.4 1.6 1.0-1.8 i458 45072 0.3 31.0 66.0 3.6 12.0-12.0 i459 45073 4.3 3.8 15.0 13.0 3.5-2.6 i460 45074 0.5 0.6 1.1 0.7 0.5-0.8 i461 45076 0.3 9.9 1.0 0.1 0.3-0.5 i462 45078 0.6 1.5 31.0 0.9 1.1-1.2 i463 45079 1.1 12.0 72.0 2.8 6.9-2.8 i464 45080 0.9 0.8 1.3 0.8 0.8-0.8 i465 45082 0.4 11.0 0.9 0.3 0.5-0.4 i466 45084 0.4 0.7 1.5 1.0 0.7-0.8 i467 45086 0.8 8.8 0.9 0.2 0.3-0.4 i468 45088 1.4 7.6 142.0 0.8 - - 2.3 i469 45089 0.1 22.0 6.0 0.1 - - 0.1 i470 45090 0.5 0.7 12.0 0.7 0.6-0.4 i471 45091 4.3 22.0 56.0 19.0 24.0-24.0 i472 45092 0.6 0.7 0.8 0.6 0.7-0.7 i473 45094 0.2 13.0 1.6 0.3 - - 0.5 i474 45095 0.6 0.7 0.8 0.9 0.6-0.7 i475 45096 1.0 2.3 46.0 1.1 2.0-1.9 i476 45097 0.5 0.6 1.0 0.5 0.4-0.6 i477 45098 0.5 5.2 0.3 0.1 0.1-0.3 i478 45099 1.3 4.0 91.0 4.3 5.6-3.4 i479 45100 2.7 14.0 140.0 10.0 21.0-5.2 i480 45101 3.2 2.1 23.0 8.2 2.1-4.6 i481 45102 2.1 6.9 35.0 7.6 5.8-7.8 i482 45103 0.4 0.7 0.6 0.6 0.5-0.5 i483 45104 0.3 0.8 13.0 0.3 0.3-0.4 i484 45105 0.8 1.5 45.0 1.1 1.1-1.4 i485 45106 1.2 7.4 82.0 1.8 2.7-2.6 i486 45107 0.5 1.8 14.0 0.6 0.5-0.9 i487 45108 6.2 3.4 59.0 9.9 18.0-1.3 i488 45109 0.9 1.4 4.6 1.0 0.8-0.9 i489 45110 0.6 2.4 34.0 0.8 1.0-1.0 i490 45111 0.6 9.5 0.4 0.3 0.3-0.4 i491 45112 1.5 1.5 2.1 2.2 1.3-1.6 i492 45113 1.0 1.0 1.0 1.0 - - 1.0 i493 45114 0.2 5.4 0.2 0.1 - - 0.1 i494 45115 0.7 2.4 1.1 0.5 - - 0.6 i495 45116 0.2 10.0 0.3 0.2 - - 0.3 i496 45117 1.0 1.0 1.0 1.0 - - 1.0 i497 45118 0.1 2.0 0.1 0.1 - - 0.1 i498 45119 0.3 2.0 1.3 0.7 - - 1.0 i499 45120 0.2 6.0 0.3 0.2 - - 0.3 i500 45121 1.0 1.0 1.0 1.0 - - 1.0 i501 45122 0.3 3.0 0.1 0.3 - - 0.1 i502 45123 0.6 0.9 1.4 2.0 - - 1.2 i503 45124 0.3 6.0 0.2 0.2 - - 0.1 i504 45125 0.8-2.9 21.5 0.9 2.4 2.2 i505 45129 0.5-0.4 1.5 0.8-1.3 i506 45131 0.7-1.7 1.8 0.4 1.0 1.2 i507 45133 0.8-0.5 2.1 0.3-1.6 i508 45135 2.3-13.4 1.7 4.6 1.8 1.0 i509 45137 2.2-2.8 5.8 1.0 4.1 1.7 i510 45139 0.5-1.0 0.2 - - - i511 45141 0.2 - - - - 0.3 - i512 45143 0.5-1.3 0.7 0.9 0.5 0.4 i513 45145 0.5-0.2 0.2-0.6 - i514 45147 0.4-0.3 1.8 1.2-0.7 i515 45149 0.2 - - - - - - i516 45151 6.1-29.0 39.3 0.5-37.2 i517 45153 0.4 - - - - - - i518 45155 1.3-95.7 - - - - i519 45157 5.8-2.1 130.2 3.1-3.8 i520 45159 2.3-26.9 112.1 41.1 - - i521 45161 0.4-0.2 1.4 0.2 0.3 - i522 45165 0.4-0.6 0.4 0.4 0.6 0.7 i523 45167 0.2 - - - - - - i524 45169 0.5-34.3 - - - - i525 45171 0.4-59.4 - - - - i526 45173 0.2 - - 0.2 0.4 - - i527 45177 2.7-3.5 3.0 1.0-3.2 i528 45179 0.7-0.9 0.9 0.6 1.3 0.7 Appendix C. The Complete Datasets 78

i529 45181 1.9-6.2 4.2 5.2 1.9 3.6 i530 45183 3.9-0.9 1.6 0.5 5.0 14.8 i531 45185 1.3-6.5 3.9 2.3 2.5 1.3 i532 45187 1.6-1.4 0.4 1.7 1.4 1.0 i533 45189 2.9-17.4 15.8 1.9 10.2 12.8 i534 45191 9.2-22.7 91.0 1.6-40.4 i535 45193 0.6-0.5 1.1 0.4 0.7 0.6 i536 45195 3.0-5.3 7.8 3.0-3.7 i537 45197 0.5-9.2 15.2 22.2 7.7 15.2 i538 45199 0.2 - - 0.1 0.4 0.3 - i539 45201 0.8-16.9 144.0 3.9-6.8 i540 45203 0.7-67.7 0.4 0.9 1.0 1.0 i541 45205 0.4-12.7 5.7 2.2 1.5 1.2 i542 45207 0.6-0.4 1.3 0.6-0.9 i543 45209 0.7-0.5 0.4 0.5-0.8 i544 45211 0.3-0.4 0.4 0.5 0.8 0.6 i545 45213 1.8-4.9 37.7 1.0-6.3 i546 45215 0.2-0.6 0.8 0.2 0.4 0.2 i547 45217 - - - 34.9 - - - i548 45219 0.4-6.5 7.8 20.2 13.5 9.0 i549 45225 1.3-12.6 18.4 42.3-7.5 i550 45227 1.5-3.5 8.3 0.7 1.4 4.8 i551 45231 1.0 - - 0.2 - - - i552 45233 0.2-0.5 0.6 0.5-0.4 i553 45235 0.9-0.3 0.8 0.7-0.6 i554 45237 0.7-3.8 6.2 0.4 10.8 2.6 i555 45239 1.8-11.8 17.9 0.4 9.8 9.4 i556 45241 3.2-53.4-12.6 48.4 - i557 45243 0.5-0.5 2.4 0.9 0.7 0.7 i558 45245 0.8-35.1 2.4 2.4-1.3 i559 45249 5.3-30.4 91.7 0.7 3.8 56.3 i560 45251 1.7-1.2 1.1 0.6-0.8 i561 45253 1.7-2.8 5.8 1.3-2.9 i562 45255 2.6-2.6 1.9 0.8-1.0 i563 45257 2.4-5.5 16.6 48.8 - - i564 45259 4.1-2.7 0.2 - - - i565 45261 1.1 - - 0.6 1.1 1.0 1.6 i566 45263 0.2-6.0 5.2 0.1 3.1 3.2 i567 45265 3.7-19.6 15.8 3.4 25.3 5.3 i568 45267 1.0-1.1 3.4 1.0-1.3 i569 45269 1.1-1.0 1.7 0.5-1.4 i570 45271 3.6-7.6 5.7 1.6-3.1 i571 45273 0.4-0.3 - - - - i572 45275 51.6-90.9 - - - - i573 45277 0.9-95.7 - - - - i574 45281 0.4-0.9 0.4 0.5 0.5 0.2 i575 45283 1.9-1.3 9.7 21.9 0.7 2.9 i576 45285 2.5-21.1 146.9 18.7 17.7 6.2 i577 45287 1.1-3.0 50.3 2.1 4.1 8.0 i578 45289 0.2 - - - - - - i579 45291 2.7-72.8 - - - - i580 45293 2.9-45.2-7.3 37.6 - i581 45295 12.3-17.8 60.9 9.2-17.2 i582 45297 - - 56.2-0.6 - - i583 45299 0.3 - - - - 0.8 - i584 45301 2.8-15.1 29.6 1.5-18.8 i585 45303 1.3-1.7 3.0 0.2 2.2 3.3 i586 45305 1.7-15.6 19.6 3.6 9.5 13.9 i587 45307 0.7-9.0 5.7 31.9 9.6 3.9 i588 45309 0.9-36.0 5.6 1.5-8.4 i589 45311 7.0-28.8 13.5 38.1 - - i590 45313 0.4-0.7 0.3 - - - i591 45315 5.1-31.4 58.6 3.8 14.9 29.2 i592 45317 0.8-0.8 0.6 0.6-1.2 i593 45319 0.5-22.1 0.9 1.1 0.9 0.5 i594 45321 0.3-0.3 0.2 0.3 0.3 - i595 45323 0.6-0.5 0.8 0.5-0.9 i596 45325 7.3-11.4 220.3 1.8-9.9 i597 45327 0.5-0.3 0.4 0.3-0.5 i598 45329 1.8-16.0 31.7 8.8 5.4 9.6 i599 45331 0.2 - - - - 0.3 - i600 45333 3.6-54.5-14.4 18.7 - i601 45335 1.0-0.6 0.5 - - - i602 45337 1.7-41.9 - - - - i603 45341 1.0-56.0 - - - - i604 45343 1.1-1.0 0.4 1.1 1.5 1.1 i605 45345 2.8-26.8 211.1 - - - i606 45347 1.7-0.5 2.9 0.8-1.8 i607 45349 15.2-52.6-7.7 11.5 - i608 45351 1.7-24.8 9.1 24.5 1.7 10.8 i609 45353 1.1-6.4 1.4 0.3 1.7 0.7 i610 45355 7.3-15.4 18.0 8.1-41.9 i611 45359 5.7-28.9 12.5 17.5 4.4 28.5 i612 45361 5.3-53.2 - - - - i613 45363 0.7-0.9 0.3 0.7 0.9 0.6 i614 45365 0.3-0.5 0.4 0.4 0.4 0.3 i615 45367 0.9-21.4 0.7 0.6-1.7 i616 45369 2.0-20.2 2.5 2.8 3.0 7.6 i617 45371 0.2 - - 0.2 0.3 - - i618 45373 0.4-0.5 0.6 0.3 0.8 0.2 i619 45375 5.9-30.4 - - - - i620 45377 1.0-1.7 1.0 0.4 1.2 0.8 i621 45379 2.7-45.2 13.6 6.9 7.6 4.2 i622 45381 0.5-0.3 0.5 0.4-0.6 i623 45383 1.1-49.6 2.9 2.2 1.9 2.1 i624 45385 0.4-10.5 4.9 4.0 0.8 4.0 i625 45387 0.3-0.3 0.4 0.3 0.5 0.3 Appendix C. The Complete Datasets 79

i626 45389 0.5-1.0 0.8 0.9-0.7 i627 45391 0.3-14.2 0.6 0.7-0.6 i628 45393 0.4-26.9 1.1 1.1-3.3 i629 45395 1.4-40.5 - - 49.0 - i630 45397 1.5-11.7 46.0 1.8 13.7 11.4 i631 45399 0.3-4.0 0.2-0.3 - i632 45401 1.7-63.3 - - - - i633 45403 0.7-12.1 1.4 0.7 0.8 2.1 i634 45405 0.3-1.4 1.9-12.7 - i635 45407 0.3-1.6 0.5 0.6-0.5 i636 45409 0.8-1.0 0.9 0.8 0.7 0.8 i637 45411 0.8-0.6 0.7 0.8 0.8 1.8 i638 45413 12.3-61.7 26.0-47.9 - i639 45415 1.0-3.4 15.7 0.4 5.1 2.5 i640 45417 0.6-1.1 0.3 0.8-1.0 i641 45419 2.6-7.9 31.3 1.2 21.2 6.2 i642 45421 26.2-95.7 - - - - i643 45423 0.2 - - - - - - i644 45425 0.5-3.6 1.9 0.4 0.8 0.4 i645 45427 0.2 - - - - 1.0 - i646 45431 1.7-9.1 14.1 12.2 3.3 4.8 i647 45433 16.1-10.7 72.9 1.0-16.9 i648 45435 4.9-47.1 26.2 4.3 40.1 12.6 i649 45437 0.8-11.7 9.5 0.6 10.0 - i650 45439 2.0-14.4 44.1 29.0 - - i651 46679 3.4-30.3 21.0 30.3-9.1 i652 46681 0.9-2.5 1.4 1.0-1.3 i653 46685 0.6-1.1 0.9 0.6-0.7 i654 46687 0.7-47.0 1.4 1.5-1.0 i655 46691 1.1-1.5 1.7 0.9-1.1 i656 46693 0.8-1.4 1.3 1.1-1.0 i657 46695 0.6-38.9 1.0 0.9-1.6 i658 46697 1.2-21.0 61.2 3.8-17.1 i659 46699 3.0-4.5 16.5 0.7-4.3 i660 46701 22.3-165.4 119.0 249.8-186.9 i661 46703 0.8-1.8 1.4 1.2-1.3 i662 46705 0.5-1.3 0.9 1.0-0.7 i663 46709 1.7-25.0 5.9 4.7-9.6 i664 46711 6.0-18.4 12.0 18.0-10.5 i665 3818 - - 1.0 0.4 0.3-0.4 i666 3899 - - 3.0 1.0 1.0-1.0 i667 3830 - - 8.0 3.0 59.0-4.0 i668 3826 - - 1.0 0.3 0.3-0.3 i669 3893 - - 6.0 35.0 71.0-31.0 i670 3883 - - 1.0 0.3 1.0-0.3 i671 3810 - - 73.0 147.0 71.0-100.0 i672 3816 - - 4.1 4.4 0.6-7.4 i673 3905 - - 24.0 87.0 5.0-21.0 i674 40015 - - 23.0 1.8 0.4-1.5 i675 3928 - - 25.0 21.0 40.0-6.0 i676 3926 - - 10.0 20.0 13.0-8.0 i677 3910 - - 1.0 1.0 1.6-1.0 i678 3839 - - 58.0 19.0 14.0-14.0 i679 3966 - - 28.0 58.0 3.0-16.0 i680 3891 - - 98.0 118.0 43.0-72.0 i681 3885 - - 95.0 132.0 51.0-62.0 i682 3924 - - 1.0 1.0 0.3-1.0 i683 3922 - - 4.0 7.0 2.0-5.0 i684 4470 - - 6.0 66.0 1.0-11.0 i685 3918 - - 80.0 52.0 38.0-60.0 i686 3850 - - 7.0 3.0 2.0-1.0 i687 3854 - - 19.0 19.0 1.0-7.0 i688 3852 7.5-7.0 14.7 0.1 2.0 14.2 i689 41684 - - 48.0 85.0 6.0-18.0 i690 41686 3.4-33.3 60.8 6.7 54.1 12.6 i691 41621 - - 13.0 11.0 7.0-5.0 i692 3867 - - 24.0 28.0 0.5-15.0 i693 3865 3.0-22.4 49.7 3.7 5.6 13.5 i694 3869 - - 10.0 41.0 1.0-10.0 i695 3895 - - 42.0 17.0 22.0-17.0 i696 3916 - - 41.0 8.0 6.0-8.0 i697 50511 1.3-2.6 1.6 1.5 1.1 1.3 i698 50513 1.6-5.8 14.0 1.6 2.8 5.2 i699 50515 0.8-0.9 0.8 0.7 0.7 0.6 i700 52822 2.3 19.3 18.4 45.3 16.7 9.9 17.0 i701 52824 0.6 0.7 1.0 0.7 0.8 0.6 0.8 i702 52826 0.6 1.3 23.3 0.6 0.3 0.6 1.1 i703 52828 1.0 15.2 29.8 36.7 7.1 10.1 23.4 i704 52830 2.1 1.9 2.5 14.5 1.2 4.3 2.5 i705 52832 1.5 27.4 43.6 73.5 21.4 13.2 31.4 i706 52834 1.2 2.5 45.0 0.9 2.3 1.2 1.8 i707 52836 1.3 2.6 4.1 7.6 0.4 1.4 3.5 i708 52838 4.9 10.8 56.0 9.3 16.2 7.5 24.5 i709 52840 4.1 17.2 6.1 61.5 20.9 10.9 8.4 i710 52889 0.4-0.9 0.5 0.7 0.4 0.5 i711 52891 3.2-87.2 4.1 11.8 2.7 22.7 i712 52893 1.1-5.9 1.8 1.1 1.3 3.0 i713 52897 1.0-297.0 23.0 71.0 16.0 87.0 i714 52905 7.7-110.8-11.6 39.0 69.6 i715 52907 0.2-46.5 1.9 3.4 1.3 15.1 i716 52909 4.2-53.2 190.0 33.2 25.2 25.2 i717 52911 3.2-75.7 176.8 32.5 49.3 41.5 i718 52915 6.3-17.6 48.7 47.7 6.3 7.5 i719 53501 0.9-6.7 1.7 1.3 0.8 0.9 i720 53503 6.0-37.0 139.0 132.0 24.0 12.0 i721 53505 8.8-56.0 187.0 306.0 28.0 9.4 i722 53507 6.0-18.0 63.0 41.0 26.0 6.1 Appendix C. The Complete Datasets 80

i723 53509 6.5-20.0 63.0 25.0 35.0 11.0 i724 53515 2.3-15.0 65.0 20.0 26.0 6.3 i725 53517 6.2-17.0 77.0 38.0 21.0 5.0 i726 53853 0.7-2.2 0.6 0.7-1.0 i727 53857 0.6-0.5 0.3 0.4-0.7 i728 53862 1.4-2.9 0.7 0.7-0.5 i729 53866 1.1-7.3 2.4 1.9-2.9 i730 53868 2.4-3.2 2.0 0.9-0.5 i731 53881 0.6-1.1 0.8 0.9-0.7 i732 53885 0.6-1.2 0.7 0.7-0.6 i733 53889 0.6-0.9 2.2 0.9-0.7 i734 53891 0.6-0.6 1.2 0.4-0.7 i735 53909 1.5-2.6 3.3 1.2-1.2 i736 53915 0.5-1.0 1.1 1.1-1.7 i737 53922 1.2-2.8 2.8 0.4-1.2 i738 53924 0.9-2.6 3.5 0.7-1.2 i739 53932 0.6-3.9 1.4 0.9-0.9 i740 53934 0.7-0.7 0.9 0.5-0.8 i741 53951 0.6-0.9 1.5 0.5-0.6 i742 54162 1.7-15.3 1.6 6.1 0.4 1.2 i743 54163 5.4-58.7 8.6 14.1-1.3 i744 54164 8.6-74.3 10.0 55.9 1.6 4.1 i745 54165 10.4-395.3 9.1 92.0 3.1 10.7 i746 54166 19.8-50.7 15.3 40.8 3.1 3.8 i747 54167 64.1-143.8 21.6 24.4 46.8 62.9 i748 54168 23.0-600.0 59.8 1000.0 7.2 42.3 i749 54169 118.1-600.0 400.0 137.2 77.0 24.2 i750 54170 31.1-600.0 84.5 140.6-26.8 i751 54171 42.5-600.0 76.0 364.6 11.4 78.3 i752 54172 32.0-600.0 42.1 241.1 13.8 57.8 i753 54173 28.1-600.0 54.2 256.2 9.0 52.6 i754 54174 19.0-271.0 31.0 105.0 6.0 41.0 i755 54175 20.9-63.9 33.7 24.0 13.9 38.0 i756 54176 74.0-600.0 141.1 1000.0 123.5 92.1 i757 54177 252.2-600.0 86.2 265.5 94.4 400.0 i758 54178 1.6-3.0 4.9 3.9 1.9 1.3 i759 54179 24.8-3.0 7.2 2.0 37.5 8.6 i760 54180 2.3-1.8 4.7 2.8 1.4 1.9 i761 54181 1.6-4.3 2.8 1.8 1.9 3.2 i762 54182 2.2-2.3 4.6 1.9 2.0 1.4 i763 54183 12.5-8.5 5.0 3.5 28.9 33.9 i764 54184 42.6-30.8 92.9 30.4 80.2 33.6 i765 54185 40.0-14.0 33.3 14.3 79.0 16.7 i766 54186 13.2-1.8 22.4 1.6 43.4 5.6 i767 54187 4.6-8.4 22.4 4.5 9.5 6.3 i768 54188 39.5-101.9 84.5 43.1 99.3 79.2 i769 54189 52.7-136.6 275.0 162.3 102.7 41.5 i770 54300 - - 7.8 12.6 4.9-2.7 i771 54301 0.4-27.8 27.8 3.2-25.3 i772 54302 0.2-31.4 1.7 1.2-6.1 i773 54303 4.2-500.0 11.9 13.5-6.1 i774 54304 3.4-3.0 7.3 5.5-1.6 i775 54305 4.3-76.7 113.0 181.0-21.9 i776 54306 1.0-3.5 5.8 0.9-4.8 i777 54307 3.5-30.9-1.1 6.4 11.5 i778 54308 16.5-20.5-12.4 54.2 7.7 i779 54309 1.6-25.8-111.2 14.6 17.4 i780 54310 4.8-28.9-57.5 14.5 19.9 i781 54311 4.3-18.7-60.8 9.7 9.7 i782 54312 400.0-30.9-453.3 294.7 22.8 i783 54313 3.8-18.1-55.0 24.5 13.8 i784 54314 6.1-18.9-24.6 8.7 26.0 i785 54315 13.0-23.3-24.7 58.3 15.4 i786 54316 9.7-24.9-4.1 24.5 39.5 i787 54317 6.5-14.8-46.5 23.4 13.4 i788 54318 9.6-47.4-86.5 20.4 15.5 i789 54319 7.5-22.7-33.0 31.5 7.9 i790 54320 25.5-22.5-29.1 64.8 13.7 i791 54321 26.7-53.7-3.3 172.4 181.7 i792 54322 3.8-93.5-1000.0 47.9 100.9 i793 54323 10.8-44.6-81.1 39.3 16.0 i794 54324 4.7-56.5-44.6 8.5 30.0 i795 54325 9.5-46.4-48.0 10.3 34.6 i796 54326 12.4-19.7-30.1 46.8 21.4 i797 54327 11.7-48.8-52.7 11.1 36.9 i798 54328 4.7-78.7-443.7 50.7 64.6 i799 54329 3.5-74.8-441.9 39.3 57.6 i800 54330 6.3-76.5-387.8 47.0 91.3 i801 54331 18.5-30.4-23.5 119.0 20.5 i802 54332 3.5-153.5-1000.0 84.1 172.2 i803 54333 31.6-43.5-120.0 80.4 13.6 i804 54334 6.3-30.7-91.7 14.4 8.1 i805 54335 14.7-54.6-122.7 5.2 12.2 i806 54336 16.7-49.8-83.8 73.3 37.1 i807 54337 3.7-45.5-33.1 47.2 35.0 i808 54338 46.8-54.0-8.4 124.0 26.4 i809 54339 5.9-35.0-33.9 45.2 17.5 i810 54340 31.3-60.9-75.6 52.3 26.3 i811 54341 25.0-99.7-1000.0 214.1 400.0 i812 54342 35.5-71.9-126.2 68.5 24.6 i813 54343 17.3-103.6-35.7 51.0 36.6 i814 54344 16.9-63.5-60.7 43.2 14.7 i815 54345 20.9-75.9-1000.0 83.6 75.3 i816 54346 10.1-191.0-1000.0 95.9 400.0 i817 54347 6.7-39.1-101.3 10.1 37.4 i818 54348 20.6-99.6-32.1 23.1 66.3 i819 54349 16.0-74.0-139.6 13.5 23.1 Appendix C. The Complete Datasets 81

i820 54350 400.0-147.8-69.7 63.1 27.2 i821 54351 12.7-78.6-73.6 56.0 50.8 i822 54352 11.4-94.5-101.0 79.9 55.7 i823 54353 18.4-89.8-286.8 112.5 49.0 i824 54354 66.4-159.4-258.7 164.3 57.7 i825 54355 15.4-132.9-128.6 209.0 400.0 i826 54356 22.4-600.0-1000.0 122.5 4000.0 i827 54392 1.8-7.6-24.7 1.2 6.2 i828 54393 3.2-4.1-1.3 5.3 5.1 i829 54394 2.5-3.6-1.8 2.3 2.4 i830 54395 0.8-30.1-0.4 0.8 1.2 i831 54396 7.8-1.4-1.1 3.6 0.6 i832 54397 0.6-1.1-0.6 0.8 1.3 i833 54398 2.9-4.8-8.0 2.4 2.6 i834 54399 3.5-6.0-5.7 3.8 3.5 i835 54400 3.9-30.0-25.4 18.1 31.1 i836 54401 9.9-48.6-91.4 22.3 29.4 i837 54402 11.0-143.6-350.0 - - i838 54407 1.8-1.3 1.1 1.3 1.6 1.5 i839 54408 8.4-3.4 4.0 3.7 7.7 3.2 i840 54409 21.0-2.9 5.2 1.3 19.0 1.6 i841 54410 48.0-4.1 6.5 2.0 31.0 2.4 i842 54411 4.6-2.6 3.6 2.7 6.2 2.1 i843 54412 18.0-3.3 5.0 1.4 15.0 1.2 i844 54419 0.5 - - - - - - i845 54420 0.7 - - - - - - i846 54421 0.8 - - - - - - i847 54422 0.8 - - - - - - i848 54423 0.9 - - - - - - i849 54424 0.9 - - - - - - i850 54425 1.1 - - - - - - i851 54426 1.1 - - - - - - i852 54427 1.4 - - - - - - i853 54428 1.4 - - - - - - i854 54429 1.7 - - - - - - i855 54430 1.9 - - - - - - i856 54431 1.9 - - - - - - i857 54432 2.2 - - - - - - i858 54433 2.4 - - - - - - i859 54434 2.4 - - - - - - i860 54435 2.5 - - - - - - i861 54436 2.6 - - - - - - i862 54437 2.8 - - - - - - i863 54438 3.4 - - - - - - i864 54439 3.5 - - - - - - i865 54440 4.4 - - - - - - i866 54441 4.4 - - - - - - i867 54442 4.7 - - - - - - i868 54443 5.4 - - - - - - i869 54444 5.8 - - - - - - i870 54445 6.5 - - - - - - i871 54446 12.1 - - - - - - i872 54447 0.1-2.4 0.4 0.5-1.6 i873 54448 0.1-5.1 0.8 1.4-2.6 i874 54449 0.1-8.2 0.9 1.2-2.3 i875 54450 0.1-12.9 1.4 1.2-3.1 i876 54451 0.1-12.2 1.3 1.5-3.0 i877 54452 0.4-0.4 0.5 0.4-0.5 i878 54453 1.1-1.4 1.3 1.0-1.1 i879 54454 0.9-2.5 1.7 1.2-1.7 i880 54595 26.4-31.0 10.9 0.3 91.5 15.1 i881 54596 13.3-5.7 0.5 0.3 102.5 - i882 54597 21.2-2.7 2.8 0.2 86.2 1.0 i883 54598 7.7-7.2 5.7 0.1 109.6 2.3 i884 54599 47.3-45.3 14.0 0.7 95.8 95.0 i885 4432 1.5-2.2 1.8 1.1-1.0 i886 4482 3.9-21.6 74.1 9.2-20.2 i887 41179 - - 67.7 176.0 208.7 73.7 155.8 i888 4387 0.7-0.8 0.9 1.1-0.8 i889 4478 - - 46.0 1.0 0.4-2.0 i890 4538 - - 55.0 54.0 85.0 7.5 21.0 i891 4664 3.1-32.0 14.7 16.9-8.7 i892 4698 1.2-3.6 1.7 1.3-0.7 i893 5221 - - 1.2 1.2 0.7 0.8 0.8 i894 5464 2.1-24.7 47.8 104.8 7.8 22.2 i895 5681 - - 37.0 83.0 7.4 25.0 26.0 i896 6024 - - 22.0 5.9 3.4 3.0 8.3 i897 6028 - - 37.0 102.0 7.9 20.0 16.0 i898 41012 - - 0.9 1.1 0.8 1.0 0.8 i899 7038 - - 11.0 2.1 1.1 4.0 6.0 i900 7103 - - 11.0 0.6 0.4 0.7 0.7 i901 7139 15.5-64.0 36.7 74.4 17.0 41.1 i902 7235 - - 8.7 5.7 1.6 2.2 3.3 i903 7260 - - 16.0 45.0 18.0 15.0 33.0 i904 7393 - - 41.0 83.0 10.0 37.0 19.0 i905 7412 6.2-10.2 75.5 591.5-12.0 i906 7414 3.5-6.2 8.6 1.1-4.4 i907 7415 2.2-9.5 24.2 19.0-6.5 i908 40018 2.8-100.0 41.0 145.6-12.0 i909 7426 5.2-47.0 32.1 28.4-12.9 i910 7430 3.3-17.7 89.9 9.4 42.6 16.7 i911 7430 2.8-80.7 121.7 42.1-48.9 Appendix C. The Complete Datasets 82

Appendix C. The Complete Datasets I used the following sequences from the Stanford HIV-1 reverse transcriptase drug susceptibility data set: Id Seq_Id 3TC_Fold ABC_Fold D4T_Fold DDC_Fold DDI_Fold DLV_Fold EFV_Fold NVP_Fold i1 1 - - - - - - - 1.0 i2 7507 - - - - 1.9 - - 1.0 i3 7508 - - - - 8.9 - - 1.0 i4 7509 - - - - 4.5 - - 110.0 i5 7510 - - - - - - - 300.0 i6 39379 - - - - 9.4 - - 300.0 i7 7512 - - - - 19.0 - - 150.0 i8 7515 - - - - - - - - i9 7516 - - - - - - - - i10 7517 - - - - - - - - i11 7518 - - - - - - - - i12 7519 - - - - - - - - i13 7878 100.0 14.5 6.4 5.4 2.7 0.1 0.2 0.5 i14 7879 100.0 7.7 1.7 2.4 1.5 0.2 0.3 0.5 i15 41691 - - - - - - - - i16 7884 4.1 3.5 3.2 1.6 1.4 0.2 17.5 72.0 i17 7888 100.0 8.6 1.7 2.4 1.9 1.8 1.2 2.1 i18 7910 1.0 0.9 1.0 1.1 1.1 - - - i19 7913 4.6 6.7 3.0 1.4 1.8 1.1 0.9 1.5 i20 40594 200.0 5.5 2.5 2.5 1.9 0.3 0.4 0.4 i21 8044 4.1 4.7 2.9 1.3 1.5 7.2 11.4 490.2 i22 8310 - - - - - - - - i23 8313 100.0 - - - - - - - i24 8311 666.0 - - - - - - - i25 8319 - - - 2.0 1.6 - - - i26 8320 - - - 1.0 0.8 - - - i27 8321 - - - 1.0 1.0 - - - i28 8322 - - - - - - - - i29 8324 - - - 1.0 1.0 - - - i30 8401 109.1 3.8 0.9 2.1 1.6 88.8 20.7 41.8 i31 8612 200.0 7.4 3.9 2.9 2.2 0.1 0.2 0.2 i32 8649 1.0 1.0 - - - - - - i33 8651-4.0 - - - - - 1.0 i34 8653-4.0 - - - - - 1.0 i35 8655-2.0 - - - - - 1.0 i36 8659-1.0 - - - - - 3.0 i37 8661-5.0 - - - - - 1.0 i38 8663-5.0 - - - - - 1.0 i39 8667-3.0 1.0 - - - - - i40 8669-5.0-2.0 - - - - i41 8671-4.0 - - - - - - i42 8673-4.0 1.0 - - - - - i43 8675-7.0 2.7 - - - - - i44 9270 200.0 8.0 3.3 3.1 2.1 0.7 1.3 4.5 i45 9342 200.0 6.9 2.3 2.1 1.9 0.1 0.2 0.2 i46 9363 - - - - - 250.0 96.0 400.0 i47 41884 - - - - - 1.0 1.0 0.9 i48 9405 1.1 0.8 0.8 1.0 0.9 0.7 0.5 0.7 i49 9425 200.0 4.2 1.6 1.9 1.3 0.3 0.3 0.4 i50 9576 6.0 3.0 2.0 2.0 2.0 - - - i51 9607 4.0 2.0 1.0 1.0 1.0 - - - i52 9608 1.0 1.0 1.0 1.0 1.0 - - - i53 9609 210.0 - - - - - - - i54 9610 15.0 5.0 5.0 3.0 2.0 - - - i55 2616 78.4 6.6 0.9 3.6 1.3 0.5 1.1 4.2 i56 9641 109.1 5.1 1.9 2.2 1.9 2.2 1.5 3.4 i57 9650 200.0 3.0 0.6 2.3 1.4 0.9 0.7 0.6 i58 9927 - - - - - 2.2 4.4 9.0 i59 9988 200.0 3.5 0.8 2.0 1.6 1.5 1.0 1.0 i60 10003 - - - - - 0.7 0.5 0.6 i61 10011 200.0 4.3 0.7 2.2 1.6 0.7 0.7 0.5 i62 10158 200.0 5.6 2.0 2.2 1.6 58.0 700.0 35.0 i63 10492 200.0 3.8 0.7 1.8 1.5 1.0 0.6 0.8 i64 10496 96.2 2.2 0.7 1.8 1.5 72.0 69.0 191.0 i65 10514 - - - - - 0.1 0.2 0.4 i66 10516 - - - - - 0.8 0.8 0.6 i67 10719 2.5 2.3 2.0 1.1 1.2 0.3 0.2 0.2 i68 11150 3.0 2.5 2.4 1.0 1.2 53.0 41.0 155.0 i69 11333 - - - - - 190.0 240.0 240.0 i70 11602 - - 3.8 1.4 1.3 - - - i71 11795 7.0 5.0 13.0 7.0 22.0 - - - i72 11806 - - 3.3-2.0 - - - i73 11807 - - 1.3 1.4 - - - - i74 11809 1.0 - - - - - - - i75 11811 170.0 - - - - - - - 83

Appendix C. The Complete Datasets i76 11810 170.0 - - - - - - - i77 11845 6.6 8.0 8.9 5.4 6.4 150.0 88.0 240.0 i78 11849 167.0 5.1 1.5 2.0 1.6 0.3 0.4 0.5 i79 2868 78.4 6.7 1.4 2.9 3.7 86.7 2.7 107.8 i80 12270 1.1 1.0 1.1 1.0 0.8 0.7 0.7 0.6 i81 12293 1.0 1.2 1.0 1.0 1.1 1.0 0.7 0.9 i82 40495 1.4 2.0 1.9 0.9 1.1 0.1 0.2 0.2 i83 12503 0.9 1.0 0.9 1.0 1.0 1.1 0.7 0.7 i84 12528 15.0 21.0 6.8 8.8 11.0 16.0 240.0 240.0 i85 12531 8.4 7.2-1.9 2.1 0.4 3.7 240.0 i86 40764 200.0 2.4 0.7 1.6 1.5 6.6 4.6 2.8 i87 12836 2.0 2.6 1.5 1.0 0.9 0.3 0.5 1.0 i88 12837 3.0 2.8 1.9 0.8 0.9 0.5 0.4 0.5 i89 12838 150.0 3.8 0.8 1.4 0.9 0.6 0.6 0.6 i90 12839 150.0 3.1 0.8 1.5 1.2 - - - i91 12840 150.0 3.4 1.4 1.8 1.4 0.4 0.3 0.3 i92 12841 1.1 1.0 1.4 0.4 0.5 - - - i93 12842 12.5 2.6 1.7 2.5 1.8 1.1 0.8 1.0 i94 12843 2.6-0.9 1.2 1.8 - - - i95 12844 3.5-1.9 1.4 1.2 - - - i96 12845 21.9 16.4 13.3 2.1 3.0 0.2 0.3 0.6 i97 12846 1.4-2.9 2.2 6.8 2.7 1.9 1.9 i98 12847 150.0 28.9 11.5 16.1 18.3 - - - i99 12848 - - - - - 30.0 10.0 3.1 i100 12849 - - - - - 4.9 5.0 11.6 i101 12850 - - - - - 52.0 26.0 55.0 i102 12851 - - - - - 5.0 1.7 64.0 i103 12852 - - - - - 1.3 1.7 3.0 i104 12853 - - - - - 35.0 3.0 161.0 i105 12854 - - - - - 9.0 109.0 500.0 i106 12855 - - - - - 0.5 7.6 75.0 i107 12856 - - - - - 0.4 47.0 206.0 i108 12857 - - - - - 2.0 123.0 500.0 i109 12858 - - - - - 1.4 500.0 500.0 i110 12859 - - - - - 250.0 31.0 500.0 i111 12860 - - - - - 37.0 213.0 500.0 i112 12964 4.0 - - - - - - - i113 12965 2.0 - - - - - - - i114 12966 4.0 - - - - - - - i115 12967 1.0 - - - - - - - i116 12968 22.0 - - - - - - - i117 12969 8.0 - - - - - - - i118 12970 2.0 - - - - - - - i119 12971 7.0 - - - - - - - i120 12972 32.0 - - - - - - - i121 12973 3.0 - - - - - - - i122 12974 14.0 - - - - - - - i123 12975 15.0 - - - - - - - i124 12976 78.0 - - - - - - - i125 12977 82.0 - - - - - - - i126 12978 85.0 - - - - - - - i127 12979 72.0 - - - - - - - i128 12980 82.0 - - - - - - - i129 12981 84.0 - - - - - - - i130 12927 150.0-1.6 1.8 1.4 0.3-0.4 i131 12929 150.0-2.1 2.1 2.5 - - - i132 12931 150.0-1.5 2.2 1.2 145.0-195.0 i133 12933 1.6-0.8-1.2 3.0-580.0 i134 12937 0.9-0.7 0.7 1.0 0.6-0.4 i135 12939 150.0-1.1 2.0 1.8 1.4-4.7 i136 12941 150.0-2.8 0.7 1.5 0.2-0.2 i137 12943 3.5-2.0 0.4 0.9 0.2-0.2 i138 14405 21.2 4.4 1.2 4.7 3.0 - - - i139 14419 32.5 10.5 2.7 4.8 13.1 - - - i140 14421 1.4 1.0 0.8 1.2 0.8 - - - i141 14425 24.1 3.2 0.5 0.7 0.9 - - - i142 14435 43.3 1.7 0.8 2.7 2.2 - - - i143 14440 34.4 2.3 0.8 1.9 1.3 - - - i144 14442 23.6 9.5 1.4 5.0 3.4 - - - i145 14448 29.8 1.9 0.4 1.3 1.5 - - - i146 14462 43.3 1.9 1.8 2.1 1.0 - - - i147 14468 0.9 1.3 1.5 0.7 0.7 - - - i148 14473 43.3 7.7 0.2 0.9 1.5 - - - i149 14476 90.0 40.0 1.7 5.4 6.0 - - - i150 14478 90.0 40.0 15.0 33.0 48.0 - - - i151 14479 1.3 1.1 2.2 3.7 3.4 - - - i152 14481 2.1 1.2 2.9 1.1 1.1 - - - i153 14486 43.3 1.2 0.1 0.4 1.2 - - - i154 14496 32.5 5.9 0.8 10.9 23.4 - - - i155 14501 21.8 1.5 0.4 2.8 1.2 - - - i156 14506 43.3 2.8 0.5 3.5 5.1 - - - i157 14514 37.0 5.4 1.6 4.8 5.8 - - - i158 14517 7.6 0.6 0.2 0.4 0.6 - - - 84

Appendix C. The Complete Datasets i159 15331 1.0 4.4 2.6 0.8 0.9 - - - i160 15332 1.5 1.4 4.1 0.8 0.9 - - - i161 15333 62.0 7.1 2.2 5.0 8.0 - - - i162 15334 62.0 4.1 3.7 5.1 2.4 - - - i163 15335 62.0 8.7 4.0 6.7 6.7 - - - i164 15336 2.9 4.0 4.4 1.7 2.1 - - - i165 15337 11.0 9.6 4.0 2.8 5.0 - - - i166 15338 20.0 12.0 6.6 4.8 4.4 - - - i167 15339 4.2 7.0 6.9 2.8 3.8 - - - i168 15340 6.1 12.0 3.1 2.5 6.5 - - - i169 15341 11.0 12.0 17.0 3.5 6.5 - - - i170 15342 3.2 1.8 2.7 3.0 2.5 - - - i171 15343 2.0 2.8 3.0 5.4 3.5 - - - i172 15554 - - - - - 58.0 23.0 39.0 i173 15555 - - - - - 250.0 25.0 780.0 i174 15556 - - - - - 250.0 450.0 780.0 i175 15558 - - - - - 161.0 5.3 600.0 i176 15559 - - - - - 250.0 270.0 600.0 i177 16033 130.0-1.2 3.7 2.6 5.4 3.8 5.5 i178 16271 200.0 3.8 0.8 1.8 1.5 1.0 0.7 0.5 i179 16577 - - - - - 0.4 7.0 96.0 i180 16578 - - - - - 0.3 56.0 191.0 i181 16579 - - - - - - 150.0 356.0 i182 16580 - - - - - 6.0 300.0 600.0 i183 16581 - - - - - - 170.0 170.0 i184 16582 - - - - - 4.0 300.0 600.0 i185 25066 - - - - - 270.0 3.0 680.0 i186 25067 - - - - - 20.0 12.0 680.0 i187 25068 - - - - - - 0.5 3.0 i188 25069 - - - - - 3.0 8.0 680.0 i189 25070 - - - - - 3.0 3.0 8.0 i190 25071 - - - - - 0.2 2.0 3.0 i191 25072 - - - - - 14.0 167.0 150.0 i192 25073 - - - - - 1.0 1.0 1.0 i193 25074 - - - - - 75.0 150.0 680.0 i194 25075 - - - - - 190.0 270.0 680.0 i195 25076 - - - - - 53.0 1.0 2.0 i196 25077 - - - - - 190.0 5.0 165.0 i197 25078 - - - - - 190.0 7.0 680.0 i198 25079 - - - - - 190.0 680.0 69.0 i199 25080 - - - - - 1.0 270.0 680.0 i200 25081 100.0 6.9 1.3-1.3 0.1 0.3 0.4 i201 25082 100.0 20.9 2.4-6.8 1.5 1.2 1.5 i202 25512 182.0 31.8-28.7 17.5 - - 2261.0 i203 25513 182.0 30.2-27.4 20.1 - - 49.2 i204 25539 182.0 32.7-6.8 13.9 - - 6659.0 i205 25540 182.0 15.7-7.5 38.7 - - 6659.0 i206 25516 182.0 18.6-22.3 8.2 - - 6659.0 i207 25541 182.0 21.4-9.6 8.0 - - 1923.0 i208 25518 182.0 27.7-27.3 7.8 - - 17.9 i209 25529 182.0 36.9-35.8 20.9 - - 6659.0 i210 25510 182.0 9.9-6.4 7.9 - - 3.9 i211 25543 182.0 5.0-5.2 7.6 - - - i212 25520 182.0 19.5-9.4 7.6 - - 19.9 i213 25545 35.8 9.1-6.9 10.4 - - 6659.0 i214 25523 182.0 34.2-26.9 13.9 - - 26.6 i215 26040 200.0 5.1 0.8 2.0 1.5 0.6 0.4 0.6 i216 26042 200.0 6.8 1.1 2.2 2.1 0.2 0.3 0.4 i217 26044 - - - - - 11.0 0.3 47.1 i218 26046 200.0 5.0 1.9 1.9 1.3 - - - i219 26067 200.0 5.5 1.7-1.7 0.2-0.3 i220 26071 1.1 1.1 1.3 0.9 1.2 3.5 1.5 3.4 i221 26072 200.0 2.3 0.7 1.7 1.0 0.3 0.2 0.4 i222 26074 200.0 3.5 1.2 1.5 1.4 0.4 0.3 0.3 i223 26049 200.0 4.6 1.6 2.1 1.5 0.3 0.3 0.5 i224 26050 200.0 6.3 1.3 1.7 1.7 1.5 9.5 9.6 i225 26028 200.0 8.5 3.4 2.8 2.5 1.1 1.4 2.6 i226 26030 8.3 4.2 4.3 1.2 2.0 25.7 330.0 750.0 i227 26053 200.0 7.7 1.3 2.0 1.7 0.4 0.3 0.4 i228 26055 - - - - - 7.8 9.4 16.0 i229 26058 - - - - - 5.9 41.6 39.2 i230 26060 - - - - - 6.1 34.4 34.1 i231 26031 200.0 7.5 2.1 2.4 2.1 0.4 0.6 0.8 i232 26032 21.3 7.8 2.4 1.9 1.9 0.6 0.7 1.2 i233 26062 200.0 3.5 0.6 0.9 1.6 0.3-0.7 i234 26063 200.0 3.1 0.9 1.9 1.6 1.4 2.1 2.2 i235 26034 - - - - - 1.1 0.6 0.7 i236 26036 - - - - - 0.7-0.7 i237 26037 200.0 5.7 1.0 2.2 2.6 1.0 0.5 0.7 i238 40788-3.5 2.0-1.4 - - - i239 26506 1.5 0.7 0.8-0.8-0.5 1.0 i240 26512 80.0 5.6 1.9 2.3 2.2 0.2 0.2 0.4 i241 27085 30.0 10.0 20.0-4.4-85.0 218.0 85

Appendix C. The Complete Datasets i242 28210 40.0 5.4 1.5 2.1 1.5 - - - i243 28211 - - - - - - - - i244 28214 50.0 13.0 3.7 2.1 3.3 - - - i245 28247 80.0 5.9 2.9 2.2 2.1 38.0 57.0 195.0 i246 28243 80.0 8.8 3.4 4.3 3.1 80.0 80.0 207.0 i247 28234 80.0 4.4 0.9 1.8 1.7 7.5 44.0 100.0 i248 28233 2.6 3.2 1.8 1.2 1.1 100.0 100.0 100.0 i249 28244 80.0 3.7 0.8 1.9 1.7 - - - i250 28242 - - - - - 1.4 0.9 0.8 i251 28241 80.0 8.2 1.7 2.4 1.9 0.4 0.5 0.6 i252 28236 80.0 3.1 1.0 1.7 1.3 46.0 20.0 43.0 i253 28238 1.6 1.1 1.0 1.0 1.0 2.2 1.2 1.3 i254 28250 - - - - - - 5.2 6.2 i255 28252 - - - - - 30.0 24.0 5.3 i256 28253 - - - - - 3.5 6.9 14.0 i257 28254 - - - - - 6.0 5.6 3.2 i258 28255 - - - - - 24.0 19.0 44.0 i259 28256 - - - - - 68.0 36.0 50.0 i260 28257 - - - - - 1.0 0.6 0.9 i261 28258 - - - - - 13.0 3.8 120.0 i262 28259 - - - - - 1.3 1.1 1.9 i263 28260 - - - - - 0.9 1.6 2.8 i264 28261 - - - - - - 1.1 - i265 28262 - - - - - 33.0 1.1 100.0 i266 28263 - - - - - 5.0 3.8 1.9 i267 28264 - - - - - 17.0 140.0 1500.0 i268 28265 - - - - - 2.3 4.0 44.0 i269 28266 - - - - - 0.2 4.6 41.0 i270 28267 - - - - - 0.7 97.0 290.0 i271 28268 - - - - - 0.4 1.2 1.7 i272 28269 - - - - - - 0.9 3.1 i273 28270 - - - - - 66.0 0.6 1.7 i274 28271 - - - - - 0.8 110.0 590.0 i275 28272 - - - - - 870.0 2400.0 110.0 i276 28273 - - - - - 130.0 210.0 310.0 i277 28274 - - - - - 36.0 250.0 136.0 i278 28275 - - - - - 39.0 84.0 270.0 i279 28276 - - - - - 270.0 32.0 1600.0 i280 28277 - - - - - - 404.0 - i281 28278 - - - - - 170.0 1400.0 2600.0 i282 28279 - - - - - 97.0 3800.0 1100.0 i283 28280 - - - - - 11.0 100.0 140.0 i284 28281 - - - - - 2.0 57.0 260.0 i285 28282 - - - - - 12.0 625.0 268.0 i286 28897 116.0 19.0 5.2 10.0 25.0 - - 2.8 i287 28898 4.7 19.0 3.9 2.9 25.0 - - 2.6 i288 28899 8.6 6.1 4.3 4.1 25.0 - - 133.0 i289 28900 20.0 5.6 4.2 4.4 25.0 - - 4.3 i290 28901 116.0 19.0 1.7 3.7 25.0 - - 1.3 i291 38707 166.7 6.6 8.8 4.8 3.7 - - - i292 38709 90.9-0.5 0.8 0.8 1.8-2.0 i293 38711 - - - - - 1.2 0.8 0.6 i294 38713 155.8 3.3 0.9 2.3 1.6 0.5 0.4 0.6 i295 38715 90.9-0.6 2.6 0.7 3.6-1.9 i296 38717 7.9-3.8 5.4 1.8 2.8-3.7 i297 38719 155.8 11.0 4.9 3.0 2.0 0.2 0.3 0.5 i298 38721 2.2 1.6 1.7 1.3 1.2 1.1 0.4 0.4 i299 38723 - - - - - 3.5-4.3 i300 38725 90.9-2.4 2.7 6.3 2.6-2.2 i301 38727 77.1-0.9 3.6 4.0 1.3 2.0 3.6 i302 38729 7.1 6.6 4.3 1.5 1.9 8.8 20.0 114.0 i303 38731 4.9-1.2 0.8 1.8 2.1-1.7 i304 38733 - - - - - 4.5-1.5 i305 38735 90.9-0.7 1.7 1.3 1.1-1.8 i306 38739 90.9-1.4 3.2 3.9 6.5-5.6 i307 38741 166.7 6.7 3.3 3.1 2.3 0.2 0.3 0.3 i308 38743 90.9-0.7 1.0 0.7 4.8-1.9 i309 38745 90.9-0.7 4.6 2.2 11.7-3.0 i310 38747 8.8 3.2 1.6 3.3 2.3 - - - i311 38749 - - - - - 5.0-2.0 i312 38751 - - - - - 1.6 0.8 1.5 i313 38753 166.7 4.2 1.7 2.3 1.8 0.4 0.3 0.3 i314 38755 1.0 0.8 1.0 1.1 1.1 1.3 0.8 0.9 i315 38757 1.0 0.9 1.1 1.1 1.1 2.1 1.0 1.1 i316 38759 - - - - - 1.3-4.4 i317 38763 155.8 3.3 0.8 1.9 1.7 1.5 0.8 0.8 i318 38765 166.7 5.1 3.1 1.9 2.1 0.2 0.2 0.2 i319 38767 4.8 3.6 5.1 1.4 1.6 0.1 0.5 1.2 i320 38769 60.6-0.9 1.6 1.0 1.1-1.0 i321 38771 166.7 4.0 1.1 1.9 1.4 0.7 0.4 0.4 i322 38773 166.7 3.8 1.4 1.9 1.6 0.1 0.2 0.1 i323 38775 90.9-1.3 2.2 3.2 5.6-3.3 i324 38777 90.9-0.6 1.2 1.2 3.6-3.2 86

Appendix C. The Complete Datasets i325 38779 - - - - - 2.3-1.9 i326 38781 3.2 2.5 3.0 1.7 1.9 0.6 0.4 0.6 i327 38783 90.9-0.9 2.7 1.5 4.3-1.0 i328 38785 155.8 2.1 0.8 0.6 1.2 0.8 0.8 0.5 i329 38787 166.7 6.3 1.7 1.5 1.5 13.0 4.7 11.0 i330 38789 3.1-4.4 4.5 1.8 1.3-1.8 i331 38791 3.5 3.0 2.2 1.2 1.5 1.2 0.9 0.6 i332 38793 - - - - - 0.3 0.3 0.6 i333 38795 90.9-0.5 0.8 0.7 3.4-1.4 i334 38797 90.9-0.5 0.5 0.7 3.5-2.2 i335 38799 3.3 3.5 2.3 1.4 1.3 0.6 0.7 0.7 i336 38801 90.9-4.7 3.5 2.0 4.1-7.0 i337 38803 90.9-0.5 2.3 1.2 1.1-1.3 i338 38805 155.8 6.7 2.1 2.3 2.5 0.4 0.5 0.9 i339 38807 166.7 7.0 1.9 2.3 1.9 0.7 0.8 2.7 i340 38809 3.1 1.0 1.7 1.9 2.0 0.9-0.6 i341 38811 - - - - - 0.3 0.3 0.4 i342 38813 90.9-1.3 2.5 1.0 0.6-0.2 i343 38815 90.9 - - 2.3 0.8 3.7-3.5 i344 38817 - - - - - 0.3 0.2 0.2 i345 38819 90.9-0.6 1.4 1.3 6.9 1.5 2.4 i346 38821 155.8 9.0 1.9 2.3 1.7 0.8 0.6 1.7 i347 38823 90.9-1.1 7.4 3.4 2.4-4.0 i348 38825 166.7 4.9 1.7 2.2 2.2 0.4 0.4 0.5 i349 38827 - - - - - 0.5 0.5 0.7 i350 38829 90.9 1.3 0.9 2.3 1.6 3.1-1.3 i351 38831 90.9-0.8 1.9 1.6 4.6-2.5 i352 38833 90.9-1.0 1.9 2.1 4.3-9.2 i353 38835 - - - - - 0.2 0.2 0.4 i354 38837 90.9-1.3 1.6 11.4 23.5-12.3 i355 38839 90.9-0.8 2.5 1.7 2.6-1.5 i356 38841 90.9-0.5 0.5 1.5 4.2-0.7 i357 38843 5.4 3.7 1.9 1.7 2.0 0.1 0.3 1.1 i358 38845 166.7 4.6 1.8 2.1 1.5 0.2 0.3 0.4 i359 38847 7.0-12.3 2.9 2.1 4.1-0.9 i360 38849 90.9-1.4 2.9 2.4 4.1 0.8 8.9 i361 38851 1.9 2.0 2.0 1.5 1.5 0.4 0.4 0.6 i362 38853 155.8 5.0 2.4 2.7 2.6 0.6 0.2 0.2 i363 38855 - - - - - 1.5 1.6 3.1 i364 38857 3.9 - - 3.0 2.7 3.9-3.5 i365 38859 166.7 5.9 2.7 2.5 2.3 0.2 0.2 0.2 i366 38861 - - - - - 1.6 0.8 0.7 i367 38863 196.7 6.0 3.0 2.6 2.2 - - - i368 38865 155.8 4.4 0.9 2.4 1.7 1.0 0.9 0.9 i369 38867 2.0-0.5 1.3 0.8 11.4-4.1 i370 38869 90.9-0.9 0.7 0.8 5.4-3.1 i371 38871 166.7 5.1 1.8 2.1 1.8 0.2 0.3 0.6 i372 38873 0.9 0.7 0.8 1.0 1.1 2.3 0.9 1.0 i373 38875 84.2-3.2 1.4 2.3 3.8-1.7 i374 38877 9.0 6.6 8.8 7.0 5.4 0.5 0.7 0.6 i375 38879 81.4-0.5 2.3 0.9 1.4 2.2 4.5 i376 38881 4.4 2.3 3.6 1.4 1.6 0.1 0.1 0.3 i377 38883 3.0 3.1 2.6 1.5 1.7 0.4 0.4 0.3 i378 38885 155.8 7.9 2.2 2.9 1.8 0.2 0.2 0.4 i379 38887 1.3 1.2 1.6 1.0 0.9 0.3 0.2 0.2 i380 38889 90.9-2.8 1.9 1.5 4.2-1.3 i381 38891 166.7 7.5 3.5 3.7 2.4 0.5 0.6 0.7 i382 38893 0.8-0.7 0.8 0.8 5.5-1.1 i383 38895 90.9-17.3 2.6 3.9 2.1-7.6 i384 38897 90.9-0.6 1.7 0.7 1.9-1.5 i385 38899 166.7 9.2 3.9 2.8 2.3 0.2 0.2 0.4 i386 38901 90.9-0.6 1.3 1.6 6.5-3.6 i387 38903 90.9-0.5 0.6 1.0 21.4-2.7 i388 38905 0.8-0.9 0.5 0.7 4.5-1.3 i389 38907 155.8 4.6 0.9 2.5 1.6 1.3 0.7 0.8 i390 38909 196.7 7.7 1.4 2.5 2.1 0.2 0.1 0.2 i391 38911 90.9-0.5 1.4 0.9 2.2-0.8 i392 38913 90.9-0.8 1.5 2.0 0.5-0.5 i393 38915 81.4-0.8 2.4 1.5 1.9 0.3 0.6 i394 38917 - - - - - 1.0 0.3 0.3 i395 38919 90.9-1.0 0.7 0.9 0.5-1.5 i396 38921 - - - - - 0.9-0.3 i397 38923 90.9-0.8 1.2 3.5 0.5-2.3 i398 38925 155.8 6.7 1.6 2.5 2.3 0.5 0.4 1.3 i399 38927 166.7 6.6 3.2 2.6 2.5 0.6 0.2 0.4 i400 38929 - - - - - 0.5 0.4 0.6 i401 39552 3.9 2.1 1.7-0.7-33.6 - i402 42158 80.0 5.3 2.3 1.9 1.7 0.4 0.3 0.4 i403 2930 78.4 4.0 0.9 1.4 0.8 4.0 2.5 3.8 i404 2905 2.1 1.4 1.5 3.0 1.5 1.0 1.2 2.1 i405 2997 100.0 4.3 1.4 1.5 1.2 - - - i406 39448 65.4 2.9 0.7 3.6 1.3 0.6 1.0 1.1 i407 43993 1.1 1.1 1.0 1.1 1.0 2.1 1.2 1.6 87

Appendix C. The Complete Datasets i408 43994 4.1 4.2 2.6 1.0 1.7 1.7 1.6 3.6 i409 43995 2.0 2.3 2.2 0.9 1.2 49.0 2.8 400.0 i410 43996 3.6 3.8 2.3 1.4 1.5 3.0 2.8 5.3 i411 43997 1.7 1.3 1.3 0.9 1.1 6.1 1.0 85.0 i412 2986 65.4 1.5 0.9 2.4 1.1 88.7 45.0 79.4 i413 43998 2.5 2.5 1.5 1.3 1.6 6.4 0.8 77.0 i414 44000 200.0 4.6 2.3 2.2 1.4 62.0 15.0 400.0 i415 44001 200.0 3.4 2.5 2.3 1.8 60.0 15.0 400.0 i416 44002 200.0 4.8 2.0 2.2 1.7 15.0 3.7 280.0 i417 44003 200.0 1.6 0.6 1.1 1.4 250.0 46.0 400.0 i418 44004 200.0 1.5 0.9 1.0 0.7 250.0 84.0 400.0 i419 44005 200.0 70.0 11.0 47.0 23.0 2.3 0.9 1.0 i420 44006 200.0 6.0 2.2 2.0 1.7 9.2 9.6 26.0 i421 44007 200.0 6.4 2.2 2.6 1.6 0.4 0.4 0.5 i422 44008 200.0 7.7 3.9 3.2 2.4 0.4 0.5 0.7 i423 44009 7.5 8.4 6.7 1.8 2.3 9.2 5.1 26.0 i424 44010 200.0 7.7 2.9 2.3 1.8 0.7 0.6 0.7 i425 44011 200.0 7.7 3.0 2.4 1.9 0.6 0.5 0.6 i426 44012 7.2 6.6 4.2 1.8 1.6 41.0 15.0 55.0 i427 44013 1.7 2.4 1.7 1.1 1.4 250.0 700.0 400.0 i428 44014 1.4 1.2 1.5 0.7 0.9 250.0 700.0 400.0 i429 44015 2.4 2.9 1.8 1.3 1.5 250.0 132.0 400.0 i430 44016 1.0 1.2 1.2 0.7 0.9 250.0 700.0 400.0 i431 44018 200.0 4.9 2.5 2.1 1.7-0.1 0.3 i432 44019 4.3 2.8 2.0 1.2 1.2 2.1 76.0 400.0 i433 44020 5.9 3.4 2.4 1.4 1.4 0.4 4.3 85.0 i434 44021 5.4 7.1 11.0 17.0 11.0 0.6 0.4 0.4 i435 44022 3.0 5.0 7.1 10.0 8.1 1.2 1.1 1.3 i436 44023 200.0 8.4 5.4 3.8 3.3 0.1 0.1 0.1 i437 44024 200.0 7.3 2.3 2.4 2.3 0.3 0.4 0.7 i438 44025 200.0 22.0 9.0 4.1 3.3-700.0 400.0 i439 44026 0.9 0.9 1.1 0.5 0.7 250.0 700.0 29.0 i440 44027 0.8 0.8 1.4 0.5 0.8 250.0 700.0 33.0 i441 44028 200.0 6.5 1.5 2.4 1.7 0.8 1.0 2.0 i442 44029 89.0 70.0 20.0 28.0 28.0 250.0 700.0 400.0 i443 44031 200.0 5.0 1.6 2.5 1.6 0.4 0.4 0.6 i444 44033 200.0 6.1 1.9 2.3 1.9 0.7 0.5 0.5 i445 44034 200.0 5.5 1.8 2.1 1.8 0.3 0.2 0.3 i446 44035 200.0 6.8 2.3 2.3 1.9 0.5 0.3 0.5 i447 44037 2.1 0.7 1.3 2.1 0.5 2.1 0.9 1.5 i448 44037 1.4 0.8 1.2 1.3 1.2 1.0 1.1 1.0 i449 44039 0.5 0.4 0.9 1.2 0.1 0.8 1.2 2.1 i450 44039 1.1 1.0 1.0 1.0 1.0 1.9 0.8 1.1 i451 44041 0.7 0.8 0.4 0.7 2.0 3.5 2.8 4.0 i452 44041 0.8 0.9 0.8 1.0 0.8 0.9 0.8 1.1 i453 44043 1.4 1.4 1.1 2.0 0.6 1.6 1.0 1.4 i454 44043 1.2 1.1 1.0 0.9 0.9 0.9 0.7 0.8 i455 44045 0.5 0.4 0.3 1.7 0.1 0.2 0.7 1.8 i456 44045 1.1 0.8 0.8 1.0 1.0 2.1 1.2 1.2 i457 44047 0.8 1.3 0.8 0.7 0.5 1.1 2.4 2.4 i458 44047 1.1 1.0 1.1 1.1 1.0 0.8 0.9 0.9 i459 44049 - - - - - 1.2 0.4 0.3 i460 44049 - - - - - 1.6 0.8 0.7 i461 44051 - - - - - 3.7 2.4 4.0 i462 44051 - - - - - 2.0 7.1 2.9 i463 44053 2.5 2.1 0.8 0.6 0.6 0.4 1.8 52.3 i464 44053 4.7 5.1 2.6 1.4 1.3 1.0 4.5 625.0 i465 44055 0.5 2.2 2.6 3.1 1.1 1.2 2.7 3.4 i466 44055 1.1 1.4 1.1 1.0 0.9 0.3 0.4 0.5 i467 44057 1.4 0.4 0.3 0.3 0.3 1.0 2.0 1.3 i468 44057 1.0 0.8 1.2 0.5 1.2 3.3 1.8 2.4 i469 44059 2.7 1.0 0.9 2.1 1.9 4.1 1.0 1.4 i470 44059 2.0 0.7 1.2 1.0 1.1 0.5 0.4 0.3 i471 44061 - - - - - 2.8 1.0 1.8 i472 44061 - - - - - 1.5 0.6 0.8 i473 44063 0.8 3.0 1.9 1.2 1.7-3.3 5.3 i474 44063 0.9 0.8 0.7 0.9 1.0 1.8 1.0 1.4 i475 44065 8.5 2.0 1.1 1.4 1.6 1.8 1.2 2.5 i476 44065 2.0 1.7 2.1 1.2 1.2 0.1 0.2 0.1 i477 44067 - - - - - 4.4 2.5 2.7 i478 44067 - - - - - 1.5 0.8 0.8 i479 44069 3.1 2.4 1.7 1.9 2.2 4.0 3.5 2.1 i480 44069 1.2 1.6 1.1 1.0 0.9 1.3 0.9 0.9 i481 44071 1.1 0.5 1.2 1.4 3.0 4.5 1.7 1.3 i482 44071 1.3 1.0 1.1 1.2 1.1 1.2 0.8 0.8 i483 44073 1.0 0.5 0.9 2.5 0.4 5.3 2.5 2.1 i484 44073 1.3 0.7 1.0 0.9 0.9 1.2 0.9 0.8 i485 44075 7.3 2.7 5.9 1.7 1.7 110.6 283.3 45.0 i486 44075 4.7 1.9 1.7 1.1 1.8 19.7 101.4 893.0 i487 44077 1.5 0.4 0.8 0.3 0.9 1.9 1.2 8.4 i488 44077 1.1 0.9 0.8 0.9 0.9 0.5 0.5 0.4 i489 44079 0.3 0.3 0.3 1.2 0.2 6.4 1.7 1.0 i490 44079 1.1 0.7 1.2 1.3 1.3 2.9 1.4 0.7 88

Appendix C. The Complete Datasets i491 44081 39.8 2.7 1.8 3.3-2.5 3.7 5.1 i492 44081 113.0 1.9 1.0 1.3 1.7 1.4 0.8 1.4 i493 44083 - - - - - 0.5 0.7 1.2 i494 44083 - - - - - 0.6 0.7 0.9 i495 44085 39.8 0.6 1.3 3.9 0.7 0.7 1.1 1.4 i496 44085 183.0 2.5 0.8 2.0 1.3 1.0 0.6 0.5 i497 44087 0.4 1.0 2.1 3.3 0.9 2.7 2.9 0.5 i498 44087 1.7 0.9 1.4 1.7 1.2 18.5 5.2 7.8 i499 44089 38.3 1.4 0.4 1.3 1.0 160.3 322.8 75.3 i500 44089 183.0 2.9 0.7 1.8 1.5 94.3 24.4 45.9 i501 44091 0.3 0.7 0.6 0.3 0.2 2.2 0.6 0.4 i502 44091 1.0 0.8 0.9 0.9 1.0 1.3 0.9 1.1 i503 44093 0.6 0.7 0.9 0.8 0.4 1.4 1.3 1.0 i504 44093 1.2 0.9 0.9 0.9 0.9 0.4 0.5 0.4 i505 44095 0.8 1.3 1.5 1.2 0.6 2.6 1.3 1.0 i506 44095 1.3 1.1 1.0 1.1 1.1 2.1 1.4 1.9 i507 44097 2.6 2.4 2.0 4.2 2.4 4.7 2.7 1.7 i508 44097 1.1 1.1 0.9 0.9 1.0 0.6 0.7 0.5 i509 44099 1.2 0.4 1.0 1.6 1.0 - - - i510 44099 0.9 0.8 0.7 0.8 0.8 - - - i511 44101 1.0 0.5 1.0 0.7 0.4 9.5 2.0 4.0 i512 44101 1.2 0.8 0.7 0.8 0.7 0.8 0.6 0.5 i513 44103 0.7 1.2 1.3 1.7 0.4 0.7 0.6 0.9 i514 44103 1.0 0.7 1.0 1.0 1.1 0.4 0.6 0.6 i515 44105 1.2 0.4 0.7 0.6 0.3-2.2 7.5 i516 44105 1.2 0.8 1.2 1.1 1.1 3.8 1.8 3.5 i517 44107 0.5 0.4 0.3 0.7 0.2 2.4 1.0 0.5 i518 44107 0.9 0.7 0.8 0.6 0.7 0.4 0.3 0.2 i519 44109 0.5 0.5 0.6 2.8 1.6 0.3 0.6 - i520 44109 2.1 0.6 1.3 1.3 1.2 1.4 1.2 2.0 i521 44111 0.9 0.3 0.4 0.7 0.5 0.2 0.4 0.2 i522 44111 1.3 1.4 1.1 1.1 1.3 0.6 1.0 0.6 i523 45037 100.0 2.4 0.7 1.5 1.2 111.0 31.0 110.0 i524 45039 100.0 2.5 0.8 2.4 2.0 1.1 0.7 0.8 i525 45041 100.0 1.9 0.7 2.4 1.7 100.0 100.0 100.0 i526 45043 0.7 0.9 1.0 0.9 1.1 2.4 1.1 1.5 i527 45045 100.0 5.7 1.5 2.1 1.8 42.0 87.0 230.0 i528 45126 32.4 4.0 2.2 3.3 2.5 3.8 1.4 0.9 i529 45130 31.2-0.7 2.2 208.4 80.5 - - i530 45132 42.4 - - - - 685.3 2.8 - i531 45134 31.2 - - - - - - - i532 45136 1.2 0.7 - - - - - - i533 45138 25.1 - - - - - - - i534 45140 42.9-0.8 0.3 - - - 1.1 i535 45142 0.7 2.2-0.5 2.8 144.7 42.7 - i536 45144 13.2 14.9 - - - 482.3 11.2 - i537 45146 31.2 - - - - - - - i538 45148 34.6 - - - - - 0.2 1.1 i539 45150 1.6 0.4 0.1 - - 57.9 - - i540 45152 20.5 2.7 0.6 1.0 1.5 4.3 1.7 1.1 i541 45154 33.9 - - - - - - 0.4 i542 45156 42.9 - - - - - - - i543 45158 134.5 - - - - - - - i544 45160 7.5 0.5 0.5 0.4 0.8 17.2 7.1 57.9 i545 45162 0.8 0.4 0.8 0.5 0.1 384.4 0.5 - i546 45164 34.6 - - - - - - 1.3 i547 45166 23.4 - - - - - - - i548 45168 1.0 2.2 0.3 0.9 0.5 276.0-1.5 i549 45170 31.3 - - - - - 3.1 3.3 i550 45172 2.0 0.6 1.2 0.5-0.2 0.5 1.2 i551 45174 23.4-1.4 1.3 0.7 390.6 0.5 - i552 45176 41.5 - - - - 63.2 88.3 - i553 45178 50.3 - - - - 81.8 121.9 - i554 45180 39.4 - - - - - - - i555 45182 21.5-2.1 0.6 2.4 0.9 1.2 0.5 i556 45184 42.4 8.0 1.1 4.7 4.2 107.1 172.4 103.6 i557 45186 32.8 11.5 2.7 2.9 1.3 191.3 541.6 115.3 i558 45188 1.2 1.5 0.6 0.3 0.4 321.4 209.3 - i559 45190 18.9 3.1 3.0 3.6 5.9 - - - i560 45192 31.2 - - - - - - - i561 45194 30.1-3.6 2.1 - - - 1.0 i562 45196 3.3 1.4 0.5 1.0 1.7 165.9 100.9 - i563 45198 32.4 - - - - - - 0.6 i564 45200 32.4 - - - 0.7 0.9 0.6 0.4 i565 45202 25.6 - - - - - - 0.8 i566 45204 42.9 - - - - - - - i567 45206 32.4 - - - - - - - i568 45208 31.2 - - - - - - - i569 45210 50.3 - - - - - - - i570 45212 25.1-0.5 0.8 144.7 42.7 - - i571 45214 31.2 - - - - - - - i572 45216 29.5-0.2 0.3 0.2 0.4 - - i573 45218 2.0 0.4 1.0 1.6 0.7 2.6 1.7 46.6 89

Appendix C. The Complete Datasets i574 45220 1.8 1.3 38.1 3.5 7.2 173.6 92.7 - i575 45222 0.2 0.3 0.4 0.2 0.1 - - - i576 45224 4.4 0.9 0.3 - - - - - i577 45226 1.4 1.3 1.4 1.1 1.0 7.4 2.9 1.8 i578 45228 32.4 - - - 1.1 0.4 0.6 0.4 i579 45230 30.4 0.8 0.5 0.5 1.0 171.3 415.0 77.8 i580 45232 4.1 2.1 3.1 0.6 1.1 161.7 57.9 - i581 45234 - - - - - 59.4 398.4 49.5 i582 45236 0.3 0.2 - - - - - 0.6 i583 45238 32.4 - - - - 384.4 0.9 - i584 45240 3.5 1.6 0.5 1.0 1.0 6.3 421.2 1.4 i585 45242 55.0 - - - - - - - i586 45244 68.5 - - - - - - - i587 45246 32.4 - - - - - - - i588 45248 34.6 - - - - - - - i589 45250 38.6 - - - - - - 2.6 i590 45252 88.7 - - - - 59.1 62.7 1.2 i591 45254 1.3 0.8 1.2 0.5 1.3 230.0 48.3 - i592 45256 1.0 0.9 0.6 0.8 1.6 9.4 98.8 49.5 i593 45258 44.1 - - - - - - - i594 45260 42.9 - - - - - 4.5 1.3 i595 45262 2.3 1.6 1.7 0.7 0.4 0.9 0.5 2.2 i596 45264 0.2 1.1 0.6 0.6 0.7 - - - i597 45266 - - - - - 0.6 1.0 1.0 i598 45268 31.2 - - - - - - - i599 45270 31.2 - - - - - - - i600 45272 32.4 - - - - - - - i601 45274 1.2 0.7 2.0 1.6 0.5 0.9 0.6 - i602 45276 42.9 - - - - 19.8 209.3 - i603 45278 42.9 - - - - - 1.9 1.6 i604 45280 30.4 - - - - - - - i605 45282 30.1 - - - - 482.3 0.5 - i606 45284 31.3 - - - - - 90.4 - i607 45286 30.1 - - - 1.3 0.5 2.6 0.9 i608 45288 2.0 1.3 1.2 0.8 1.2 - - - i609 45290 32.4 - - - - - - 1.1 i610 45292 4.8 3.3 2.1 0.6 1.3 1.1 0.8 1.1 i611 45294 38.4 - - - - - - - i612 45296 31.2 - - - - - - 1.5 i613 45298 55.4 - - - - - - - i614 45300 42.4 - - - - 170.9 103.6 - i615 45302 37.1 - - - - - - - i616 45304 2.2 1.0 1.2 0.5 0.7 0.7 6.4 32.1 i617 45306 2.6-0.7 0.7 1.0 1.6 1.2 - i618 45308 1.8 1.1 0.7 1.1 1.2 - - - i619 45310 41.5 - - - - - - - i620 45312 2.1 2.2 1.7 0.1 1.1 - - - i621 45314 2.8 0.6 1.4 1.5 0.9 0.8 0.6 1.9 i622 45316 47.7 - - - - - - - i623 45318 50.3 - - - - 18.0 121.9 - i624 45320 30.4 - - - - - - - i625 45322 2.3 0.6 0.5 1.3 1.0-384.4 32.1 i626 45324 31.2 - - - - 31.6 80.5 - i627 45326 31.2 - - - - - - - i628 45328 44.1 - - - - - 1.1 0.5 i629 45330 0.5 0.3 0.5 0.3 0.2 144.7 42.7 - i630 45332 32.4 0.3-1.1 0.7 2.1 0.2 - i631 45334 30.1 - - - - - - 1.9 i632 45336 31.2 - - - - - - - i633 45338 32.4 - - - - - - 0.8 i634 45340 0.7 1.2 0.7 0.1 - - - 1.9 i635 45342 1.2 2.1 1.2 0.4 0.7 208.4-2.5 i636 45344 - - - - - 321.4 209.3 - i637 45346 25.1 - - - - 365.5 0.7 - i638 45348 31.2 - - - - 455.5 0.3 - i639 45350 47.7 - - - - - - - i640 45352 2.7 1.6 0.4 0.7 1.0 176.2-2.5 i641 45354 38.4 - - - - - - - i642 45356 31.2 - - - - - - - i643 45358 4.2 5.5 4.9 1.2 1.0 0.4 0.5 4.1 i644 45360 38.4 - - - - 45.2 57.9 - i645 45362 3.3 0.7 1.4 0.4 0.8 0.2 0.4 - i646 45364 42.9 - - - - - - 0.9 i647 45366 32.4 - - - - - - - i648 45368 37.5 - - - - - 0.7 0.1 i649 45370 30.4 - - - - - - - i650 45372 0.3 0.2 - - - - - - i651 45374 25.1 - - - - - - - i652 45376 65.6 - - - - - 1.3 3.1 i653 45378 47.7 - - - - - - - i654 45380 42.9 - - - - 91.7 209.3 - i655 45382 31.2 - - - - - - - i656 45384 47.7 3.8 2.3 2.7 2.0 7.1 4.6 4.1 90

Appendix C. The Complete Datasets i657 45386 32.4 - - - - - - - i658 45388 1.0 0.6 0.8 0.5 0.5 1.0 0.8 2.5 i659 45390 0.5 1.2 1.4 0.9 0.4 114.2-1.7 i660 45392 0.5 0.4 0.3 0.6 0.5 10.8 16.7 49.5 i661 45394 46.3 6.9 0.8 3.1 1.8 265.2-0.3 i662 45396 5.7 0.9 0.7 0.5 0.7 114.1 58.7 44.5 i663 45398 31.3 - - - - 456.2 0.8 - i664 45400 32.4 - - - - - - - i665 45402 3.8 1.4 - - - 685.3 7.0 - i666 45404 - - - - - 0.4 0.4 - i667 45406 41.5 - - - - - - - i668 45408 32.4 - - - - 357.1 0.3 - i669 45410 8.4 3.7 3.7 3.3 1.8 73.1 621.3 4.0 i670 45414 2.6 5.4 1.8 1.3 1.9 1.4 0.9 - i671 45416 32.4 - - - - - - - i672 45418 32.4 - - - - 357.1 0.3 - i673 45420 30.1 - - - - - - - i674 45422 42.9 - - - - 502.4 2.0 - i675 45424 32.4 0.5 0.2 0.7 0.3 1.4 0.6 0.6 i676 45426 1.3 1.2 0.4 0.1-0.3 - - i677 45428 32.4-2.1 0.1 - - - - i678 45430 - - - - - 2.8 1.5 0.7 i679 45434 62.1 - - - - 36.5 42.7 6.0 i680 45436 0.8 4.5 2.2 0.8 1.0 321.4-2.9 i681 45438 32.4-1.0 0.4 0.4 0.5 0.5 0.3 i682 45440 32.4 - - - - 5.8 32.1 - i683 46680 17.4 31.3 15.2 3.3 4.0 10.1 365.0 820.2 i684 46682 1.0 0.7 1.1 1.1 1.1 2.1 1.3 1.6 i685 46684 2.3 1.3 1.7 1.6 1.3 24.0 20.7 48.6 i686 46686 0.7 0.7 1.0 0.9 0.8 107.5 31.6 94.2 i687 46688 174.8 4.9 1.8 2.4 1.6 0.2 0.2 0.5 i688 46690 174.8 4.0 1.4 2.2 1.2 0.1 0.2 0.3 i689 46692 0.9 0.7 0.8 0.9 0.6 2.8 1.5 1.4 i690 46694 - - - - - 1.3 31.6 820.2 i691 46696 2.4 2.3 1.7 1.1 0.8 0.1 0.5 0.8 i692 46698 15.3 4.0 2.0 1.8 1.3-1.5 20.3 i693 46700 174.8 7.7 1.1 2.0 1.3 0.2 0.2 0.4 i694 46702 174.8 13.9 3.8 4.3 2.9 - - - i695 46704 174.8 5.1 1.4 2.1 1.5 0.6 0.4 0.8 i696 46706 - - - - - 1.6 1.0 1.1 i697 46708 174.8 8.4 2.5 2.9 1.6 0.2 0.3 1.3 i698 46710 10.6 5.8 6.1 3.2 2.4 64.0 4.0 381.9 i699 46715 1.2 1.1 0.7 1.8 0.9 - - 1.3 i700 46717 1.3 1.2 1.4 1.0 1.0 - - 1.2 i701 46718 49.4 6.9 2.6 7.3 2.4 - - 2.1 i702 46719 1.7 0.5 1.0 1.1 0.8 - - 2.7 i703 46720 0.7 0.9 1.4 1.0 0.9 - - 0.7 i704 46721 34.4 2.8 0.8 5.1 1.7 - - 0.8 i705 46726 0.5 0.1 0.7 0.7 0.3 - - 0.9 i706 46728 0.5 0.9 1.6 0.9 0.5 - - 2.4 i707 46729 34.4 1.8 0.8 1.1 1.1 - - 4.1 i708 46730 0.8 0.4 0.3 0.6 0.4 - - 0.4 i709 46731 0.1 0.1 0.1-1.1 - - 0.6 i710 46732 38.1 1.1 0.2 0.7 0.6 - - 0.8 i711 46736 0.1 0.2 0.2 0.1 0.2 - - 0.4 i712 46737 30.2 1.8 0.6 1.2 1.2 - - 3.2 i713 46739 0.8 1.1 1.0 1.2 0.9 - - 3.9 i714 46740 19.6 1.0 0.3 0.5 0.3 - - 1.2 i715 46743 0.4 2.0 1.2 1.1 0.7 - - 4.4 i716 46744 27.9 0.7 0.2 0.9 0.7 - - 0.8 i717 46746 0.4 0.3 0.6 0.3 0.8 - - 4.5 i718 46748 0.9 1.3 1.3 0.9 1.0 - - 0.7 i719 46749 54.9 0.8 0.2 0.3 0.3 - - 0.5 i720 46757 45.0 4.2 1.8 5.1 4.5 - - 2.4 i721 46762 45.0 3.3 0.8 1.8 1.9 - - 3.4 i722 46763 1.2 1.0 1.0 1.1 0.2 - - 1.4 i723 46764 53.1 2.7 0.8 1.6 3.8 - - 2.1 i724 46765 1.1 1.0 1.6 1.1 0.8 - - 0.5 i725 46766 1.1 0.8 0.8 0.8 1.1 - - 1.2 i726 46771 0.1 0.4 0.2 0.7 0.4 - - 1.4 i727 46773 30.2 5.8 3.5 3.4 3.7 - - - i728 46774 7.2 2.8 2.7 0.8 0.9 - - - i729 46775 12.6 3.8 5.6 1.3 2.1 - - - i730 46778 1.0 0.8 1.0 0.8 1.0 - - 4.3 i731 46786 1.7 0.7 0.6 1.5 1.1 - - 4.6 i732 46787 34.4 2.6 0.9 1.1 1.2 - - 2.0 i733 46788 0.2 0.5 0.5 0.2 0.4 - - 0.4 i734 46789 74.0 0.5 0.1 0.5 1.0 - - - i735 46794 30.2 2.1 1.4 1.4 1.4 - - 4.4 i736 46799 0.5 0.8 1.2 0.6 0.6 - - 1.6 i737 46803 0.4 0.3 0.5 0.8 0.4 - - 0.9 i738 46804 24.0 1.9 0.7 1.3 0.8 - - 0.9 i739 46805 1.7 0.5 0.6 1.0 1.6 - - 2.0 91

Appendix C. The Complete Datasets i740 46811 0.8 0.9 0.9 0.5 2.4 - - 2.1 i741 46812 0.5 0.4 0.4 0.8 0.7 - - 0.7 i742 46813 10.4 0.5 0.1 0.9 0.7 - - 0.6 i743 46819 1.5 1.3 3.2 1.1 1.1 - - 5.3 i744 46822 0.7 0.5 0.3 0.8 0.4 - - 1.3 i745 46823 46.5 1.2 0.3 3.6 1.5 - - 2.1 i746 46824 69.9 4.4 0.8 2.8 1.6 - - 3.1 i747 46826 - - 0.5 0.1 0.1 - - 0.7 i748 46827 63.9 0.9 0.6 1.0 1.6 - - 2.0 i749 46828 0.3 0.3 0.2 0.2 0.2 - - 1.2 i750 46829 49.6 1.5 0.5 1.0 1.0 - - 1.4 i751 46833 0.3 0.6 0.6 0.9 0.7 - - 0.9 i752 46834 49.1 1.8 0.3 0.5 0.6 - - 0.6 i753 46835 0.9 1.0 1.1 0.7 1.6 - - 2.0 i754 46842 45.0 2.8 0.6 1.2 1.0 - - - i755 46844 0.5 0.5 0.6 0.4 0.3 - - 0.8 i756 46845 1.8 1.5 1.6 1.0 1.1 - - 0.8 i757 46846 1.2 1.1 1.0 1.7 0.8 - - 1.4 i758 46847 0.8 0.7 0.5 0.9 0.8 - - 2.0 i759 46848 0.5 0.4 0.8 1.0 0.6 - - 3.2 i760 46849 27.9 1.7 0.8 0.8 1.2 - - 10.4 i761 46850 0.1 0.3 0.3 0.2 0.1 - - 1.0 i762 46852 0.4 0.4 0.4 0.7 0.4 - - 0.8 i763 46854 30.2 1.4 0.5 1.1 0.8 - - 2.6 i764 46858 1.6 1.1 0.8 1.9 1.2 - - 4.1 i765 3819 65.4 2.4 3.0 2.5 0.6 1.3 1.1 1.1 i766 3900 65.4 2.8 0.6 2.4 1.7 0.3 0.6 0.8 i767 3831 66.8 3.7 2.9 2.7 1.5 1.4 17.7 105.9 i768 3827 - - - - - 0.9 0.7 0.5 i769 3894 1.4 1.9 0.8 0.4 0.6 0.1 0.5 0.7 i770 3884 65.4 3.9 1.0 3.6 2.4 3.6 1.2 2.4 i771 3811 4.0 3.9 7.9 1.5 3.5 1.2 0.5 2.6 i772 3817 78.4 12.0 - - 3.7 1.1 10.7 107.8 i773 3904 - - - - - 2.8 2.8 8.4 i774 3959 78.4 12.3 2.0 3.3 2.9 0.5 1.2 1.7 i775 40016 78.4 25.3 6.4 13.0 27.3 2.8 2.2 1.3 i776 3929 65.4 2.8 1.2 4.6 1.7 0.9 0.6 0.7 i777 3927 69.0 3.5 0.9 2.1 2.2 1.2 0.7 0.5 i778 3964 44.0 7.9 4.7 1.7 9.6 0.6 0.8 0.5 i779 3838 78.4 19.8 8.0 8.8 6.8 133.0 126.0 107.8 i780 40561 78.4 2.6 1.4 1.4 2.2 0.6 0.7 1.7 i781 3967 78.4 1.9 1.8 5.4 7.0 0.8 0.9 3.3 i782 3892 69.0 12.7 3.3 5.7 4.2 0.2 0.5 0.5 i783 3890 - - - - - 1.1 1.2 2.1 i784 39476 65.4 5.3 0.3 1.3 0.7 59.0 363.4 109.7 i785 3886 - - - - - 0.6 0.8 3.1 i786 3925 - - - - - 1.0 1.0 0.4 i787 3923 78.4 4.0 2.2 2.6 1.9 1.8 31.5 107.8 i788 4471 - - - - - - - - i789 3919 44.0 4.6 4.8 1.5 6.4 0.4 0.5 0.5 i790 3849 78.4 5.7 3.7 3.6 2.6 1.1 1.3 2.5 i791 3851 68.8 3.7 0.7 2.8 3.3 1.4 2.0 3.6 i792 3855 68.8 5.9 6.1 2.9 2.9 14.8 2.2 3.0 i793 3853 71.8 9.5 4.0 2.4 2.5 8.0 2.6 2.7 i794 41683 - - - - - - - - i795 41620 78.4 8.2 6.0 1.9 6.9 - - - i796 3868 4.0 2.6 1.5 1.5 1.8 0.6 0.6 0.5 i797 3866 - - - - - 1.1 0.6 0.3 i798 3870 78.4-1.7 3.0 1.2-1.7 6.0 i799 40564 33.6 0.9 0.6 1.5 1.0 0.4 0.8 0.9 i800 3896 65.4 6.1 1.3 3.1 2.3 0.8 0.6 0.6 i801 3917 - - - - - 1.0 2.9 0.9 i802 50506 - - - - - 5.6 270.0 600.0 i803 50508 - - - - - - 270.0 600.0 i804 50510 1.0 1.0 0.9 0.8 1.1 0.8 0.7 0.9 i805 50512 200.0 3.8 0.8 2.2 1.5 2.7 1.5 2.3 i806 50514 1.2 0.8 1.0 0.9 1.0 1.3 0.9 0.8 i807 50598 - - - - - - 270.0 600.0 i808 52823 200.0 23.9 9.0 5.3 4.4 0.4 0.3 0.7 i809 52825 200.0 15.1 10.9 45.2 19.7 3.0 1.3 1.3 i810 52827 200.0 3.2 1.0 1.4 1.2 0.3 0.3 0.5 i811 52829 7.0 4.1 3.3 1.4 1.7 15.3 201.3 400.0 i812 52831 7.8 6.6 6.9 2.2 2.3 0.1 0.1 0.3 i813 52833 2.8 1.6 3.2 0.8 0.8 1.0 10.0 400.0 i814 52835 200.0 6.1 1.7 1.8 1.6 30.8 19.5 42.6 i815 52837 200.0 2.7 1.5 1.5 1.3 63.1 30.4 51.8 i816 52839 4.2 2.3 2.3 0.9 1.1 0.3 0.3 0.5 i817 52841 200.0 2.5 1.3 1.3 1.2 0.1 0.2 0.1 i818 52892 - - - - - 0.2 0.3 0.4 i819 52894 1.0 1.5 1.3 1.0 1.2 103.4 273.5 22.1 i820 52898 143.0 25.0 11.0 35.0 15.0 2.0 1.0 1.0 i821 52906 79.2 6.6 1.3 2.6 1.6 0.1 0.2 0.2 i822 52908 123.2 6.4 2.7 2.3 2.6 0.3 0.5 0.6 92

Appendix C. The Complete Datasets i823 52910 109.1 7.2 5.3 3.8 2.7 0.4 0.2 0.2 i824 52912 111.6 5.5 2.1 2.6 1.6 27.4 1.7 483.4 i825 52914 121.5 5.8 2.0 2.2 2.0 0.7 0.7 0.9 i826 52916 87.6 4.7 1.1 2.0 1.6 1.4 0.7 0.8 i827 53157 0.5 0.5 1.1-0.2 - - - i828 53158 20.3 5.4 2.3-2.2 - - - i829 53159 0.7 0.3 1.2-1.2 - - - i830 53160 8.0 1.5 0.6-4.0 - - - i831 53162 0.9 1.0 2.1-1.4 - - - i832 53163 15.2 6.9 1.0-4.3 - - - i833 53166 1.5 0.2 2.4-1.2 - - - i834 53167 13.3 2.6 2.6-2.3 - - - i835 53168 1.8 0.9 1.7-1.8 - - - i836 53174 2.4 0.5 0.5-0.7 - - - i837 53175 5.9 4.6 5.5-10.1 - - - i838 53589 - - - - - 56.0 15.0 60.0 i839 53590 - - - - - 36.0 1.1 60.0 i840 53591 - - - - - 42.0 1.1 3.0 i841 53592 - - - - - 100.0 23.0 60.0 i842 53593 - - - - - 100.0 43.0 60.0 i843 53594 - - - - - 100.0 3.3 60.0 i844 53595 - - - - - 100.0 84.0 60.0 i845 53852 0.3 0.7 1.4 1.2 1.3 2.2 1.2 1.0 i846 53856 0.3 0.4 0.5 0.8 0.7 2.9 0.8 1.1 i847 53861 0.9 1.8 1.1 1.4 0.7 2.0 1.1 3.1 i848 53865 1.6 1.9 1.9 4.0 4.2 3.5 3.7 0.7 i849 53867 0.7 1.3 1.2 1.3 4.0 3.6 3.9 3.2 i850 53880 0.6 0.7 0.7 0.8 1.1 2.5 0.8 1.8 i851 53884 0.7 0.8 0.5 1.1 1.3 2.9 1.3 4.3 i852 53888 0.5 1.3 1.3 1.0 1.0 5.3 1.8 11.9 i853 53890 1.1 0.7 0.9 0.4 0.3 1.2 0.8 1.5 i854 53908 0.6 1.0 0.8 2.6 1.8 6.6 0.6 3.3 i855 53914 0.4 0.8 0.9 1.1 1.2 1.3 2.0 0.6 i856 53921 0.7 2.1 3.1 1.2 1.1 3.9 2.1 3.3 i857 53923 1.4 1.5 2.2 2.4 6.5 4.3 3.9 1.8 i858 53931 0.7 0.4 0.6 0.8 1.2 2.1 0.6 0.5 i859 53933 0.7 1.3 1.2 0.7 1.0 9.3 2.9 4.4 i860 53950 0.6 0.6 0.5 0.7 1.0 2.9 1.2 3.5 i861 54190 9.7 2.3 1.6-1.6 - - - i862 54191 9.0 2.4 1.8-1.6 - - - i863 54192 6.3 1.6 1.4-1.3 - - - i864 54269 1.3 2.4 2.3 2.9 1.8 - - - i865 54270 1.6 1.6 1.6 1.2 0.8 - - - i866 54271 2.7 2.0 1.4 1.4 1.5 - - - i867 54272 3.1 0.9 1.6 0.8 0.9 - - - i868 54273 22.0 4.8 3.2 2.7 2.1 - - - i869 54274 8.0 5.0 5.1 1.9 1.4 - - - i870 54275 6.6 3.2 3.0 1.8 1.3 - - - i871 54276 32.0 14.0 5.7 1.8 4.8 - - - i872 54277 14.0 7.4 4.9 3.1 2.2 - - - i873 54278 15.0 7.5 7.5 2.5 2.8 - - - i874 54279 62.0 6.3 3.0 3.4 4.3 - - - i875 54280 62.0 10.0 4.4 3.2 2.2 - - - i876 54286 - - - - - 1.0 8.0 75.0 i877 54287 - - - - - 1.0 47.0 300.0 i878 54288 - - - - - 1.0 1.0 1.0 i879 54289 - - - - - 2.0 11.0 19.0 i880 54290 - - - - - 5.0 2.0 64.0 i881 54291 - - - - - 21.0 1.0 744.0 i882 54292 - - - - - 35.0 3.0 161.0 i883 54357 - - - - - 1.2 1.1 2.4 i884 54358 - - - - - 14.0 7.5 29.1 i885 54359 - - - - - 10.8 6.5 35.0 i886 54360 - - - - - 12.9 1.3 8.5 i887 54361 - - - - - 9.5 1.0 - i888 54362 - - - - - 2.2 1.3 1.9 i889 54363 - - - - - 2.9 1.5 1.6 i890 54403 120.0 5.5 1.3 1.9 2.5 180.0 180.0 98.0 i891 54404 120.0 4.5 1.5 2.2 1.5 180.0 180.0 78.0 i892 54405 8.2 3.6 1.6 1.9 1.5 0.6 3.5 180.0 i893 54601 0.8 0.6 0.8 0.8 0.9 0.3 0.4 0.6 i894 54602 200.0 4.6 1.0 2.3 1.9 50.0 1.0 123.0 i895 54603 200.0 3.7 1.3 2.2 1.4 11.0 10.0 24.0 i896 54604 64.0 11.0 20.0 28.0 22.0 1.5 0.9 1.1 i897 54605 10.0 7.3 7.1 2.2 2.4 9.2 5.6 29.0 i898 54617 - - - - - - - - i899 54618 - - - - - - - - i900 54619 - - - - - - - - i901 54620 - - - - - - - - i902 54621 - - - - - - - - i903 54622 - - - - - - - - i904 54623 - - - - - - - - i905 54624 - - - - - - - - 93

Appendix C. The Complete Datasets i906 54625 - - - - - - - - i907 54626 - - - - - - - - i908 54627 - - - - - - - - i909 54641 - - - - - - - - i910 4433 - - - - - 1.1 0.7 0.7 i911 4483 100.0 8.8 1.9 2.9 2.1 0.2 0.3 0.4 i912 41178 3.0 2.2 2.1 1.0 1.3 21.2 5.8 563.5 i913 4388 100.0 8.1 1.8 2.8 1.7 0.5 0.5 0.6 i914 4479 0.8 0.9 0.7 0.4 1.3 3.1 119.8 98.4 i915 4539 200.0 3.7 2.0 1.8 1.3 - - - i916 4663 100.0 7.1 1.3 2.8 1.9 - - - i917 4697 100.0 5.9 1.3 1.9 1.7 0.8 0.5 0.5 i918 5222 200.0 5.7 1.2 3.1 2.0 0.5 0.3 0.3 i919 63 - - 1.2 3.0 - - - - i920 64 - - - - - - - - i921 65 - - 0.8 1.0 - - - - i922 71 - - 1.0 1.0 - - - - i923 5465 109.1 7.0 2.7 2.3 2.1 50.6 17.4 42.0 i924 5682 - - - - - 11.0 154.0 400.0 i925 6025 - - - - - 0.2 0.2 0.3 i926 6029 7.3 18.0 16.0 4.1 4.8 250.0 700.0 79.0 i927 6139 2.4 3.8 1.0 4.0 7.5 - - - i928 6138 - - - 1.0 1.3 - - - i929 6137 - - - 15.0 8.6 - - - i930 2378 - - - 4.0 1.0 - - - i931 6149 5.2 3.0 2.0 2.0 3.0 - - - i932 6148 0.6 2.1 1.0 1.0 1.0 - - - i933 6147 70.0 2.9 2.0 3.0 2.0 - - - i934 6146 12.0 3.6 1.0 4.0 4.0 - - - i935 6145 70.0 6.9 1.0 8.0 5.0 - - - i936 6144 70.0 10.0 1.0 7.0 6.0 - - - i937 6143 70.0 8.5 2.0 5.0 4.0 - - - i938 6142 1.3 4.0 1.0 2.0 4.0 - - - i939 6141 70.0 6.8 2.0 3.0 5.0 - - - i940 6140 70.0 11.0 2.0 4.0 5.0 - - - i941 6312 - - - - 8.0 - - 1.0 i942 6311-1.0 - - 1.0 - - 1.0 i943 6313 - - - - 0.9 - - 240.0 i944 6314 - - - - 6.4 - - 80.0 i945 6315 - - - - 7.1 - - 1.0 i946 6316 - - - - 7.1 - - 60.0 i947 6317 - - - - 1.1 - - 1.0 i948 6318 - - - - 0.4 - - 200.0 i949 6319 - - - - 1.4 - - 1.0 i950 6320 - - - - 1.0 - - 112.0 i951 6321 - - - - - - - 100.0 i952 6412 - - - - - - - - i953 6413 - - - - - - - - i954 6414 - - - - - - - - i955 6415 - - - - - - - - i956 6416 - - - - - - - - i957 6417 - - - - - - - - i958 6418 - - - - - - - - i959 6419 - - - - - - - - i960 6420 - - - - - - - - i961 6421 - - - - - - - - i962 6422 - - - - - - - - i963 6423 - - - - - - - - i964 6424 - - - - - - - - i965 6425 - - - - - - - - i966 6426 - - - - - - - - i967 6427 - - - - - - - - i968 6428 - - - - - - - - i969 6429 - - - - - - - - i970 6430 - - - - - - - - i971 6431 - - - - - - - - i972 6432 - - - - - - - - i973 6461 - - 1.0 1.0 1.0 - - - i974 6462 - - 1.7 18.0 6.0 - - - i975 6463 - - 7.6 19.0 5.5 - - - i976 6464 - - 3.3 15.0 - - - - i977 6465 - - 1.1 15.0 - - - - i978 6466 - - 2.2 30.0 - - - - i979 6467 1.0 - - 1.0 1.0 - - 1.0 i980 6468 2000.0 - - 3.0 2.8 - - 1.0 i981 6469 - - - - - - - 150.0 i982 6470 - - - 4.0 - - - 126.0 i983 6471 - - - 3.0 2.4 - - 1.0 i984 6472 - - - 6.0 3.9 - - 1.0 i985 6474 - - - 4.0 2.1 - - 80.0 i986 6475 - - - 2.0 1.9 - - 1.0 i987 6476 - - - 4.0 2.4 - - - i988 6477 - - - 4.0 1.6 - - 46.0 94

Appendix C. The Complete Datasets i989 6519 200.0 2.7 0.8 1.6 1.3 34.0 18.0 25.0 i990 6569 3.9 5.3 4.2 1.7 1.8 176.0 45.6 480.2 i991 40816 87.6 5.5 2.0 1.9 1.5 0.9 13.2 416.5 i992 41011 0.8 0.8 1.0 0.8 0.9 2.7 1.9 3.2 i993 6796 7.1 7.0 4.4 1.7 2.4 0.1 0.2 0.3 i994 6820 4.8 1.6 1.8 1.1 1.4 1.9 4.3 400.0 i995 6859 1.4 2.3 2.0 0.8 1.0 0.9 0.4 0.6 i996 6876 100.0 9.6 3.3 3.1 2.1 0.3 0.3 0.5 i997 7297 - - - 1.0 1.0 - - - i998 7298 - - - 5.4 1.1 - - - i999 7299 - - - 4.0 1.0 - - - i1000 7305 - - - 1.4 1.0 - - - i1001 7304 - - - 4.0 1.0 - - - i1002 7303 - - - 1.3 1.3 - - - i1003 7302 - - - 2.0 1.0 - - - i1004 7301 - - - 3.6 1.5 - - - i1005 7300 - - - 1.0 0.8 - - - i1006 7311 - - - - - - - - i1007 7314 - - - - - - - - i1008 7318 - - - - - - - - i1009 7328 3.6 1.7 1.5 0.9 1.1 5.4 24.0 400.0 i1010 7344 100.0 18.5 11.2 6.3 3.5 - - - i1011 7347 100.0 9.9 3.9 4.0 2.1 0.7 0.5 0.6 i1012 7348 100.0 7.7 2.3 2.3 1.8 0.7 0.4 0.3 i1013 7350 100.0 4.6 1.0 2.2 1.6 1.0 1.0 1.1 i1014 40017 100.0 20.0 11.3 65.4 20.1 1.4 0.8 0.6 i1015 7360 100.0 6.7 1.6 2.2 1.4 0.6 0.6 0.9 i1016 7364 - - - - - - - - i1017 7376 100.0 8.5 1.7 2.6 1.7 0.4 0.4 0.6 i1018 7377 100.0 29.3 10.1 9.0 4.1 - - - 95

Appendix D Original Decision Trees Given below are the decision trees for ZDV (a), ddc (b), ddi (c), d4t (d), 3TC (e), ABC (f), NVP (g), DLV (h), EFV (i), SQV (j), IDV (k), NFV (m) and APV (n), as presented in [1]. image taken from http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=123057&rendertype=figure&id=f3 Note that (a) is unable to offer a classification because no labels are attached to the leafs, apart from one. 96