Phylogenetic Analysis of HIV Samples from a Single Host

Size: px

Start display at page:

Download "Phylogenetic Analysis of HIV Samples from a Single Host"

Kevin Matthews
6 years ago
Views:

1 Phylogenetic Analysis of HIV Samples from a Single Host Master Thesis Rounak Vyas November 20, 2011 Advisors: Prof. Niko Beerenwinkel, Dr. Osvaldo Zagordi Computational Biology Group, ETH Zürich

2 Contents Contents i 1 Introduction AIDS HIV Longitudinal Studies HIV Sequencing Recent Studies of HIV Materials and Methods Patient History Data Pre-processing Entropy Analysis Recombination Analysis Molecular Clock Estimation Poisson Fitter Sliding MinPD Rate of synonymous and nonsynonymous substitutions Results Entropy Analysis Molecular Clock Rate Estimation and Phylogenetic Tree Construction Founder Virus Analysis Demographic Reconstruction Sliding MinPD Conclusions Bibliography 40 i

3 Chapter 1 Introduction The earliest well documented incident of AIDS dates back to the 1980 s. Since then, more than 25 million people have died from AIDS [4]. The World Health Organization has now declared AIDS a pandemic and sincere efforts are underway around the world to identify an effective cure for this condition. 1.1 AIDS As the name suggests, Acquired Immunodeficiency Syndrome (AIDS) is a condition wherein the patient s immune system becomes severely compromised, enabling opportunistic infections such as pneumonia, tuberculosis, herpes, and others. These infections ultimately result in the death of the patient. Clinically, AIDS is described as an advanced stage in an infection caused by Human Immunodeficiency Virus (HIV) wherein the CD4+ cell count drops below the critical level. CD4+ cells are a special class of white blood cells which play a major role in recognizing foreign antigens (like bacteria and viruses) within the body [12]. In their absence, the immune system is not able to recognize and clear these foreign agents leading to sustained infections. HIV infection can only be contracted from another infected individual through exchange of body fluids like blood, genital fluids and breast milk. Chance exchange of these takes place when the individuals share needles of an injection or engage in unprotected sexual intercourse. The infection cannot be acquired by ingestion of the virus. An individual may remain HIV positive for several years before becoming an AIDS patient. After the onset of AIDS, the patient s life span shortens to 8 to 12 months [19]. Current therapies are able to significantly increase the life span of the infected individuals by delaying the onset of AIDS. Most of these 1

4 1.2. HIV therapies act by interfering with one of the crucial steps in replication, entry or release of the virus from the infected cell. However, due to the unusually high rate of mutation in HIV, it is able to quickly develop resistance against these therapies and flourish again. Hence there is presently no cure for AIDS [29]. AIDS is not a disease that targets a specific organ, but a condition characterized by progressive immune failure leading to infections in several organs. To develop an effective therapy, it is imperative to gain insight on the processes through which the virus evolves to establish a nonperishable population within the host. Of particular interest are the evolutionary pattern the virus undergoes while subjected to selective pressures of the host immune system, and the development of viral drug resistance. A detailed understanding of these processes may offer insight into the development of effective treatment strategies. 1.2 HIV The Human Immunodeficiency Virus belongs to the family of retroviruses [3], RNA viruses that use reverse transcriptase to encode their genetic material into DNA only within a host cell. It is known to cause Acquired Immunodeficiency Syndrome [34]. There are two prominent types known as HIV-1 and HIV-2, differing in their virulence, infectivity, and prevalence [15]. These have originated from the Simian Immunodeficiency Virus subtypes cpz and smm and infect chimpanzees and old world monkeys respectively. In this report we focus on HIV-1 as this is the virus type with which both patients were infected. Figure 1.1: Diagram of Human Immunodeficiency Virus [1] The virus has of two identical copies of the complete genome encoded on two positive single RNA strands and consists of nine genes encoding 19 vi- 2

5 1.2. HIV ral proteins. The viral core is composed of the Capsid Protein (CA, p24), Matrix protein (MA, p17) and P6. Following reverse transcription of the viral genome by reverse transcriptase subsequent to host infection, the newly produced DNA is incorporated into the host genome by viral integrase [25]. The pre-proteins encoded by the viral genome are converted to fully functioning HIV proteins by protease. Following host infection RNAse H breaks down the retroviral genome. The HIV genome codes for a series of proteins serving structural and regulatory functions. Structural proteins include gp120 which lies outside the virus particle and gp41 just inside the membrane, with gp41 serving as a membrane anchor for gp120. Tat (transactivator) is a regulatory gene that accelerates the production of viral progenies and is known to be a crucial protein for HIV replication. Rev stimulates the production of HIV proteins but suppresses the expression of other regulatory genes of HIV. Nef (negative replication factor gene) encodes for proteins that are exposed to the cytoplasm of the host cell and are necessary for viral spread and disease progression by down-regulating the CD4 count. Vif encodes the Viral Infectivity Factor found inside the virus and is responsible the rapid spread of the virus. Vpr (Viral protein R) accelerates the production of HIV proteins and interferes with the host cell cycle thus inhibiting the cell division. Vpu (Viral protein U) helps in assembling new virus particles, budding out from the host cell, and accelerates the degradation of CD4 proteins. Figure 1.2: HIV-1 genome HXB2 strain [17] HIV Life Cycle HIV enters lymphocytes by binding to the chemokine and CD4 receptors present on the cell surface [10, 40]. This binding is facilitated by viral gp160 protein (gp120 and gp41 proteins) [10, 40]. Following binding, the viral envelope fuses with the cell membrane and releases the HIV capsid into the cell. Besides CD4+ cells, HIV can also infect macrophages and dendritic cells [10, 40]. Once the viral capsid enters the cell, the viral RNA is reverse transcribed to a cdna molecule [25]. This process is facilitated by the reverse transcriptase enzyme which is extremely error prone and also lacks the proof reading 3

6 1.3. Longitudinal Studies capacity leading to a misincorporation rate of 10 4 to 10 5 per base or approximately one mis-incorporation per genome per replication cycle [38]. The cdna and its complement form a double stranded viral DNA which is then transported into the cell nucleus where it is subsequently integrated into the host genome with the help of integrase viral enzyme [25]. Once incorporated, viral DNA requires cellular transcription factors to encode for viral proteins. During viral replication, the pro-viral DNA is transcribed into mrna which is spliced and then transported to the cytoplasm where it is translated into viral proteins (mainly Tat and Rev Proteins). Rev protein accumulates in the nucleus and inhibits mrna splicing and the unspliced full length mrna leave the nucleus to enter the cytoplasm [32]. The full length mrna is actually the viral genome which binds to Gag protein and is packaged into new viral packets. After processing by the host endoplasmic reticulum gp160 is transported to the plasma membrane where gp41 anchors gp120 to the membrane of the infected host cell. The viral capsid then assembles and buds out of the cell to infect other cells [25]. The high rate of mutation in HIV is due to the high error rate of the reverse transcriptase enzyme while transcribing the viral RNA genome into a DNA sequence that can be incorporated in the host cell genome for coding viral proteins [5]. Along with a high misincorporation rate of approximately one base per replication, reverse transcriptase also lacks proof-reading activity rendering it unable to check and rectify copy mistakes, often resulting in several slightly different copies of HIV within a single patient. Distinct viral populations are referred to as quasispecies, and each genetically distinct individual is referred as a haplotype. 1.3 Longitudinal Studies Due to the prohibitively expensive nature of prospective HIV screening, HIV studies are generally only performed on high risk population groups, such as within a prison. Combined with additional ethical considerations, the result is that most studies enroll patients that are already symptomatic. In already symptomatic populations the viral load is already established and thus is of little insight into the dynamics of the pre-seroconversion phase, i.e. before any viral antibody production has taken place. To develop insight into the temporal evolution of the virus within a host, longitudinal studies of patient populations are of critical importance. Longitudinal studies combine data collected at multiple examinations at intervals between minutes and years to afford a more comprehensive insight into the viral dynamics than is possible through examinations at a single time point alone. 4

7 1.3. Longitudinal Studies Figure 1.3: HIV Life Cycle [6] While longitudinal studies are clearly desirable, they also present technical challenges such as censoring of events due to the relocation, death or disenrollment of cohort members. Additionally, changing patient habits and a lifestyle choice can complicate analysis. Lastly, longitudinal studies are necessarily more involving and therefore more expensive than single time-point studies. 5

8 1.4. HIV Sequencing 1.4 HIV Sequencing Traditional Sanger-based sequencing method only read the consensus genomic sequence of heterogeneous viral populations [36], obfuscating the genomic variability present in the population which is of potential importance for identifying gradually-fixating resistance mutations. Next Generation Sequencing (NGS) technologies represent an improvement over Sanger sequencing and facilitate the sequencing of distinct haplotypes within a sample [30]. However, NGS reads are error-prone and require sophisticated processing techniques to create error-free haplotype reconstruction and frequency estimation. Currently, haplotypes with frequencies as low as 0.05% can be estimated with 99% confidence [24]. 1.5 Recent Studies of HIV-1 In 1999 R. Shankarappa et al. conducted a landmark study investigating the evolution of HIV within an infected individual prior to the onset of AIDS [22]. They studied the evolution of C2-V5, a high mutation region of the HIV-1 env gene, in nine patients over six to twelve years. They estimated the viral diversity within and between time points, identified mutations conferring the viral strain any fitness advantage, and characterized the existence of three distinct phases: the early phase with linear increase in diversity and divergence from the founder virus strain, an intermediate phase with linear increase in divergence but stabilization or decline in diversity, and a late phase with stabilization in the divergence and continued decrease in diversity. More recently, Poon et al, used longitudinal deep sequencing data with coalescent analysis to estimate the date of HIV infection [26]. Time of infection was estimated using the time to most recent common ancestor (TMRCA) of a time-calibrated phylogenetic tree relating sequences from all time points. This is justified by the argument that most HIV infections are established by a single viral strain due to bottlenecks during transmission [39, 13]. 19 HIV positive individuals were followed and 7 genomic regions were analyzed. The authors compared the estimated time since infection from experimental methods to TMRCA estimates obtained with the BEAST software library. They observed a stronger correlation between clinical and computational estimates for TMRCA in highly variable regions of HIV genome (such as env) relative to that in conserved regions such as pol. The reduced correlation in the conserved regions is thought to be due to a possible overestimation of time scales due to the increased sensitivity of the coalescent based methods towards the sampled genetic variation. Consequently, sequences with high divergence were found to be ideal for calibrating the evolutionary clock. In the case of a multiple founder virus infection, this method was found to 6

9 1.5. Recent Studies of HIV-1 overestimate the infection time. In the same month another interesting study was published by a different author, Suzanne English, et al. [23]. This discussed the construction of the transmission history of HIV-1 infected individuals using Phylogenetic methods. It showed that the diversity is fairly limited in the early phase of the infection and is even compatible with the transmission of a single viral variant. It also provided evidence to support the idea that a single donor can in principle transmit two distinct variants to two different individuals in a small time span of few hours. The transmission history was constructed using the Bayesian and Maximum likelihood approach. Env, gag and pol regions were used for this analysis. The inter host genetic diversity predictions proportionately varied depending on the extent of conservation observed in the region used for its prediction. Highest diversity was predicted using env gene (least conserved) followed by pol gene and then gag gene. Transmission history was constructed using the inter-host variation observed in these three regions. BEAST software was used for carrying out this analysis. A temporal study on HIV-1 was undertaken by G. Achaz et al and published in 2004 [11]. This study was conducted using gag-pol sequence data collected over time from two chronically infected individuals to estimate the population structure and the neutral mutation rate of this region per site per generation. Neutral coalescent models were used for the analysis. For the genealogy construction, coalescent approach identical to the one proposed by Felsenstein 1999 was used. 19 time points collected over a period of 4 years were used for the analysis. This compensated for the low mutation rate in the sequences. A longitudinal study to understand the viral evolution in early Acute Hepatitis C Virus infection was carried out by Bull RA, Luciani F, McElroy K, Gaudieri S, Pham ST, et al. published in 2011 [9]. We discuss this paper in greater detail due to the parallelism with our study. This study aimed at identifying genetic variants as low as 0.1% frequency and subsequently quantify them over the course of infection. They also identified two sequential bottlenecks that occurred early in infection. BEAST software was used to estimate the changes in the effective population size of the virus population over time. It was also used to construct ancestor descendant relationships with the viral samples from different time points. The rate of evolution for the virus was also estimated during this analysis. In depth nonsynonymous and synonymous substitution analysis was carried out to identify any visible pattern of change. Entropy changes were measured across the whole genome and across patients which indicated non-uniform evolution of HCV across the genome and over time. Single founder virus hypothesis were tested for infection using the freely available tool called as Poisson Fitter on the HIV database. 7

10 1.5. Recent Studies of HIV-1 Several other time series data analysis on HIV positive patients have been carried out for identifying/understanding the order in which resistance mutations are accumulated when the patient is placed under a drug therapy. However, we do not discuss these since our patients did not show any drug resistance even after the therapy was discontinued. 8

11 Chapter 2 Materials and Methods 2.1 Patient History HIV samples were collected from two patients enrolled at the department of infectious diseases at Universitatspital Zurich. The protease coding region of HIV was deep-sequenced and analyzed. This region was chosen for the study since both the patients were treated with a protease inhibitor drug. Patient I.D.123 Figure 2.1: Viral load in patient I.D. 123 This patient was a part of the Primary HIV Infection study which emphasizes on beginning the treatment in the early phase of infection and then discontinuing it. It is based on the assumption that the patient is likely to control the virus when the treatment is started early. However, most patients 9

12 2.1. Patient History Table 2.1: Sample collection time points, patient I.D.123 Sr.No. Sample Name Sample Collection Date 1 PR PR PR PR suffer from a viral rebound, like patient 123. As can be seen in figure 2.1, four samples over a period of 3.74 years were sequenced from the patient after being tested as HIV positive. These samples have been marked in red. The regions marked as ART in figure 2.1 show the periods of treatment with Lopinavir, an anti-retroviral drug. The exact dates of sample collection can be seen in table 2.1 Patient I.D.181 Figure 2.2: Viral load in patient I.D.181 The patient remained untreated until almost an year after being tested HIV positive. During this phase, the viral load in blood plasma was regularly monitored. Samples from three time points marked with red in figure 2.2 were deep-sequenced and analyzed. The exact dates of sample collection have been mentioned in table

13 2.2. Data Pre-processing Table 2.2: Sample collection time points, Patient I.D.181 Sr.No. Sample Name Sample Collection Date 1 PR PR PR Data Pre-processing Haplotype reconstruction and error correction was performed using ShoRAH [24]. The output file contained haplotype sequences in FASTA format. The header of each haplotype contained two numbers. First number showed our confidence in the haplotype sequence on a scale of 0 to 1. The other number could be used to calculate the frequency of the haplotype in the sample population. It showed the number of times the sequences constituting the haplotype were sequenced. It is known as the average read number of a haplotype. These files often contained over a hundred sequences with only a few having a high read count and confidence. For a meaningful analysis, these files were filtered to select sequences with a confidence of over 0.9. This reduced the number of haplotypes to one quarter or less. The threshold was chosen to optimize the number of sequences for analysis. Too few sequences would not contain enough information for the analysis and too many would certainly add noise to the result. This cutoff returned a reasonable number of haplotypes. Since a functional protease is fundamental for HIV, any gaps present in the haplotype sequences were assumed to be sequencing errors. The haplotypes from a single run were used to first construct a consensus sequence using the Biopython EMBOSS tool known as Cons. Any gaps present in the consensus sequence were filled using HXB2 protease reference sequence. The consensus sequence was then used to fill the gaps present in the haplotype sequences. The reading frame of every haplotype was also ensured to start at the first nucleotide position. A python script was written for performing all the above tasks. 2.3 Entropy Analysis HIV constantly accumulates mutations to cope with the selective forces being exerted by the immune system and drug treatments. The nucleotide sites that accumulate these mutations are mainly responsible for rendering the virus resistant to different therapies. In order to improve our current 11

14 2.4. Recombination Analysis methods, it would be fruitful to identify these sites and also have an insight on how these sites maintain diversity in the viral population. For this purpose we calculate entropy for every time point dataset and try to identify any visible spatial or temporal patterns. Entropy of a position in a sequence depicts the uncertainty associated with the nucleotide present at the site. High entropy indicates that the site can have variable nucleotides. Let X be a discrete random variable ( bases while considering nucleotides, amino acids while considering proteins), taking a finite number of possible values x 1, x 2,..., x n with probabilities p 1, p 2,..., p n such that p i 0, i = 1, 2..., n and n i=1 p i = 1. The entropy is then given by H n (p 1, p 2,..., p n ) = n i=1 p i log b p i Here b is the base of the logarithm. A simple python script was written to calculate and plot the entropy at every position in the alignment, we used natural logarithm for our calculations. When deep-sequencing data was submitted as an input, the script could take into account the average number of reads while calculating the entropy at every position. 2.4 Recombination Analysis Recombination plays a crucial part in the evolution of retroviruses and is more prevalent in conserved regions [7]. Since we use a fairly conserved HIV region for our analysis, we performed a recombination detection study. If an alignment contains recombinant sequences then relationships between different segments of the alignment cannot be described using a single phylogenetic tree. To unfold the true evolutionary relationships, it is imperative to identify the recombination break points and partition the alignment into the number of observed recombinant sets and then depict the evolutionary relationships in each of these partitions using a separate phylogenetic tree. If recombination events are not taken into account during a phylogenetic analysis, then the results are most likely to be meaningless. We used Recombination Identification program [37] which has been developed to specifically detect recombinants in HIV-1 nucleotide sequences. It accepts a set of nucleotide sequences from a single viral genomic region collected from a single patient as an input. The program requires a background sequence which is essentially the consensus sequence of the genomic region that is to be analyzed. This can be selected from the available list in the program; alternatively the user is free to submit a consensus sequence with 12

15 2.5. Molecular Clock Estimation the nucleotide data. In the latter case, the consensus should be aligned to the rest of the sequences. This program detects recombinants by sliding a window of pre-specified length along the alignment and calculating the hamming distance of the query sequence from all other sequences. The best match within every window is qualified and the confidence in each match is calculated using a z-test. If two neighboring windows on the same sequence have best matches with different sequences then it is considered as a recombinant. The program implicitly assumes each site to be evolving independently but according to the same process. It also approximates the binomial distribution of the hamming distances by a normal distribution. 2.5 Molecular Clock Estimation This section closely follows The Evolutionary Analysis of Measurably Evolving Populations using Serially Sampled gene sequences by Allen Rodrigo, et al [21] and Estimating Divergence times by J.L Throne, H.Kishino [28] Our interest in estimating the rate of evolution comes from our desire to construct rooted, time scaled phylogenetic trees using serially sampled data. A phylogenetic tree using only contemporary sequences can be constructed using standard approaches like maximum parsimony and N-J method which assume that all the input sequences belong to a single time point and are therefore equally distant from the root of the tree [35]. This is not the case with serially sampled sequences and care must be taken to scale the branches according to their time of sampling. Rate of mutation is required for this scaling of branches. Unlike the standard tree construction techniques where the branch lengths are calculated using a composite parameter µt, where µ is the substitution rate and t is the sampling time, with serially sampled data these two parameters can be decoupled into time and substitution rate and the tree branches can be expressed in units of either. The rate of molecular evolution is an outcome of a complex interplay between the biological systems and their surroundings. Since these systems and their surroundings change over time, it is inherent that their evolutionary rates would also fluctuate. These fluctuations in rates over different periods are best described as the relaxed molecular clock. In the case of HIV, the rate of evolution is influenced by the rate of mutation, the generation length as well as the probability of fixation of the mutation in the viral population. All these factors depend intricately on the biology as well as the population size of HIV. When the population size fluctuates, so does the fixation probability of a mutation resulting in the change of selection pressure on the virus. Hence changes in the population size are necessary to be taken into account while deciphering phylogenetic relationships. This is done using co- 13

16 2.5. Molecular Clock Estimation alescent based models that use sequence data to determine the population genetic parameters (e.g. population size, etc) which in turn determines the shape of the genealogy. Coalescent theory describes the dependence of a phylogenetic tree that represents the shared ancestry of sampled genes (i.e. genealogies) on the change in population size and structure [33]. BEAST implements variable population size coalescent model which allows determining the past population dynamics. This option is known as the Bayesian skyline plot. It is a non-parametric model which makes use of the time calibrated sequence data to estimate demographic model parameters using the Bayesian methods [14]. It can estimate the evolutionary rate, substitution model parameters, phylogeny and ancestral population dynamics within a single run. It then plots the past population evolution over time. The plot begins from the estimated root age of the phylogenetic tree. It can also be argued that depending on the period of observation, the evolutionary rates can be assumed to be approximately constant implying a strict molecular clock. Such an assumption facilitates evolutionary studies but one should always keep in mind the scenarios when the weakness of this assumption out competes its convenience. The molecular clock model selection and rate estimation was performed using a Java based tool, BEAST [27]. It was the natural choice for performing the analysis since it implements substitution models, insertion deletion models, demographic models for performing a series of coherent analysis. It can also explicitly model the rate of molecular evolution on every branch of the phylogenetic tree. This rate can be constrained to be constant over all branches or can be allowed to freely vary along different lineages. This molecular clock model can be readily combined with other models that allow the rate of substitution to vary along the alignment while sharing some common parameters such as the rate of transition or transversion. Since several models can be combined, many unnecessary simplifying assumptions can be avoided. BEAST provides Bayesian framework for testing hypothesis on biological data. Its three main genera of analysis are constructing rooted and time measured phylogenies, estimating population change over time using coalescent based models and demo-geographic sequence analysis. We constructed time calibrated phylogenies after estimating the clock rate and population evolution plots, hence we will be discussing these two methods in detail. Demogeographic analysis uses the location of sample collection and includes this information while drawing statistical inferences. BEAST is one of the few available platforms which can deal with time stamped data and make use of relaxed or strict molecular clock models to construct rooted trees and calibrate internal node ages in absolute time scales. It makes use of the Metropolis-Hastings Markov Chain Monte Carlo 14

17 2.5. Molecular Clock Estimation algorithm to provide sample based estimates of the posterior distributions of the evolutionary parameters given a set of sequence data. It facilitates analysis of multi-locus data since the data can be appropriately partitioned and the evolutionary parameters can be linked/unlinked between partitions. This feature can be extremely helpful when dealing with viral sequences with genes e.g. Pol and Env which have different rates of mutations. In such a situation, the demographic model parameters can be shared between partitions assuming exponential or logistic growth while the substitution model parameters can be unlinked across different partitions. Model Summary The model first estimates a phylogenetic tree to explain the relationship between n contemporaneous sequences. This is the genealogy, denoted by g. The coalescent events are then assumed to occur only on internal nodes of the tree, i.e. there can be maximum of n 1 coalescent events occurring on the tree. The population might change or remain the same after the occurrence of a coalescent event. The indicator function I c (i) is used to denote whether the i th event is a coalescent. The times at which the coalescent events occur are denoted using a vector u = (u 1, u 2,..., u n 1 ). The period where the population size remains unchanged is called as an interval and the vector used to denote the number of coalescent events in each interval is A = (a 1, a 2,..., a m ). Here m is the total number of such intervals with 1 < m < n 1. The time at which each grouped interval ends is denoted by w = (w 1,..., w m ) and the vector of effective populations sizes is denoted using Θ = (θ 1, θ 2,..., θ m ). The vectors denoting the effective population size together with the genealogy g and the vector of number of coalescent events in each interval A constitute to the demographic and coalescent time parameters. The probability of the genealogy can be easily calculted and is denoted by f G (g θ, A). BEAST uses a fixed number of coalescent events m since the resulting posterior demographic function is consistent for a large range of its values. The vector of effective population size are sampled using a MCMC algorithm. Each new population size is sampled from a exponential distribution with a mean equal to the previous population size. This formulation represents our belief that the population size is autocorrelated through time. The posterior distribution sampled is the product of the likelihood of piecewise demographic model and the priors f het (Θ, A, Ω, g, µ D) = 1 Z Pr(D µ, g) f G(g Θ, A) f Θ (Θ 1 )X f A (A) f Ω (Ω) f µ (µ) where, f Θ (Θ 1 ) is the scale invariant prior for the first effective population size and the rest are drawn from an exponential distribution which is cen- 15

18 2.6. Poisson Fitter tered around the size of previous population. Ω contains the parameters of the substitution model and µ is the mutation rate that scales the genealogies (phylogenetic tree) from units of mutations per site to units of time. The sampled posterior distribution is the product of the likelihood of piecewise demographic model and the priors. If the sampled substitution model parameters and mutation rates are ignored, then we get a list of states associated with a genealogy and demographic parameters. Then the demographic history can be constructed as a piecewise function of time for each of the states. The marginal posterior distribution of the population size is calculated for each time point till the time to most recent common ancestor along with the 95% confidence interval that accounts for phylogenetic and demographic uncertainty. The population estimates are usually smooth due to the averaging effect of the sampling procedure in use. 2.6 Poisson Fitter Freely available on Studies in [39, 13] showed that HIV undergoes genetic bottlenecks when the mode of transmission is sexual (horizontal transfer) or mother to child (vertical transfer). This primarily results in new infections being initiated by homogeneous viral strains. Once the infection is established by a single viral strain, it is expected to grow exponentially until the host immune system initiates a response. This is a case of neutral evolution where the mutation counts are expected to follow a poisson process. Once the host immune system triggers a response against the infection or when the patient is placed under therapy, the virus population does not grow exponentially and the accumulated mutations are no longer random and the Poisson distribution cannot be used for describing the pairwise Hamming Distance frequency distribution. Poisson Fitter [16] analyzes a set of HIV sequences assumed to be collected close to the time of infection to estimate whether the infection was initiated by a single of a multiple founder viruses. It is based on maximum likelihood approach which first tests the hypothesis of a single founder virus strain initiating the infection and if this condition is met then the time of infection is estimated with 95% confidence interval, provided the sample has been drawn before the virus population was subject to any selection pressure. Poisson Fitter can read deep-sequencing datasets and so was the natural choice for performing this analysis. Another reason for selecting this tool was that it is specially designed for working with HIV and Hepatitis C virus datasets and makes use of their default substitution rate. It has been used in some other longitudinal studies that have been discussed in the literature review section. 16

19 2.7. Sliding MinPD This tool compares the sample genetic diversity with the diversity expected under the neutral growth model, i.e infection by a single viral strain accumulating random mutations, by performing statistical tests on the Hamming Distance and fitting a Poisson distribution to the same using the maximum likelihood method. It tests whether the phylogenetic tree for the sequences shows a star topology. The Poisson distribution shape parameter is then found to be λ = n i=0 iy i i=0 n Y = E(Y) i where Y = (Y 0,..., Y n ) are the number of pairs of sequences that have a hamming distance equal to the subscript n. The model assumes a generation time of 2 days and a mutation rate of per site per replication with a basic reproductive ratio R 0 = 6 based on the findings from [39, 13, 18]. When the sequence data shows a star phylogeny, one finds that E(Y i ) = Y i. Once this condition is satisfied, the age of the root of the tree is the same as the time of HIV transmission to the patient. When the goodness of fit is low, it might indicate that the sample was collected after the initiation of the selection pressures or the infection was initiated by multiple founder viruses. Deep sequencing data can also be analyzed using Poisson Fitter and the plots are then on a log scale since the number of identical sequences are much more than the ones that differ and this information gets masked on a linear scale. 2.7 Sliding MinPD This section describes the methods in [8] The traditional phylogenetic approaches treat all the sequence data as contemporaneous data and deal with serially sampled data by merely scaling the tips of the leaves. These methods are also not able to account for recombination events. Furthermore, when the data is collected from quickly evolving viruses which exhibit complex substitution patterns, phylogenetic trees are not able to depict all the information. In such a situation, an evolutionary network can be used to depict the ancestor descendant relationships and recombination events. Sliding MinPD constructs an evolutionary network using serially sampled data and detects recombination events using a sliding window approach. It is based on the minimum pairwise distance approach combined with the sliding window method and recombination detection techniques. The algorithm consists of three phases. In the first phase, every sequence that does not belong to the first time point is deemed as the query sequence and its pairwise distance is calculated against every other sequence from 17

20 2.8. Rate of synonymous and nonsynonymous substitutions the previous time point. In the second phase, the breakpoints in the recombinant sequences and their donor sequences are identified using the sliding window approach where the best match is identified for every window along the alignment. In the final step, potential ancestors from previous time points are identified. For the non-recombinant sequences these are the ones which had the shortest calculated distance in first step. The results of this program were found to be extremely sensitive to the specified window length. Hence the analysis was carried out for only a single patient. 2.8 Rate of synonymous and nonsynonymous substitutions This section closely follows [31], the chapter Neutral and adaptive protein evolution by Ziheng Yang in [41] and the Hypothesis Testing for Phylogenies manual [20] The rate of nonsynonymous and synonymous substitution provides an insight on the type of selection pressure acting on the viral population. When the ratio of the rates of nonsynonymous and synonymous substitutions is greater than one for a genomic region, then that region is said to be under positive selection, e.g. when a HIV patient is placed under a drug therapy, the virus shows concerted substitutions towards acquiring a particular residue which eventually fixates in the population making the virus drug resistant. This type of evolution is known as positive directional selection. Another kind of positive selection is to maintain the amino acid diversity at certain sites which are potential targets of the host immune system. This is commonly known as diversifying positive selection. When the genomic region accumulates synonymous and nonsynonymous substitutions at the same rate, then it is said to be under neutral evolution. In negative selection the rate of nonsynonymous substitutions is much lower than that of synonymous substitutions causing selective removal of alleles that are deleterious. It is also commonly known as purifying selection. A substitution behaves synonymous or nonsynonymous depending on the codon in which it occurs and on the position within the codon. For example, GGX GGY is always a synonymous substitution whereas CAX CAY is synonymous if X Y is a transition and nonsynonymous otherwise. Hence while dealing with coding sequences, it is always meaningful to use codons as the units for selection analysis. We used Mega software for calculating the rate of synonymous and nonsynonymous mutations which is based on the method described by M. Nei and T. Gojobori in Here we describe the method in detail. First the 18

21 2.8. Rate of synonymous and nonsynonymous substitutions number of synonymous and nonsynonymous sites for each codon present in the sequence is computed. Let S be the number of synonymous sites for each codon S= 3 i=1 f i, where f i is the fraction of synonymous changes at the i th position in a codon. Then the number of non-synonymous sites S for each codon can be calculated as N= 3 S This can be understood by a simple example. codes for leucine In the case of TTA which f 1 = 1 3 (T C), f 2 = 0, f 3 = 1 3 (A G) and so S = 2 3, N = 7 3 The total number of synonymous and nonsynonymous sites in a sequence of r codons is therefore given by S = r i=1 f i and N = (3r S). The number of nonsynonymous and synonymous nucleotide differences between a pair of sequences is calculated by comparing the sequences codon by codon and counting the number of synonymous and nonsynonymous nucleotide differences for each pair of compared codons. This can be easily done when the codons are differing at only a single position. When they differ at two nucleotide positions then there are two possible ways through which this difference could have occurred. Both the paths are considered then with equal probability and the number of synonymous and nonsynonymous substitutions are counted and S d andn d are updated. For example: If TTT codon is compared against GTA, then the two pathways are 1. TTT(Phe) GTT(Val) GTA(Val), one synonymous and one nonsynonymous substitution 2. TTT(Phe) TTA(Leu) GTA(Val), two nonsynonymous substitution The value of S d becomes 0.5 and N d becomes 1.5 respectively. Similarly, when there are three nucleotide differences then six possible pathways between the codons with three mutational steps within each pathway are considered. The proportion of synonymous and nonsynonymous differences are then calculated using the equations p s = S d S and p n = N d N where S and N are the average number of synonymous and nonsynonymous sites for the two compared sequences. Further the per site substitutions are calculated using the Jukes and Cantor (1969) formula [31]: d = 3 4 ln(1 4 3 p) Where p is p s and p n for synonymous and nonsynonymous substitutions respectively. This method gives approximate estimates and the formula is not applicable to two and threefold degenerate sites. The program used 19

22 2.8. Rate of synonymous and nonsynonymous substitutions by us for this analysis makes use of this method for calculating the rate of synonymous and nonsynonymous changes. 20

Chapter 3 Results 3.1 Entropy Analysis Patient I.D. 123 The entropy was calculated and plotted for every position in the alignment for all the four datasets, as shown in figures 3.1, 3.2, 3.3 and 3.4.

23 Chapter 3 Results 3.1 Entropy Analysis Patient I.D. 123 The entropy was calculated and plotted for every position in the alignment for all the four datasets, as shown in figures 3.1, 3.2, 3.3 and 3.4. A set of constant peaks can be observed around position 50 and 290 in all the four plots. The sequence region around these sites was explored and summarized in table 3.1. The base number shows the nucleotide position whose neighboring sequence is being viewed. The following four columns show the entropy of the base at different time points. The last column shows the neighboring sequence of the base. Constant high entropies were found in the homo-polymeric regions. These were most likely sequencing errors since the 454 sequencing technique (used in our analysis) is known to suffer from high base mis-incorporation rate in the homopolymeric regions. These regions of the alignment were manually curated to remove the anomalies. Figure 3.1: Patient I.D. 123: Entropy plot for samples collected in

24 3.1. Entropy Analysis Figure 3.2: Patient I.D. 123: Entropy plot for samples collected in 2005 Figure 3.3: Patient I.D. 123: Entropy plot for samples collected in 2006 Figure 3.4: Patient I.D. 123: Entropy plot for samples collected in

25 3.2. Molecular Clock Rate Estimation and Phylogenetic Tree Construction Table 3.1: Patient I.D. 123: Neighboring sequences of high entropy sites Base No. Time1 Time2 Time3 Time4 Sequence preceeding the site GGGGGG (43-48) GGGGGGC (43-49) TTTAAATTTT ( ) In general, sequences from the first time point showed the highest entropy measure which gradually decreased over time. Patient I.D. 181 High entropy was measured in the several regions including few homopolymeric regions. Regions suspected with sequencing errors have been listed in the table 3.2. These region were manually corrected to remove the sequencing errors. There was no spatial or temporal trend observed for change in entropy. Table 3.2: Patient I.D. 181: Neighboring sequences of high entropy sites Base No. Time1 Time2 Time3 Sequence preceding the site AAACCAAAAA ( ) AAACCAAAAA ( ) TTTAAATTTT ( ) 3.2 Molecular Clock Rate Estimation and Phylogenetic Tree Construction Sequences from all the data points were used to simultaneously estimate the substitution rate and for constructing the phylogenetic tree depicting ancestral descendant relationship between sequences from all time points. BEAST software was used for this analysis. After performing a number of test runs to understand the effect of every parameter on the run, the following setting was found to provide the optimal results in terms of high log-likelihood value of the estimated parameters, fast convergence of the MCMC chain and low standard deviation in the distribution of parameters. The phylogenetic tree construction runs for both the patients were performed with the settings specified in table 3.3. The priors specified for the BEAST run have been summarized in table 3.4. The operator setting used to explore the sample space for the parameters has been summarized in table

26 3.2. Molecular Clock Rate Estimation and Phylogenetic Tree Construction Table 3.3: Patient I.D. 123 and 181: Analysis Settings. fig: 3.7, 3.9 Site Models Parameter 1. Substitution Model HKY 2. Base Frequencies Estimated 3. Site Heterogeneity Model Gamma+Invariant Sites 4. Partition in codon position None Results shown in Clock Model 1. Model Strict Clock 2.Estimate Rate Yes Demographic model 1. Tree Prior Constant size 2. Starting Tree Randomly Generated Table 3.4: Patient I.D. 123 and 181: Priors for the BEAST run Parameter Prior Bound Description Kappa Lognormal[1,1.25] [0,inf] HKY transition transversion parameter Frequencies uniform[0,1] [0,1] base frequencies alpha uniform[0,10] [0,1000] gammma base frequencies pinv uniform[0,1] [0,1] proportion of invariant sites parameter clock.rate uniform[5.4e-5,1] [0,inf] substitution rate rootheight Using Tree Prior [3.761,inf] root height of the tree const.popsize 1/x [0,inf] coalescent population size parameter Let us now briefly discuss how the current choice of parameters was formulated. A series of runs were set up with parameter rich substitution models like general time reversible model. These initial runs took long to converge. As a result, simpler substitution model was selected for the analysis like HKY model. This made a significant difference in the decreasing the convergence time of the chain. All initial test runs were made with the uncorrelated lognormal clock which draws the rate of each branch from an underlying lognormal distribution. The standard deviation (i.e. ucld.stdev parameter) estimate of the clock rate was close to zero for most of these runs. A value close to zero for this parameter indicated clock-like behavior of the dataset [2]. Thereafter a set of runs were made with different coalescent models like expansion growth model, exponential growth model and constant growth model. The Bayes Factor was used for selecting the best fitting model. For 24

27 3.2. Molecular Clock Rate Estimation and Phylogenetic Tree Construction Table 3.5: Choice of operator values for the rate estimation and phylogenetic tree construction, Patient I.D. 123 and 181 Operates on Type Tuning Weight Description Kappa scale HKY transition-transversion parameter of partition Frequencies deltaexchange frequencies Alpha scale gamma shape parameter of partition Clock.rate scale substitution rate of partition Tree subtreeslide Performs the subtree-slide rearrangement of the tree Tree wideexchange n/a 3.0 Performs global rearrangements of the tree Tree wilsonbalding n/a 3.0 Performs the Wilson-Balding rearrangement of the tree Tree narrowexchange n/a 15.0 Performs local rearrangements of the tree rootheight scale root height of the tree of partition Internal node uniform n/a 30.0 Draws new internal node heights heights uniformly popsize scale coalescent population size parameter of partition growthrate randomwalk exponential.growthrate Substitution rate and heights updown Scales substitution rates inversely to node heights of the tree both the patients, the constant population model could not be rejected. Table 3.6 shows the statistics for this parameter. Table 3.7 summarizes the statistics for patient I.D The substitution rate in HIV protease coding region for this patient was found to be slightly higher. The shape of the distribution was similar to the one of patient I.D The phylogenetic tree topologies and parameters were also sampled over the MCMC chain. Tree and parameter values were logged once every 10,000 steps and the chain was run till the effective sample size, i.e the number of independent draws from the posterior distribution exceeded 250 [2]. In our case the effective sample sizes were well over 1000 for the relevant parameters. Figure 3.7 shows the ancestor descendant relationship between sequences 25

28 3.2. Molecular Clock Rate Estimation and Phylogenetic Tree Construction Table 3.6: Patient I.D. 123: Clock rate statistics mean E-3 stderr of mean 1.922E-5 median E-3 geometric mean E-3 95% HPD lower E-4 95% HPD upper E-3 auto-correlation time (ACT) effective sample size (ESS) Figure 3.5: Patient I.D. 123: Substitution rate distribution. 95% confidence interval has been marked in blue. The red region shows the rate sampling outside the interval Frequency E-3 0 1E-3 2E-3 3E-3 4E-3 5E-3 clock.rate Figure 3.6: Patient I.D. 181: Substitution rate distribution. The 95% confidence interval has been marked in blue. The red region shows the rate sampling outside the interval Frequency E E-3 5E-3 7.5E-3 1E E-2 1.5E E-2 2E-2 clock.rate 26

29 3.2. Molecular Clock Rate Estimation and Phylogenetic Tree Construction Table 3.7: Patient I.D. 181: Clock rate statistics mean E-3 stderr of mean E-5 median E-3 geometric mean E-3 95% HPD lower E-4 95% HPD upper E-2 auto-correlation time (ACT) effective sample size (ESS) collected at different times from patient I.D Most of the sequences from the first time point share nodes with the sequences from the second and the third time point. However, the sequences from the final time point (i.e. the ones collected in 2007) appear like a segregated clade from the rest of the tree showing only a faint relationship with low frequency haplotypes from the first time point. In a more rigorous analysis, it was found that haplotype number 25 (not shown in this tree as the frequency of the haplotype was 0.02%) from the first sampling point was identical to a low frequency haplotype number 5 from the final time point. None of other sequences from all the other time points were found 100% identical to any of the haplotypes from the final time point. There were a series of identical haplotypes found over time. Starting from one side of the tree, we see that haplotype 6 from 2003 branches out to haplotype 2 of 2005 which further branches to haplotype 3 from These three sequences were found to be identical over time and their frequencies constantly dwindled between 0.8% to 1% after which the haplotype was not visible. There were 9 such haplotypes from the first time point that appeared again in the later stages. These have been shown in the figure 3.8. We see in this figure that the frequency of the haplotypes decreases from the first to the second time point but it again increases in the next time point. The sequences from the final time point show little similarity with the ones from previous time points. A pairwise sequence alignment was performed between the majority haplotype with a frequency of 65% from the the third time point and the haplotype from the final time point with a frequency of 86%. These two sequences showed 97% similarity. Another tree incorporating haplotypes with frequency as low as 0.2% was constructed to trace any relationship of the sequences from the final time point with low frequency variants from the second and the third time point. However there were no variants from 2005 and 2006 sharing a node with the sequences sampled in From a total of eight haplotypes with frequency greater than 0.5% in the first 27

30 3.3. Founder Virus Analysis time point, four were found to be present in the second time point. Note that the second sample was collected after almost an year of the anti-retroviral therapy. This leads us to conclude that the viral haplotypes were latent during the therapy period but were quick to rebound when the therapy was stopped. So, even though an year passed in terms of absolute time scale nothing much happened in terms of evolutionary time scale. For patient I.D. 181, the haplotypes with frequency greater than 1% progressively increased over time. While there were only 3 haplotypes from the first time point with frequency greater than 1%, the number changed to 23 for the final time point. All the sequences from the first time point could be traced over the three time points. The haplotype with the highest frequency from the first time point remained to be the one with highest frequency over the next two time points as well but the frequency decreased over time from 87% to 38%. This can be clearly seen in figure Founder Virus Analysis The analysis was carried out to detect if the infection was initiated by a single founder virus haplotype. This knowledge is often necessary in tracing the source of infection. It is also useful for explaining the pattern of genetic diversity observed in the patient. If the phylogenetic tree constructed using sequences from only the first time point shows a star like phylogeny then the infection is likely to have been initiated by a single virus. When we see distinct clades in the tree, the infection can be assumed to be initiated by multiple founder viruses. Another reason for the phylogenetic tree to not show a star like topology can be that the sequences are from a time where the immune selection already started shaping the viral evolution. Care must be taken to use samples that have been collected from a time very close to the time of infection so that the intra-sample diversity is not higher than 10%. When the sample shows a star like phylogeny with low intra-sample diversity, then the time of infection can be estimated using the Poisson Fitter tool. Figure 3.11 and 3.12 show the phylogenetic trees constructed using the samples from the first time points for patient I.D. 123 and 181 respectively. These trees clearly do not show a star topology that would indicate a single founder virus infection. Since the exact dates of infection are unknown, it might be the case that the samples used for the analysis were from viruses that were evolving under selective pressure exerted by the immune system. The sample from patient 181 that was used for the analysis was collected after a few months of infection ( this can be seen by looking at the patient infection time line shown in figure 2.2), hence this result might be misleading. We see a single virus with frequency equal to 86% in the first time point. It 28

31 3.3. Founder Virus Analysis Figure 3.7: Patient I.D. 123: Phylogenetic Tree showing relationship between sequences with frequency greater than 0.5%.Blue marks sequences from 2003, Green marks sequences from 2005, While red and orange show sequences from 2006 and 2007 respectively 29

32 3.3. Founder Virus Analysis Figure 3.8: Patient I.D.123: Haplotypes traced over time with their measured frequencies 30

33 3.3. Founder Virus Analysis Figure 3.9: Patient I.D. 181: Phylogenetic Tree showing relationship between sequences with frequency greater than 0.5%.Blue marks sequences sampled on , Green marks sequences from , While orange shows sequences sampled on

34 3.3. Founder Virus Analysis Figure 3.10: Patient I.D. 181: Haplotypes traced over time with their measured frequencies 32

35 3.3. Founder Virus Analysis Figure 3.11: Patient I.D. 123: Phylogenetic tree with sequences from 2003 with frequency greater than 0.5% is likely that the infection was started by a single founder virus but due to the selective immune pressure, the founder virus showed a divergent evolutionary pattern. Poisson Fitter analysis showed that the hamming distance distribution of the first time point sample did not confirm with the distribution expected under neutral evolution from a single founder virus. Another explanation for the observation could be that the samples were not from a time point close to the initiation of infection. Figure 3.12: Patient I.D. 181: Phylogenetic tree with sequences from 2005 with frequency greater than 0.5% 33

36 3.4. Demographic Reconstruction Figure 3.13: Patient I.D. 123: Demographic construction of the viral population with frequency greater than 0.5% 3.4 Demographic Reconstruction The effective population size of the virus was plotted over the period of infection for patient I.D. 123 figure 3.13 and patient 181 figure The time points where the sequences were collected have been marked in the plot. We see a slight increase in the viral population over time during course of infection for patient I.D The results of this analysis were robust to the use of different model settings. This indicated that the sequence data was informative and not too sensitive towards slight model mis-specification. However, since the model could not be informed about the anti-retroviral therapy period, we do not know if the results would be sensitive to this information. As previously concluded from figure 3.8 that the virus was under a latency period during the therapy, one would assume that the results of the demographic should not be affected by the therapy phase. For patient I.D. 181, we see no change in the effective population size of the virus over the first year of infection. The times of sample collection have been marked in the figure As a proof of principle of the coalescent model used in our study, another analysis was performed on a sequence dataset collected over a period of 3 years from an influenza epidemic. This demographic plot was in agreement with the variation observed during the epidemics. The data contained sequences of length 1700 base pairs and there were samples collected over every few months giving a high coverage over a period of three years. The results are not shown since these are freely available on the BEAST website. 34

37 3.5. Sliding MinPD Figure 3.14: Patient I.D. 181: Demographic construction of the viral population with frequency greater than 0.5% 3.5 Sliding MinPD The evolutionary network was constructed for the patient I.D. 123 using the Sliding MinPD software. Sequences from 2005, 2006 and 2007 were used in the analysis. The samples from the first time point were ignored since the majority haplotypes were same between the first and the second time point. A set of runs were set up with differing window sizes and sliding window sizes. The results of the runs were found to be sensitive to slight changes in the window length. Due to lack of confidence in the results of this analysis, this was not carried out for patient I.D.181. The results of the run have been shown in figure 3.15 and The parameters for the runs except the window and the sliding window size has been mentioned in table 3.8. Table 3.8: Parameters used for constructing the evolutionary network using Sliding MinPD Active Recombination Detection Yes Recombination Detection Option Bootscan RIP Crossover option Many PCC threshold p0.4 bootstrap recomb. tiebreaker option Yes bootscan seed E-3 bootscan threshold TN93 substitution model gamma shape - rate heterogeneity alpha 0.5 show bootstrap values No markers for clustering Yes clustering distance threshold T0.001 clustering option by bases but post 35

Fayth K. Yoshimura, Ph.D. September 7, of 7 HIV - BASIC PROPERTIES

Fayth K. Yoshimura, Ph.D. September 7, of 7 HIV - BASIC PROPERTIES 1 of 7 I. Viral Origin. A. Retrovirus - animal lentiviruses. HIV - BASIC PROPERTIES 1. HIV is a member of the Retrovirus family and more specifically it is a member of the Lentivirus genus of this family.