Recursive partitioning of resistant mutations for longitudinal markers based on a U-type score

Size: px
Start display at page:

Download "Recursive partitioning of resistant mutations for longitudinal markers based on a U-type score"

Transcription

1 Biostatistics (2011), 0, 0, pp doi: /biostatistics/kxr011 Biostatistics Advance Access published May 19, 2011 Recursive partitioning of resistant mutations for longitudinal markers based on a U-type score CHENGCHENG HU Division of Epidemiology and Biostatistics, Mel and Enid Zuckerman College of Public Health, University of Arizona, Tucson, AZ 85724, USA hucc@ .arizona.edu VICTOR DEGRUTTOLA Department of Biostatistics, Harvard School of Public Health, Boston, MA 02115, USA SUMMARY Development of human immunodeficiency virus resistance mutations is a major cause of failure of antiretroviral treatment. We develop a recursive partitioning method to correlate high-dimensional viral sequences with repeatedly measured outcomes. The splitting criterion of this procedure is based on a class of U-type score statistics. The proposed method is flexible enough to apply to a broad range of problems involving longitudinal outcomes. Simulation studies are performed to explore the finite-sample properties of the proposed method, which is also illustrated through analysis of data collected in 3 phase II clinical trials testing the antiretroviral drug efavirenz. Keywords: Antiretroviral drugs; Longitudinal data; Recursive partitioning; Repeated measurements; Resistance mutations; Tree method. 1. INTRODUCTION Highly active antiretroviral therapy (HAART), which usually consists of 3 or more different antiretroviral drugs, has been very effective in suppressing the replication of human immunodeficiency virus (HIV) and slowing the progression to acquired immune deficiency syndrome (AIDS). After HAART initiation, viral load, which is quantified as the number of HIV RNA copies per milliliter of blood plasma, quickly declines. Under the selective pressure of antiretroviral drugs, however, viruses may develop drug resistance mutations, resulting in a slower decline or even a rebound of the viral load. These mutations tend to persist after change in therapy and impact the effect of subsequent treatment regimens, as there is considerable cross-resistance among drugs within the same drug class. Furthermore, minority species of virus with drug resistance mutations, present since initial infection, may emerge as the dominant species under drug pressure. Antiretroviral resistance is a major cause of treatment failure. In clinical studies of HIV-infected patients, resistant mutations are generally monitored by sequencing the relevant portions of the HIV genome; in this paper, we focus on the reverse transcriptease (RT) region of the HIV genome, To whom correspondence should be addressed. c The Author Published by Oxford University Press. All rights reserved. For permissions, please journals.permissions@oup.com.

2 2 C. HU AND V. DEGRUTTOLA corresponding to the classes of drugs used in the studies that motivated our research. These genetic regions have a total length of several hundred codons, and high dimensionality complicates the effort to understand the relationship between HIV genotype and viral response to therapy. To apply traditional statistical methods to viral genetic data, we need to reduce the dimensionality of the viral genome and to classify the genetic sequences into a small number of relatively homogeneous classes. Such method may also be useful in stratification of patients in clinical trials and in guiding treatment choices. Recursive partitioning is of particular interest in the study of HIV-resistant mutations. Such methods naturally accommodate high-dimensional data, allow exploration of complex interactions of different codons, and provide easily interpretable results. For an introduction to recursive partitioning methods, the readers are referred to the seminal work of Breiman and others (1984), while Zhang and Singer (1999) gives a very nice summary of more recent developments. This method has been employed to investigate the association between drug-resistant mutations and a single measurement of viral load or 50% inhibitory concentration (IC 50 ) of a certain drug, which is a drug susceptibility phenotype. See Sevin and others (2000), Foulkes and DeGruttola (2002), Foulkes and others (2004), and Beerenwinkel and others (2002) for details. Recursive partitioning methods have also been developed for repeatedly measured responses. Zhang (1998) studied multiple binary responses with joint probability distribution from the exponential family. The method of Segal (1992) for longitudinal responses requires the observations to be spaced equally in time, and the covariance structure needs to be modeled. Lee (2006) proposed a recursive partitioning method based on generalized estimating equation (GEE) models, which also requires a common set of observation time points for all study participants or the covariance structure needs to be modeled. In this article, we develop a tree-based method to classify HIV genetic sequences with respect to longitudinal outcome measures such as plasma HIV-1 RNA (viral load) and CD4 count. Although in most studies, all participants follow the same schedule of clinical visits, at which the biomarkers will be measured, the actual visit time could be weeks or even months away from the scheduled time. Also, trajectory of a patient s longitudinal viral load can be quite erratic and parametric modeling could be very difficult. As examples, Figure 1 shows viral RNA trajectories of 6 subjects who started combination therapy Fig. 1. Viral load trajectories of 6 participants in the DMP-266 studies.

3 Recursive partitioning of resistant mutations 3 containing the drug efavirenz. Two subjects, represented by solid curves, kept the viral load suppressed below 400 copies/ml (the horizontal line) since a few weeks after treatment initiation. Two other subjects, represented by broken curves, initially had suppressed viral load but then it rebounded back to high levels. Viral load of the remaining 2 subjects (dotted curves) was never suppressed. We see it would be challenging to find an appropriate parametric model for viral load trajectories the variable is generally measured at different time points for different subjects, and the mean structure is hard to model. The method we propose in Section 2 is fully nonparametric and does not require modeling the mean and covariance structure; it also allows irregular times of measurement. Performance of the method is studied by simulation in Section 3 under realistic settings with moderate sample sizes, and in Section 4, the method is applied to data from 3 phase II clinical trials, from which the 6 subjects shown in Figure 1 are selected. Discussions and potential extensions are relegated to Section METHODS 2.1 Definition of U-type score For a study of n subjects, assume that the viral load of the ith subject (i = 1,..., n) was measured at baseline t i0 = 0 and n i subsequent time points 0 < t i1 < < t i,ni, with values Y i0, Y i1,..., Y i,ni, respectively. Viral load declines in the first few months of successful treatment; we restrict our attention to this phase of the study to focus on therapeutic efficacy. Similar to May and DeGruttola (2007), for 2 different subjects indexed by i and j and one viral measurement of each person measured at t ik > 0 and t jl > 0, respectively, we define 1 if Y ik < Y jl and t ik t jl, or Y ik = Y jl and t ik < t jl ; D{(Y ik, t ik ), (Y jl, t jl )} = 1 if Y ik > Y jl and t ik t jl, or Y ik = Y jl and t ik > t jl ; 0 otherwise. This score compares 2 outcome measurements of 2 different subjects; if the ith subject achieved a lower viral load at t ik than did the jth subject at time t jl t ik, a score of 1 is assigned. On the other hand, if the ith subject had a higher viral load at t ik than did the jth subject at time t jl t ik, a score of 1 is assigned. Since viral load is expected to decline, we assign a score of 0 if a subject achieved a lower viral load level at a later time than did another subject, as we are unable to judge which performance was superior. To compare the rates of decline for any 2 subjects, we sum over scores of all pairwise comparisons between their viral load measurements D(i, j) = 1 n i n j n i n j D{(Y ik, t ik ), (Y jl, t jl )}. (2) k=1 l=1 This score compares the full viral load trajectories of the 2 subjects. A positive score implies that the ith subject had a better response to the therapy than did the jth subject. (1) 2.2 Properties of U-type score When the viral loads of 2 subjects follow exactly the same downward trend and the observations at different times follow the same process, we can easily show that the U-type score comparing them has expectation 0. Prospective studies, including clinical trials, normally require the study participants to visit clinics at a fixed set of time points for collection of laboratory and clinical data. The date at which the

4 4 C. HU AND V. DEGRUTTOLA visit is scheduled depends on the convenience of both the study subject and the clinic staff, and hence could differ from the pre-set timeline by a few days or even weeks. For the ith subject, let tik be the kth time point (k = 0,..., m) of scheduled clinical visit. Note that {ti0,..., t im } could differ from the {t i0,..., t i,ni } since the subject might miss one or more of the clinical visits. Let ik be an indicator of whether the ith subject had the kth scheduled visit. In the decline phase of an HIV study, we can generally assume that {ti0,..., t im } and { i0,..., im } are identically and independently distributed across all study participants. We also assume that the visit indicators are independent of all other data, that is, visits are missing completely at random. Let Yik be the (possibly unobserved) viral load of the ith subject at the scheduled time t ik. It can be seen that D(i, j) = (n i n j ) 1 m ml=1 k=1 ik jl D{(Yik, t ik ), (Y jl, t jl )}. If the viral load process is the same for both the ith subject and the jth subject, viral load data of the ith and jth subjects are identically distributed. For any k and l, it can be easily seen that E D{(Yik, t ik ), (Y jl, t jl )} = E D{(Y jk, t jk ), (Y il, t il )} = E D{(Yil, t il ), (Y jk, t jk )}. In the special case of k = l, this reduces to E D{(Y ik, t ik ), (Y jk, t jk )} = 0. Since the visit indicators ik s are assumed to be independent of all other data, we get E D(i, j) = 0. When the viral load processes for subjects i and j are different, assume Yik = μ i(tik ) + ɛ i(tik ) and Y jl = μ j(t jl ) + ɛ j(t jl ), where μ i( ) and μ j ( ) are deterministic functions representing the means of the time-varying viral loads for the 2 subjects, and ɛ i ( ) and ɛ j ( ) are independently and identically distributed mean-zero error processes that are also independent of all other data. If μ j (t) > μ i (t) for all positive time t in the range we consider, we can easily prove that for 1 k l m, E [D{(Yik, t ik ), (Y jl, t jl )} + D{(Yil, t il ), (Y jk, t jk )}]>0, and for 1 k m, E D{(Y ik, t ik ), (Y jk, t jk )}>0. We then have E D(i, j) > 0. Thus, we can be sure that when the viral load of subjects i and j are identically distributed, the expectation of D(i, j) is 0, and if the viral load of the ith subject has a mean curve lower than that of the jth subject, the expectation of D(i, j) is positive. These properties demonstrate that D(, ) provides a valid pairwise comparison measure between any 2 subjects. 2.3 Alternative definition of score Flexibility in the definition of the pairwise comparison score allows accommodation of special characteristics of the outcome measures. For example, in a study with longitudinal viral loads as the outcomes, we can revise the score D(, ) to take into account information about viral response. If a subject fails to attain viral load suppression below a certain threshold in the first few months of receiving a new therapy, the viral load is unlikely to become suppressed later on the same treatment, and the therapy is considered to have failed. If one subject always has higher viral load levels than does another, but neither attains viral load levels below a fixed threshold, both are considered treatment failures and we may not want to distinguish between them. Also, if the viral load of a subject who complies with therapy initially becomes suppressed but rebounds later, viral load is unlikely to be suppressed again on the same treatment, which is also considered to have failed. Again we might not want to distinguish between 2 such subjects. Regardless of the nature of treatment failure, it may be most appropriate to consider any subject who failed to have had a worse response than any other subject who did not fail, within the same interval of time. Investigators may choose to consider subjects whose viral loads are never suppressed to have worse treatment response than those who have an initial viral suppression followed by a rebound. Alternatively, because subjects of the latter description are at more risk of developing new resistance mutations than are those who never responded to treatment, we may consider transient response to be worse than no response. The choice may depend on the goal of the analysis: focusing on biological activity would lead to different choices for the score definition than focusing on clinical benefit. Viral suppression usually occurs within a few weeks after initiation of an effective new treatment. In HIV/AIDS studies, the most common thresholds for viral suppression are 400 copies/ml or 50 copies/ml, depending on assay characteristics. Then definition of viral rebound varies across studies; in the studies

5 Recursive partitioning of resistant mutations 5 we consider, the definition was 2 consecutive viral loads of over 400 copies/ml (or 50 copies/ml) or one single viral load of over 4000 copies/ml after an initial viral suppression. For the ith subject, let 0 for viral suppression; V (i) = 1 for initial viral suppression followed by rebound; 2 for no suppression. Now, we can define an alternative score comparing 2 different subjects indexed by i and j: 1 if V (i) < V ( j); 1 if V (i) > V ( j); D(i, j) = 0 if V (i) = V ( j) > 0; D(i, j) if V (i) = V ( j) = 0, where D(i, j) is defined in (2). In this definition, an initial viral suppression followed by rebound is considered a response superior to no suppression. Another marker of interest in HIV studies is CD4 T-lymphocyte count, which is expected to rise after initiation of an effective therapy. A more rapid rise is considered to be a more favorable response; hence, a score can be defined as in (1), reversing the directions of the inequalities involving the outcomes. 2.4 Construction of trees Here, we introduce a recursive partitioning method to correlate baseline variables of potentially high dimensions, like viral genetic mutations, with trajectories of a longitudinal biomarker, like viral load. A splitting criterion based on the U-type score D(, ) defined in (2) measures the difference between 2 mutually exclusive groups of study subjects. Based on a viral genetic mutation or other baseline variable chosen by the splitting criterion, the cohort of all subjects (defined as the root node) is divided into 2 subgroups, called daughter nodes. The same splitting rule is then applied recursively to each daughter node to build a large tree. A pruning procedure is then developed to remove some branches to obtain a tree with proper size. The methods are motivated by those of Breiman and others (1984) and also by the survival tree methodology of LeBlanc and Crowley (1993), who used the log-rank statistic as the splitting criterion. For a study of n subjects, assume all had the relevant genetic sequence at study entry. For the ith subject (i = 1,..., n), let Z ir be an indicator of mutation at the rth codon (r = 1,..., R). Let v be an arbitrary node, which is a subset of study participants. For any r, let w r be the subset of v with all study subjects having a mutation at the rth codon: w r = {i v : Z ir = 1}. Thus, v\w r consists of subjects with a wild-type rth codon. Let 2 G(r, v) = 1ˆσ r 2 D(i, j), j v\w r i w r where ˆσ r is an estimator for the standard deviation of the sum of U-type scores in the formula above. As for general tree methods, a binary split of v can also be defined by dichotomizing a baseline continuous or ordinal variable Z, where the 2 daughter nodes are defined as {Z z} and {Z < z} for a threshold z or the split can be defined by dichotomizing a baseline nominal (unordered categorical) variable W, where the 2 daughter nodes are defined as {W A} and {W A} for a subset A of all levels of W. (3)

6 6 C. HU AND V. DEGRUTTOLA The variance estimator ˆσ r can be obtained from general theory on two-sample U-statistics, which is covered in textbooks like Serfling (1980). For an arbitrary binary split of v into w and v\w, let m 1 = w and m 2 = v\w, the sizes of w and v\w, respectively, and let m = m 1 + m 2. Define j v\w D(i, j). We have m(u w E U w ) d U w = (m 1 m 2 ) 1 i w N(0, σ10 2 /p + σ 01 2 /(1 p)), where σ10 2 = Cov{D(i, j), D(i, j )}, σ01 2 = Cov{D(i, j), D(i, j)}, and p is the limit of m 1 /m. Here, i, i w, and i i, while j, j v\w, and j j. These quantities can be easily estimated from the data: ˆσ 10 = (m 1 ) 1 i w {m 2(m 2 1)} 1 j j D(i, j)d(i, j ) Uw 2, and ˆσ 01 = (m 2 ) 1 j v\w {m 1(m 1 1)} 1 i i D(i, j)d(i, j) Uw 2. As mentioned earlier, a binary split can be defined based on any ordered or unordered baseline variable (Breiman and others, 1984). For brevity of notation, however, we assume that there is no baseline variable other than those mutation indicators. We now split the node v based on the mutation status at the r th codon, where G(r, v) = max 1 r R G(r, v). The 2 daughter nodes are then w r and v\w r, with G(v) G(r, v) a goodness of split measure indicating the difference in viral load trajectories of the 2 daughter nodes. We apply this procedure first to the root node of all subjects, and then recursively to the resulting daughter nodes. The procedure does not stop until every terminal node satisfies one of the criteria: the node size is smaller than a prespecified number, say 5; the node contains only homogeneous subjects with respect to the splitting criterion, that is, G(r, ) = 0 for all 1 r R or all subjects in the node have the same baseline viral sequence. This splitting process generally results in a large tree, and pruning is needed to avoid overfitting. 2.5 Determination of proper tree size To describe the pruning procedure, we introduce some notation. For any tree or any branch of a tree T, T denotes the set of all terminal nodes of T, and T o T \ T denotes all internal nodes of T. When T 1 is a subtree of T 2, we write T 1 T 2, while T 1 T 2 specifies that T 1 is a proper subtree of T 2. Also T 0 denotes the large tree obtained in the splitting procedure. The goodness of split measure for a tree or a branch of a tree T, G(T ) = v T o G(v), is the sum of goodness of split measures for all internal nodes of T. When further splits are made, G(T ) can only increase. As in Breiman and others (1984) and LeBlanc and Crowley (1993), we add a penalty term to the splitting criterion, penalizing the complexity of the tree. For α 0, define G α (T ) = G(T ) α T o. Thus, α is the penalty deducted from the splitting criterion for each additional split. Our goal is to find a subtree of T 0 that maximizes G α ( ) for any α 0. As pointed out by LeBlanc and Crowley (1993), α can be chosen between 2 and 4. Using α = 2 is in the same spirit of the Akaike information criterion, while a penalty of α = 4 is roughly equivalent to setting the significance level at 0.05 for each split. For any tree T, if T T and G α (T ) = max T T G α (T ), we call T an optimally pruned subtree of T for the penalty value α. For α 0, let T (α) denote the smallest optimally pruned subtree of T with respect to α. For any node v of T, let T v be the branch of T rooted at v. As in Breiman and others (1984), we can establish the existence of T (α) by induction on T and further prove that T (α) = {v T : G α (T v ) > 0 for any ancestor v of v}. Note that T 0 (0) = T 0. To identify T 0 (α) for all α 0, for any node v T 0, we define o G(T 0v )/ T 0v if v T o 0, g 0 (v) = + otherwise. Here, T 0v is the branch of T 0 rooted at v. Let α 1 = min v T0 g 0 (v). Then T 0 = T 0 (α) for all α < α 1 and T 1 = T 0 (α 1 ) can be obtained by removing from T 0 every branch T 0v with g 0 (v) = α 1. Now, for any v T 1, define

7 Recursive partitioning of resistant mutations 7 o G(T 1v )/ T 1v if v T o 1, g 1 (v) = + otherwise. Let α 2 = min v T1 g 1 (v). Then T 1 = T 0 (α) for α 1 α < α 2 and T 2 = T 0 (α 2 ) can be obtained by removing from T 1 every branch T 1v with g 1 (v) = α 2. This process is repeated until for some S > 0, T S is reduced to the tree containing the root node of T 0 only. We then have a sequence of trees T S T 1 T 0 and a sequence of numbers = α S+1 > α S > > α 1 > α 0 = 0 so that T s = T 0 (α) for α s α < α s+1. Like LeBlanc and Crowley (1993), we define the cost for a node v as R(v) = v T o G(v ) if v T o 0, 0v and 0 otherwise. For any T T 0, let R(T ) = v T R(v) and R α(t ) = R(T ) + α T. This implies that G α (T ) = G(T 0 ) + α R α (T ) and our proposed pruning process can be described in terms of R α ( ) instead of G α ( ), and all optimal properties of the pruning process can be proved as in Chapter 10 of Breiman and others (1984). Since G α is calculated from the same data used to grow and prune the tree, it is a biased estimate of the expected G α for the same tree applied to a new independent data set. If the sample size is large, we can randomly split the data into a training sample and a test sample. A tree will be grown and pruned on the training sample, and the goodness of split measures can be calculated using the test sample, resulting in an unbiased estimate of G α. Data sets of moderate size do not allow for separate training and test samples, so resampling methods are used to calculate a proper estimate of G α as in LeBlanc and Crowley (1993). Let X 1 and X 2 be 2 samples of data, and let G(X 1 ; X 2, T ) be the G statistic calculated from X 1 for a tree T built on X 2. Then G(T ) = G(X; X, T ) for the data set X, and what we need is an estimator for G (T ) = E{G(X ; X, T )}, where X is a sample independent of X. Let α s = α s α s+1 for 1 s S 1. We use the following resampling method to identify the proper tree size. First, draw B bootstrap samples from X and denote them by X b (b = 1,..., B). Then grow a large tree T b for each b, and for each s find T b (α s ), the optimally pruned subtree of T b for the per split penalty α s. The overoptimism in G(T b(α s )) can be estimated by O bs = G(X b ; X b, T b (α s )) G(X; X b, T b (α s )). Then ˆω s = ( B b=1 O bs )/B is a reasonable estimate for the overoptimism in G(T (α s )). For any fixed per split penalty α, we can then choose the optimal subtree as the T (α s ) that maximizes Ĝ α (T (α s )) = G(X; X, T (α s )) ˆω s α T (α s ). If Ĝ α (T (α s )) is maximized at s = S 1 and Ĝ α (T (α S 1 )) < 0, the optimal subtree contains only the root node. 3. SIMULATION STUDIES Simulation studies are carried out under various settings. In each setting, n = 200 subjects have baseline (t = 0) viral load and viral genetic sequences and are scheduled to have 5 viral load measurements at t = 1, 2,..., 5. Thus, n i = 5 for i = 1,..., n. For any scheduled t > 0, the actual measurement time follows a uniform distribution on [t 0.25, t ]. For r = 1,..., 30, let C ir be the indicator of mutation at the rth codon, which are all independent in the simulation. Let Y ik be the log 10 viral load at time t ik for the ith subject. In all scenarios, we assume Y ik = β 0 + β 1i t ik + ɛ ik, where the random error ɛ ik N(0, σ 2 ) is independent of other variables. We set σ = log 10 2, so that a change of one standard deviation in the error is equivalent to doubling or halving the viral load on the original scale. The baseline mean β 0 is set to be log 10 (50000), close to the median observation in the efavirenz studies to be analysed in the next section. The slope β 1i depends on the resistance status of each patient. All simulated viral loads are imputed at 50, reflecting the threshold of accurate quantification for commonly used assays. At any split the left daughter node is associated with better response. Data are simulated under 4 general scenarios:

8 8 C. HU AND V. DEGRUTTOLA I. Each codon has a 0.5 chance to be mutant and none is resistant. Two values of the slope β 1i are simulated; β 1i = log 10 (2) (rapid decline) and β 1i = log 10 (2)/2.5 (slow decline). Here, the correct tree is one with root node only. II. Each codon has a 0.5 chance to be mutant, and a mutant first codon confers resistance to the study drug. For resistant subjects β 1i = 0, and for other subjects sensitive to drugs, we simulate 2 separate settings of rapid decline (β 1i = log 10 (2)) and slow decline (β 1i = log 10 (2)/2.5). Here, the correct tree has a single split on codon 1. III. Each codon has a 0.5 chance to be mutant, and a subject is resistant to drugs only if the first and second codons are both mutant. Hence, about 25% of all subjects are resistant. For resistant subjects, the slope β 1i = 0, and for the sensitive subjects, we simulate 2 separate settings of rapid decline (β 1i = log 10 (2)) and slow decline (β 1i = log 10 (2)/2.5). Here, the correct tree has 2 splits on the first 2 codons in either order, and the second split should be at the right daughter node of the root. IV. The first codon has probability of 1/3 to be mutant and the other codons have 0.5 chance to mutant. A subject is resistant if C i1 = 1 with slope β 1i = 0. A subject is sensitive to drugs if C i1 = C i2 = 0, with the slope β 1i set to be negative. If C i1 = 0 and C i2 = 1, the viral load declines initially at the first 3 time points after baseline and then rebounds to baseline level. For these rebounding subjects, the slope β 1i is negative for k 3 and β 1i = 0 for k = 4, 5. Under this scenario, the cohort is divided into 3 subsets of approximately equal sizes: resistant, sensitive, and rebounding. Again we simulate 2 separate settings of rapid decline (β 1i = log 10 (2)) and slow decline (β 1i = log 10 (2)/2.5) for the sensitive subjects and for the initial decline phase of the rebounding subjects. Multiple trees are considered correct: root node is split on codon 1 and its left daughter node is split on codon 2, root node is split on codon 2 and its left daughter node is split on codon 1, and also this tree with an extra split added to the right daughter node based on codon 1. One thousand data sets are generated for each setting. In the pruning process, we use 50 bootstrap samples to calculate the overoptimism in the goodness of split measure. Two values are used for the persplit penalty (α = 2 and 4). Table 1 provides the proportions of getting various types of trees for each setting. The correct trees under all scenarios have been described above. When there should be at least one split but the resulting tree has none we refer to it as a null tree. A noisy tree contains all correct split(s) and also some extra noise. A partially correct tree (denoted by partial in the table) contains some but not all correct splits, and an incorrect tree has none of the correct split(s) but some noise. For the first scenario of no resistance mutations, the method performs well for all settings, with >95% power to obtain the correct tree. Not surprisingly, the performance is slightly better in the slow decline setting, where observations at different time points are closer to each other. For the second scenario with one resistant mutation, the method performs well in all settings, with >97% power to obtain the correct tree. For the small number of data sets where the correct tree is not obtained, all selected trees have the root node correctly split on codon 1, with some added noise. For the third scenario with interaction effect of 2 codons, the power to identify the correct tree is very high (96%) in the rapid decline setting. In the slow decline setting, viral load trajectories of resistant and sensitive subjects are not as well separated as in the rapid decline setting, and the power is moderately high (>80%). For the fourth scenario with resistant, sensitive and rebounding subjects, the power is very high (>97%) in the rapid decline setting and is moderately high (>76%) in the slow decline setting. The 2 penalty values α = 2 or 4 induced similar performance in all scenarios. From Table 1, it can be seen that the average tree sizes are close for these 2 α values. Simulation studies are also carried out for the smaller sample sizes of n = 100 and n = 30, and results are summarized in Section 1 of the web-based supplementary material available at Biostatistics online. For n = 100, the proposed method still performs well except for the slow decline settings of scenarios III and IV, where the power is low to detect the resistant mutations. This is due to the small number of

9 Recursive partitioning of resistant mutations 9 Table 1. Results of simulation studies Scenario Rate of α Proportion Proportion Proportion Proportion Proportion Mean number decline correct (%) null (%) noisy (%) partial (%) incorrect (%) of nodes I Rapid Slow II III IV Rapid Slow Rapid Slow Rapid Slow subjects with resistant mutations (e.g., 25 in scenario III). When n = 30 we see some further reduction in the power to detect resistant mutations. For scenarios II and III, we also study through simulation the power of a parametric approach using the true functional form of the viral load over time. For each node, a GEE model is fitted and a splitting criterion based on the residual sum of squares is used. The algorithm has an almost 100% power to identify the correct resistant codons when n = 200 and 100. The power deteriorated substantially only in the slow decline setting of scenario III when n = 30. As discussed in Section 1, viral trajectories can be quite complicated as they reflect a complex interaction of viral dynamics, evolutionary dynamics, intercurrent illnesses, and behavior. These factors induce a high level of heterogeneity in the data, so parametric modeling will be very difficult, and sometimes even impossible. 4. PHASE II EFAVIRENZ STUDIES Data from 3 phase II clinical trials of efavirenz (DMP , 004 and 005) were analyzed using the proposed methods. Most study participants were previously exposed to nucleoside reverse transcriptase inhibitors (NRTIs), often leading to development of various resistance mutations to NRTIs. The drug under study belongs to the drug class of nonnucleoside reverse transcriptase inhibitors (NNRTIs) and proved to be useful in construction of potent regimens. Because most study participants received efavirenz in combination with 2 NRTIs, baseline NRTI mutations could predict response to treatment. The details of these studies can be found in Bacheler and others (2000). A total of 156 subjects had viral genetic sequences and viral load available at baseline, and at least one viral load measurement after study entry. The December 2009 version of the International AIDS Society-USA drug resistance mutations list (Johnson and others, 2009) specifies 25 NRTI and NNRTI resistance sites, and 12 of them appeared in the baseline data of these 3 studies. They are all considered as potential predictors for future viral load, along with the baseline viral load dichotomized at 50,000 copies/ml, which is close to the median of 55,115 copies/ml. These predictors are summarized in Table 2.

10 10 C. HU AND V. DEGRUTTOLA Table 2. Frequencies of baseline mutations and high viral load Variable Frequency 95% confidence interval RT (0.101, 0.220) RT (0.040, 0.131) RT (0.031, 0.115) RT (0.080, 0.191) RT (0.000, 0.035) RT (0.000, 0.035) RT (0.002, 0.046) RT (0.000, 0.035) RT (0.278, 0.433) RT (0.022, 0.099) RT (0.145, 0.277) RT (0.040, 0.131) RNA 50, (0.457, 0.618) We first investigate plasma HIV-1 RNA trajectories in the first 12 weeks after study entry. The U-type score D(, ) defined in (2) serves as the basis of the first analysis; another analysis makes use of the alternative score D(, ) defined in (3). In the latter, viral suppression is defined as HIV-1 RNA level below 400 copies/ml, and rebound, as 2 consecutive values above 400 copies/ml or a single one above 4000 copies/ml after an initial suppression. As viral load testing did not always happen on the scheduled day, we allow a 1-week window before and after the scheduled dates. In this way, we include all viral loads measured in the first 13 weeks in the 12-week analysis. In calculation of both scores, viral load measurements of 2 different subjects taken within 7 days of each other were considered to have occurred at the same time. During the period of time under consideration, the median number of viral load measurements was 7, and nearly 95% of subjects had at least 4 viral load measurements. For the alternative score, the numbers of subjects with V (i) = 0 (viral suppression), 1 (initial viral suppression followed by rebound) and 2 (no suppression) were 55(35%), 37(24%), and 64(41%), respectively. To identify appropriate size of a tree, the resampling method mentioned at the end of Section 2 was used to correct for the overoptimism in G(T ), with 50 bootstrap samples. The per-split penalty α is first set to be 2, and analysis is repeated with α = 4. Figure 2 shows the tree constructed from original scores with α set to be 2; subjects with better response were included in the left daughter nodes. We see that the root node was split on baseline viral load, with low levels associated with better response. For subjects with low baseline viral load, baseline mutation at RT184 predicted worse response, as did mutation at RT210 for those with high viral load. Subjects with low viral load and mutant RT184 were split based on RT69, and such subjects with wild-type RT69 were further split based on RT219. The mean profile for trajectories in each terminal node is shown in Figure 1 of the web-based supplementary material available at Biostatistics online. The alternative scores gave a very similar tree (not shown), but did not contain the split based on RT210 for subjects with high baseline viral loads. A smaller tree is expected, since the alternative score contains less information than did the original, but the 2 types of scores gave a quite consistent picture about resistant mutations. When α was set to be 4, the resulting tree was unchanged for both scores. When viral load trajectories in the first 16 weeks are studied, setting α = 2 and using the original scores gave a similar tree (Figure 3) to that for response in the first 12 weeks. The tree has one fewer split that associated with RT69. The mean profile for trajectories in each terminal node is shown in Figure 2 of the web-based supplementary material available at Biostatistics online. The alternative scores, on the other hand, did not give any split. When α is set at 4, both types of scores gave the tree with no

11 Recursive partitioning of resistant mutations 11 Fig. 2. Tree for DMP-266 studies based on viral load (VL) of the first 12 weeks. Fig. 3. Tree for DMP-266 studies based on viral load (VL) of the first 16 weeks. split. We see that the association between baseline viral load and mutations with viral response during the 16-week period was weaker. This may reflect the impact of new resistance mutations after study entry. As suggested by a referee we also carried out a traditional variable selection procedure using GEE models. All predictors significant at 0.10 level in single-covariate models were included in a backward elimination process at the 0.05 significance level. Similar models were obtained for both the 12-week and the 16-week periods, which show that high baseline viral load along with mutations at RT75 and RT101

12 12 C. HU AND V. DEGRUTTOLA were associated with poor outcome, while mutation at RT106 was associated with favorable outcome. One cause of the difference in results from this approach and our recursive partitioning model could be the additive versus interaction nature of the methods. Also, the proposed recursive partitioning method gives a predictive model, while GEE is used to find association between predictors and the outcome. 5. DISCUSSION We have proposed a recursive partitioning method to correlate baseline variables, which could be of high dimension, with the entire observed trajectory of a longitudinally measured marker. No functional form is assumed for the marker trajectory, and no parametric assumptions are made on the marker distribution. This method is especially useful for markers like the viral load in HIV/AIDS studies where parametric modeling can be very complicated. The definition of the U-type score is very flexible and can depend on the nature of specific markers and the goals of analysis. We explored different ways of defining the score for markers expected to fall (e.g. viral load) and those expected to rise (e.g. CD4 T-cell count). We can also incorporate qualitative information like suppression and rebound. We showed through simulation that our method works very well for modest sample sizes. The pruning process is able to simultaneously identify signals in the data and exclude noise. Choice of per-split penalty α = 2 or 4 had little impact on trees an observation that also applied in the analysis of DMP-266 data. The choice of α did not affect the resulting tree for the strong signal observed during the first 12 weeks of the study. During that time period, we see that similar trees were generated from the original and alternative scores, despite the very different score definitions. During the 16-week period in which the associations appeared to be somewhat weaker, the original scores, containing more information, led to a greater number of splits, as did the choice of α = 2 compared to α = 4. In the development of the score, we assumed that longitudinal marker measurements are missing completely at random. More general missing mechanism will be explored in future research. When viral loads are missing at random, the missingness mechanism will be modeled, and an inverse probabilityweighted version of D(i, j) can be defined. In Section 3, we see that the power of the proposed method is low for some scenarios when n 100. The score comparing 2 viral load observations (Y 1, t 1 ) and (Y 2, t 2 ) from 2 different subjects is set to 0 if Y 1 < Y 2 but t 1 > t 2. This score avoids specification of the mean trajectory and parametric assumptions for the longitudinal viral load but does not utilize all information contained in such pair of observations. Making use of such information, when possible, could improve the power of the proposed method. Further research on this issue will use smoothing techniques in the calculation of scores. SUPPLEMENTARY MATERIAL Supplementary material is available at ACKNOWLEDGMENTS The authors would like to thank Bristol-Myers Squibb Company for providing the DMP 266 data, and 2 reviewers and the associate editor for helpful suggestions and comments. Conflict of Interest: None declared. National Institutes of Health (AI51164). FUNDING

13 Recursive partitioning of resistant mutations 13 REFERENCES BACHELER, L., ANTON, E., KUDISH, P., BAKER, D., BUNVILLE, J., KRAKOWSKI, K., BOLLING, L., AUJAY, M. X. W., ELLIS, D., BECKER, M. and others (2000). Human immunodeficiency virus type 1 mutations selected in patients failing efavirenz combination therapy. Antimicrobial Agents and Chemotherapy 44, BEERENWINKEL, N., SCHMIDT, B., WALTER, H., KAISER, R., LENGAUER, T., HOFFMANN, D., KORN, K. AND SELBIG, J. (2002). Diversity and complexity of HIV-1 drug resistance: a bioinformatics approach to predicting phenotype from genotype. Proceedings of the National Academy of Sciences of the United States of America 99, BREIMAN, L., FRIEDMAN, J., OLSHEN, R. AND STONE, C. (1984). Classification and Regression Trees. Boca Raton, FL: Chapman & Hall/CRC Press. FOULKES, A. S. AND DEGRUTTOLA, V. (2002). Characterizing the relationship between HIV-1 genotype and phenotype: prediction based classification. Biometrics 58, FOULKES, A. S, DEGRUTTOLA, V. AND HERTOGS, K. (2004). Combining genotype groups and recursive partitioning: an application to human immunodeficiency virus type 1 genetics data. Applied Statistician 53, JOHNSON, V. A., BRUN-VZINET, F., CLOTET, B., GÜNTHARD, H. F., KURITZKES, D. R., PILLAY, D., SCHAPIRO, J. M. AND RICHMAN, D. D. (2009). Update of the drug resistance mutations in HIV-1: December Topics in HIV Medicine 17, LEBLANC, M. AND CROWLEY, J. (1993). Survival trees by goodness of split. Journal of the American Statistical Association 88, LEE, S. K. (2006). On classification and regression trees for multiple responses and its application. Journal of Classification 23, MAY, S. AND DEGRUTTOLA, V. (2007). Nonparametric tests for dependent observations obtained at varying time points. Biometrics 63, SEGAL, M. R. (1992). Tree-structured methods for longitudinal data. Journal of the American Statistical Association 87, SERFLING, R. J. (1980). Approximation Theorems of Mathematical Statistics. New York: John Wiley & Sons. SEVIN, A. D., DE GRUTTOLA, V., NIJHUIS, M., SCHAPIRO, J. M., FOULKES, A. S., PARA, M. F. AND BOUCHER, C. A. (2000). Methods for investigation of the relationship between drug-susceptibility phenotype and human immunodeficiency virus type-1 genotype with applications to AIDS clinical trials group 333. Journal of Infectious Diseases 182, ZHANG, H. P. (1998). Classification trees for multiple binary responses. Journal of the American Statistical Association 93, ZHANG, H. P. AND SINGER, B. (1999). Recursive Partitioning in the Health Sciences. New York: Springer. [Received July 20, 2010; revised April 11, 2011; accepted for publication April 20, 2011]

A novel approach to estimation of the time to biomarker threshold: Applications to HIV

A novel approach to estimation of the time to biomarker threshold: Applications to HIV A novel approach to estimation of the time to biomarker threshold: Applications to HIV Pharmaceutical Statistics, Volume 15, Issue 6, Pages 541-549, November/December 2016 PSI Journal Club 22 March 2017

More information

Supplement for: CD4 cell dynamics in untreated HIV-1 infection: overall rates, and effects of age, viral load, gender and calendar time.

Supplement for: CD4 cell dynamics in untreated HIV-1 infection: overall rates, and effects of age, viral load, gender and calendar time. Supplement for: CD4 cell dynamics in untreated HIV-1 infection: overall rates, and effects of age, viral load, gender and calendar time. Anne Cori* 1, Michael Pickles* 1, Ard van Sighem 2, Luuk Gras 2,

More information

Predicting Breast Cancer Survival Using Treatment and Patient Factors

Predicting Breast Cancer Survival Using Treatment and Patient Factors Predicting Breast Cancer Survival Using Treatment and Patient Factors William Chen wchen808@stanford.edu Henry Wang hwang9@stanford.edu 1. Introduction Breast cancer is the leading type of cancer in women

More information

Unit 1 Exploring and Understanding Data

Unit 1 Exploring and Understanding Data Unit 1 Exploring and Understanding Data Area Principle Bar Chart Boxplot Conditional Distribution Dotplot Empirical Rule Five Number Summary Frequency Distribution Frequency Polygon Histogram Interquartile

More information

Chapter 1: Exploring Data

Chapter 1: Exploring Data Chapter 1: Exploring Data Key Vocabulary:! individual! variable! frequency table! relative frequency table! distribution! pie chart! bar graph! two-way table! marginal distributions! conditional distributions!

More information

MOST: detecting cancer differential gene expression

MOST: detecting cancer differential gene expression Biostatistics (2008), 9, 3, pp. 411 418 doi:10.1093/biostatistics/kxm042 Advance Access publication on November 29, 2007 MOST: detecting cancer differential gene expression HENG LIAN Division of Mathematical

More information

Comparison of discrimination methods for the classification of tumors using gene expression data

Comparison of discrimination methods for the classification of tumors using gene expression data Comparison of discrimination methods for the classification of tumors using gene expression data Sandrine Dudoit, Jane Fridlyand 2 and Terry Speed 2,. Mathematical Sciences Research Institute, Berkeley

More information

University of California, Berkeley

University of California, Berkeley University of California, Berkeley U.C. Berkeley Division of Biostatistics Working Paper Series Year 2007 Paper 221 Biomarker Discovery Using Targeted Maximum Likelihood Estimation: Application to the

More information

Lecture Outline. Biost 590: Statistical Consulting. Stages of Scientific Studies. Scientific Method

Lecture Outline. Biost 590: Statistical Consulting. Stages of Scientific Studies. Scientific Method Biost 590: Statistical Consulting Statistical Classification of Scientific Studies; Approach to Consulting Lecture Outline Statistical Classification of Scientific Studies Statistical Tasks Approach to

More information

Bayesian Latent Subgroup Design for Basket Trials

Bayesian Latent Subgroup Design for Basket Trials Bayesian Latent Subgroup Design for Basket Trials Yiyi Chu Department of Biostatistics The University of Texas School of Public Health July 30, 2017 Outline Introduction Bayesian latent subgroup (BLAST)

More information

Investigating the robustness of the nonparametric Levene test with more than two groups

Investigating the robustness of the nonparametric Levene test with more than two groups Psicológica (2014), 35, 361-383. Investigating the robustness of the nonparametric Levene test with more than two groups David W. Nordstokke * and S. Mitchell Colp University of Calgary, Canada Testing

More information

Selection and Combination of Markers for Prediction

Selection and Combination of Markers for Prediction Selection and Combination of Markers for Prediction NACC Data and Methods Meeting September, 2010 Baojiang Chen, PhD Sarah Monsell, MS Xiao-Hua Andrew Zhou, PhD Overview 1. Research motivation 2. Describe

More information

Sampling Weights, Model Misspecification and Informative Sampling: A Simulation Study

Sampling Weights, Model Misspecification and Informative Sampling: A Simulation Study Sampling Weights, Model Misspecification and Informative Sampling: A Simulation Study Marianne (Marnie) Bertolet Department of Statistics Carnegie Mellon University Abstract Linear mixed-effects (LME)

More information

Analysis of Vaccine Effects on Post-Infection Endpoints Biostat 578A Lecture 3

Analysis of Vaccine Effects on Post-Infection Endpoints Biostat 578A Lecture 3 Analysis of Vaccine Effects on Post-Infection Endpoints Biostat 578A Lecture 3 Analysis of Vaccine Effects on Post-Infection Endpoints p.1/40 Data Collected in Phase IIb/III Vaccine Trial Longitudinal

More information

An Introduction to Bayesian Statistics

An Introduction to Bayesian Statistics An Introduction to Bayesian Statistics Robert Weiss Department of Biostatistics UCLA Fielding School of Public Health robweiss@ucla.edu Sept 2015 Robert Weiss (UCLA) An Introduction to Bayesian Statistics

More information

T. R. Golub, D. K. Slonim & Others 1999

T. R. Golub, D. K. Slonim & Others 1999 T. R. Golub, D. K. Slonim & Others 1999 Big Picture in 1999 The Need for Cancer Classification Cancer classification very important for advances in cancer treatment. Cancers of Identical grade can have

More information

Business Statistics Probability

Business Statistics Probability Business Statistics The following was provided by Dr. Suzanne Delaney, and is a comprehensive review of Business Statistics. The workshop instructor will provide relevant examples during the Skills Assessment

More information

The Impact of Continuity Violation on ANOVA and Alternative Methods

The Impact of Continuity Violation on ANOVA and Alternative Methods Journal of Modern Applied Statistical Methods Volume 12 Issue 2 Article 6 11-1-2013 The Impact of Continuity Violation on ANOVA and Alternative Methods Björn Lantz Chalmers University of Technology, Gothenburg,

More information

IAPT: Regression. Regression analyses

IAPT: Regression. Regression analyses Regression analyses IAPT: Regression Regression is the rather strange name given to a set of methods for predicting one variable from another. The data shown in Table 1 and come from a student project

More information

Response to Mease and Wyner, Evidence Contrary to the Statistical View of Boosting, JMLR 9:1 26, 2008

Response to Mease and Wyner, Evidence Contrary to the Statistical View of Boosting, JMLR 9:1 26, 2008 Journal of Machine Learning Research 9 (2008) 59-64 Published 1/08 Response to Mease and Wyner, Evidence Contrary to the Statistical View of Boosting, JMLR 9:1 26, 2008 Jerome Friedman Trevor Hastie Robert

More information

Estimating HIV incidence in the United States from HIV/AIDS surveillance data and biomarker HIV test results

Estimating HIV incidence in the United States from HIV/AIDS surveillance data and biomarker HIV test results STATISTICS IN MEDICINE Statist. Med. 2008; 27:4617 4633 Published online 4 August 2008 in Wiley InterScience (www.interscience.wiley.com).3144 Estimating HIV incidence in the United States from HIV/AIDS

More information

Application of Local Control Strategy in analyses of the effects of Radon on Lung Cancer Mortality for 2,881 US Counties

Application of Local Control Strategy in analyses of the effects of Radon on Lung Cancer Mortality for 2,881 US Counties Application of Local Control Strategy in analyses of the effects of Radon on Lung Cancer Mortality for 2,881 US Counties Bob Obenchain, Risk Benefit Statistics, August 2015 Our motivation for using a Cut-Point

More information

Chapter 1. Introduction

Chapter 1. Introduction Chapter 1 Introduction 1.1 Motivation and Goals The increasing availability and decreasing cost of high-throughput (HT) technologies coupled with the availability of computational tools and data form a

More information

Sum of Neurally Distinct Stimulus- and Task-Related Components.

Sum of Neurally Distinct Stimulus- and Task-Related Components. SUPPLEMENTARY MATERIAL for Cardoso et al. 22 The Neuroimaging Signal is a Linear Sum of Neurally Distinct Stimulus- and Task-Related Components. : Appendix: Homogeneous Linear ( Null ) and Modified Linear

More information

Lessons in biostatistics

Lessons in biostatistics Lessons in biostatistics The test of independence Mary L. McHugh Department of Nursing, School of Health and Human Services, National University, Aero Court, San Diego, California, USA Corresponding author:

More information

Title: A robustness study of parametric and non-parametric tests in Model-Based Multifactor Dimensionality Reduction for epistasis detection

Title: A robustness study of parametric and non-parametric tests in Model-Based Multifactor Dimensionality Reduction for epistasis detection Author's response to reviews Title: A robustness study of parametric and non-parametric tests in Model-Based Multifactor Dimensionality Reduction for epistasis detection Authors: Jestinah M Mahachie John

More information

Study Guide for the Final Exam

Study Guide for the Final Exam Study Guide for the Final Exam When studying, remember that the computational portion of the exam will only involve new material (covered after the second midterm), that material from Exam 1 will make

More information

Models of HIV during antiretroviral treatment

Models of HIV during antiretroviral treatment Models of HIV during antiretroviral treatment Christina M.R. Kitchen 1, Satish Pillai 2, Daniel Kuritzkes 3, Jin Ling 3, Rebecca Hoh 2, Marc Suchard 1, Steven Deeks 2 1 UCLA, 2 UCSF, 3 Brigham & Womes

More information

A Comparison of Methods for Determining HIV Viral Set Point

A Comparison of Methods for Determining HIV Viral Set Point STATISTICS IN MEDICINE Statist. Med. 2006; 00:1 6 [Version: 2002/09/18 v1.11] A Comparison of Methods for Determining HIV Viral Set Point Y. Mei 1, L. Wang 2, S. E. Holte 2 1 School of Industrial and Systems

More information

SUPPLEMENTARY INFORMATION. Table 1 Patient characteristics Preoperative. language testing

SUPPLEMENTARY INFORMATION. Table 1 Patient characteristics Preoperative. language testing Categorical Speech Representation in the Human Superior Temporal Gyrus Edward F. Chang, Jochem W. Rieger, Keith D. Johnson, Mitchel S. Berger, Nicholas M. Barbaro, Robert T. Knight SUPPLEMENTARY INFORMATION

More information

Hypothesis Testing. Richard S. Balkin, Ph.D., LPC-S, NCC

Hypothesis Testing. Richard S. Balkin, Ph.D., LPC-S, NCC Hypothesis Testing Richard S. Balkin, Ph.D., LPC-S, NCC Overview When we have questions about the effect of a treatment or intervention or wish to compare groups, we use hypothesis testing Parametric statistics

More information

Fundamental Clinical Trial Design

Fundamental Clinical Trial Design Design, Monitoring, and Analysis of Clinical Trials Session 1 Overview and Introduction Overview Scott S. Emerson, M.D., Ph.D. Professor of Biostatistics, University of Washington February 17-19, 2003

More information

WDHS Curriculum Map Probability and Statistics. What is Statistics and how does it relate to you?

WDHS Curriculum Map Probability and Statistics. What is Statistics and how does it relate to you? WDHS Curriculum Map Probability and Statistics Time Interval/ Unit 1: Introduction to Statistics 1.1-1.3 2 weeks S-IC-1: Understand statistics as a process for making inferences about population parameters

More information

Abstract. Introduction A SIMULATION STUDY OF ESTIMATORS FOR RATES OF CHANGES IN LONGITUDINAL STUDIES WITH ATTRITION

Abstract. Introduction A SIMULATION STUDY OF ESTIMATORS FOR RATES OF CHANGES IN LONGITUDINAL STUDIES WITH ATTRITION A SIMULATION STUDY OF ESTIMATORS FOR RATES OF CHANGES IN LONGITUDINAL STUDIES WITH ATTRITION Fong Wang, Genentech Inc. Mary Lange, Immunex Corp. Abstract Many longitudinal studies and clinical trials are

More information

Summary Report for HIV Random Clinical Trial Conducted in

Summary Report for HIV Random Clinical Trial Conducted in Summary Report for HIV Random Clinical Trial Conducted in 9-2014 H.T. Banks and Shuhua Hu Center for Research in Scientific Computation North Carolina State University Raleigh, NC 27695-8212 USA Eric Rosenberg

More information

The Loss of Heterozygosity (LOH) Algorithm in Genotyping Console 2.0

The Loss of Heterozygosity (LOH) Algorithm in Genotyping Console 2.0 The Loss of Heterozygosity (LOH) Algorithm in Genotyping Console 2.0 Introduction Loss of erozygosity (LOH) represents the loss of allelic differences. The SNP markers on the SNP Array 6.0 can be used

More information

Generalized Estimating Equations for Depression Dose Regimes

Generalized Estimating Equations for Depression Dose Regimes Generalized Estimating Equations for Depression Dose Regimes Karen Walker, Walker Consulting LLC, Menifee CA Generalized Estimating Equations on the average produce consistent estimates of the regression

More information

Generation times in epidemic models

Generation times in epidemic models Generation times in epidemic models Gianpaolo Scalia Tomba Dept Mathematics, Univ of Rome "Tor Vergata", Italy in collaboration with Åke Svensson, Dept Mathematics, Stockholm University, Sweden Tommi Asikainen

More information

SPRING GROVE AREA SCHOOL DISTRICT. Course Description. Instructional Strategies, Learning Practices, Activities, and Experiences.

SPRING GROVE AREA SCHOOL DISTRICT. Course Description. Instructional Strategies, Learning Practices, Activities, and Experiences. SPRING GROVE AREA SCHOOL DISTRICT PLANNED COURSE OVERVIEW Course Title: Basic Introductory Statistics Grade Level(s): 11-12 Units of Credit: 1 Classification: Elective Length of Course: 30 cycles Periods

More information

1 Introduction. st0020. The Stata Journal (2002) 2, Number 3, pp

1 Introduction. st0020. The Stata Journal (2002) 2, Number 3, pp The Stata Journal (22) 2, Number 3, pp. 28 289 Comparative assessment of three common algorithms for estimating the variance of the area under the nonparametric receiver operating characteristic curve

More information

HIV Drug Resistance. Together, we can change the course of the HIV epidemic one woman at a time.

HIV Drug Resistance. Together, we can change the course of the HIV epidemic one woman at a time. HIV Drug Resistance Together, we can change the course of the HIV epidemic one woman at a time. #onewomanatatime #thewellproject What Is Resistance? HIV drugs are designed to keep the amount of HIV virus

More information

STATISTICAL METHODS FOR DIAGNOSTIC TESTING: AN ILLUSTRATION USING A NEW METHOD FOR CANCER DETECTION XIN SUN. PhD, Kansas State University, 2012

STATISTICAL METHODS FOR DIAGNOSTIC TESTING: AN ILLUSTRATION USING A NEW METHOD FOR CANCER DETECTION XIN SUN. PhD, Kansas State University, 2012 STATISTICAL METHODS FOR DIAGNOSTIC TESTING: AN ILLUSTRATION USING A NEW METHOD FOR CANCER DETECTION by XIN SUN PhD, Kansas State University, 2012 A THESIS Submitted in partial fulfillment of the requirements

More information

Bayesian approaches to handling missing data: Practical Exercises

Bayesian approaches to handling missing data: Practical Exercises Bayesian approaches to handling missing data: Practical Exercises 1 Practical A Thanks to James Carpenter and Jonathan Bartlett who developed the exercise on which this practical is based (funded by ESRC).

More information

Lecture Outline. Biost 517 Applied Biostatistics I. Purpose of Descriptive Statistics. Purpose of Descriptive Statistics

Lecture Outline. Biost 517 Applied Biostatistics I. Purpose of Descriptive Statistics. Purpose of Descriptive Statistics Biost 517 Applied Biostatistics I Scott S. Emerson, M.D., Ph.D. Professor of Biostatistics University of Washington Lecture 3: Overview of Descriptive Statistics October 3, 2005 Lecture Outline Purpose

More information

Missing Data and Imputation

Missing Data and Imputation Missing Data and Imputation Barnali Das NAACCR Webinar May 2016 Outline Basic concepts Missing data mechanisms Methods used to handle missing data 1 What are missing data? General term: data we intended

More information

Cancer outlier differential gene expression detection

Cancer outlier differential gene expression detection Biostatistics (2007), 8, 3, pp. 566 575 doi:10.1093/biostatistics/kxl029 Advance Access publication on October 4, 2006 Cancer outlier differential gene expression detection BAOLIN WU Division of Biostatistics,

More information

Comparison of the Null Distributions of

Comparison of the Null Distributions of Comparison of the Null Distributions of Weighted Kappa and the C Ordinal Statistic Domenic V. Cicchetti West Haven VA Hospital and Yale University Joseph L. Fleiss Columbia University It frequently occurs

More information

Accommodating informative dropout and death: a joint modelling approach for longitudinal and semicompeting risks data

Accommodating informative dropout and death: a joint modelling approach for longitudinal and semicompeting risks data Appl. Statist. (2018) 67, Part 1, pp. 145 163 Accommodating informative dropout and death: a joint modelling approach for longitudinal and semicompeting risks data Qiuju Li and Li Su Medical Research Council

More information

Chapter 14: More Powerful Statistical Methods

Chapter 14: More Powerful Statistical Methods Chapter 14: More Powerful Statistical Methods Most questions will be on correlation and regression analysis, but I would like you to know just basically what cluster analysis, factor analysis, and conjoint

More information

MBios 478: Systems Biology and Bayesian Networks, 27 [Dr. Wyrick] Slide #1. Lecture 27: Systems Biology and Bayesian Networks

MBios 478: Systems Biology and Bayesian Networks, 27 [Dr. Wyrick] Slide #1. Lecture 27: Systems Biology and Bayesian Networks MBios 478: Systems Biology and Bayesian Networks, 27 [Dr. Wyrick] Slide #1 Lecture 27: Systems Biology and Bayesian Networks Systems Biology and Regulatory Networks o Definitions o Network motifs o Examples

More information

Catherine A. Welch 1*, Séverine Sabia 1,2, Eric Brunner 1, Mika Kivimäki 1 and Martin J. Shipley 1

Catherine A. Welch 1*, Séverine Sabia 1,2, Eric Brunner 1, Mika Kivimäki 1 and Martin J. Shipley 1 Welch et al. BMC Medical Research Methodology (2018) 18:89 https://doi.org/10.1186/s12874-018-0548-0 RESEARCH ARTICLE Open Access Does pattern mixture modelling reduce bias due to informative attrition

More information

BIOSTATISTICAL METHODS AND RESEARCH DESIGNS. Xihong Lin Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA

BIOSTATISTICAL METHODS AND RESEARCH DESIGNS. Xihong Lin Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA BIOSTATISTICAL METHODS AND RESEARCH DESIGNS Xihong Lin Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA Keywords: Case-control study, Cohort study, Cross-Sectional Study, Generalized

More information

Detecting Multiple Mean Breaks At Unknown Points With Atheoretical Regression Trees

Detecting Multiple Mean Breaks At Unknown Points With Atheoretical Regression Trees Detecting Multiple Mean Breaks At Unknown Points With Atheoretical Regression Trees 1 Cappelli, C., 2 R.N. Penny and 3 M. Reale 1 University of Naples Federico II, 2 Statistics New Zealand, 3 University

More information

Section on Survey Research Methods JSM 2009

Section on Survey Research Methods JSM 2009 Missing Data and Complex Samples: The Impact of Listwise Deletion vs. Subpopulation Analysis on Statistical Bias and Hypothesis Test Results when Data are MCAR and MAR Bethany A. Bell, Jeffrey D. Kromrey

More information

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo Please note the page numbers listed for the Lind book may vary by a page or two depending on which version of the textbook you have. Readings: Lind 1 11 (with emphasis on chapters 10, 11) Please note chapter

More information

OLS Regression with Clustered Data

OLS Regression with Clustered Data OLS Regression with Clustered Data Analyzing Clustered Data with OLS Regression: The Effect of a Hierarchical Data Structure Daniel M. McNeish University of Maryland, College Park A previous study by Mundfrom

More information

Observational studies; descriptive statistics

Observational studies; descriptive statistics Observational studies; descriptive statistics Patrick Breheny August 30 Patrick Breheny University of Iowa Biostatistical Methods I (BIOS 5710) 1 / 38 Observational studies Association versus causation

More information

RISK PREDICTION MODEL: PENALIZED REGRESSIONS

RISK PREDICTION MODEL: PENALIZED REGRESSIONS RISK PREDICTION MODEL: PENALIZED REGRESSIONS Inspired from: How to develop a more accurate risk prediction model when there are few events Menelaos Pavlou, Gareth Ambler, Shaun R Seaman, Oliver Guttmann,

More information

BEST PRACTICES FOR IMPLEMENTATION AND ANALYSIS OF PAIN SCALE PATIENT REPORTED OUTCOMES IN CLINICAL TRIALS

BEST PRACTICES FOR IMPLEMENTATION AND ANALYSIS OF PAIN SCALE PATIENT REPORTED OUTCOMES IN CLINICAL TRIALS BEST PRACTICES FOR IMPLEMENTATION AND ANALYSIS OF PAIN SCALE PATIENT REPORTED OUTCOMES IN CLINICAL TRIALS Nan Shao, Ph.D. Director, Biostatistics Premier Research Group, Limited and Mark Jaros, Ph.D. Senior

More information

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo

Describe what is meant by a placebo Contrast the double-blind procedure with the single-blind procedure Review the structure for organizing a memo Business Statistics The following was provided by Dr. Suzanne Delaney, and is a comprehensive review of Business Statistics. The workshop instructor will provide relevant examples during the Skills Assessment

More information

Dr. Kelly Bradley Final Exam Summer {2 points} Name

Dr. Kelly Bradley Final Exam Summer {2 points} Name {2 points} Name You MUST work alone no tutors; no help from classmates. Email me or see me with questions. You will receive a score of 0 if this rule is violated. This exam is being scored out of 00 points.

More information

Chapter 5: Field experimental designs in agriculture

Chapter 5: Field experimental designs in agriculture Chapter 5: Field experimental designs in agriculture Jose Crossa Biometrics and Statistics Unit Crop Research Informatics Lab (CRIL) CIMMYT. Int. Apdo. Postal 6-641, 06600 Mexico, DF, Mexico Introduction

More information

Supplemental Digital Content 1. Combination antiretroviral therapy regimens utilized in each study

Supplemental Digital Content 1. Combination antiretroviral therapy regimens utilized in each study Supplemental Digital Content 1. Combination antiretroviral therapy regimens utilized in each study Study Almeida 2011 Auld 2011 Bassett 2012 Bastard 2012 Boulle 2008 (a) Boulle 2008 (b) Boulle 2010 Breen

More information

Dottorato di Ricerca in Statistica Biomedica. XXVIII Ciclo Settore scientifico disciplinare MED/01 A.A. 2014/2015

Dottorato di Ricerca in Statistica Biomedica. XXVIII Ciclo Settore scientifico disciplinare MED/01 A.A. 2014/2015 UNIVERSITA DEGLI STUDI DI MILANO Facoltà di Medicina e Chirurgia Dipartimento di Scienze Cliniche e di Comunità Sezione di Statistica Medica e Biometria "Giulio A. Maccacaro Dottorato di Ricerca in Statistica

More information

COMPARING SEVERAL DIAGNOSTIC PROCEDURES USING THE INTRINSIC MEASURES OF ROC CURVE

COMPARING SEVERAL DIAGNOSTIC PROCEDURES USING THE INTRINSIC MEASURES OF ROC CURVE DOI: 105281/zenodo47521 Impact Factor (PIF): 2672 COMPARING SEVERAL DIAGNOSTIC PROCEDURES USING THE INTRINSIC MEASURES OF ROC CURVE Vishnu Vardhan R* and Balaswamy S * Department of Statistics, Pondicherry

More information

Psychology Research Process

Psychology Research Process Psychology Research Process Logical Processes Induction Observation/Association/Using Correlation Trying to assess, through observation of a large group/sample, what is associated with what? Examples:

More information

Practical Bayesian Design and Analysis for Drug and Device Clinical Trials

Practical Bayesian Design and Analysis for Drug and Device Clinical Trials Practical Bayesian Design and Analysis for Drug and Device Clinical Trials p. 1/2 Practical Bayesian Design and Analysis for Drug and Device Clinical Trials Brian P. Hobbs Plan B Advisor: Bradley P. Carlin

More information

UN Handbook Ch. 7 'Managing sources of non-sampling error': recommendations on response rates

UN Handbook Ch. 7 'Managing sources of non-sampling error': recommendations on response rates JOINT EU/OECD WORKSHOP ON RECENT DEVELOPMENTS IN BUSINESS AND CONSUMER SURVEYS Methodological session II: Task Force & UN Handbook on conduct of surveys response rates, weighting and accuracy UN Handbook

More information

Chapter 23. Inference About Means. Copyright 2010 Pearson Education, Inc.

Chapter 23. Inference About Means. Copyright 2010 Pearson Education, Inc. Chapter 23 Inference About Means Copyright 2010 Pearson Education, Inc. Getting Started Now that we know how to create confidence intervals and test hypotheses about proportions, it d be nice to be able

More information

PSYCH-GA.2211/NEURL-GA.2201 Fall 2016 Mathematical Tools for Cognitive and Neural Science. Homework 5

PSYCH-GA.2211/NEURL-GA.2201 Fall 2016 Mathematical Tools for Cognitive and Neural Science. Homework 5 PSYCH-GA.2211/NEURL-GA.2201 Fall 2016 Mathematical Tools for Cognitive and Neural Science Homework 5 Due: 21 Dec 2016 (late homeworks penalized 10% per day) See the course web site for submission details.

More information

Nonparametric DIF. Bruno D. Zumbo and Petronilla M. Witarsa University of British Columbia

Nonparametric DIF. Bruno D. Zumbo and Petronilla M. Witarsa University of British Columbia Nonparametric DIF Nonparametric IRT Methodology For Detecting DIF In Moderate-To-Small Scale Measurement: Operating Characteristics And A Comparison With The Mantel Haenszel Bruno D. Zumbo and Petronilla

More information

breast cancer; relative risk; risk factor; standard deviation; strength of association

breast cancer; relative risk; risk factor; standard deviation; strength of association American Journal of Epidemiology The Author 2015. Published by Oxford University Press on behalf of the Johns Hopkins Bloomberg School of Public Health. All rights reserved. For permissions, please e-mail:

More information

MULTIPLE LINEAR REGRESSION 24.1 INTRODUCTION AND OBJECTIVES OBJECTIVES

MULTIPLE LINEAR REGRESSION 24.1 INTRODUCTION AND OBJECTIVES OBJECTIVES 24 MULTIPLE LINEAR REGRESSION 24.1 INTRODUCTION AND OBJECTIVES In the previous chapter, simple linear regression was used when you have one independent variable and one dependent variable. This chapter

More information

Individualized Treatment Effects Using a Non-parametric Bayesian Approach

Individualized Treatment Effects Using a Non-parametric Bayesian Approach Individualized Treatment Effects Using a Non-parametric Bayesian Approach Ravi Varadhan Nicholas C. Henderson Division of Biostatistics & Bioinformatics Department of Oncology Johns Hopkins University

More information

11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES

11/18/2013. Correlational Research. Correlational Designs. Why Use a Correlational Design? CORRELATIONAL RESEARCH STUDIES Correlational Research Correlational Designs Correlational research is used to describe the relationship between two or more naturally occurring variables. Is age related to political conservativism? Are

More information

Studying the effect of change on change : a different viewpoint

Studying the effect of change on change : a different viewpoint Studying the effect of change on change : a different viewpoint Eyal Shahar Professor, Division of Epidemiology and Biostatistics, Mel and Enid Zuckerman College of Public Health, University of Arizona

More information

Statistical Techniques. Masoud Mansoury and Anas Abulfaraj

Statistical Techniques. Masoud Mansoury and Anas Abulfaraj Statistical Techniques Masoud Mansoury and Anas Abulfaraj What is Statistics? https://www.youtube.com/watch?v=lmmzj7599pw The definition of Statistics The practice or science of collecting and analyzing

More information

A COMPARISON OF IMPUTATION METHODS FOR MISSING DATA IN A MULTI-CENTER RANDOMIZED CLINICAL TRIAL: THE IMPACT STUDY

A COMPARISON OF IMPUTATION METHODS FOR MISSING DATA IN A MULTI-CENTER RANDOMIZED CLINICAL TRIAL: THE IMPACT STUDY A COMPARISON OF IMPUTATION METHODS FOR MISSING DATA IN A MULTI-CENTER RANDOMIZED CLINICAL TRIAL: THE IMPACT STUDY Lingqi Tang 1, Thomas R. Belin 2, and Juwon Song 2 1 Center for Health Services Research,

More information

Epidemiological Model of HIV/AIDS with Demographic Consequences

Epidemiological Model of HIV/AIDS with Demographic Consequences Advances in Applied Mathematical Biosciences. ISSN 2248-9983 Volume 5, Number 1 (2014), pp. 65-74 International Research Publication House http://www.irphouse.com Epidemiological Model of HIV/AIDS with

More information

Selection and estimation in exploratory subgroup analyses a proposal

Selection and estimation in exploratory subgroup analyses a proposal Selection and estimation in exploratory subgroup analyses a proposal Gerd Rosenkranz, Novartis Pharma AG, Basel, Switzerland EMA Workshop, London, 07-Nov-2014 Purpose of this presentation Proposal for

More information

Clinical Trials A Practical Guide to Design, Analysis, and Reporting

Clinical Trials A Practical Guide to Design, Analysis, and Reporting Clinical Trials A Practical Guide to Design, Analysis, and Reporting Duolao Wang, PhD Ameet Bakhai, MBBS, MRCP Statistician Cardiologist Clinical Trials A Practical Guide to Design, Analysis, and Reporting

More information

Perspective Resistance and Replication Capacity Assays: Clinical Utility and Interpretation

Perspective Resistance and Replication Capacity Assays: Clinical Utility and Interpretation Perspective Resistance and Replication Capacity Assays: Clinical Utility and Interpretation Resistance testing has emerged as an important tool for antiretroviral management. Research continues to refine

More information

Anumber of clinical trials have demonstrated

Anumber of clinical trials have demonstrated IMPROVING THE UTILITY OF PHENOTYPE RESISTANCE ASSAYS: NEW CUT-POINTS AND INTERPRETATION * Richard Haubrich, MD ABSTRACT The interpretation of a phenotype assay is determined by the cut-point, which defines

More information

Mathematical-Statistical Modeling to Inform the Design of HIV Treatment Strategies and Clinical Trials

Mathematical-Statistical Modeling to Inform the Design of HIV Treatment Strategies and Clinical Trials Mathematical-Statistical Modeling to Inform the Design of HIV Treatment Strategies and Clinical Trials 2007 FDA/Industry Statistics Workshop Marie Davidian Department of Statistics North Carolina State

More information

Mathematical-Statistical Modeling to Inform the Design of HIV Treatment Strategies and Clinical Trials

Mathematical-Statistical Modeling to Inform the Design of HIV Treatment Strategies and Clinical Trials Mathematical-Statistical Modeling to Inform the Design of HIV Treatment Strategies and Clinical Trials Marie Davidian and H.T. Banks North Carolina State University Eric S. Rosenberg Massachusetts General

More information

Reliability, validity, and all that jazz

Reliability, validity, and all that jazz Reliability, validity, and all that jazz Dylan Wiliam King s College London Published in Education 3-13, 29 (3) pp. 17-21 (2001) Introduction No measuring instrument is perfect. If we use a thermometer

More information

List of Figures. List of Tables. Preface to the Second Edition. Preface to the First Edition

List of Figures. List of Tables. Preface to the Second Edition. Preface to the First Edition List of Figures List of Tables Preface to the Second Edition Preface to the First Edition xv xxv xxix xxxi 1 What Is R? 1 1.1 Introduction to R................................ 1 1.2 Downloading and Installing

More information

Adjusting for mode of administration effect in surveys using mailed questionnaire and telephone interview data

Adjusting for mode of administration effect in surveys using mailed questionnaire and telephone interview data Adjusting for mode of administration effect in surveys using mailed questionnaire and telephone interview data Karl Bang Christensen National Institute of Occupational Health, Denmark Helene Feveille National

More information

Using Ensemble-Based Methods for Directly Estimating Causal Effects: An Investigation of Tree-Based G-Computation

Using Ensemble-Based Methods for Directly Estimating Causal Effects: An Investigation of Tree-Based G-Computation Institute for Clinical Evaluative Sciences From the SelectedWorks of Peter Austin 2012 Using Ensemble-Based Methods for Directly Estimating Causal Effects: An Investigation of Tree-Based G-Computation

More information

Student Performance Q&A:

Student Performance Q&A: Student Performance Q&A: 2009 AP Statistics Free-Response Questions The following comments on the 2009 free-response questions for AP Statistics were written by the Chief Reader, Christine Franklin of

More information

A Memory Model for Decision Processes in Pigeons

A Memory Model for Decision Processes in Pigeons From M. L. Commons, R.J. Herrnstein, & A.R. Wagner (Eds.). 1983. Quantitative Analyses of Behavior: Discrimination Processes. Cambridge, MA: Ballinger (Vol. IV, Chapter 1, pages 3-19). A Memory Model for

More information

Introduction to Bayesian Analysis 1

Introduction to Bayesian Analysis 1 Biostats VHM 801/802 Courses Fall 2005, Atlantic Veterinary College, PEI Henrik Stryhn Introduction to Bayesian Analysis 1 Little known outside the statistical science, there exist two different approaches

More information

CHAPTER 3 DATA ANALYSIS: DESCRIBING DATA

CHAPTER 3 DATA ANALYSIS: DESCRIBING DATA Data Analysis: Describing Data CHAPTER 3 DATA ANALYSIS: DESCRIBING DATA In the analysis process, the researcher tries to evaluate the data collected both from written documents and from other sources such

More information

MEA DISCUSSION PAPERS

MEA DISCUSSION PAPERS Inference Problems under a Special Form of Heteroskedasticity Helmut Farbmacher, Heinrich Kögel 03-2015 MEA DISCUSSION PAPERS mea Amalienstr. 33_D-80799 Munich_Phone+49 89 38602-355_Fax +49 89 38602-390_www.mea.mpisoc.mpg.de

More information

PO Box 19015, Arlington, TX {ramirez, 5323 Harry Hines Boulevard, Dallas, TX

PO Box 19015, Arlington, TX {ramirez, 5323 Harry Hines Boulevard, Dallas, TX From: Proceedings of the Eleventh International FLAIRS Conference. Copyright 1998, AAAI (www.aaai.org). All rights reserved. A Sequence Building Approach to Pattern Discovery in Medical Data Jorge C. G.

More information

Introduction. We can make a prediction about Y i based on X i by setting a threshold value T, and predicting Y i = 1 when X i > T.

Introduction. We can make a prediction about Y i based on X i by setting a threshold value T, and predicting Y i = 1 when X i > T. Diagnostic Tests 1 Introduction Suppose we have a quantitative measurement X i on experimental or observed units i = 1,..., n, and a characteristic Y i = 0 or Y i = 1 (e.g. case/control status). The measurement

More information

Recursive Partitioning Method on Survival Outcomes for Personalized Medicine

Recursive Partitioning Method on Survival Outcomes for Personalized Medicine Recursive Partitioning Method on Survival Outcomes for Personalized Medicine Wei Xu, Ph.D Dalla Lana School of Public Health, University of Toronto Princess Margaret Cancer Centre 2nd International Conference

More information

Fixed Effect Combining

Fixed Effect Combining Meta-Analysis Workshop (part 2) Michael LaValley December 12 th 2014 Villanova University Fixed Effect Combining Each study i provides an effect size estimate d i of the population value For the inverse

More information

Method Comparison for Interrater Reliability of an Image Processing Technique in Epilepsy Subjects

Method Comparison for Interrater Reliability of an Image Processing Technique in Epilepsy Subjects 22nd International Congress on Modelling and Simulation, Hobart, Tasmania, Australia, 3 to 8 December 2017 mssanz.org.au/modsim2017 Method Comparison for Interrater Reliability of an Image Processing Technique

More information