Latent Variable Models in Diagnostic Medicine

KU Leuven Biomedical Sciences Group Faculty of Medicine Department of Public Health and Primary Care L-Biostat Latent Variable Models in Diagnostic Medicine with Applications to Visceral Leishmaniasis Research Joris MENTEN Jury: Promoter: Co-promoter: Chair: Jury members: Prof. Dr. E. Lesaffre Prof. Dr. M. Boelaert Prof. Dr. G. Verbeke Prof. Dr. C. Matheï Prof. Dr. D. Berkvens Prof. Dr. P. Lemey Prof. Dr. N. Speybroeck Prof. Dr. F. Tuerlinckx Prof. Dr. A.H. Zwinderman Dissertation presented in partial fulfilment of the requirements for the degree of Doctor in Biomedical Sciences March 2015

This doctoral thesis was prepared in collaboration with the Institute of Tropical Medicine, Antwerp and the Interuniversity Biostatistics and Statistical Bioinformatics Centre, I-Biostat. 2015. Groep Biomedische Wetenschappen, Campus Gasthuisberg O&N2, Herestraat 49, bus 700, 3000 Leuven, Belgium. Alle rechten voorbehouden. Niets uit deze uitgave mag worden vermenigvuldigd en/of openbaar gemaakt worden door middel van druk, fotokopie, microfilm, elektronisch of op welke andere wijze dan ook zonder voorafgaandelijke schriftelijke toestemming van de uitgever. All rights reserved. No part of the publication may be reproduced in any form by print, photoprint, microfilm, electronic or any other means without prior written permission from the publisher.

List of Publications This thesis corresponds to a collection of the following original publications. Chapter 2: Menten, J., Boelaert, M. and Lesaffre, E. (2008). Bayesian latent class models with conditionally dependent diagnostic tests: A case study. Statistics in Medicine, 27(22), 4469-4488. doi: 10.1002/sim.3317. Chapter 3: Menten, J., Boelaert, M. and Lesaffre, E. (2012). An application of Bayesian growth mixture modeling to estimate infection incidences from repeated serological tests. Statistical Modelling, 12(6), 551-578. doi: 10.1177/1471082X12465797. Chapter 4: Menten, J., Boelaert, M. and Lesaffre, E. (2012). Bayesian meta-analysis of diagnostic tests allowing for imperfect reference standards. Statistics in Medicine, 32(30), 5398-5413. doi: 10.1002/sim.5959. Chapter 5: Menten, J. and Lesaffre, E. (2015). Bayesian meta-analysis of comparative diagnostic test accuracy studies. Submitted i

List of Abbreviations C CI CRS D DAT DIC DOR DTA ELISA HMM ITN KA LCA LCM MCMC OD OR PCR RCT RDT rdor ROC RR S T VL Specificity Credible Interval Combined Reference Standard Disease Status Direct Agglutination Test Deviance Information Criterion Diagnostic Odds Ratio Diagnostic Test Accuracy Enzyme-linked Immunosorbent Assay Hidden Markov Model Insecticide Treated Net Kala-Azar Latent Class Analysis Latent Class Model Markov Chain Monte Carlo Optical Density Odds-Ratio Polymerase Chain Reaction Randomized Controlled Trial Rapid Diagnostic Test Relative Diagnostic Odds Ratio Receiver Operating Characteristic (Curve) Relative Risk Sensitivity Test Result Visceral Leishmaniasis ii

Table of Contents List of Publications i List of Abbreviations ii Table of Contents iii Part I: Introduction Chapter 1: Introduction and Background Material 2 1.1 General Introduction 2 1.2 Visceral Leishmaniasis 6 1.3 The Problem of Imperfect Reference Tests 7 1.3.1 Notation 7 1.3.2 Imperfect reference tests 8 1.4 Latent Class Analysis 12 1.4.1 The Basic LCA Model 12 1.4.2 Conditional Dependence 14 1.5 Meta-Analysis of Diagnostic Accuracy Studies 15 1.6 Bayesian Model Estimation 16 1.6.1 The Basis of Bayesian Statistics 16 1.6.2 Model Estimation and Software 18 1.6.3 Use of Bayesian Statistics in This Thesis 19 1.7 Aims of the Thesis 21 Part II: Bayesian Latent Class Models Applied to Visceral Leishmaniasis Research Chapter 2: Bayesian Latent Class Models with Conditionally Dependent 26 Diagnostic Tests: a Case Study 2.1 Introduction 27 2.2 Study Design 28 2.3 Fixed and Random Effects LCMs 29 2.3.1 Conditional Independence Model 29 2.3.2 Fixed Effects LCMs 30 2.3.3 Random Effects LCMs 31 2.3.4 Comparison of Fixed and Random Effects LCMs and Alternative 32 Approaches 2.4 Model Identifiability and Selection 32 2.5 Application to the VL Data 34 2.5.1 Study Design and Data Structure 34 2.5.2 Model Descriptions 35 2.5.3 Results 37 2.6 Discussion 40 Appendices 43 References 45 iii

Chapter 3: An Application of Bayesian Growth Mixture modeling to Estimate 47 Infection Incidences from Repeated Serological Tests 3.1 Introduction 48 3.2 Motivating Example 49 3.2.1 Study Description 49 3.2.2 Description of the Study Data 51 3.3 The Model 53 3.3.1 Overview 53 3.3.2 Response Model for the Observed Data 54 3.3.3 Structural Model for the Unobserved Infection Status 56 3.3.4 Model Estimation and Priors 57 3.4 Application to the Kalanet Data 59 3.4.1 Response Model 59 3.4.2 Structural Model 61 3.4.3 Model Assessment 62 3.5 Simulation Study 69 3.5.1 Setup 70 3.5.2 Results 70 3.6 Discussion 70 References 74 Chapter 4: Bayesian Meta-analysis of Diagnostic Tests Allowing for Imperfect 76 Reference Standards 4.1 Introduction 77 4.2 Motivating Example 79 4.2.1 Introduction 79 4.2.2 Study Data 79 4.3 The Bivariate Model 80 4.3.1 Using Data from Studies Using a Perfect Reference Standard 81 4.3.2 Latent Class Analysis with Informative Prior Distributions for Reference 81 Test Accuracy 4.3.3 Using Plug-in Estimates from Primary Studies that Are Based on 82 Latent Class Analysis 4.3.4 Combining Data from Different Sources in the Bivariate Model 83 4.4 Simulation Study 83 4.4.1 Setup 83 4.4.2 Results 84 4.4.3 Conclusions 86 4.5 Application 86 4.5.1 Study Description 86 4.5.2 Results 87 4.6 Discussion 89 References 90 iv

Chapter 5: Bayesian Meta-analysis of Comparative Diagnostic Test Accuracy 93 Studies 5.1 Background 94 5.2 Methods 95 5.2.1 Measures of Relative Value of Diagnostic Tests 96 5.2.2 Models for the Comparative Meta-Analysis of Diagnostic Tests 96 5.2.3 Model Estimation and Prior Specification 99 5.2.4 Simulation Study 99 5.2.5 Real Data Example 99 5.3 Results 100 5.3.1 Measures of Relative Value of Diagnostic Tests 100 5.3.2 Model Performance: Simulation Study 100 5.3.3 Real Data Example: Diagnostic Tests for Visceral Leishmaniasis 101 5.4 Discussion 101 5.5 Conclusions 102 References 102 Figures 104 Part III: Concluding Remarks Chapter 6: General Conclusions and Further Research 108 6.1 General Conclusions 108 6.2 The Utility of Latent Class Models 110 6.3 Further Research 112 References 114 Part IV: Supplementary Material Appendix A: Supplementary Material for Chapter 4 122 A.1 Software Code 122 A.2 Additional Figures 125 A.3 Summary of studies included in the motivating example 127 A.4 Priors used in the simulation and motivating example analysis 129 A.5 Expert opinion of 7 experts on the diagnostic accuracy of the two 131 reference standards used in the VL study Appendix B: Supplementary Material for Chapter 5 132 B.1 Simulation Study for the Selection of Appropriate Statistics for a Comparative 132 DTA Review B.2 Software Code 137 B.3 Simulation Study of the Modeling Approach 144 B.4 Visceral Leishmaniasis Data 153 Professional Career 156 Abstract 163 Samenvatting 165 v

Part I Introduction 1

Chapter 1 Introduction and Background Material 1.1 General Introduction An accurate diagnosis is the first step to an effective treatment of a patient [1]. Diagnosis of a disease can sometimes be made on the basis of clinical signs and symptoms, but accurate diagnosis often requires the use of diagnostic tests. To be appropriate for use in a specific setting, a test must conform to a number of criteria. It must be easy to use in the intended setting, inexpensive, give rapid results, stable to varying conditions of storage, transport and use, but above all it must be accurate. The evaluation of the accuracy of diagnostic tests is thus crucial [2]. This diagnostic evaluation must be done in the setting for which the test is intended. It must be evaluated on a relative large sample of clinically suspected patients by health care workers who later in practice will use the test, possibly in difficult circumstances. The diagnostic accuracy of a test is the ability to discriminate accurately between patients who have and do not have the target disease. The accuracy has two dimensions: how well the test identifies diseased subjects, and how well it rules out non-diseased subjects. Even though other measures of diagnostic accuracy exist [1, 3], the sensitivity and the specificity are the most commonly used diagnostic test measures. The sensitivity is the proportion of diseased subjects that show a positive test result. The specificity is the proportion of non-diseased subjects that show a negative test result. The work in this thesis is inspired by an application in visceral leishmaniasis (VL). VL, also known as Kala-Azar, is a deadly disease caused by the protozoal parasites Leishmania donovani and L. infantum and is transmitted by sandflies. It occurs mainly in rural areas of Eastern Africa, Southern Asia, and Latin America. There are 200,000 to 400,000 new VL cases and 20,000 to 40,000 VL deaths each year [4]. VL patients present with general signs and symptoms of persistent systemic infections as fatigue, weakness, fever, and weight loss. The disease manifests itself through enlarged lymph nodes, spleen and liver. Symptoms may persist for weeks or months and if appropriate medical care is not provided, patients ultimately die from bacterial co-infections, massive bleeding or severe anaemia. Many patients do not receive proper medical attention, as VL diagnosis and treatment is often only available in tertiary health centers. To reach more patients at the primary-care level, safe and effective drugs and in addition simple but also robust diagnostic tests are needed [5]. It can be difficult to diagnose VL as several causes of febrile splenomegaly exist, notably 2

Chapter 1: Introduction and Background Material malaria [6], in endemic areas of the disease. Up until the 1990s, diagnosis of VL was, and still is in many areas, based on the microscopical examination of parasites in tissue sample or cultures [7]. False positive results, where a non-diseased subject is diagnosed with the disease through microscopy techniques, are expected to occur rarely. However, many diseased subjects fail to show parasites in the samples taken. The likelihood of these false negative test results depend on the parasite load of the patient and on the site the sample is taken. When samples are taken from the spleen, only approximately 5% of patients show false negative results. When samples are taken from bone marrow or lymph nodes this number may be much higher, up 40% for bone marrow and 50% for lymph nodes [8]. Due to the risks inherent in spleen aspiration, it is contraindicated in patients with severe anemia and bleeding tendency and in restless children. After the procedure the patient has to be observed in a facility where blood transfusion and surgery are available [2]. In endemic settings the resources needed to support tissue diagnosis (skilled technicians, good smears, proper stains, appropriately maintained and working microscopes) are often unavailable [6]. Consequently, one of the priorities of VL research has been the development of a simple, highly sensitive, specific, reliable and affordable diagnostic test for VL that can be used in first-line health services in endemic countries. Over the last decades, a number of noninvasive serological tests commonly referred to as rapid diagnostic tests (RDTs) for VL have been developed and commercialized [7]. Some tests have been developed into an immunochromatographic test format: either strip test or cassette. These tests are easy to perform, can be stored at ambient temperature, and can be carried to remote areas. A village health worker can be trained in a few hours, allowing early detection of the disease. The performance of these new tests needs to be evaluated in endemic regions recruiting clinical suspect patients, a design called the phase IV [1, 3] design. Case-control studies based on (possibly stored) samples of known positive patients identified by clinical cultures and nondiseased or non-endemic control subjects can lead to overestimation of the diagnostic accuracy of the test under consideration [9]. Also studies of clinical suspects recruited at referral centers, rather than primary health care may give biased results due to difference in patient population and the presence of better trained staff and more sophisticated equipment. Another source of bias may result from excluding subjects with unclear disease manifestations or uncertain diagnosis, cherry-picking the most readily diagnosed patients [9]. However, primary health care centers that will use the RDTs are not well equipped to perform the gold standard reference tests needed to diagnose VL accurately. Especially, spleen aspiration is often not performed or only on a minority of patients. Many studies assessing the diagnostic accuracy of a novel RDT for VL do so by comparing the RDT test results with those of microscopical examination or cultures of bone marrow or lymph nodes. In others an attempt is made to perform spleen aspiration, but in practice this may be done only on a minority of clinical suspects due to the contra-indications of spleen aspiration [10]. We will discuss the technical details in Section 1.3, but it can be easily understood that the use of these microscopical techniques as reference test may lead to bias when assessing the diagnostic accuracy of a new RDT. Imagine that an investigator wishes to assess the diagnostic accuracy of a RDT, the index test, for VL using as reference test the microscopical examination of bone marrow samples. She considers those who show parasites in the bone marrow sample to be VL cases and considers those who do not show parasites as controls. However, of 100 true VL cases, only 60 may be detected by this reference test while 40 may show false negative results. We show this situation in Table 1.1. So, unbeknownst to the investigator, among the 3

Chapter 1: Introduction and Background Material 140 subjects she considers controls, there are 40 VL cases. The RDT may identify a number of these patients correctly as VL cases. However, in her analyses the investigator will count these subjects as false positive results for the RDT, resulting in an underestimation of the diagnostic accuracy of the test. True Disease Status Not-Diseased Diseased Reference Negative 100 40 Test Result Positive 0 60 Total 100 100 Table 1.1: Hypothetical results of an imperfect reference test with false negative results Investigators may wish to counteract these effects by constructing a composite reference test, combining the results of two reference tests, to improve the sensitivity of the reference test [11]. In VL often parasitology is combined with a laboratory based serological test - the direct agglutination test (DAT) - in studies that wish to assess the performance of a dipstick based RDT. Cases are then defined as subjects that either show Leishmania parasites in tissue sample or show a positive result to the DAT. Controls are test subjects that show negative results to both tests. However, this combined reference test is not guaranteed to be 100% sensitive, i.e. it may still give false negative results, and it may also give false positive results due to cross-reaction of the DAT results with other parasitological diseases or due to prior or asymptomatic VL infection (Table 1.2). Moreover, if DAT and the RDT are based on a similar immunological response, the false positive results of DAT and RDT may be correlated. Again, this may result in bias when estimating the diagnostic accuracy of the RDT. However, in this case, the direction of the bias is less clear and under- as well as over-estimation of the RDT s diagnostic accuracy may occur. True Disease Status Not-Diseased Diseased Reference Negative 90 20 Test Result Positive 10 80 Total 100 100 Table 1.2: Hypothetical results of an imperfect reference test with false negative and false positive results The imperfect reference test bias is a well known problem in diagnostic test research and a number of solutions have been proposed. Reitsma et al. [12] classify the approaches in 4 categories: (1) methods that impute missing data on the reference standard, (2) methods that correct estimates of accuracy obtained with an imperfect reference standard, (3) methods that construct a reference standard by combining multiple test results through a predefined rule or through statistical modeling, (4) methods that relate index test results to relevant clinical data, such as history, future clinical events, and response to therapy. In this thesis, we will study in depth the use of Latent Class Analysis (LCA) to address this problem. LCA is classified by 4

Chapter 1: Introduction and Background Material Reitsma et al. [12] in the third set of solutions. LCA was formally developed in the mid-twentieth century by Lazarsfeld [13] but has its origins in ideas which were developed much earlier. In 1884, Pierce already suggested that the structure of a two-by-two contingency table could better be understood by considering it was constructed from a mixture of two different sets of results [14, 15]. Lazarsfeld developed latent structures to understand the attitudes of American soldiers to the Army [16, 17]. By assuming the population consists of two types of soldiers: one set with high morale and another set with low morale, Lazarsfeld and his colleagues analyzed a set of questions posed to 1000 draftees to the American army during World War II. During the second half of the twentieth century, LCA was mainly used within the social sciences. Only near the end of the twentieth century, LCA started to be applied to the problem of diagnostic test validation without a gold standard [18]. Since then, LCA for diagnostic tests has been widely studied in the statistical literature. Recently, these methods have been increasingly applied in diagnostic test accuracy studies [19]. In LCA, a statistical model is used for combining multiple test results to estimate the diagnostic accuracy of each of the tests under consideration. Even though it has been described as a black-box approach, the basic premise of LCA can easily be understood. Imagine that on each patient three diagnostic tests are performed. In our motivating example of VL diagnosis, this may for example be the DAT (T 1), a dipstick RDT (T 2), and microscopical examination of a bone marrow sample (T 3). For each patient we can describe the results of the three tests as a pattern. For example, a patient may show positive results on T 1 and T 2, and a negative result on T 3. We write this a pattern as + +. We can count the total number of subjects with each test pattern and summarize this in a table (Table 1.3). In our example, there are 17 patients with the pattern + +. Diagnostic Test Observed T 1 T 2 T 3 Frequency + + + 55 + + - 17 + - + 17 + - - 6 - + + 8 - + - 19 - - + 2 - - - 166 Table 1.3: Results of the VL diagnostic test results for DAT (T 1), dipstick RDT (T 2) and microscopical examination of a bone marrow sample (T 3) obtained in Sudan (See Chapter 2, simplified version of Table 2.I) Intuitively, we may suspect that subjects with all tests positive (pattern + + +) are very likely to be diseased, while those with all tests negative (pattern ) are likely to be not diseased. For a subject with an intermediate pattern, for example +, the situation is less clear. The patient may show a false positive result on T 3, or false negative results on T 1 and T 2. The probability that a patient with such a pattern of outcomes is diseased will depend on the diagnostic accuracy of the three tests. If we knew that T 3 is highly specific, then it would be less likely that the result for T 3 is false positive. Similarly if we knew that T 1 and 5

Chapter 1: Introduction and Background Material T 2 lack sensitivity, the probability of false negative results on T 1 and T 2 would be increased. In fact, if we knew the sensitivity and specificity of the 3 tests, the prevalence of the disease in the study sample, and the covariation of errors of the 3 tests, we would be able to predict how many diseased and non-diseased subjects would show each of the possible outcome patterns. However the reason to perform the diagnostic study is to estimate the disease prevalence and the sensitivity and specificity of the tests. Consequently the situation is the inverse. We know the distribution of test subjects over the different outcome patterns. This is the data obtained in the trial. From this data, we want to estimate the diagnostic accuracy of the tests. By using the observed frequency of the outcome patterns as data and a statistical model to describe the relationship between the unknown disease status of the study subjects and their test results, we can - under well specified circumstances - estimate the disease prevalence and diagnostic accuracy of the tests under study. This is the premise of LCA, which we develop further in Section 1.4. In Chapter 2 we describe a case study on the estimation and interpretation of latent class models in VL diagnostics. 1.2 Visceral Leishmaniasis Throughout this thesis, we use the diagnosis of VL and Leishmania infection as motivation and illustration for our statistical model development. Leishmania is a genus of protozoan parasites of which more than 20 species afflict humans. It is related to other pathogenous parasites as Trypanosoma, the causative agent of sleeping sickness and Chagas disease. Leishmania parasites are transmitted to humans through the bite of infected sandflies (Phlebotominae). Transmission of Leishmania parasites to humans can either be of the form human-sandflyhuman, i.e. anthroponotic, or the disease can be transmitted from wild or domestic animals to humans via sandflies, i.e. zoonotic [20, 21]. Leishmania infection may cause different disease manifestations and remains a major public health problem [22]. In this thesis we focus on the most severe: VL. This disease form affects the vital organs of afflicted people and is fatal if left untreated. The main symptoms are fever, weight loss, anaemia and enlargement of spleen (splenomegaly) and liver (hepatomegaly). Other clinical manifestations of Leishmania infection are cutaneous and mucosal leishmaniasis. Cutaneous Leishmaniasis is the most common form of the disease. Its symptoms include rash and ulcers on exposed parts of the body. It may occur as part of primary Leishmania infection, or after recovery from VL. In the last form it is labeled post-kala-azar dermal leishmaniasis (PKDL). Mucosal leishmaniasis affects mucous membranes in the nose, mouth and throat and can cause facial disfigurement [20]. Both VL and PKDL are most commonly caused by L. donovani, and to a lesser extent by L. infantum. Cutaneous and mucocutaneous leishmaniasis are more commonly caused by others species of the genus Leishmania. VL is endemic in Eastern Africa, Southern Asia, and Latin America. It afflicts 200,000 to 400,000 people yearly, resulting in 20,000 to 40,000 deaths each year. The disease primarily affects poor households in small clusters in remote rural areas. Epidemiologically, disease burden has been linked with poor housing and unhealthy living habitats [23]. In their turn Leishmania related diseases can have significant effects on the financial situation on those afflicted resulting in a vicious circle of poverty, malnutrition and ill health [22]. For adequate control of the disease, field-applicable techniques for diagnosis, treatment, and prevention of the diseases are urgently needed [20]. Leishmania infection and VL are treatable conditions, however the most effective treatments can be difficult to administer and can have 6

Chapter 1: Introduction and Background Material toxic effects. Prevention and control of the disease can have different aspects: case finding and treatment, control of sandflies, and control of the animal reservoir in areas where the disease is a zoonosis [22]. Vector control usually consists of indoor spraying with residual insecticides, i.e. insecticides which remain effective for some length of time at the application site. Individual protection with insect-impregnated bednets can also be used. However, the effectiveness of these strategies has been not been adequately documented [24, 25] Both treatment and control of Leishmaniasis requires adequate diagnostic and screening tools. Diagnosis of Leishmaniasis can be done by demonstration of the parasite in tissue samples through microscopical techniques. This requires adequate samples and trained technicians. The success rate, i.e. sensitivity, depends on the parasite load in the patients and the site from which tissue samples have been taken. In case of VL, the sensitivity varies from 95% for spleen samples to 60% for bone marrow samples and 50% for lymph nodes [8]. The specificity of these parasitological techniques, if performed by trained technicians, can be close to 100%, but may be lower if laboratory staff is not adequately trained to separate Leishmania parasites from similar organisms (for example Histoplasma species) [20]. Novel techniques based on PCR or immunodiagnostic techniques have the potential to show better diagnostic accuracy or be more suited to field conditions. Further assessment of the diagnostic performance and applicability to field conditions of these novel tests is needed [20]. The need to correctly estimate the diagnostic accuracy of these novel tests, which may be more accurate than the available reference standards, is the motivation for the work in this thesis. 1.3 The Problem of Imperfect Reference Tests In this section, we formalize some of the concepts introduced in Section 1.1. 1.3.1 Notation Diagnostic tests aim to determine the unknown disease status of a patient or test subject. In first instance, we presume the disease status to be dichotomous: a patient either has the disease of interest or she has not. The true disease status is denoted using the binary variable D, with: { 1 for diseased; D = 0 for not-diseased; with disease prevalence P (D = 1) = π. The variable T indicates the result of the diagnostic test, with a positive test result being indicative of disease: { 1 positive; T = 0 negative. We limit the discussion to the situation that a binary diagnostic test is used to diagnose a binary disease status, but the methods can be extended to multicategorical outcomes and multicategorical or continuous test results. Many diagnostic tests offer continuous or multicategorical outcomes which are subsequently dichotomized using a threshold value. However it is equally possible to analyze the continuous test results directly using mixture modeling [26, 27]. A case study of this approach is given in Chapter 3, where we estimate the effect of distributing insecticide impregnated bednets to prevent Leishmania infection through mixture 7

Chapter 1: Introduction and Background Material modeling of the original continuous data. The use of multicategorical disease outcomes is less common, but several disease states may occur. For example, in infectious diseases subjects may be not-infected, asymptomatically infected or be symptomatic. In differential diagnosis, the aim is to pick the most appropriate diagnosis among several possible diseases. When combining a binary disease status and a binary test result four possible outcomes emerge (Table 1.4). A subject is correctly classified with the diagnostic test if the test result is negative and the subject is truly not diseased (true negative, a) or if the test result is positive while the subject is truly diseased (true positive, d). A subject is incorrectly classified with the diagnostic test if the test result is negative and the subject is in reality diseased (false negative, b) or vice versa (false positive, c). True Disease Status Not-Diseased Diseased Test Negative a b Result Positive c d Table 1.4: Classification of test results by disease status. We define the probability of correct classification of diseased (sensitivity) or non-diseased (specificity) subjects as follows: the sensitivity S = P (T = 1 D = 1), estimated as d/(b + d); the specificity C = P (T = 0 D = 0), estimated as a/(a + c). Other summaries of the diagnostic accuracy are available. These include the probability of missclassification, positive and negative predictive values, diagnostic likelihood ratios and the diagnostic odds-ratio. All measures are interrelated and can be calculated from the test sensitivity, specificity and prevalence of the disease [1, 3]. In this thesis, we limit the discussion mostly to the joint estimation of the sensitivity-specificity pair {S, C}. In Chapters 4 and 5, we will also discuss the diagnostic odds-ratio which summarizes the accuracy of a test in a single value: DOR = (S C)/[(1 S) (1 C)], estimated as (a d)/(b c) [28]. 1.3.2 Imperfect reference tests The diagnostic accuracy of a new test is usually calculated by comparing the results of the new test (index test) with those of a reference test. Assuming this reference test accurately reflects the underlying disease status of the study subjects, we can replace the true disease status in Table 1.4 by the reference test result and subsequently estimate S and C as shown above. However, in many instances the reference test used may be imperfect, and results are more accurately described in Table 1.5. In this case, the true S of the index test is (c2 + d2)/(a2 + b2 + c2 + d2), while it is estimated, through comparison with the reference test, as (d1+d2)/(b1+b2+d1+d2). Similarly, the true C of the index test is (a1+b1)/(a1+b1+c1+d1), while it is estimated, through comparison with the reference test, as (a1+a2)/(a1+a2+c1+c2). The resulting bias tends to underestimate the diagnostic accuracy of the index test. Only if errors of the index and reference standard are highly correlated, the diagnostic accuracy of the index test will be overestimated [3]. 8

Chapter 1: Introduction and Background Material Reference Test Result Negative Positive True Disease Status Not-Diseased Diseased Not-Diseased Diseased Index Test Negative a1 a2 b1 b2 Result Positive c1 c2 d1 d2 Table 1.5: Classification of index and imperfect reference test results by disease status. To gain some insight in the bias resulting from the imperfect reference test, we can first study the situation that the errors of the index and reference test are not correlated. This situation is called conditional independence, meaning that test results are independent given the disease status. If results of index and reference tests are independent in non-diseased subjects, this means the false positive results are uncorrelated. If results of index and reference tests are independent in diseased subjects, this means that the false negative results are uncorrelated. In this case, Table 1.6 gives the expected numbers of test subjects in the cross-classification of index and reference test as a function of S and C of the tests. From this the apparent specificity of the index test C I,app is: C I,app = (1 π) C I C R + π (1 S I ) (1 S R ), (1 π) C R + π (1 S R ) which depends in a complex way on the disease prevalence, the S and C of the index test and the S and C of the reference test. Figures 1.1 and 1.2 show the apparent S and C of the index test for varying reference test S and C values. Focusing on the case of uncorrelated errors (Figures 1.1.A and 1.2.A), we observe that the apparent index test specificity C I,app is influenced the most by the reference test sensitivity S R (Figure 1.1.A). The reference test specificity C R influences C I,app in this example to a lesser extent, appearing to amplify the effects of a reduced S R. The first observation is explained as follows: the more cases the reference test misses, the more likely it is that the index test will detect some of them as positive. This results in apparent false positives for the index test. The relationship between C I,app and C R is less intuitive and is best studied using simulation studies as shown in Figure 1.1. Similar observations can be made for the apparent index test sensitivity S I,app, which is mainly influenced by the reference test specificity C R and to a lesser extent the reference test sensitivity S R (Figure 1.2.A). In most cases, S I,app and C I,app are lower than the true C I and S I, respectively, indicating an underestimation of the index test accuracy. In our example, C I,app only exceeds C I when the reference test is highly sensitive (S R > 0.95), but has lower specificity, and false positive results of index and reference test are correlated (Figure 1.1.A). Again this corresponds to our intuition: if index and reference tests make the same false positive errors, these errors are not recognized. In a similar way, S I,app only exceeds S I when the reference test is highly specific (C R > 0.95), but has lower sensitivity, and false negative results of index and reference test are correlated (Figure 1.2.A). 9

Chapter 1: Introduction and Background Material Reference Test Result Negative Positive True Disease Status Not-Diseased Diseased Not-Diseased Diseased Index Nega- N (1 π) N π N (1 π) N π Test tive CI CR (1 SI) (1 SR) CI (1 CR) (1 SI) SR Result Posi- N (1 π) N π N (1 π) N π tive (1 CI) CR SI (1 SR) (1 CI) (1 CR) SI SR Total N (1 π) CR N π (1 SR) N (1 π) (1 CR) N π SR Table 1.6: Expected cell counts for the classification of index and imperfect reference test results by disease status, assuming conditional independence of index and reference test results. ( SI, CI ) and ( SR, CR ) are the sensitivity and specificity of index and reference test, respectively. N is the total sample size, and π the disease prevalence in the sample. 10

Chapter 1: Introduction and Background Material Figure 1.1: Apparent specificity C I,app of an index index test with true specificity C I = 0.9 when compared to an imperfect reference test. Results are given for conditional independence between index and reference test (panel A), for conditional dependence in diseased subjects (panel B) and for conditional dependence in non-diseased subjects (panel C). Figure 1.2: Apparent sensitivity S I,app of an index test with true sensitivity S I = 0.9 when compared to an imperfect reference tests. Results are given for conditional independence between index and reference test (panel A), for conditional dependence in diseased subjects (panel B) and for conditional dependence in non-diseased subjects (panel C). 11

Chapter 1: Introduction and Background Material 1.4 Latent Class Analysis Many approaches have been suggested to deal with the problem caused by the use of imperfect reference standards in diagnostic research. Often these approaches have been the subject of controversy. For example, one approach, discrepant analysis, has been shown to be inappropriate in general [29, 30], but may on the other hand offer some assistance and may lead to less biased estimates than a naive analysis assuming the reference test is perfect [31]. The approach we are discussing in this thesis, LCA, has also been the subject of debate [11, 32, 33]. However, this technique has been increasingly used in the last decade and can produce valid estimates of the accuracy of diagnostic tests even in the absence of a perfect reference test [19]. In this section, we give a non-technical introduction to LCA. More technical details are given in Chapter 2. 1.4.1 The Basic LCA Model Table 1.7 shows the results of the VL diagnostic study with sample size N = 273 in Sudan (Chapter 2) for the DAT (T 1) and microscopical examination of a bone marrow sample (T 2). If T 1 reflected the disease status accurately, the prevalence of VL in the sample would be 28.6%, if T 2 were 100% accurate, the prevalence would be 23.8%. Clearly either or both tests must have given some incorrect results. Microscopy (T 2) Negative Positive Total DAT Negative 185 10 195 (T 1) Positive 23 55 78 Total 208 65 273 Table 1.7: Results of the VL diagnostic test results for DAT (T 1) and microscopical examination of a bone marrow sample (T 2) obtained in Sudan (See Chapter 2, simplified version of Table 2.I) For example, there are 185 subjects with negative test results on both tests (Table 1.7). This does not necessarily mean that these subjects are not diseased: each of these subjects may be a non-diseased subject with, correct, negative results for both tests (i.e., a true negative) or a diseased subject with, incorrect, negative results for both tests (i.e., a false negative). So these 185 subjects consist of an unknown number of true negative subjects and an unknown number of false negative subjects. Given there are in total N π diseased subjects and N (1 π) non diseased subjects, the number of subjects with negative results on both tests is: N π P (T 1 = 0, T 2 = 0 D = 1) + N (1 π) P (T 1 = 0, T 2 = 0 D = 0). If we assume the test results are independent within the diseased subjects and within the non-diseased subjects, this corresponds to: N [π P (T 1 = 0 D = 1) P (T 2 = 0 D = 1) + (1 π) P (T 1 = 0, T D = 0) P (T 2 = 0 D = 0)] 12

Chapter 1: Introduction and Background Material or N [π (1 S T 1 ) (1 S T 2 ) + (1 π) C T 1 C T 2 ], where S T 1, C T 1 and S T 2, C T 2 are the sensitivity and specificity of T 1 and T 1, respectively. The corresponding equations for all cells from Table 1.7 are in Table 1.8. Combining the data from Table 1.7 and the equations from Table 1.8, we obtain a set of 4 equations: N [(1 π) C T 1 C T 2 + π (1 S T 1 ) (1 S T 2 )] = 185 N [(1 π) C T 1 (1 C T 2 ) + π (1 S T 1 ) S T 2 ] = 10 N [(1 π) (1 C T 1 ) C T 2 + π S T 1 (1 S T 2 )] = 23 N [(1 π) (1 C T 1 ) (1 C T 2 ) + π S T 1 S T 2 ] = 55 Microscopy (T 2) Negative Positive DAT Nega- N [(1 π) C T 1 C T 2 + N [(1 π) C T 1 (1 C T 2) + tive π (1 S T 1) (1 S T 2)] π (1 S T 1) S T 2] (T 1) Posi- N [(1 π) (1 C T 1) C T 2 + N [(1 π) (1 C T 1) (1 C T 2) + tive π S T 1 (1 S T 2)] π S T 1 S T 2] Table 1.8: Expected cell counts for the classification of index and imperfect reference test results. Finding the solution to this set of equations gives us estimates of π, S T 1, C T 1, S T 2, and C T 2 and is the essence of the use of LCA in diagnostic medicine. However, in this set of equations there are 5 unknowns and only 3 independent equations, the fourth equation being a linear combination of the 3 previous equations. Consequently, this problem is non-identifiable as the degrees of freedom of the data (3) is less than the number of unknown parameters. The results in Table 1.7 could be obtained from an infinite set of possible situations. For example, (1) π = 0.25, S T 1 =1, C T 1 =0.8, S T 2 =0.8, and C T 2 =0.95; (2) π = 0.33, S T 1 =0.85, C T 1 =0.99, S T 2 =0.72, and C T 2 =1; (3) π = 0.21, S T 1 =0.96, C T 1 =0.89, S T 2 =1, and C T 2 =0.96, etc. This is similar to fitting a regression line y = a + b x through a single data point {x 1, y 1 }. If we have 3 conditionally independent diagnostic tests, we have 8 possible outcome patterns (+ + +, + +, + +, +, + +, +, +, ), resulting in 7 independent linear equations. In this situation, we have equally 7 unknowns: the prevalence and S and C for each of the 3 tests. Consequently, the degrees of freedom of the data equal the numbers of unknown parameters and a single analytical solution is then possible. For example for Table 1.7 the solution is π = 0.37, S T 1 =0.88, C T 1 =0.99, S T 2 =0.77, C T 2 =0.91, S T 3 =0.77, C T 3 =1 1. This corresponds to the situation of fitting a regression line y = a + b x through two data points {x 1, y 1 }, {x 2, y 2 }. There always will be a single line through the two points, but it will be impossible to assess if the assumption of a linear relationship between x and y is warranted. Similarly, a basic latent class model with three binary tests will always fit the data perfectly. No degrees of freedom are left to assess if the main assumption underlying the basic latent 1 Actually, two numerically valid solutions are possible (1) π = 0.37, S T 1=0.88, C T 1=0.99, S T 2=0.77, C T 2=0.91, S T 3=0.77, C T 3=1 and (2) π = 0.63, S T 1=0.01, C T 1=0.12, S T 2=0.09, C T 2=0.23, S T 3=0, C T 3=0.23. Only the first solution makes clinical sense, however. 13

Chapter 1: Introduction and Background Material class model is valid. This assumption is that the test results are independent, given the disease status of the patient. This assumption is crucial to the identifiability of this model. Given its importance we describe this assumption in more detail in Section 1.4.2. By collecting data on more than 3 diagnostic tests, it becomes possible to assess the fit of the data to the latent class model. No analytical solution to the set of equations is possible. The parameters of interest (π, S T j, C T j ) will be estimated using statistical techniques as weighted least squares, maximum likelihood, or Bayesian methods. Again, this is similar to fitting a regression line through 3 points or more. The a and b coefficients of the regression line would be estimated through a statistical model and the assumption of linearity can be assessed and relaxed if needed, for example by including higher order terms in the regression equation. In the same vein, the basic latent class model can be extended to incorporate the conditional dependency between the different tests [34]. However, a general conditionally dependent model is not identifiable as the number of possible dependence terms is larger than the number of available degrees of freedom [35]. In the case of 4 diagnostic tests, 15 parameters can be estimated, but while a conditional independent model only has 9 parameters, a fully conditional dependent model has 31 [35]. Consequently, to arrive at estimates of the disease prevalence and diagnostic accuracy of the tests under consideration constraints need to be added to the model. Models with different constraints can be constructed and the fit of the data to the models can be assessed using frequentist (Pearson chi-square), maximum likelihood (loglikelihood chi-square) or Bayesian techniques (Deviance Information Criterion, Bayesian p-value) and adequately fitting models retained. We research this problem further in Chapter 2. 1.4.2 Conditional Dependence As explained above, an important assumption underlying the basic latent class model is the assumption of conditional independence. Under this assumption the amount of agreement between two tests is fully explained by the underlying disease status. However, it is common that results of two tests are dependent even within categories of the disease of interest. Conditionally dependent tests tend to give the same result, even when this result is wrong [29]. This conditional dependence may be due to a common biological phenomenon on which the two tests are based. Examples of dependent tests are numerous. For example, contamination with nucleic acid may make 2 genetic tests that test for the same organism both positive disregarding the fact whether the pathogen is present or not, while the presence of an inhibitor of the test process would make both tests negative [29]. Two tests that detect the presence of a parasite through the presence of antigens or microscopical examinations may both be more likely to be positive in infected patients which show a high parasite load while being more likely to be negative in infected patients which show a low parasite load. The presence of dependence between an imperfect reference test and index test results complicates the selection of the optimal tests. The index test which is conditionally dependent with the reference test will appear to be more accurate than one which is conditionally independent. This will result in bias when comparing the accuracy of two or more tests. We describe this bias in more detail in Chapter 5 of this thesis. Latent class models may also be biased if the assumption of conditional independence is made when it is not warranted [11, 19, 36, 37, 38]. Consequently, it is important to account for this dependence when analyzing the results of a diagnostic study. If more than 3 tests are 14

Chapter 1: Introduction and Background Material available, the assumption of conditional independence can be statistically assessed. However differently specified latent class models can be successfully fitted to the same set of test results, with relevant differences in disease prevalence and test accuracy estimates [19, 32]. In Chapter 2, we describe a case study where different dependence models result in similar fits to the data while resulting in different conclusions. 1.5 Meta-Analysis of Diagnostic Accuracy Studies As for interventional studies, there is an expanding need to summarize data across studies. The increased importance of evidence based medicine makes it essential that reliable summaries of the large volume of clinical research are constructed [39]. This is even truer for diagnostic studies as they tend to be small and may study diverse populations. The recent interest in the meta-analysis of diagnostic studies has resulted in the establishment of a Diagnostic Test Accuracy Working Group within the Cochrane Collaboration. The Cochrane Collaboration is an organization that links clinical researchers and other people interested in health with the aim of improving health care decisions. One of the main activities of the Collaboration is to aid researchers in making high quality systematic reviews and providing a forum for the publication of these results [40]. Since the initiation of the DTA working group, 34 reviews of diagnostic test accuracy studies have been published in the Cochrane Library. As for all research projects, a meta-analysis should be performed using a structured approach. In a first step, the research objectives should be identified and a detailed research protocol prepared [41]. The research protocol should detail the exact diagnostic tests under study (the index tests), the endpoints that are evaluated, the patient groups and clinical settings for which the results should apply, the search strategy of the clinical literature, the data extraction procedures and the statistical analysis methodology [42]. A meta-analysis of diagnostic studies has broadly the same aims as a meta-analysis of intervention studies: to provide more precise estimates of the overall parameters of interest, to assess if results vary by subgroups, to study apparent conflicts in study results, to understand the heterogeneity among study results and to provide summaries that are generalizable to a larger population [39, 40, 42]. The aim of a meta-analysis of diagnostic studies should not only be to construct a summary or pooled estimate of a certain parameter, for example S or C, but also to develop an understanding of the variability in parameter estimates between studies. For DTA reviews is generally assumed that there is underlying heterogeneity of the diagnostic accuracy among studies. Consequently, only random effect types of models are usually considered adequate for meta-analyses. The amount of heterogeneity between studies should be described to assess possible variability of the diagnostic accuracy of the test. This suggests that it may be appropriate to include a wide range of studies, and use modeling approaches to understand and describe the variation among studies. This may also ensure a larger generalizability of the meta-analysis results than meta-analysis based a limited set of strictly selected studies, where the accuracy of the index test may not be representative of that in lower resources settings. It is sometimes suggested that studies that use an imperfect reference test should be excluded from the systematic review [41]. This could however bias the selection of studies to those performed in reference centers which may have better trained personnel, more extensive resources, and higher quality laboratory equipment. It may be better to allow for the inclusion of such studies and deal this in the analysis. In a similar vain, some authors state that all 15

Chapter 1: Introduction and Background Material studies in a meta-analysis should use the same reference standard [43]. This implies that these reference tests are by definition imperfect. If two perfect reference standards are available, their results should be identical and it should not matter which one is selected. Again, we suggest it may be better to include studies with a variety of reference standards and model the differences in accuracy among the reference standards. In the past meta-analyses has been criticized in the clinical and epidemiological literature - a summary of these critiques is found in Borenstein et al., 2009 [44]. Changes to meta-analysis techniques, specifically the use of random effects models and meta-regression and the work on the detection and alleviation of publication bias, make most of this critiques obsolete. This is however only true if a meta-analysis is correctly performed and reported. If substantial heterogeneity between studies is observed then the analysis should focus on this heterogeneity rather than trying to summarize the data in a single parameter [44], or in the case diagnostic accuracy studies, a single S, C pair. This can be done by focusing on prediction regions for the diagnostic accuracy of the test in a new setting rather than confidence (or credible) regions [45]. 1.6 Bayesian Model Estimation We are estimating all models using Bayesian methods using Markov Chain Monte Carlo (MCMC) techniques. Bayesian statistics is one of the two main approaches to statistical inference [46]; the other approach is the classical, or frequentist, approach. Many excellent introductions to Bayesian statistics are available to which we refer for an in depth treatment of the fundamentals of Bayesian statistics [46, 47, 48, 49]. 1.6.1 The Basis of Bayesian Statistics Bayesian statistics are based on updating our beliefs in the face of observations. We start with a prior distribution describing our uncertainty on some parameters of interest, collect data, and move to a posterior distribution incorporating our prior information and the information contained in the data [50]. The basic principle of Bayesian statistics is to start with a certain believe concerning a parameter θ, described by the prior distribution p(θ). Subsequently, some data y are observed. The aim of statistical inference is to obtain a new distribution describing our believe about the parameter θ given the observed data y. This is the posterior distribution p(θ y). Bayes theorem describes how the posterior distribution can be obtained from the prior distribution of θ, the distribution of the data given the parameter values, the likelihood function p(y θ), and the marginal distribution of the data p(y) [47]: p(θ y) = p(y θ)p(θ)/p(y). This is not fundamentally different from how a physician diagnoses a patient. The close link between diagnostic medicine and Bayesian statistics has lead some authors to posit that clinicians are natural Bayesians [51]. When a doctor first faces a patient with a certain set of symptoms he will have some idea of the probability that the patient has a certain disease. For example, in an endemic region of VL, a patient presents with fever and splenomegaly. The physician who observes the patient may estimate on the basis of his experience or epidemiological data, that there is a 1 in 2 (50%) chance that the patient has VL. To confirm his suspicion, 16