ABSTRACT. Prostate cancer is the most common solid tumor that affects American men.

Size: px

Start display at page:

Download "ABSTRACT. Prostate cancer is the most common solid tumor that affects American men."

Maurice Boyd
5 years ago
Views:

1 ABSTRACT ZHANG, JINGYU. Partially Observable Markov Decision Processes for Prostate Cancer Screening. (Under the direction of Dr. Brian T. Denton). Prostate cancer is the most common solid tumor that affects American men. Screening is carried out using prostate-specific antigen (PSA) tests and biopsies. This dissertation investigates the optimal design of screening policies that tradeoff the cost and harm to patients of screening with the benefits of early detection. We report on partially observable Markov decision processes (POMDPs) for the study of prostate cancer screening decisions. A Markov process represents the occurrence and progression of prostate cancer in our models. The core states are the patients prostate cancer related health states. PSA test results and biopsy results are the observations. First, a POMDP model is proposed for prostate biopsy referral decisions assuming the patient undergoes annual PSA screening. The objective is to maximize expected quality adjusted life years (QALYs). Several structural properties which give insights into the optimal biopsy referral policy over the course of a patient s lifetime are proved. An age-specific prostate biopsy referral policy is obtained and sensitivity analysis is used to show how the optimal policy and value are sensitive to parameters in the model. Next, a POMDP model is proposed for optimizing both PSA screening and biopsy referral decisions. We use this model to compute the optimal policy for PSA testing or biopsy at each decision epoch over the course of a patient s lifetime. The objective of the model is to maximize the difference between rewards for QALYs and the cost of screening, biopsy, and treatment. The optimal policy is compared to no screening

2 and the traditional guideline from the published literature. Benefits of screening are shown in terms of the expected QALYs and costs, and sensitivity analysis is performed with respect to cost parameters. Finally, a multi-stage POMDP is proposed to consider the coordination of prostate cancer screening and treatment decisions. Multiple treatment options, including active surveillance and radical prostatectomy, are considered in the model. The model is extended to include additional actions, core states, and observations at each decision epoch. A new sampling-based approximation method is developed to solve the extended POMDP model. Structural properties of the model are discussed and a method to take advantage of the underlying structure is incorporated into the approximation method. Computational experiments are presented which compare the new approximation method to other previously proposed methods to show the effectiveness and efficiency of our proposed approximation method. Empirical results for the optimal screening and treatment policy are presented. Sensitivity analysis is used to show how the availability of active surveillance (AS) influences the optimal screening policy and the expected QALYs.

3 Partially Observable Markov Decision Processes for Prostate Cancer Screening by Jingyu Zhang A dissertation submitted to the Graduate Faculty of North Carolina State University in partial fullfillment of the requirements for the Degree of Doctor of Philosophy Operations Research Raleigh, North Carolina 2011 APPROVED BY: Shu-Cherng Fang Julie S. Ivy Brian T. Denton Chair of Advisory Committee Thom J. Hodgson

4 ii DEDICATION To my family.

5 iii BIOGRAPHY Jingyu Zhang was born in Leshan, Sichuan Province, China in After he finished his junior middle school in Leshan, Sichuan, he attended the Experimental Class in Sciences sponsored by the Chinese Department of Education in the High School Attached to Tsinghua University, Beijing, China in He received his Bachelor of Science degree majoring in Mathematics and Physics from the Fundamental Science Class, School of Sciences, Tsinghua University, Beijing, China in He received his Masters in Operations Research at North Carolina State University in After his Ph.D final defense he joined Philips Research North America as a Member Research Staff in April 2011.

6 iv ACKNOWLEDGMENTS This thesis would not have been possible without the invaluable support and guidance of my advisor Dr. Brian T. Denton, who always believed in me and encouraged me to succeed during this challenging process. I am very thankful to him for his endless support. I would also like to acknowledge support for this research which was funded in part by grant CMMI from the National Science Foundation. I would like to thank my committee members, Dr. Thom J. Hodgson, Dr. Shu- Cherng Fang, and Dr. Julie S. Ivy for agreeing to be in my committee and providing feedback on my work. I would like to thank my collaborators, Dr. Brant A. Inman, Dr. Hari Balasubramanian and Dr. Nilay D. Shah for their efforts and suggestions. I would also like to thank Daniel Underwood for his technical assistance. I thank my father, my mother, and all my family and friends for their unconditional love and support. Finally, I thank my girlfriend Chuan Tian, for everything she did for me and for her endless love.

7 v TABLE OF CONTENTS LIST OF TABLES LIST OF FIGURES vii x 1 Introduction A Literature Review Introduction General POMDP Model POMDP Applications in Medical Decision Making Structural Properties of POMDPs Computational Methods Exact Algorithms Approximation Algorithms Contributions to the Literature Optimization of Prostate Biopsy Referral Decisions Introduction Prostate Cancer Background Literature Review POMDP Model Transition Probability Matrices and Reward Vectors POMDP Structural Properties Results Data Description Estimating Parameters Computational Experiments and Sensitivity Analysis Benefits of Prostate Cancer Screening Discussion Conclusions Optimization of PSA Screening Decisions Introduction POMDP Model Transition Probability Matrices and Reward Vectors Optimality Equations Results Optimal Screening Policies Benefits of Prostate Cancer Screening

8 vi Sensitivity Analysis Discussion Conclusions Optimal Coordination of Prostate Cancer Screening and Treatment Introduction Model Formulation Transition Probability Matrices and Reward Vectors Bayesian Updating Optimality Equations Methodology Upper and Lower Bounds Sampling-Based Approximation Method Core State Reduction Results Model Parameter Estimation Computational Experiments Optimal Screening and Treatment Policy Discussion Conclusions Conclusions Bibliography Appendices Appendix A

9 vii LIST OF TABLES Table 3.1 Detailed description of model parameters defining transition probabilities and rewards for the core state process Table 3.2 The age-specific values of the prostate cancer incidence rate, w t Table 3.3 Parameters, their sources, and specific values used in our base case analysis Table 3.4 The bounds on w t derived from [1] for one-way sensitivity analysis Table 3.5 Sensitivity analysis for expected QALYs for a 40-year-old patient assuming π 40 (C) = 0 comparing the optimal policy to the case of no screening. Base-cases values are shown in bold Table 4.1 Detailed description of model parameters defining transition probabilities and rewards for the core state process Table 4.2 Sensitivity analysis of the expected benefits of PSA screening for 40- year-old healthy men for different ϵ and µ. The optimal policy is compared to the case of no screening, and the case of annual PSA tests with 4.0 ng/ml threshold for biopsy from both patient and societal perspectives. The latter perspective is based on a societal willingness to pay of β = 50, 000. Since the expected QALYs of the traditional guideline is obtained by simulation, 95% confidence interval is presented in the parentheses. The base case is represented in bold Table 4.3 Sensitivity analysis of the total costs of the optimal policy and the traditional guideline for 40-year-old healthy men from the societal perspective for varying β and costs. Other parameters including ϵ and µ are set according to the base case

10 viii Table 5.1 Detailed description of model parameters defining transition probabilities and rewards for the core state process Table 5.2 Detailed description of model parameters defining transition probabilities and rewards for the core state process Table 5.3 Comparison of approximation method compared to IP. LBA denotes the lower bound algorithm, UBA denotes the upper bound algorithm, LBACR denotes the lower bound algorithm with core state reduction, and UBACR denotes the upper bound algorithm with core state reduction Table 5.4 The performance of our sampling-based approximation method under different budget constraints of k = 11, 12, 13, 30, 300, and 3000 given c k. LBA denotes the lower bound algorithm, UBA denotes the upper bound algorithm, LBACR denotes the lower bound algorithm with core state reduction, and UBACR denotes the upper bound algorithm with core state reduction Table 5.5 The performance of our sampling-based approximation method under different budget constraints c = 6, 7, 8, 9, 10, and 15, and k = 30. LBA denotes the lower bound algorithm, UBA denotes the upper bound algorithm, LBACR denotes the lower bound algorithm with core state reduction, and UBACR denotes the upper bound algorithm with core state reduction Table 5.6 The performance of our sampling-based approximation method under different randomized policies, Pr(B) = 0.1, 0.5 and 0.9. LBA denotes the lower bound algorithm, UBA denotes the upper bound algorithm, LBACR denotes the lower bound algorithm with core state reduction, and UBACR denotes the upper bound algorithm with core state reduction Table 5.7 The performance of our sampling-based approximation methods and the comparison with no screening and the optimal policy assuming RP upon prostate cancer detection in terms of the expected QALYs for people with prior belief π 40 (NC) = 1. OPRP denotes the optimal policy assuming RP upon detection, IP denotes the IP algorithm, LBA denotes the lower bound algorithm, UBA denotes the upper bound algorithm, LBACR denotes the lower bound algorithm with core state reduction, and UBACR denotes the upper bound algorithm with core state reduction

11 ix Table 5.8 Sensitivity analysis of the expected benefits of PSA screening for 40- year-old healthy men for different ϵ and µ. The optimal screening and treatment policy is compared to the case of RP immediately upon detection. The base-case values are in bold Table 5.9 Parameters, their sources and specific values used in our sensitivity analysis

12 x LIST OF FIGURES Figure 3.1 An ROC curve that illustrates the imperfect nature of PSA tests for diagnosing prostate cancer. The different points on the curve correspond to different PSA thresholds used to distinguish a suspicious and likely benign test. The curve was generated using the dataset described in Section Figure 3.2 Illustration of the typical stages of prostate cancer screening and treatment including PSA screening, biopsy, and treatment Figure 3.3 POMDP model simplification: aggregating the three non-metastatic prostate cancer stages after detection into a single core state T. Solid lines denote the transitions related to prostate cancer; dotted lines denote the action of biopsy and subsequent treatment; dashed lines in (c) denote the deaths from other causes Figure 3.4 Optimal biopsy referral policy. The solid line denotes the optimal threshold for the base case Figure 3.5 One-way sensitivity analysis for parameters: w t, d t b t, e t, z t, f, µ, ϵ, γ and λ. Solid lines denote the base-case policy and dashed lines denote the bounds Figure 3.6 One-way sensitivity analysis on optimal values for model parameters: d t, w t, ϵ, z t, µ, γ, e t, f and b t Figure 4.1 Illustration of the decisions and outcomes associated with the prostate cancer screening problem. The dashed rectangle denotes the prostate biopsy decision problem solved in Chapter 3, which is a subproblem of the screening problem we consider in this chapter Figure 4.2 Transitions among health states of the prostate cancer Markov model. Note that death from other causes is possible for any health state in our model, but omitted for simplicity Figure 4.3 Optimal prostate cancer screening policies from the patient and societal perspectives. Lines denote thresholds for PSA testing and biopsy. If the patient s age and probability of having prostate cancer are in area B they are referred for biopsy; DB means defer biopsy referral until obtaining the PSA test result in the next decision epoch; DP means defer biopsy referral and the PSA test in the next decision epoch

13 xi Figure 5.1 Recurring screening and treatment decision process for prostate cancer at decision epochs t, t + 1, π (i) t denotes the patient s belief state (probability of being in different health state) at decision stage i, P SA and P SA denotes to have a PSA test or not, B and B denotes to have a biopsy or not, and RP and RP denotes to perform RP or not. Death is possible but not shown in this figure Figure 5.2 Markovian transitions among the prostate cancer states. Partially observable states are in the dotted box, completely observable states are in the solid box, the triangle is the death from prostate cancer; transitions due to RP are represented by the dashed lines, other transitions are represented using solid lines; death from other causes is possible from all states in the model but not shown in this figure Figure 5.3 Illustration of the bounds of the approximation algorithm for a twocore state POMDP with a scalar of fully represent the belief state. (a) illustrates the true value function represented by a minimal α-vector set; (b) illustrates a lower bound on the value function formed by an outer linearization represented by a subset of the minimal α-vector set; and (c) illustrates an upper bound on the value function formed by an inner linearization represented by a set of sampled belief points and their values in red Figure 5.4 Illustration of the idea of core state reduction and how it improves sampling efficiency of the sampling-based approximation method using a two dimensional example with k = 7. (a) shows how the sampling-based approximation method samples. If the subspace of π(s 1 ) = 0 (shown in solid line in (b) and (c)) can be pre-solved, it is not necessary to sample belief points in the pre-solved subspace anymore. Therefore, higher sampling density in the area of π(s 1 ) > 0 can be achieved in (c) than that in (b) under the same budget constraint, k = Figure 5.5 One-way sensitivity analysis on optimal values for model parameters: d t, w t, µ, ϵ, γ, z t, g t and f

14 1 Chapter 1 Introduction Population screening programs are critical to the early detection of chronic diseases. For many chronic diseases, such as cancer, early detection can add years or even decades to an individual s life time. Early detection can also reduce costs to the health system by avoiding high costs associated with late stages of diseases. Until recently many life threatening diseases were only detected when late stage symptoms manifested themselves. Recent discoveries of biomarkers for certain diseases has enabled the development of screening programs with the goal of early detection and treatment. Unfortunately, most biomarkers are imperfect and can result in false positive or false negative outcomes. Therefore, a patient s true health status is often not known with certainty. This can present difficult decisions for physicians and patients that must decide whether to proceed with more invasive and expensive testings. A partially observable Markov decision process (POMDP) is a sequential decision making process that explicitly considers uncertainty about the state of a system. POMDPs have been applied widely in many contexts during the last 30 years including machine maintenance and repair, educational applications, estimating the location of a moving object, and sensor networks. POMDPs are also very well suited to the study of medical decision making in the context of diagnostic tests that provide imperfect information about a patient s true health state.

15 2 The focus of this thesis is on the investigation of new POMDP models and solution methods for the optimization of prostate cancer screening decisions. This application has potentially important societal impacts since prostate cancer is the most common solid tumor in American men, and the best screening policy for prostate cancer is highly debated [2, 3]. This dissertation is structured as the follows. First, a literature review of POMDP models, theoretical properties, algorithms, and their applications is provided in Chapter 2. This literature review also summarizes some of the most important applications of MDPs and POMDPs in the medical decision making context. In Chapter 3, a POMDP model is proposed for prostate biopsy referral decisions assuming a patient undergoes annual PSA screening. The objective is to maximize expected quality adjusted life years (QALYs). Several structural properties including the existence of a control-limit type policy for the biopsy referral decision and the condition under which the screening should be discontinued are proved. These structural properties give insights into the optimal biopsy referral policy over the course of a patient s lifetime. Age-specific prostate biopsy referral policy is obtained. Sensitivity analysis is used to evaluate how the optimal policy and expected QALYs are affected by changes to parameters in the model. In Chapter 4, a POMDP model is proposed for simultaneously making PSA screening and biopsy referral decisions. We use this model to investigate optimal policies for whether and when to have a PSA test or biopsy over the course of a patient s lifetime. The objective of the model is to maximize the difference between the reward for QALYs and the cost of screening, biopsy, and treatment. The optimal policy is compared to the case of no screening and the traditional PSA screening guideline from the published medical literature. The value of screening is measured in terms

16 3 of the expected QALYs and cost. Sensitivity analysis is performed with respect to several model inputs including cost parameters. In Chapter 5 a multi-stage POMDP is proposed for coordination of screening and treatment decisions in the presence of multiple treatment options including active surveillance and radical prostatectomy. The model is extended from the models in previous chapters to include multiple stages of actions at each epoch (PSA testing, biopsy referral, and treatment), additional core states (cancer grades), and observations. Due to the large scale of the resulting model, a new sampling-based approximation method with computational budget constraints is developed to find new optimal solutions to the extended POMDP model. A core state reduction approach suited to the particular structure of our POMDP is also used to further improve the efficiency of the sampling-based approximation method. We use computational experiments to measure the effectiveness and efficiency of our new method compared to an existing method. We demonstrated that our method can get a high quality solution within a reasonable time compared to the existing method. Empirical results for the optimal screening and treatment policy are presented. Sensitivity analysis is done to show how the availability of active surveillance (AS) influences the optimal screening policy and the expected QALYs. In Chapter 6 we summarize the most significant findings from Chapter 3, 4, and 5. We also discuss some of the limitations of our POMDP models from the perspective of prostate cancer screening. Finally, we discuss opportunities for future research.

17 4 Chapter 2 A Literature Review 2.1 Introduction A Markov decision process (MDP) defines a sequential decision process where decisions must be made without perfect knowledge of the future. MDPs are defined by states (the status of the process), actions (interventions that determine the evolution of the process), and rewards (the outcome associated with the state and action). The states of a Markov process satisfy the Markov property which means future states and rewards depend only on the current state and action, and are independent of the history of states and actions. For discrete time and discrete state MDPs, the process is described by the transition matrices which define probabilities of transition between states during decision epochs of some defined duration. Finally, a reward vector associates rewards with different states and actions of the system throughout the decision horizons. A partially observable Markov decision process (POMDP) is a generalization of an MDP in which the states are not completely observable. The unobservable states are called core states and they satisfy the Markov property. In this new sequential decision process the decision maker does not know exactly what core state the process is in at each decision epoch; however, the probability of being in the core states can be inferred based on observations of the system.

18 5 While MDPs are defined by a transition probability matrix and reward vector, POMDPs require a definition of observable states and an additional information matrix comprising the conditional probabilities of the observations given the process is in one of the underlying core states. Furthermore, the actions in a POMDP are defined on the belief state, which is a vector of probabilities of being in each of the core states. POMDPs are particularly attractive for medical decisions where a patient s true health status is not directly observable, such as the presence of prostate cancer. In such situations, physicians rely on test results which provide estimates of the probability a patient is in a certain health state. In such cases, a POMDP describes the decision making process more accurately than an MDP. The importance of formulating medical and healthcare decision problems into the POMDP framework was first suggested by Smallwood et al. [4] in However, due to the restricted computational effort for solving large size POMDPs, medical decision making did not become a serious area of research until the late 1990s [5]. A detailed review of POMDP applications in medical decision making is provided in Section 2.3. There are a number of literature reviews about POMDP. Monahan [6], Lovejoy [7], White [8], Kaelbling et al. [9], Cassandra [10] and Littman [11] all provide extensive reviews focusing on theoretical properties and solution methodologies. Cassandra [12] provides a detailed review of POMDP applications. The remainder of this chapter is different from the above referenced reviews in the following aspects. First, we focus on medical decision making and healthcare applications of POMDPs. Second, this chapter is more recent than previous reviews capturing the latest literature on POMDP, including some recent theoretical structural properties, computational methods, and

19 6 applications. This chapter is structured as the follows: Section 2.2 provides a mathematical description of a general POMDP model and defines notation which is used throughout this thesis. Section 2.3 reviews POMDP applications in the medical decision making context. Section 2.4 provides some important definitions and reviews some classic structural properties of the optimal policies for POMDPs. Section 2.5 summarizes exact and approximation algorithms that have been proposed for solving POMDPs. Finally, Section 2.6 concludes with a description of open opportunities and challenges that exist for research on POMDPs. 2.2 General POMDP Model A POMDP is an MDP with a core process satisfying the Markov property. The core process states are partially observable based on observations of a message process. Core states define the true state of the system at a decision epoch, t, and are denoted by s t S. The core process is a Markov process on the core states with transition probabilities, p t (s t+1 s t, a t ). P (a t ) is used to denote the corresponding transition probability matrix where a t A is the action in decision epoch t. At each decision epoch an observation of the message space is made by the decision maker. The core states are inferred from the observations through the conditional probabilities q t (l t s t ) (Q t denotes the corresponding matrix named information matrix) where l t M denotes the observations in the message process. Bayesian updating is used to combine observations collected at each decision epoch with the prior belief to define the current belief state. π t (s) [0, 1] denotes the probability (belief) of being in core state s at decision epoch t. We let π t = {π t (1), π t (2),, π t ( S )} denote the

20 7 corresponding vector of beliefs for all s t S. The POMDP defined on the finite core state set, S, with finite action set, A, and finite observation set, M, can be transformed to a continuous and completely observable MDP defined on a continuous S -dimensional probability space of π t Π, where Π = [0, 1] S is the belief state space. After this transformation, the reward defined on the belief state, r t (π t, a t (π t )) = s t S π t (s t )r t (s t, a t (π t )), is the expected reward over different core states in epoch t. The continuous belief state transition from π t to π t+1 is defined by a Bayesian updating process with q t+1 (l t+1 s t+1 ) p t (s t+1 s t, a t (π t ))π t (s t ) s π t+1 (s t+1 ) = t S q t+1 (l t+1 s t+1 ) p t (s t+1 s t, a t (π t ))π t (s t ). (2.1) s t+1 S Based on these definitions the optimality condition on the continuous state MDP can be written as v t (π t ) = where max a t (π t ) A r t(π t, a t (π t )) + λ p t (l t+1 π t, a t (π t )) = and λ is the discount factor. l t+1 M s t S s t+1 S q t+1 (l t+1 s t+1 ) s t S v t+1 (π t+1 )p t (l t+1 π t, a t (π t )), π t Π (2.2) p t (s t+1 s t, a t (s t ))π t (s t ), (2.3) Decision epochs increase to infinity for infinite horizon POMDPs [13]. Finite horizon POMDPs, on the other hand, have a finite terminal decision epoch, N, in which the value function v N (π N ) depends only on the terminal reward, r N (s N, a N (π N )), as follows: v N (π N ) = s S π N (s)r N (s, W ), π N Π.

21 8 POMDPs are often more difficult to solve than MDPs because they are defined on a continuous belief state. Furthermore, the number of possible policies increases superexponentially as the decision horizon increases in POMDPs. For instance, a policy tree for a finite horizon POMDP with horizon length N contains t=n M t = M N 1 t=0 M 1 possible observation nodes. At each observation node, A actions can be chosen, which makes the total number of possible policies A M N 1 M 1 [14]. As a result of the computational challenges of solving POMDPs there has been a significant amount of research on solution methods. We review several of the proposed methods in Section POMDP Applications in Medical Decision Making POMDPs have been successfully applied in many industrial application areas. Machine maintenance and replacement [15, 16] and education [17] were among the first areas of application. Other industrial applications include structural inspection [18], elevator control policies [19], fishery [20] and autonomous robot navigation [21]. However, only a small number of studies consider POMDP applications to health care and medical decision making. Although POMDPs have not been widely applied, there are many applications of MDPs in the medical decision making (see Schaefer et al. [22] for a more comprehensive review of MDPs in the context of medical decision making). For instance, Alagoz et al. [23] studied the living-donor liver transplantation timing problem. A stationary infinite horizon MDP is used to obtain the optimal time of liver transplant. They used the model for end-stage liver disease (MELD) score to define the health state of

22 9 the patient. There are two actions, wait and transplant, and the objective is to maximize expected quality adjusted lifespan for the patient, where lifespan is composed of pre and post-transplant portions. Structural properties such as the existence of a control-limit policy were proved under specific assumptions about the rewards and transition probabilities. The optimal transplant strategy was reported for different disease groups of patients. Denton et al. and Kurt et al. [24, 25] studied the optimal start time of statin therapy patients with for type 2 diabetes. Total cholesterol and High-density lipoprotein (HDL) were used to define a finite set of health states. The authors considered the objective to maximize expected QALYs from the patient perspective, and maximize the weighted difference of rewards for QALYs and costs of treatments. Additional applications of MDPs to medical decision making are reviewed in Schaefer et al. [22]. Existing POMDP applications in health care and medical decisions are much fewer than MDP applications. However, since the true health or disease states are usually unknown or difficult to know, POMDP applications are recently becoming more common. The first POMDP application to medical decision making was proposed by Smallwood et al. [4] in The authors define the core states as the patients disease status. They provided an information state diagram to visualize the belief states. However, their idea was quite general they did not formulate a POMDP model for a specific medical decision making problem. Hu et al. [26] formulated an optimal drug infusion problem with uncertain pharmacokinetics as a POMDP. They use their POMDP to choose a drug infusion regimen to keep the concentration of drug in the patient s blood plasma at a predetermined level. The state space comprised the finite intervals of the volume, clearance, and current drug concentration. However, they did not solve this POMDP model to op-

23 10 timality. Instead, they defined some easy to implement drug infusion policies and examined and compared their performance in simulations. Hauskrecht et al. [5, 27] applied a POMDP formulation to the problem of treating patients with ischemic heart disease. This appears to be the first example of solving a real POMDP in the context of medical decision making. The core states are based on the health status of a patient including death as an observable absorbing state. Observations of the message process are the test results (e.g. ischemia level, catheter coronary artery result, and stress test result) and history of surgical procedures; actions include treatment actions (wait, medication treatment, angioplasty and coronary artery bypass graft surgery), and investigative actions (stress test and angiogram investigation) to collect information relevant to the core state of the patient. They acquired the parameters for transition probabilities and rewards from the medical literature, or inferred them from available data at a particular medical center. Bounds on the optimal policies were obtained by proposed approximation algorithms. One of the proposed algorithms, the fast informed bound method, selects the best linear function for every observation and every current state separately. Another algorithm, incremental linear function approach, gradually improves the convex and piecewise linear lower bound of a finite fixed grid-based approximation method. Due to the complexity of the model, the gap between the upper and lower bounds was significant for some of their numerical experiments. Peek [28] formulated time-critical management problems in medicine as a POMDP model. The author discussed a problem of clinical treatment of children with a ventricular septal defect as an example of a time-critical clinical management problem. Although the author provided the detailed description of decision horizons, states, actions, transition probabilities, observations and rewards, he neither provided values

24 11 of the parameters, nor solved the POMDP model for this problem. Tusch [29] modeled the optimal therapy plan for liver transplantation using a POMDP. His goal was to find an optimal clinical management strategy based on a risk assessment of patients. The core states were risk and non-risk; the actions include therapeutic actions such as surgery, and test actions. There are a total of 24 possible clinical tests, grouped in three scores, resulting in three observations of the restricted model. The problem was reduced to a three decision-epoch constrained POMDP. The author used artificial neural networks to estimate the probabilities in the information matrix. The proposed constrained POMDP formulation was transformed to a non-linear optimization problem to solve using robust partial classification methods such as artificial neural networks and linear discriminant analysis by considering the POMDP as a classification procedure. Kreke [30] used a finite horizon POMDP model to answer the question about when to test for cytokine levels (a predictor of sepsis patient survival) using potentially costly and inaccurate testing procedures in hospitals. The decision horizon was the time from a patient s admission to the hospital to discharge. The objective was to maximize the patient s expected survival time. Unobservable core states defined the patient s health status. Observations were the patient s cytokine level which are subject to error and may not be patient s true cytokine level. Actions comprised discharge of the patient from the hospital without testing, order of a cytokine test, and keeping the patient in the hospital for one more decision epoch. The cost of ordering a cyotokine test was converted into patient life days using cost-effectiveness analysis, and then incorporated into the rewards. A finite-fixed grid method was employed to transform the POMDP into a MDP to solve. Although control-limit type policies were observed in empirical results, the author neither proved nor provide conditions

25 12 for the existence of the control-limit type policy for the proposed POMDP model. Sensitivity analysis was done for test accuracy and cost. Fard et al. [31] investigated the comparative effectiveness of different treatments provided sequentially for patients suffering from depression using a POMDP, although the focus of this paper was to propose a method for estimating bias and variance of the value function. The unobservable states are the levels of depression and the observations are a numerical score called the quick inventory of depressive symptomatology, which roughly indicates the level of depression. They used this medical application as an example to evaluate the precision of the proposed method. They compared policies with different choices of medications and gave intervals for different policies. Goulionis et al. [32] employed POMDP models in Parkinson disease treatment optimization. The core states are three levels of a patient s Parkinson disease status. The observations are the characteristics obtained by clinical examinations. The actions are medical treatment, with incomplete monitoring, and surgical treatment. Their goal was to find the belief threshold for surgery in order to minimize an objective including a combination of QALYs and monetary values. They used the policy iteration algorithm from [13] to solve the formulated infinite horizon stationary POMDP. Their core states were the Parkinson disease degrees and Bayesian updating was based on various diagnostic observations. Optimal average cost policies for patients with Parkinson disease with three deterioration levels were obtained based on clinical data in Athens, Greece. Ivy [33] formulated a POMDP model for breast cancer decision and treatment. She considered both third party payer s and the patient s perspective. The payer s perspective was to minimize the cost associated with monitoring and treating breast cancer, and the patient s perspective was to maximize the expected discounted total

26 13 utility (QALYs). In her model, cancer states were the partially observable core states and actions were screening and treatment options. Results of clinical breast exams and mammograms are the observations. An algorithm that sequentially selects the policy for the constrained POMDP was used in obtaining the optimal policy and constructing the trade-off curves between cost and utility. In Maillart et al. [34], the authors used a partially observable Markov chain to study breast cancer screening policies using mammography. They evaluated agedependent screening policies and studied the tradeoff between lifetime mortality risk of breast cancer and the expected number of mammograms. They generated the efficient frontier for the evaluated policies measured by life-time mortality risk and expected mammogram count, and demonstrated the robustness of the resulting frontier. Chhatwal et al. [35] studied a breast cancer biopsy optimization problem based on the mammography observations. In their MDP model, they use a set of discretized probability of breast cancer as their states. They use a mammography Bayesian Network to estimate the probability. They also proposed a PODMP model assuming the mammography Bayesian Network is not perfect. They conclude that their POMDP model is not as good as the MDP model since they do not have good estimates for core-state transition and observation probabilities. 2.4 Structural Properties of POMDPs Some POMDPs have optimal policies that exhibit special structures. For instance, the existence of a control-limit type policy means there exists hyperplanes separating the decision space (belief state space in a POMDP) into parts within which different

27 14 actions are optimal. To develop a rigorous theoretical description of these properties, we first provide definitions of some common terms used in the literature. Definition 2.1. An n n matrix A is totally positive of order k, denoted by TP k, if A p is nonnegative for all p = 1,, k, where A p is the pth compound matrix of A defined as the ( n p ) -square matrix of the p minors of A. Definition 2.2. An m n matrix A has the increasing failure rate (IFR) property if n and only if A ij n A i j, i < i {1,, m}, k {1,, n}. j=k j=k Definition 2.3. Stochastic dominance (first order): the mass function p is stochastically less than or equal to the mass function q, denoted by p s q, if N N p(k) q(k), m, 0 m N. k=m k=m Definition 2.4. The mass function p is less than or equal to the mass function q in the sense of monotone likelihood ratio (MLR), denoted by p r q, if q(k)/p(k) is a nondecreasing function of k (excluding k such that p(k) = q(k) = 0). Definition 2.5. Blackwell ordering: Let X and Y be standard Borel spaces. Given two transition probabilities P and Q from X to Y, we say that P is less informative than Q (P B Q) if there exists a transition probability K from Y to Y such that P (x; C) = Q(x; dy)k(y; C), x X and for all measurable C Y. Structural properties of POMDPs have been investigated by many researchers for more than thirty years. Sondik and Smallwood [36, 37] showed that the optimal objective function in a maximization problem is piecewise linear and convex in the belief

28 15 state for any given decision horizon. This is the basis of many proposed POMDP algorithms. For infinite horizon POMDPs, Sondik [13] showed the convex property still exists, however, the piecewise linearity may not hold. Instead, the optimal objective function can be approximated arbitrarily closely by a piecewise linear and convex function. Albright [38] gave some conditions for which two-core-state, two-action POMDPs have monotonic value functions and control limit type policies. The author proposed two types of models one obtains an observation after each state transition and the other first has a state transition followed by an observation. Both models have similar structural properties. In order to derive monotonicity results, the author first showed that the m n matrix is TP 2 if and only if the matrix has the IFR property and n = 2. The author also provided sufficient conditions for π t+1 to be isotone in π t, a t and l t+1, and established monotonicity of the value function given the information matrix, Q, and core state transition probability matrix, P, are both TP 2, and there exists an ordering of states such that the reward function is nondecreasing. The monotonicity of the optimal policy at each decision horizon also requires the reward function be supperadditive or subadditive. White [39] provided some conditions for the existence of optimal control-limit type policies for special cases of completely observed and completely unobserved POMDPs. The author discussed two extreme cases of the general POMDP. There are two main contributions of this paper. First, it demonstrates the sufficient conditions for the existence of monotone optimal control laws for general POMDPs are restrictive and difficult to verify; second, it emphasizes the potential usefulness of the two extreme cases in determining bounds on optimal solution to a POMDP. Lovejoy [40] presents weaker conditions for monotonicity of policies for more gen-

29 16 eral POMDPs. The author showed that the optimal value function in a discrete-time, finite core state POMDP, is monotone on the space of belief vectors ordered by likelihood ratios. He required the state probability (information) vector to be MLR, and used a machine replacement example to illustrate the conditions. Rieder [41] proposed conditions for monotonicity results for the value function and optimal policy based on the T P 2 and Blackwell stochastic orderings. The author also proposed a more general POMDP formulation than those in [40] and other earlier researches. The author showed how the value functions depend on the observation using Blackwell ordering and presented conditions for a lower bound of the optimal policy. The results carry over from the finite horizon to the discounted infinite horizon case. These results extend and complete investigations of Albright [38], White [39] and Lovejoy [40] and they can be used to derive further structural properties of optimal policies for some special types of partially observed control models such as Bayesian control models. Recently Grosfeld-Nir [42] proved that dominance in expectation which is weaker than stochastic dominance suffices for the optimal policy to be control-limit type in a two-core-state, two-action problems. It can be treated as an extension to Albright [38]. 2.5 Computational Methods Computational methods for POMDPs were first discussed in the 1970s [36]. Since then many exact and approximation algorithms have been proposed and developed in the operations research, computer science and artificial intelligence communities. In this section, we review exact and approximation algorithms. The reader is referred to [9] for a recent and detailed review of algorithmic methods for POMDPs.

30 Exact Algorithms Sondik and Smallwood [36, 37] first proved that the finite horizon POMDP value function is piecewise linear and convex at each decision epoch for a maximization problem. Each given sequence of actions and observations results in a specific vector (hyperplane) in the belief space commonly referred to as α-vectors. The set of all the vectors corresponding to all the policies is called the α-vector set. The convex hull representing the optimal value function is constructed by the epigraph of the vectors of all possible policies. Since some vectors in the α-vector set are dominated, the epigraphc can be often represented by a smaller subset of α-vectors, called the minimal α-vector set or parsimonious representation of the value function. The first exact algorithm, called the one-pass algorithm, was also proposed in [36, 37] and standardized later by Monahan [6]. In order to obtain the minimal α-vector set, the one-pass algorithm solves a linear program for every α-vector in order to obtain the minimal set. However, the computational effort could be large since the number of constraints in each linear program is the total number of α-vectors. This shortcoming has become the target of later algorithmic improvements which try to find the minimal α-vector set more efficiently [14, 43]. White [8] proposed a more efficient routine (also known as Lark s method since the routine was originally proposed by J.W. Lark in a private communication with the author) to reduce the set of α-vectors to the minimal set.. This routine generates the minimal α-vector set beginning with the null set. Thus, the linear program used to identify dominance has fewer constraints than the linear program in the one-pass algorithm, which enumerates all α-vectors. More recently, Littman [14] proposed the witness algorithm which improves on

31 18 the algorithm provided by White [8]. The witness algorithm divides the problem into small subproblems according to different actions in order to reduce the number of constraints in each linear program in identifying the minimal α-vector set. Each linear program finds a witness belief point at which another α-vector is found dominating all other α-vectors in the current minimal set, and is added into the minimal α-vector set of a action. Finally, the union of minimal α-vector sets for different actions are purged into the minimal α-vector set of the optimal value function. The author used lexicographic ordering to break the ties and guarantee the final α-vector set is of minimum size. Additionally, the author analyzed the performance of finite-horizon approximations to infinite horizon POMDPs. Zhang [43] developed an algorithm called incremental pruning. The algorithm can be viewed as an extension of the witness algorithm. It does not search the regions of the entire state space; instead, it constructs each possible α-vector in the minimal set in an incremental fashion by taking advantage of the decomposable nested structure of the value function of a POMDP. In Cassandra et al. [44], this algorithm is specified to be incrementally purging the α-vectors associated with different observations regarding to the α-vector set of a specific action. More specifically, the minimal α-vector set associated with a specific action can be decomposed into the vector subsets according to the corresponding observations. The vector subsets are added one by one, and the dominated vectors are pruned each time a new subset is added. The algorithm was named after this special pruning procedure. This algorithm is shown to be more efficient than other previous exact algorithms including the witness algorithm.

32 Approximation Algorithms Approximation algorithms are often necessary for large-scale POMDPs to obtain good (hopefully near optimal) solutions. Approximation algorithms for POMDPs have been widely developed for decades; hence there are many proposed algorithms. In this section, we will review some of the more common approximation algorithms. A more thorough review can be found in [7]. The most intuitive approximation algorithm for a POMDP is to discretize the continuous belief state and solve it as an MDP. Eckles [45] was the first to use this idea to solve POMDPs problems. Continuous belief states are discretized into finite fixed grids. Approximate optimal value functions are computed at belief points on the grid at each decision epoch t. They are computed from the approximate optimal values of all possible posterior belief points at epoch t + 1. Linear interpolation is used in approximating the value functions for belief points between two adjacent grid points. This method was also named as fixed-grid method. Lovejoy [7] provides a detailed review of these methods including bounds on the approximation error. Kaelbling [9] discusses extensions such as nonlinear interpolation and related approximations. The Finite memory approximation uses information from a fixed future horizon from the current decision epoch to approximate the objective function. The method was introduced in Sondik [36, 13] to approximate infinite horizon POMDPs. Platzman [46] proposed a finite-memory approximation using a finite length of the most recent actions and observations. Platzman generalized the finite memory idea to finite memory states which could be aggregations of recent observations and actions. A memory state transition occurs when there is a new action or observation. Platzman also presents methods to bound the approximation of the optimal value, and random-

Optimal Design of Biomarker-Based Screening Strategies for Early Detection of Prostate Cancer

Optimal Design of Biomarker-Based Screening Strategies for Early Detection of Prostate Cancer Brian Denton Department of Industrial and Operations Engineering University of Michigan, Ann Arbor, MI October