Statistical methods for knowledge discovery in adverse drug reaction surveillance

Size: px

Start display at page:

Download "Statistical methods for knowledge discovery in adverse drug reaction surveillance"

Antony Bridges
6 years ago
Views:

1 Statistical methods for knowledge discovery in adverse drug reaction surveillance

3 Statistical methods for knowledge discovery in adverse drug reaction surveillance G. Niklas Norén Stockholm University

4 c G. Niklas Norén, Stockholm 2007 Cover photography by G. Niklas Norén ISBN pp Typeset by L A T E X Printed in Sweden by Universitetsservice AB, Stockholm 2007 Distributor: Department of Mathematics, Stockholm University

5 Abstract Collections of individual case safety reports are the main resource for early discovery of unknown adverse reactions to drugs once they have been introduced to the general public. The data sets involved are complex and based on voluntary submission of reports, but contain pieces of very important information. The aim of this thesis is to propose computationally feasible statistical methods for large-scale knowledge discovery in these data sets. The main contributions are a duplicate detection method that can reliably identify pairs of unexpectedly similar reports and a new measure for highlighting suspected drug drug interaction. Specifically, we extend the hit-miss model for database record matching with a hit-miss mixture model for scoring numerical record fields and a new method to compensate for strong record field correlations. The extended hit-miss model is implemented for the WHO database and demonstrated to be useful in real world duplicate detection, despite the noisy and incomplete information on individual case safety reports. The Information Component measure of disproportionality has been in routine use since 1998 to screen the WHO database for excessive adverse drug reaction reporting rates. Here, it is further refined. We introduce improved credibility intervals for rare events, post-stratification adjustment for suspected confounders and an extension to higher order associations that allows for simple but robust screening for potential risk factors. A new approach to identifying reporting patterns indicative of drug drug interaction is also proposed. Finally, we describe how imprecision estimates specific to each prediction of a Bayes classifier may be obtained with the Bayesian bootstrap. Such case-based imprecision estimates allow for better prediction when different types of errors have different associated loss, with a possible application in combining quantitative and clinical filters to highlight drug ADR pairs for clinical review.

7 List of Papers This thesis is based on the following original publications, which are referred to in the text by their Roman numerals. I II Norén, G. N., Orre, R., Bate, A., Edwards, I. R. (2007). Duplicate detection in adverse drug reaction surveillance. Data Mining and Knowledge Discovery. Published on-line. Norén, G. N., Bate, A., Orre, R., Edwards, I. R. (2006). Extending the methods used to screen the WHO drug safety database towards analysis of complex associations and improved accuracy for rare events. Statistics in Medicine, 25(21): III Hopstadius, J, Norén, G. N., Bate, A., Edwards, I. R. (2007). Adjustment for potential confounders in adverse drug reaction surveillance. Submitted for publication. IV V Norén, G. N., Sundberg, R., Bate, A., Edwards, I. R. (2007). A statistical methodology for drug drug interaction surveillance. Submitted for publication. Norén, G. N., Orre, R. (2005). Case based imprecision estimates for Bayes classifiers with the Bayesian bootstrap. Machine Learning, 58(1): Reprints of I, II and V were made with kind permission from the publishers.

9 Contents Part I: Thesis summary 1 Introduction Aim Adverse drug reaction surveillance Individual case safety reports The WHO database Adverse drug reaction signal detection Knowledge discovery in adverse drug reaction surveillance Context Process Disproportionality Shrinkage Pattern discovery and detection Facilitating interpretation Future directions Overview of the papers Paper I Paper II Paper III Paper IV Paper V Acknowledgements Bibliography

11 Part I: Thesis summary

13 1. Introduction It is in the nature of pharmaceutical development that the full safety profile of a new medicinal product will not be known at the time it is introduced to the general public. Because randomised clinical trials are limited in both the types and numbers of patients exposed, continued safety monitoring of drugs is in the interest of patients, regulatory authorities and pharmaceutical companies (Finney 1966, Evans 2000). Individual case safety reports are submitted by health professionals based on suspected adverse drug reaction (ADR) incidents (Edwards and Aronson 2000) observed in real world clinical practice. They remain one of the best resources for early post-marketing discovery of potential public health or patient safety issues. They are rich sources of information, but anecdotal in nature. The reliance on voluntary submission, the variation in quality of information and the large number of new reports submitted to national and international organisations every year provide a range of interesting statistical challenges. 1.1 Aim The overall aim of this thesis is to propose improved statistical methods for knowledge discovery in collections of individual case safety reports. I proposes a new method for automated duplicate detection based on the hit-miss model introduced for statistical record linkage (matching records across data sets) by Copas and Hilton (1990). An extended hit-miss model that handles numerical record fields and compensates for correlations between record fields is implemented for the WHO database and demonstrated to be useful in real world duplicate detection. II proposes improved credibility intervals, a poststratification approach to adjustment for confounding variables and an extension to higher order associations for the Information Component (IC) measure of disproportionality used to screen the WHO database for excessive ADR relative reporting rates. III demonstrates that the post-stratification adjustment of the observed-to-expected ratio for suspected confounders adopted for the IC in II may lead to spurious underestimation in the presence of any very small strata in a stratified data set. A comparison to a literature reference indicates that while routine adjustment for some potential confounders in first pass screening of collections of individual case safety reports does improve 1

14 performance, the magnitude of this improvement is modest compared to the improvement from a triage (prioritisation) criterion requiring reports from at least two countries before a drug ADR pair is highlighted for clinical review. This suggests that confounding may have less impact on the analysis of individual case safety reports than previously believed. IV introduces a new measure of drug drug interaction for collections of individual case safety reports. Unlike methods proposed previously for this purpose, it defines interaction as departure from a baseline model with independent attributable risk. V introduces a Bayesian bootstrap method for estimating the uncertainty in Bayes classification associated with each individual prediction. We demonstrate how this information can be used to improve performance, when different types of errors have different associated loss, with a possible application in selecting drug ADR pairs for detailed clinical review. 2

15 2. Adverse drug reaction surveillance The analysis of individual case safety reports is the cornerstone of early postmarketing ADR detection (Rawlins 1988). Whereas large, formal drug safety studies are useful to test specific hypotheses related to drug safety, they are not suitable for continuous monitoring with the aim of detecting previously unsuspected ADRs, as early as possible. In the context of this PhD thesis, ADR surveillance refers exclusively to drug safety monitoring based on individual case safety reports. It thus excludes other post-marketing efforts such as the intensive monitoring programs of New Zealand and the United Kingdom, as well as safety monitoring based on health registries and hospital-based safety monitoring. For comprehensive overviews of post-marketing ADR surveillance, see Lindquist (2003) and Bate (2003). 2.1 Individual case safety reports Individual case safety reports communicate genuine clinical concerns from observant health professionals (Edwards 1999). As they are based on actual patients in real world clinical practice, their collection and analysis increase the chance to discover ADRs that are due to drug drug interaction, affect patients with certain medical predispositions or that belong to patient subgroups that tend to be excluded from pre-marketing clinical trials, such as children or pregnant women. In addition, the large numbers of patients exposed and the unlimited follow-up time available considerably increase the chance to detect ADRs that are rare or that occur only after extended periods of use. An example of an authentic individual case safety report is provided in Figure 2.1. Much of the information on these reports can be originally provided as free text, some of which is later encoded as structured information upon database entry. This is usually done by trained personnel at pharmaceutical companies or at national authorities. The encoding of observed ADR incidents in terms of standardised terminology is a critical part of the preprocessing. One potential pitfall is the risk of misinterpretation when the ADR encoding is performed by someone who has never actually met the patient. Variation in coding across regions and time periods may lead to systematic differences that can affect subsequent data analysis. A general problem is that several ADR terms are often applicable to a given incident. Thus, exploratory analysis 3

16 Figure 2.1: Sample individual case safety report. Reprinted with kind permission of the Adverse Drug Reactions Unit at the Therapeutic Goods Administration of Australia 4

17 focusing on single ADR terms may fail to include all relevant reports a phenomenon which has been referred to as signal fragmentation (Purcell 2003). While in the follow-up of specific issues, this can be remedied by specifying groups of relevant ADR terms for the issue of interest, it is not obvious how such strategies can be easily automated for routine exploratory analysis. Individual case safety reports refer to suspected ADR incidents and some adverse events observed in association with drug prescription will in reality be coincidental, due to concomitant medication or natural progression of the underlying disease. At the same time, not all ADR incidents that actually occur are identified as such and eventually reported to the national drug safety centres. The degree of under-reporting is unknown but can be expected to vary with the severity of the suspected ADR, across geographical regions and time periods. There may also be variation in the propensity to report suspected ADRs during the life-span of a drug and in response to any attention to suspected drug safety issues in the public or scientific media. The categories of health professionals who are allowed to submit reports also differ over time and between regions. Some countries allow only medical doctors to submit reports, whereas others accept reports from medical nurses and pharmacists as well. In addition, some countries encourage direct consumer reporting. Unsurprisingly, the propensity to report suspected ADRs of different types varies considerably between different categories of reporters (Savage 1985). An important characteristic of individual case safety report submission is that separate reports sometimes have a common origin and therefore cannot be considered as independent pieces of information (Finney 1973). This may distort automated knowledge discovery and mislead clinical review. The most obvious cause of non-independent reports is report duplication, where a single suspected ADR incident results in several reports. This phenomenon is discussed at some length in I. More subtle examples include groups of reports provided by the same health professional, such as those from the Norwegian dentist discovered in I, reports from the same clinical study (sometimes mislabelled as spontaneous reports) or separate reports for the same patient at different points in time. If single individuals are responsible for encoding large numbers of reports, this may also induce superficial similarity between reports. A potential example of this is the group of over 600 very similar reports originally collected by a single law firm, discovered in IV. Violated independence assumptions differ from other data quality issues in that they do not relate to the quality of single reports, but to the quality of collections of reports. Even upon the confirmation that a pair of reports are indeed duplicates it is not obvious how to proceed: should the suspected duplicates be flagged or should one of them perhaps be removed from the data set (if so, which one)? 5

18 Drugs Anatomical Class Therapeutic Selective serotonin reuptake inhibitors (SSRI) Antiinfl. prep. nonsteroids for topical use (NSAID) Platelet aggregation inhibitors excl. heparin ADRs Reports System Organ Class Reports 193,939 Body as a whole - general disorders 180,770 Skin and appendages disorders 179,226 Gastro-intestinal system disorders ACE inhibitors, plain 171,706 Central & peripheral nervous system disorders deriva- Benzodiazepine tives 1,218,425 1,070, , , ,898 Psychiatric disorders 677,227 Table 2.1: The most commonly reported groups of drugs and ADRs in the WHO database (note that each report may list more than one drug and more than one ADR). 2.2 The WHO database The Uppsala Monitoring Centre maintains and analyses the world s largest collection of individual case safety reports. As of December 2006, the WHO database contained over 3.8 million reports, with a current yearly growth of over 200,000 reports (see Figure 2.2). The database is held on behalf of the countries participating in the WHO Programme for International Drug Monitoring, whose number has continued to grow from the founding 10 countries in 1968 to over 80 member countries at the end of The international coverage allows rare but important public health or patient safety issues to be detected earlier after drug launch than if based on isolated analysis of national data sets (Olsson 1998). Variation between countries in the range of available drug substances, populations at risk, reporting culture and regulation may influence relative reporting rates and make knowledge discovery more complicated in international data sets. At the same time, this diversity is an invaluable asset in detecting public health or patient safety issues related to for example ethnic or dietary ADR risk factors. Thus, even though most reports in the WHO database come from the USA and other industrialised nations, the worldwide coverage of the WHO programme is perhaps its greatest strength. As is clear from Figure 2.2, the vast majority of reports in the WHO database are so-called spontaneous reports that refer to observations in regular clinical practice. However, a small minority are from intensive monitoring programs or clinical studies. Such atypical reports should in principle be labelled as 6

19 Number of reports 4,000,000 3,000,000 2,000,000 1,000, Year a. Database growth United States United Kingdom Germany Canada France Australia Other countries 20% 5% 5% Spontaneous reports Other 5% 47% 12% 6% 5% 95% b. Biggest contributors c. Types of reports 4,000,000 2,000,000 Number of reports 3,000,000 2,000,000 1,000,000 Number of reports 1,500,000 1,000, , Number of drugs per report Number of ADRs per report d. Number of drugs per report e. Number of ADRs per report Number of reports 50,000 40,000 30,000 20,000 10, Patient age (years) f. Patient age distribution Figure 2.2: Characteristics of the WHO database 7

20 such, but occasional mislabellings do occur. Thus, they cannot reliably be excluded from the analysis. Table 2.1 indicates what groups of drugs and ADRs have been reported most often during the entire life span of the WHO database. From Figure 2.2, it is clear that most reports list only one suspected drug and between one and four ADRs, but there are reports that deviate from this general pattern, and list very large numbers of drugs and ADRs. The most striking aspect of the empirical age distribution in Figure 2.2 is perhaps the large number of reports for children less than two years of age. A large proportion of these relate to suspected adverse reactions to vaccines. Another interesting phenomenon is the digit preference on 0 and 5 for encoding patient age. 2.3 Adverse drug reaction signal detection The detection of early warnings related to potential public health or patient safety issues is the main aim of collecting and analysing individual case safety reports. In the context of ADR surveillance, the WHO defines a signal as: "Reported information on a possible causal relationship between an adverse event and a drug, the relationship being unknown or incompletely documented previously. Usually more than a single report is required to generate a signal, depending upon the seriousness of the event and quality of the information." (Edwards and Biriell 1994) As is clear from the definition, single reports in isolation rarely motivate the communication of an early warning of a potential ADR, but there are exceptional examples where single reports of very high quality do (Meyboom et al. 1997). Particularly valuable pieces of information in this respect are those that indicate the effect on the ADR of withdrawing the suspected medication (socalled dechallenge intervention), and the effect of re-exposing the patient to the suspected treatment, after a successful dechallenge (so-called rechallenge intervention) (Edwards et al. 1990). Moreover, Aronson and Hauben (2006) argue that there are certain types of ADRs for which single, well documented incidents may motivate early warning, much in the spirit of the triage algorithms proposed by Ståhl et al. (2004). Early warning of a potential ADR is possible even in the absence of any individually very strong reports, if there is a large enough number of reports on the drug ADR pair of interest (Edwards et al. 1990). This is true in particular when alternative systematic explanations to excessive reporting rates, such as reporting biases or strong confounding, can be dismissed and the relative 8

21 Figure 2.3: Signal detection process reporting rate remains excessive even after suspected duplicates have been removed. The aim of ADR signal detection is to generate, strengthen and refine hypotheses related to suspected drug toxicity. Hypothesis testing is not possible on account of the inherently non-systematic nature of data collection and the lack of proper comparison groups. In-depth clinical evaluation and scrutiny of reports remain at the core of the ADR signal detection process. However, the WHO database receives tens of thousands of reports every month and this massive inflow of reports require efficient computational methods to help clinical experts focus on the groups of reports most likely to represent important public health or patient safety issues (Meyboom et al. 2002). As indicated in Figure 2.3, the signal detection process in routine use on the WHO database consists of a combination of automated knowledge discovery methods (Bate et al. 1998), triage (prioritisation) algorithms and clinical review (Ståhl et al. 2004). The knowledge discovery methods highlight drug ADR pairs with unexpectedly large numbers of reports relative to the average reporting rates in the database. Triage algorithms use a combination of quantitative and qualitative information to focus attention on the most urgent issues for follow-up (Ståhl et al. 2004). Reports related to drug ADR pairs singled out by the triage algorithm are forwarded to a panel of international experts for clinical review. In the context of the clinical review, pattern discovery methods may often be useful to profile larger groups of reports and suggest alternative explanations to observed excessive reporting rates. Hypotheses of suspected ADRs first highlighted in automated knowledge discovery that remain after clinical review are routinely communicated to the drug safety community, and some have been published in the mainstream medical literature (Coulter 9

22 et al. 2001, Sanz et al. 2005). However, the risk of distortion from undiscovered data quality problems and the difficulty of obtaining complete, detailed information on reported ADR incidents mean that signals of suspected ADRs often remain tentative, even after clinical review. 10

23 3. Knowledge discovery in adverse drug reaction surveillance Vast improvements in data storage capacity over the last decades have spurred ever increasing ambitions to analyse large, complex data sets not originally collected for the purpose of statistical analysis. Such investigations require data analysis methodology that scales well with increasing amounts of data and that focuses on discovery and exploration rather than on inference. This area of research and application, on the border between mathematical statistics and computer science, is referred to as knowledge discovery or data mining. Fayyad et al. (1996) describe data mining as one step in a more general knowledge discovery process. Mannila (1996) and Hand (1998) emphasise the similarity between data mining and exploratory statistical analysis, the latter characterising the difference as one primarily related to data set size and properties: in data mining, contamination, nonstationarity and biases are standard. On account of the complex data sets involved, interpretability is often a main consideration, which may favour simplicity at the expense of prediction accuracy (Glymour et al. 1997). An important dividing line is the choice between model based inference and algorithmic approaches (Breiman 2001). Whereas much of the research on knowledge discovery has been driven by computer science, key contributions from the statistical community include the clarification of inferential processes underlying algorithmic methods, insight into the bias variance trade-off in determining model complexity, methods for quantifying uncertainty and placing emphasis of the impact on interpretation of potential distortions such as confounding or selection biases (Elder and Pregibon 1996, Glymour et al. 1997, Efron 2001). In contrast with the more rigid framework for hypothesis testing, knowledge discovery is usually an interactive and iterative process of increasingly refined hypothesis generation. In my view, it should combine an unintimidated attitude towards the analysis of problematic and complex data sets with a proper understanding and clear statement of the limitations in nature and strength of the conclusions that can be drawn. 11

24 3.1 Context Collections of individual case safety reports clearly contain important pieces of rich and very useful information (Finney 1973, Edwards 1997), but they constitute an inherently non-random sample. The presence of reporting biases and violated independence assumptions discussed in Section 2 render summary statistics potentially deceptive. In particular, the presence of nonindependent reports can lead to optimistic precision estimates and invalidate standard tests for association (Finney 1971). As a consequence, the place for statistical methodology in the analysis of collections of individual case safety reports is somewhat out of the ordinary. Its main focus is on providing a framework for effective hypothesis generation and refinement, rather than on hypothesis testing (Bate 2003). Methods for reliably identifying elevated ADR reporting rates in collections of individual case safety reports are already part of routine drug safety signal detection (Ståhl et al. 2004). In the future, methods for highlighting suspected drug drug interaction, groups of nonindependent reports or reporting patterns involving larger sets of drugs and ADRs should allow for even more sophisticated use of this valuable source of information. The emphasis on hypothesis generation and refinement applies throughout this thesis: the aim of the record matching algorithm in I is to highlight likely duplicates for manual review and the aim of II, III and IV is to determine the most effective approach to highlighting apparently excessive ADR reporting rates for further follow-up. The purpose of implementing the methods in V for prioritisation of drug ADR pairs for further follow-up as discussed in Section 4.5 would also be effective hypothesis generation. In knowledge discovery, large numbers of possible associations and patterns are considered simultaneously. Familywise error rates that reflect the probability that any highlighted association corresponds to a false positive are usually less relevant in this context, because all open-ended investigations are bound to produce some false positives. Performance is better evaluated in terms of measures that indicate the proportion of false positives that can be anticipated in a specific study, such as false discovery rates. In our work, we have used two related measures of performance from the literature on Information Retrieval: precision (the number of true positives over the sum of true and false positives) and recall (the number of true positives over the sum of true positives and false negatives). Precision recall graphs that indicate how the precision and recall vary by the threshold for clinical review are used in both I and III. They provide an informative overview of performance, independent of the selected threshold. 12

25 Figure 3.1: Exploratory analysis of collections of individual case safety reports 3.2 Process Fayyad et al. (1996) define knowledge discovery as: "The nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data." The knowledge discovery process is not limited to actual data analysis but includes: data collection, cleaning and preparation; reduction and projection; data analysis and interpretation, and finally dissemination, incorporation into existing structures and action based on discovered knowledge. It thus entails the entire ADR signal detection process outlined in Section 2.3, from the collection of reports and their pooling in an international database, through data preparation and transformation including conversion from free text to structured information, data cleaning and duplicate detection, via disproportionality analysis and triage algorithms to clinical review, and finally communication to national centres, pharmaceutical companies and the general public. The statistical methodology developed in the context of this thesis is applied at two different stages of the knowledge discovery process for ADR surveillance, as indicated in Figure 2.3. On one hand, disproportionality analysis is a core component in screening for excessive ADR reporting rates in first pass analysis of the database. On the other hand, pattern discovery methods are 13

26 useful in assisting clinical review and highlighting interesting aspects of specific groups of reports in more detailed investigations. Figure 3.1 proposes a general framework for such exploratory analysis. For the purpose of illustration, assume that the data subset of interest consists of all reports involving a particular drug D. At the outset of the exploratory analysis, simple descriptive information such as the total number of reports listing D and from what countries and during what time periods they have been submitted, may be very useful. Together with lists of the most commonly co-reported drugs and ADRs, as well as empirical distributions for patient age and gender, this provides a descriptive overview of the reporting of D which can serve as a useful reference for subsequent discoveries. Experienced data analysts may react directly to descriptive information that contradicts their subject matter knowledge. For example, a domain expert familiar with the WHO database may react to the observation that a suspiciously large proportion of the reports in a subgroup of interest have been submitted from a country with a low overall reporting rate. The middle box in Figure 3.1 is an attempt to formalise such comparative data analysis. Contrasts between the group of reports of interest and a comparison group (e.g. the database as a whole or all reports involving a drug in the same class of drugs as D) provide insight into what properties of the data subset differentiate it from the comparison group. For example, it may turn out that the relative reporting rate of a rare ADR for D by far exceeds that in the database as a whole. Such discrepancies may well be more enlightening than information on what the most commonly reported ADR is in absolute terms. The discussion of such disproportionality analysis is further extended in Section 3.3. Both descriptive and comparative studies may be misleading when the group of interest contains distinct subgroups. For example, if D is prescribed on one hand to young males and on the other hand to elderly females, the summary information that the average patient age on reports listing D is 43 years and that the overall proportion of females is 52%, conveys a very insufficient overview. Clustering algorithms allow for automated partitioning of data, with the aim of detecting latent structure, and may allow for much more relevant subsequent descriptive or comparative data analysis, as indicated by Figure 3.1. In addition to the iteration of automated partitioning, description and comparison described above, there are other methods for pattern discovery in collections of individual case safety reports. Record matching methods such as that adapted for duplicate detection in I can be used to detect groups of unexpectedly similar reports. Modified Hopfield networks and clustering algorithms such as those evaluated in Orre et al. (2005) may allow groups of often recurring ADRs (syndromes) to be identified. Similarly, interaction detection methods such as those in II and IV can be used to highlight suspected ADR risk factors. 14

27 It is rarely possible to specify at the outset of a large exploratory study, a fully automated, all-purpose approach to exploratory data analysis appropriate for all possible questions and patterns of potential interest. In addition, knowledge discovery often produces results that relate not to the primary study objective, but to fundamental properties of the data or of the data collection process. Thus, data cleaning and analysis are in practice intertwined, so that the correction of a data quality problem highlighted in initial data analysis allows for more refined subsequent data analysis. For example, in screening the WHO database for reporting patterns indicative of suspected drug interaction in IV, some larger groups of non-independent reports were highlighted. Their removal may allow for more accurate subsequent studies of drug drug interaction in the WHO database. 3.3 Disproportionality The frequency or relative frequency of a certain event (or set of events) in a database is sometimes of direct interest. However, in many knowledge discovery applications the discrepancy between the observed (relative) frequency and its expected value under some baseline model is of greater interest. An example from the analysis of purchasing patterns in supermarket sales data is that even if milk is the product most commonly purchased together with the product of interest, because this is true of most products, it may be more enlightening to point out that, for instance, grapefruit juice is purchased four times as often together with the product of interest as overall in the database. Such contrasts provide the basis of disproportionality analysis, which focuses on identifying events whose relative frequency in a given subgroup deviates substantially from the relative frequency of the same event in a given comparison group. Most modern methods for screening collections of individual case safety reports for excessive ADR reporting rates are based on disproportionality relative to the rest of the database. This is true of the Information Component (IC) (Bate et al. 1998), the Empirical Bayes Geometric Mean (EBGM) (DuMouchel 1999), the Proportional Reporting Ratio (PRR) (Evans et al. 2001) and the Reporting Odds Ratio (ROR) (Egberts et al. 2002). All these measures compare the number of reports on a certain drug ADR pair to an expected number of reports conditional on the overall reporting rates for the drug and the ADR in the database. The original idea of making comparisons with the database itself as reference goes back to the early days of ADR surveillance (Patwary 1969, Finney 1974). In addition to the lack of reliable external estimates for the international usage of different drugs, an advantage of disproportionality analysis is that marginal reporting biases that affect only the drug or only the ADR, cancel out (at least approximately) in 15

28 a measure of disproportionality. Thus, even though the reporting rates are likely to be higher for serious than for harmless ADRs, this does not have a considerable impact on the measures of disproportionality, as long as the reporting bias affects all drugs to an equal extent. The main drawback of disproportionality measures is that they rely on comparison to the reporting of other drug ADR pairs. Thus, if a particular drug ADR pair is massively reported, it will inflate the overall reporting rates for both the drug and the ADR, sometimes to the extent that excessive reporting rates for the same drug with another ADR or for another drug with the same ADR are masked (Evans 2004, Hauben et al. 2005). Assume the following contingency table based on the cross-classification of reports according to whether they involve a drug x and an ADR y: y not y x a b not x c d The basis for pairwise disproportionality analysis in the WHO database is an observed-to-expected ratio OE contrasting the relative reporting rate of y given x to the overall relative reporting rate of y in the database. With the annotation used in the above contingency table, the observed number of reports on y given x is a, and the expected number of reports conditional on the table marginals is the product of the marginal relative reporting rate of y and the total number of reports on x: is: a+c a+b+c+d (a + b). The observed-to-expected ratio OE = a/(a + b) (a + c)/(a + b + c + d) (3.1) The same measure of disproportionality has been used also in the context of association rule analysis (Agrawal et al. 1996), where it is referred to as the lift or the interest of an association rule involving x and y (Silverstein et al. 1998, Hastie et al. 2001). The similarity between the observed-to-expected ratio and other measures of disproportionality proposed for the analysis of individual case safety reports is clear. The Proportional Reporting Ratio (PRR) based on the above contingency table is (Evans et al. 2001): PRR = a/(a + b) c/(c + d) (3.2) and the corresponding Reporting Odds Ratio (ROR) is (Egberts et al. 2002): ROR = a/b c/d (3.3) 16

29 The IC measure of disproportionality used in routine knowledge discovery for the WHO database is essentially a conservative version of log 2 OE, that tends to 0 for rare drug ADR pairs. The moderation in magnitude is referred to as shrinkage (for details see Section 3.4 below). The availability of thoroughly evaluated shrinkage measures is the main advantage of the OE ratio over the PRR and the ROR. Other strengths are the link to Bayes classifiers described in Norén (2005) and the somewhat better robustness to zero counts in the contingency table than for the PRR and ROR (van Puijenbroek et al. 2002). The main limitation is that the observed-to-expected ratio provides a less distinct contrast between the group of interest and the reference group by including the group of interest in the reference. Another limitation is that the observed-toexpected ratio for a given pair of events by definition cannot exceed the inverse of the marginal relative reporting rate for each event. For example, if one of the events has an overall relative reporting rate of 0.5, then observed-to-expected ratios involving this event can at most reach 2 (if the relative reporting rate of the first event conditional on the other event is 1.00). In practice this limits the usefulness of observed-to-expected ratio as a measure of disproportionality to events that are reasonably rare. While disproportionality analysis is usually carried out at an early stage of the exploratory analysis of collections of individual case safety reports, there are sometimes requests to compute a measure of disproportionality for a drug ADR pair highlighted for review based on clinical judgement or on account of one or a few very strong reports. If the drug ADR pair turns out to be disproportionally reported, this may indeed lend added support. However, observed disproportionality must always be interpreted with caution. The possibility of alternative explanations such as report duplication, violated independence assumptions, publication biases or confounding must always be analysed and clearly stated. 3.4 Shrinkage Shrinkage is an attempt to regularise and reduce the volatility of a measure or parameter estimate of interest, by trading an increase in bias for a decrease in variance. In large and sparse data sets such as national or international collections of individual case safety reports, raw measures of disproportionality tend to sometimes yield very large values based on extremely low numbers of reports, but disproportionality based on just 1 or 2 reports is rarely of practical interest. The problem is that for rare drugs and ADRs, the expected number of reports may be very close to 0, relative to which even a single observed report may constitute a substantial deviation. Very low expected numbers of reports occur in the analysis of collections of individual case safety reports because the 2 by 2 contingency table of Section 3.3 is usually very unbalanced. Even 17

30 Figure 3.2: The simplified IC shrinkage measure plotted against the standard IC shrinkage measure for 10,000 randomly selected drug ADR pairs in the WHO database for the most common drugs and ADRs in the WHO database, the number of reports that do not involve either the drug or the ADR, d, is around 3,000,000, whereas b and c are generally in the order of 100 or 1,000 and a is even smaller (a 10 for 80% of the drug ADR pairs in the database). In order to reduce the vulnerability to spurious associations, two shrinkage measures of disproportionality have been proposed for the analysis of collections of individual case safety reports: the IC (Bate et al. 1998) and the EBGM (DuMouchel 1999). These measures of disproportionality are versions of the (logarithm of the) observed-to-expected ratio in (3.1) moderated towards a baseline value in the absence of large amounts of data. For the IC, the baseline value is 0 which corresponds to an observed-to-expected ratio of 1. Such shrinkage provides a robust measure of disproportionality moderated towards less extreme values for rare drugs and ADRs. However, as data accumulates it tends to log 2 OE as desired. The IC shrinkage measure is defined in II as a Bayesian maximum à posteriori estimate of a parameter related to the logarithm of the observed-to-expected ratio in (3.1). It is well approximated by the following simplified shrinkage measure based on observed and expected counts O xy and E xy : IC log 2 O xy + 1/2 E xy + 1/2 (3.4) A comparison between the IC shrinkage measure in II and that in (3.4) for 10,000 randomly selected drug ADR pairs in the WHO database is presented in Figure 3.2. Clearly, the difference between the two shrinkage measures is negligible. The main advantages of the simplified IC shrinkage measure are 18

31 that it is easier to compute and that it provides a general recipe for shrinkage that can be applied to any measure expressed in terms of an observed-toexpected ratio, such as the Ω measure of drug drug interaction in IV. This shrinkage can also be implemented for the PRR and ROR, after re-expression in terms of observed-to-expected ratios with O xy = a and E xy = (a+b)c c+d for the PRR, and with O xy = a and E xy = bc d for the ROR. Empirical Bayes estimation provides an alternative framework for shrinkage, where the prior distribution for a group of parameters is estimated based on the empirical distribution of maximum likelihood estimates for the group. The main advantage of empirical Bayes estimators is that they borrow strength from similar observations to improve the overall accuracy. However, with respect to each parameter, its estimate will only improve under the assumption that it is indeed related to the other parameters. Unlike the IC prior distribution, an empirical Bayes prior for the observed-to-expected ratio will not necessarily be centred at 1, and thus may inflate individual disproportionality measures rather than shrink them towards less extreme values. A practical issue is that for drug ADR pairs that have never been co-reported, the maximum likelihood estimate of the observed-to-expected ratio is 0. In practice, these drug ADR pairs appear to be ignored in the estimation of the empirical prior distribution in DuMouchel (1999), and the potential bias due to this is unclear. Berry and Berry (2004) propose a hierarchical empirical Bayes estimator for the observed-to-expected measure of disproportionality, where each measure of disproportionality is shrunk towards the group mean for a smaller group of more closely related ADRs. This should allow for more sophisticated empirical Bayes shrinkage, but the identification of appropriate groups of related ADR terms remains a challenging research problem in its own right. 3.5 Pattern discovery and detection Pattern recognition is the attempt to partition a group of data points into classes, based on a given set of explanatory variables (Webb 2002). Distinction is made between supervised and unsupervised pattern recognition: in supervised pattern recognition (or discrimination) a classifier is constructed based on training data consisting of labelled data points with the aim of accurately categorising unseen data points; in unsupervised classification (or clustering), the aim is to identify a natural partitioning of the available data set, without labelled training data available, or even a specification of what the classes of interest may be. For our purposes, the distinction between patterns and models in the context of pattern discovery and detection is more relevant. Hand and Bolton (2004) characterise patterns as related to local features of a data set involving only 19

32 subsets of the data points and/or subsets of the variables. Whereas a global model provides a high level description of the most important general features of a data set, a pattern may highlight one or a few outlying observations or a strong correlation between two variables. Hand and Bolton (2004) propose the following general definition: "A pattern is a local structure that generates data with an anomalously high density compared with that expected under the (global) baseline model." The focus on deviation from a global baseline model applies broadly to the methods described in this PhD thesis. The very aim of disproportionality analysis, is to identify groups of events that are co-reported more often than would be expected, based on a baseline independence model. Similarly, in duplicate detection and other record matching applications, the aim is to identify pairs (or small subsets) of unexpectedly similar reports whose similarity deviates from a global baseline model assuming all reports have been submitted independently. With the exception of the work on Bayes classifiers in V (which relates primarily to supervised pattern recognition by the above definition), this thesis focuses on unsupervised pattern discovery. The aim is to discover structure in data, without strict à priori specification of what the structure of interest is. At the same time, completely open-ended hypothesis generation is not possible as the type of potential patterns is determined by the choice of pattern discovery method, as well as implicitly by a range of other choices such as the variables considered in a given study (Hand 1994, p 319). Thus, while disproportionality analysis may highlight a variety of patterns related to anything from a suspected drug ADR association to an elevated reporting rate of a certain drug in one particular country, the type of patterns in such studies is restricted to unexpectedly high (or low) relative reporting rates. Similarly, record matching may highlight a variety of non-independent reports, but all highlighted patterns will refer to unexpected report similarity. 3.6 Facilitating interpretation Interpretation is one of the final steps in the knowledge discovery process (Fayyad et al. 1996), and a key component of the ADR signal detection process. Transparency is of particular importance in the analysis of non-systematically collected data such as individual case safety reports, where the use of overly complex statistical methodology may give a false sense of security and distract domain experts from limitations with the data (Hauben et al. 2005). Breiman (1985) refers to the application of 20

33 advanced statistical methodology to hide inadequacies with the data as edifice building ; in the ADR signal detection process, the use of overly complex statistical methods may divert clinical experts from careful consideration of alternative explanations to apparently excessive ADR relative reporting rates. Since the primary aim of applying knowledge discovery methods to collections of individual case safety reports is to guide and support domain experts in their manual review, better transparency is a strong argument in favour of choosing a simple method over a more complicated one. Indeed, better transparency is perhaps the strongest argument for choosing the simple IC shrinkage measure over the more complicated one as discussed in Section 3.4. Statistical sophistication does not necessarily rule out transparency, however. The hit-miss model record matching algorithm in I is based on a rather intricate probabilistic model, but its basis for highlighting a given record pair as suspected duplicates is immediately clear from an overview such as that presented in Figure 4.2 of Section 4. While sophisticated statistical methods are sometimes required to make the most of the available data, knowledge discovery results should always be presented as transparently as possible. For example, while shrinkage measures of disproportionality have proved a very powerful basis for filtering individual case safety reports for interesting reporting patterns, they may confuse domain experts, with little interest in the statistical methodology. Moreover, it is difficult to evaluate the impact of data quality issues such as suspected duplication or reporting biases on shrinkage measures of disproportionality. Observed and expected counts provide a more transparent explanation for why certain drug ADR pairs have been highlighted for manual review. In the presence of suspected data quality issues, simple arithmetic will indicate to what extent an excessive reporting rate may be due to a group of suspected duplicates, for instance. At the same time, domain experts often do want a sense of whether an observed disproportionality is likely to be due to chance or not. Credibility intervals around measures of disproportionality give some such indication, although the potential for violated independence assumptions means that precision can be overestimated. Adjustment for potential confounders may complicate interpretation of shrinkage measures of disproportionality. However, as commented on in III, adjusted observed-to-expected ratios sometimes correspond closely to stratum specific ones, and translating adjusted observed-to-expected ratios to stratum specific ones may simplify interpretation. For example, in the example on hypertension and zimeldine in Appendix E.4 of Hopstadius (2006), the IC increases from to when adjusted for time of reporting and country of origin. A closer investigation of the detailed data available in Appendix F.2 of the same thesis indicates that the discrepancy 21

34 is due to hypertension being more than twice as common on US reports (1.7%) as on reports from other countries (0.7%), whereas zimeldine was never used in the USA. Additionally the overall relative reporting rate of hypertension has increased in recent years, whereas zimeldine was primarily used in the early 1980 s. Thus, the crude IC which contrasts the observed relative reporting rate of hypertension given zimeldine to the overall relative reporting rate of hypertension in the entire database underestimates the disproportionality. Arguably, the best information to present to clinical experts in this case would be the observed number of reports on hypertension for zimeldine, and the expected number of such reports based on the relative reporting rate of hypertension in the countries and time period in which it was available. To guide domain experts to appropriate interpretation is clearly as important a challenge as method development in knowledge discovery research. 3.7 Future directions The new methodology proposed in this thesis provides a strong basis for future improvement and further research on knowledge discovery methods for collections of individual case safety reports. The method for drug drug interaction detection goes beyond simple drug ADR disproportional reporting rates, and could potentially be used also to screen for other types of ADR risk factors, such as related to patient gender or age. In general, we must aim to make better use of the rich information available on individual case safety reports. Virtually all knowledge discovery methods of today (including those described in this thesis) are based on raw numbers of reports (Hauben et al. 2005). They do not account for the amount or quality of information on each report nor for suspected duplication. This is in stark contrast with clinical review, in which both the quality of single reports and the quality of sets of reports as a group is carefully scrutinised (Meyboom et al. 1997). Indeed, given that the overall aim of applying knowledge discovery methods to collections of individual case safety reports is to assist and direct clinical review, an important challenge for the future is to achieve better alignment between automated knowledge discovery and clinical review. A first step may be to develop new and improved quality criteria for individual case safety reports similar to those discussed by Edwards et al. (1990). Based on such quality criteria, the number of high quality, distinct reports referring to a particular drug ADR pair can be identified and potentially provide a useful triage criterion. The possibility to highlight single high quality reports is interesting in its own right. The extended hit-miss model record matching algorithm has proved very useful for duplicate detection in the WHO database. Its importance is likely to 22

A Data Mining Approach for Signal Detection and Analysis

SHORT COMMUNICATION Drug Safety 2002; 25 (6): 393-397 0114-5916/02/0006-0393/$25.00/0 Adis International Limited. All rights reserved. A Data Mining Approach for Signal Detection and Analysis Andrew Bate,