IR Meets EHR: Retrieving Patient Cohorts for Clinical Research Studies

Size: px

Start display at page:

Download "IR Meets EHR: Retrieving Patient Cohorts for Clinical Research Studies"

Doreen Davis
5 years ago
Views:

1 IR Meets EHR: Retrieving Patient Cohorts for Clinical Research Studies References William Hersh, MD Department of Medical Informatics & Clinical Epidemiology School of Medicine Oregon Health & Science University Web: Blog: Anonymous (2009). Initial National Priorities for Comparative Effectiveness Research. Washington, DC, Institute of Medicine. Buckley, C and Voorhees, E (2000). Evaluating evaluation measure stability. Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Athens, Greece. ACM Press Buckley, C and Voorhees, EM (2004). Retrieval evaluation with incomplete information. Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Sheffield, England. ACM Press Cormack, GV and Lyman, TR (2007). Online supervised spam filter evaluation. ACM Transactions on Information Systems. 25(3): Article 11. Demner-Fushman, D, Abhyankar, S, et al. (2012). NLM at TREC 2012 Medical Records Track. The Twenty-First Text REtrieval Conference Proceedings (TREC 2012), Gaithersburg, MD. National Institute for Standards and Technology Demner-Fushman, D, Abhyankar, S, et al. (2011). A knowledge-based approach to medical records retrieval. The Twentieth Text REtrieval Conference Proceedings (TREC 2011), Gaithersburg, MD. National Institute for Standards and Technology Edinger, T, Cohen, AM, et al. (2012). Barriers to retrieving patient information from electronic health record data: failure analysis from the TREC Medical Records Track. AMIA 2012 Annual Symposium, Chicago, IL Friedman, CP, Wong, AK, et al. (2010). Achieving a nationwide learning health system. Science Translational Medicine. 2(57): 57cm29. Hanbury, A, Müller, H, et al. (2015). Evaluation-as-a-service: overview and outlook, arxiv. Harman, DK (2005). The TREC Ad Hoc Experiments. TREC: Experiment and Evaluation in Information Retrieval. E. Voorhees and D. Harman. Cambridge, MA, MIT Press: Hersh, W, Müller, H, et al. (2009). The ImageCLEFmed medical image retrieval task test collection. Journal of Digital Imaging. 22: Hersh, W and Voorhees, E (2009). TREC genomics special issue overview. Information Retrieval. 12: 1-15.

2 Hersh, WR (2009). Information Retrieval: A Health and Biomedical Perspective (3rd Edition). New York, NY, Springer. Hersh, WR, Weiner, MG, et al. (2013). Caveats for the use of operational electronic health record data in comparative effectiveness research. Medical Care. 51(Suppl 3): S30-S37. Hripcsak, G, Knirsch, C, et al. (2011). Bias associated with mining electronic health records. Journal of Biomedical Discovery and Collaboration. 6: Jarvelin, K and Kekalainen, J (2002). Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems. 20: King, B, Wang, L, et al. (2011). Cengage Learning at TREC 2011 Medical Track. The Twentieth Text REtrieval Conference Proceedings (TREC 2011), Gaithersburg, MD. National Institute for Standards and Technology Müller, H, Clough, P, et al., Eds. (2010). ImageCLEF: Experimental Evaluation in Visual Information Retrieval. Heidelberg, Germany, Springer. Roegiest, A and Cormack, GV (2016). An architecture for privacy-preserving and replicable highrecall retrieval experiments. Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, Pisa, Italy Safran, C, Bloomrosen, M, et al. (2007). Toward a national framework for the secondary use of health data: an American Medical Informatics Association white paper. Journal of the American Medical Informatics Association. 14: 1-9. Smith, M, Saunders, R, et al. (2012). Best Care at Lower Cost: The Path to Continuously Learning Health Care in America. Washington, DC, National Academies Press. Voorhees, E and Hersh, W (2012). Overview of the TREC 2012 Medical Records Track. The Twenty- First Text REtrieval Conference Proceedings (TREC 2012), Gaithersburg, MD. National Institute of Standards and Technology Voorhees, EM (2013). The TREC Medical Records Track. Proceedings of the International Conference on Bioinformatics, Computational Biology and Biomedical Informatics, Washington, DC Voorhees, EM and Harman, DK, Eds. (2005). TREC: Experiment and Evaluation in Information Retrieval. Cambridge, MA, MIT Press. Voorhees, EM and Tong, RM (2011). Overview of the TREC 2011 Medical Records Track. The Twentieth Text REtrieval Conference Proceedings (TREC 2011), Gaithersburg, MD. National Institute of Standards and Technology Washington, V, DeSalvo, K, et al. (2017). The HITECH era and the path forward. New England Journal of Medicine. 377: Wu, S, Liu, S, et al. (2017). Intra-institutional EHR collections for patient-level information retrieval. Journal of the American Society for Information Science & Technology: in press. Wu, ST, Sohn, S, et al. (2013). Automated chart review for asthma cohort identification using natural language processing: an exploratory study. Annals of Allergy, Asthma & Immunology. 111: Yilmaz, E, Kanoulas, E, et al. (2008). A simple and efficient sampling method for estimating AP and NDCG. Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Singapore Zhu, D, Wu, ST, et al. (2014). Using large clinical corpora for query expansion in text-based cohort identification. Journal of Biomedical Informatics. 49:

IR Meets EHR: Retrieving Patient Cohorts for Clinical Research Studies William Hersh, MD Professor and Chair Department of Medical Informatics & Clinical Epidemiology School of Medicine Oregon Health

3 IR Meets EHR: Retrieving Patient Cohorts for Clinical Research Studies William Hersh, MD Professor and Chair Department of Medical Informatics & Clinical Epidemiology School of Medicine Oregon Health & Science University Portland, OR, USA Overview Re-use of clinical data Primer on information retrieval and challenge evaluations TREC Medical Records Track Mayo-OHSU Clinical Text for Cohort Identification Project Some caveats for re-use of clinical data 2 1

Re-use of clinical data Many re-uses (or secondary uses ) of electronic health record (EHR) data, including (Safran, 2007) Clinical and translational research generating hypotheses and facilitating

4 Re-use of clinical data Many re-uses (or secondary uses ) of electronic health record (EHR) data, including (Safran, 2007) Clinical and translational research generating hypotheses and facilitating research Public health surveillance for emerging threats Healthcare quality measurement and improvement Opportunities in US facilitated by growing incentives for meaningful use of EHRs in the HITECH Act (Washington, 2017), aiming toward the learning healthcare system (Friedman, 2010; Smith 2012) But also limitations on ability to re-use of clinical data (Hripcsak, 2011; Hersh, 2013) 3 Information retrieval (IR) (Hersh, 2009) Focus on indexing and retrieval of knowledgebased information Historically centered on text in knowledge-based documents, but increasingly associated with many types of content 4 2

5 Evaluation of IR systems System-oriented how well system performs Historically focused on relevance-based measures Recall and precision proportions of relevant documents retrieved User-oriented how well user performs with system e.g., performing task, user satisfaction, etc. 5 System-oriented IR evaluation Usually assessed with test collections, consisting of Content fixed yet realistic collections of documents, images, etc. Topics statements of information need that can be fashioned into queries entered into retrieval systems Typically need for stability of results (Buckley, 2000) Relevance judgments by expert humans for which content items should be retrieved for which topics Evaluation consists of runs using a specific IR approach with output for each topic measured and averaged across topics 6 3

6 Recall and precision Recall (sensitivity) # retrieved and relevant documents R = # relevant documents incollection Precision (positive predictive value) # retrieved and relevant documents P = # retrieved documents 7 Recall and precision can be combined into single measure Mean average precision (MAP) is mean of average precision for each topic (Harman, 2005) Bpref accounts for when relevance information is significantly incomplete (Buckley, 2004) Normal discounted cumulative gain (NDCG) allows for graded relevance judgments (Jarvelin, 2002) MAP and NCDG can be inferred when there are incomplete judgments (Yilmaz, 2008) All of above typically vary from 0 (worst) 1 (best) 8 4

7 Challenge evaluations Develop common task, content collection, topics, evaluation metrics, etc., ideally aiming for real-world size and representation for data, tasks, etc. Release of document collection to participating researchers Experimental runs and submission of results Relevance judgments Analysis of results Results usually show wide variation between topics and between systems Should be viewed as relative, not absolute performance Averages can obscure variations 9 Some well-known challenge evaluations in IR Text Retrieval Conference (TREC, Voorhees, 2005) sponsored by National Institute for Standards and Technology (NIST), started in 1992 Many tracks of interest, such as routing/filtering, Web searching, question-answering, etc. Mostly non-biomedical; first domain-specific track was Genomics Track (Hersh, 2009) Conferences and Labs of the Evaluation Forum (CLEF, Initial focus on retrieval across languages, European-based Additional focus on image retrieval, includeing medical image retrieval tasks (Hersh, 2009; Müller, 2010) TREC has inspired other challenge evaluations, e.g., i2b2 NLP Shared Task,

data due to Privacy issues Task issues Facilitated with development of large-scale, de-identified data set from

8 TREC Medical Records Track Most work using EHR data has involved natural language processing over small collections; IR approaches allow to scale up IR research always been easier with knowledge-based content than patient-specific data due to Privacy issues Task issues Facilitated with development of large-scale, de-identified data set from University of Pittsburgh Medical Center (UPMC) Launched in 2011, repeated in Test collection 12 (Voorhees, 2013) 6

9 Some issues for test collection De-identified to remove protected health information (PHI), e.g., age number range De-identification precludes linkage of same patient across different visits (encounters) UPMC only authorized use for TREC 2011 and TREC 2012 but nothing else, including any other research (unless approved by UPMC) 13 Wide variations in number of documents per visit visits > 100 reports; max report size Number of reports in visit 14 (Voorhees, 2013) 7

10 Topic development and relevance assessments Task Identify patients who are possible candidates for clinical studies/trials Had to be done at visit level due to deidentification of records 2011 topics derived from 100 top critical medical research priorities in comparative effectiveness research (IOM, 2009) Selected 35 topics from 54 assessed for appropriateness for data and with at least some relevant visits Relevance judgments by OHSU informatics students who were physicians 15 Participation in 2011 (Voorhees, 2011) Runs consisted of ranked list of up to 1000 visits per topic for each of 35 topics Automatic no human intervention from input of topic statement to output of ranked list Manual everything else Up to 8 runs per participating group Subset of retrieved visits contributed to judgment sets Because resources for judging limited, could only judge relatively small sample of visits, necessitating use of BPref for primary evaluation measure 127 runs submitted from 29 research groups 16 8

11 Evaluation results for top runs manual auto unjudged bpref P(10) 0 NLMManual CengageM11R3 SCAIMED7 UTDHLTCIR* udelgn* WWOCorrect* uogtrdenio* NICTA6* EssieAuto* buptpris01 IRITm1QE1 SCAIMED1* UCDCSIrun3* mayo1brst* ohsumanall merckkgaamer 17 BUT, wide variation among topics Num Rel Best Median

12 Easy and hard topics Easiest best median bpref 105: Patients with dementia 132: Patients admitted for surgery of the cervical spine for fusion or discectomy Hardest worst best bpref and worst median bpref 108: Patients treated for vascular claudication surgically 124: Patients who present to the hospital with episodes of acute loss of vision secondary to glaucoma Large differences between best and median bpref 125: Patients co-infected with Hepatitis C and HIV 103: Hospitalized patients treated for methicillin-resistant Staphylococcus aureus (MRSA) endocarditis 111: Patients with chronic back pain who receive an intraspinal pain-medicine pump 19 Failure analysis for 2011 topics (Edinger, 2012) Reasons for Incorrect Retrieval Visits Topics Visits Judged Not Relevant Topic terms mentioned as future possibility 16 9 Topic symptom/condition/procedure done in the past 22 9 All topic criteria present but not in time/sequence specified by the topic description 19 6 Most, but not all, required topic criteria present 17 8 Topic terms denied or ruled out Notes contain very similar term confused with topic term Non-relevant reference in record to topic terms Topic terms not present unclear why record was ranked highly 14 8 Topic present record is relevant disagree with expert judgment Visits Judged Relevant Topic not present record is not relevant disagree with expert judgment Topic present in record but overlooked in search Visit notes used a synonym or lexical variant for topic terms Topic terms not named in notes and must be inferred 3 2 Topic terms present in diagnosis list but not visit notes

13 2012 Track same collection and methods, new topics (Voorhees, 2012) 21 What approaches did (and did not) work? Best results in 2011 and 2012 obtained from NLM group (Demner-Fushman, 2011; Demner-Fushman, 2011) Top results from manually constructed queries using Essie domain-specific search engine (Ide, 2007) Other automated processes fared less well, e.g., creation of PICO frames, negation, term expansion, etc. Best automated results in 2011 obtained by Cengage (King, 2011) Filtered by age, race, gender, admission status; terms expanded by UMLS Metathesaurus Benefits of approaches commonly successful in IR provided small or inconsistent value for this task in 2011 and 2012 Document focusing, term expansion, etc

Semi-Structured IR in Clinical Text for Cohort Identification Mayo Clinic-OHSU collaboration Funded by NLM R01 Hongfang Liu, Mayo Clinic, Co-PI Stephen Wu, OHSU, OHSU, Co-PI William Hersh, OHSU, Co-I

14 Semi-Structured IR in Clinical Text for Cohort Identification Mayo Clinic-OHSU collaboration Funded by NLM R01 Hongfang Liu, Mayo Clinic, Co-PI Stephen Wu, OHSU, OHSU, Co-PI William Hersh, OHSU, Co-I Adding natural language processing (NLP) and language modeling (LM) to base IR methods on large amounts of unmodified (not deidentified) text from EHR Preliminary data showed improvement over baseline IR techniques with TREC Medical Record Track collection (Wu, 2013; Zhu, 2014) 23 Overall work of project 24 12

Parallel collections of patient data (Wu, 2017) OHSU Extraction of patients from Research Data Warehouse (RDW) having inpatient or outpatient encounters in primary care departments (Internal

15 Parallel collections of patient data (Wu, 2017) OHSU Extraction of patients from Research Data Warehouse (RDW) having inpatient or outpatient encounters in primary care departments (Internal Medicine, Family Medicine, or Pediatrics) with 3 or more encounters 5 or more text entries Between 1/1/2009 and 12/31/2013 Stored on (highly!) secure server Mayo using comparable approach and quantities 25 OHSU data model and statistics 26 13

16 Topics From OHSU Derived from clinical study data requests by researchers from the Oregon Clinical and Translational Research Institute (OCTRI), querying Research Data Warehouse (RDW) (29 topics) From Mayo Phenotype KnowledgeBase (PheKB) (7 topics) Rochester Epidemiology Project (REP) (9 topics) 2 topics from its own RDW Similar topics merged to avoid redundancy (1 OHSU/REP topic, and 2 OHSU/PheKB topics), resulting in total of 56 topics 27 Evaluation across sites Relevance assessments done behind firewall at each site Have developed Web-based relevance judging system based on one used for TREC being used at both sites Only aggregate results (no PHI) will be shared across sites Failure analysis will be done at each site and reported in aggregate manner 28 14

17 Next step: Evaluation-as-a-Service (EaaS) (Hanbury, 2015, Roegiest, 2016) 29 EaaS pros, cons, and plans Pros Data providers can facilitate wider participation, including those who do not have data Can be applied to many types of tasks (not just IR) Organization of challenge evaluations Cons Data consumers cannot see most of data Plans Grant proposal under review 15

Caveats for use of operational EHR data (Hersh, 2013) may be Inaccurate Incomplete Transformed in ways that undermine meaning Unrecoverable Of unknown provenance Of insufficient granularity

18 Caveats for use of operational EHR data (Hersh, 2013) may be Inaccurate Incomplete Transformed in ways that undermine meaning Unrecoverable Of unknown provenance Of insufficient granularity Incompatible with research protocols 31 Many idiosyncrasies of clinical data (Hersh, 2013) Left censoring First instance of disease in record may not be when first manifested Right censoring Data source may not cover long enough time interval Data might not be captured from other clinical (other hospitals or health systems) or nonclinical (OTC drugs) settings Bias in testing or treatment Institutional or personal variation in practice or documentation styles Inconsistent use of coding or standards 32 16

19 Lessons learned and future directions IR-scale evaluation of EHR data is feasible Biggest challenges Sensitivity of data protecting privacy vs. facilitating informatics research Quantity of data aim to be realistic at scale Models for multisite evaluation How to perform IR evaluation on highly personal data similar challenges in other domains, e.g., spam detection (Cormack, 2007) Can we perform secure and scalable evaluation of tasks using EHR data? 33 Thank You! William Hersh, MD Professor and Chair Department of Medical Informatics & Clinical Epidemiology School of Medicine Oregon Health & Science University Portland, OR, USA Web: Blog:

Information Retrieval from Electronic Health Records for Patient Cohort Discovery

Information Retrieval from Electronic Health Records for Patient Cohort Discovery References William Hersh, MD Professor and Chair Department of Medical Informatics & Clinical Epidemiology Oregon Health