How can Natural Language Processing help MedDRA coding? April 16 2018 Andrew Winter Ph.D., Senior Life Science Specialist, Linguamatics
Summary About NLP and NLP in life sciences Uses of NLP with MedDRA Examples in MedDRA coding of adverse events in FDA drug labels How NLP could feed into MedDRA development 2
Use of NLP in Life Sciences Advanced text analytics delivers value along the pipeline Gene-disease mapping Regulatory Submission QC HEOR Target ID/selection Trial site selection and study design IDMP Pharmacovigilance Toxicity analysis and prediction Safety Competitive intelligence Mutation/expression analysis SAR Biomarker discovery Real World Evidence Comparative Effectiveness Drug repurposing Patent analysis KOL identification Opportunity scouting Voice of the Customer analysis Social media analysis 3
NLP Turns Text into Actionable Insights Transform unstructured or semi-structured data into insights to advance human health Turn text Into structured data using sophisticated queries To drive analytics Analytics Natural Language Processing Ontologies Statistical Methods Machine Learning - Chemistry Regular Expressions etc. Enterprise Warehouse 4
NLP finds information however it is expressed Different word, same meaning cyclosporine ciclosporin Neoral Sandimmune NLP Different expression, same meaning Non-smoker Does not smoke Does not drink or smoke Denies tobacco use Different grammar, same meaning 5mg/kg of cyclosporine per day 5mg/kg per day of cyclosporine cyclosporine 5mg/kg per day Same word, different context Diagnosed with diabetes Family history of diabetes No family history of diabetes 5
Blend of powerful rule- and machine learning-based methods to transform unstructured data into structured Linguistic Processing Terminologies/ Ontologies Precise linguistic relationships, sentence co-occurrence Precise negation e.g. pressure not blood pressure Multiple languages Search for concepts and their synonyms with spelling and optical character recognition (OCR) correction Out of the box or custom ontologies Quantitative Data Results Normalization Quantitative & pattern-based data extraction at scale e.g. numerical data, dates, gene mutations Range search Ontology and rule-based normalization of results Essential for organizing structured output Enables indirect relations, filtering/faceting results, etc. Chemistry Identify and extract chemicals in context based on substructure and chemical similarity Table & Region Processing Unique capability to capture knowledge from tables embedded in documents Fielded search within regions of a document 6
Data normalization: always treat the same concept in the same way the key to structured results Concept Text Normalized Value Diseases breast cancer Breast Neoplasm carcinoma of the breast Genes Raf-1 RAF1 Raf I Dates 27th Feb 2014 20140227 2014/02/27 Measurements 0.2g 200 mg Two hundred milligrams Mutations Val 158 Met V158M Behaviours Val by Met at codon 158 denies alcohol and tobacco use is not a cigarette smoker Non-smoker Data normalization Overview Convert text into a standard format Is a fundamental component in transforming text into structured data and driving actionable insights Key benefits Find concepts however they are expressed Join results to discover new indirect relationships Cluster or facet results by concept or quantity Compare measurements with different units e.g. kg vs. lbs Relationships...nimesulide, a selective COX2 inhibitor, inhibits Entrez ID: 5743 7
Use of NLP with MedDRA Errors in Regulatory Submissions Social Media Adverse Events in Drug Labels 8
Table: Most Frequently Reported Medical Conditions ( 5% in Any Treatment Group) Study Total Number Subjects Cardiac disorders 70 (7.0) Angina pectoris 4 (0.4) Dyspepsia 174 (17.5) GERD 83 (8.3) Metabolic / nutritional disorders 2000 Pooled Studies Rx N=997 Pbo N=927 Rx N=1021 Number (%) of Subjects 253 (25.4) Dyslipedaemia 1 (0.1) Hypercholesterolaemia 65 (6.5) Hyperlipidaemia 147 (14.7) Osteoarthritis 102 (10.2) Nervous system disorders 628 (63.0) Headache 413 (41.4) Psychiatric disorders 137 (13.7) Insomnia 84 (8.4) 32 (3..5) 5 (0.5) 120 (12.9) 52 (5.6) 165 (17.8) 0 (0) 50 (5.4) 79 (8.5) 57 (6.6) 409 (44.1) 280 (30.2) 81 (8.7) 47 (5.1) 2003 Pooled Study 108 (10.6) 74 (7.2) 3 (0.3) 30 (2.9) 194 (19.0) 15 (1.5) 88 (8.6) 56 (5.5) 12 (1.2) 28 (2.7) 9 (0.9) 14 (1.4) 9 (0.9) Pbo N=956 101 (10.6) 71 (7.4) 2 (0.2) 27 (2.8%) 212 (22.2) 19 (2.0) 103 (10.8) 66 (6.9) 11 (1.2) 19 (2.0) 7 (0.7) 15 (1.6) 8 (0.8) Commonly reported conditions included Seasonal allergies, Back pain, and Hypercholesterolaemia. The majority of AEs were considered treatment related in all cohorts and the relationship between treatment groups and between cohorts was similar to that observed for all-causality AEs. Permanent discontinuations were reported at higher rates in the Rx groups than in the placebo groups in the 3 pooled cohorts. The majority of AEs leading to permanent discontinuation were considered treatment related in both treatment groups in all cohorts. The single most frequently reported event was headache, which was reported in approximately 40% of Rx subjects and 20% of placebo subjects in the 2000 Pooled cohort. Other AEs reported across all cohorts at rates greater in Rx subjects than placebo subjects included Seasonal allergies and Insomnia (2000 8.4% vs 5.4%, 2003 0.9% vs 0.8%, 2006 14.0% vs 10.1%; Rx vs placebo respectively). Key Incorrect formatting: doubled period, incorrect number of decimal places, addition of percent sign Incorrect calculation: number of patients divided by total number does not agree with percent term Incorrect threshold: presence of row does not agree with table title Text-Table inconsistency: numbers in the table do not agree with numbers in the accompanying text Sample table and text highlighting, to show inconsistencies between data. The highlight colour makes it easy for the reviewer to rapidly assess where there are errors and what type of errors, and can then correct these appropriately. 9 Linguamatics 2016
Use Case: Automated Blinded Data Review for Regulatory Submissions Before unblinding a clinical trial, data are checked for errors and inconsistencies Among the many checks performed, MedDRA terms for Adverse Events Reports are verified, including: Is the Preferred Term valid in any version of MedDRA? Reporter may have inserted the Investigator Entry in the wrong field, or used an LLT Are multiple MedDRA versions in use in the same trial? Reporter Error or Error when generating the blinded data Does the specified version of MedDRA agree with the Preferred Terms being reported? Reporter may have used a more precise MedDRA term from a more recent version of MedDRA Does the Preferred Term agree with the declared System Organ Class? Automation of this process is in use at large pharma 10
Use Case: Social Media Analysis Social media: plenty of AEs mentioned Language informal Linguistic patterns can find mentions of AEs without using a dictionary Using MedDRA LLTs finds only one of the following 4 examples 11
Use Case: Extraction of Adverse Events using MedDRA Extraction of adverse events, MedDRA terms and frequency of occurrence, clustered by medicinal product Structured results can be used to populate a database, e.g. IDMP Different customers have different MedDRA requirements, e.g. PT vs LLT, which is easy to accommodate Results table (background) and highlighted source document (foreground) are shown 12
Extraction of AEs from FDA Drug Labels FDA drug labels are not structured Want to compare AEs found in Real World Evidence with known AEs Find AEs from within text, and within tables Pull out values if want to filter to only include AEs where greater than placebo 13
Use of NLP terminology features in extracting AEs Increase recall with: Morphological variants Spelling correction Matching across conjunctions Mapping multiple concepts to MedDRA PT Increase precision with: Excluding inappropriate contexts Use of document sections to exclude inappropriate terms 14
Increase recall: morpho variants MedDRA PT Congenital anomaly Additional hits when using morphological variants 15
Increase recall: spelling correction MedDRA PT Hypersensitivity 16
Increase recall: MedDRA matching across conjunctions MedDRA PT Hepatic neoplasm OR Thyroid neoplasm 17
Increase recall: mapping multiple concepts to a MedDRA PT MedDRA PT Blood creatinine increased Blood creatinine increased Creatinine blood increased Creatinine high Creatinine increased Creatinine serum increased Increased serum creatinine Plasma creatinine increased Raised serum creatinine Serum creatinine increased has low recall. Combining MedDRA PT Blood creatinine Blood creatinine Creatinine Plasma creatinine Serum creatinine with Relation Increase Increase Elevate Raise... in a linguistic pattern allowing flexibility in expression... gives significant additional recall (). 18
Increase precision: exclusion of hits in inappropriate contexts when searching for adverse events Thousands of examples of MedDRA concepts that are not AEs. Linguistic patterns can filter out inappropriate contexts. 19
Increase precision: using document regions - exclusion of PTs that occur in Indications when searching for AEs Can be removed based on same PT 20
How NLP could feed into MedDRA development: improved coverage of terminology Terms appearing with MedDRA terms in the same list Explicit constructions such as AEs such as, or from tables Look for terms in appropriate contexts e.g. made me? 21
Noun phrases occurring in a list after adverse events such as, and which are not already in MedDRA 22
Noun phrases occurring in the same list as another MedDRA term 23
Summary NLP is required to rule out inappropriate contexts, improving precision NLP techniques e.g. Morphological variants and OCR correction improve recall String based synonym matching cannot cope with all the variation found in real text, e.g. Elevation of blood creatinine. Here Linguistic patterns are required. Region and table processing are often required to get the right context. 24