Clinical NLP, PubGene Clinical trials in Coremine Oncology Text processing and information extraction for surgery planning form November 2017 Dag Are Steenhoff Hov, PubGene AS 1
PubGene, founded 2001 ArrayIt H25K microarray Integration of structured and unstructured information Interpretation of biomedical analysis data General information Specialized information analysis Scientific Literature Coremine Networks COREMINE Oncology COREMINE Medical COREMINE Platform 2
Clinical NLP in PubGene - examples Clinical trials in Coremine Oncology PubGene in Ahus Optique Courtesy of DNV-GL (Tore Hartvigsen) 3
Coremine Oncology AIM: To enable oncologists to make better treatment decisions HOW: Combine data from relevant sources to aid interpretation of oncogenomics data from NGS and other platforms Input: Somatic mutations, copy number changes, gene expression, or similar quantity Output: Gene/biomarker annotations, related drugs and drug sensitivity, pathways, clinical trials, etc. 4
Coremine Oncology Our Scope We focus on: Analysis of called events ; assumed that normalization and data quality considerations have been taken care of Collecting and integrating information for interpretation Linking to potentially relevant treatments Linking to clinical trials related to the input data 5
Coremine Oncology Currently three types of input data: (Somatic) mutations Copy number changes gene expression Analysis/Interpretation module to display information (annotations) about Mutation Gene/Protein Protein Domains Summary module to show patient level information with respect to: Statistics on mutations Related drugs for targets with change (in progress: also biomarker and sensitivity info) Pathways for targets with change Relevant clinical trials for aberrations 6
Example Somatic mutations input data Input for Coremine Oncology, case from lung cancer Chromosome number Position Reference nucleotide Alternate nucleotide 7
View of imported data file
Mutation annotation 1 patient - 1 missense mutation
10
Clinical Trials for Cetuximab 11
Clinical Trials for biomarkers AIM: To map biomarkers from patient data to relevant clinical trials METHOD: Identify how biomarkers are mentioned (referred to) in clinical trials Download and index data from clinicaltrials.gov Develop dictionaries of biomarkers and methods for detecting these in trial descriptions Focus on eligibility CHALLENGES Text mining is difficult! Biomarkers are described, or referred to in many ways Ultimately, we want to identify biomarkers related to eligibility, but this is not straightforward Complicated logic in inclusion/exclusion criteria, e.g., negation Also need to check title, description, and condition for biomarkers 12
Clinical Trials text data mining Compiled several lists of biomarkers of different types: Single-Nucleotide mutations (Cosmic) Polymorphisms Fusion genes Gene regulation (Exp-up/down) Copy number changes Several strategies for finding these in text: Detect explicit mentions Detect patterns based on gene name and marker type, e.g., GENE amplification GENE activating mutation Curated list of cancer types matched with conditions Statistics for patterns Expression: 135 CNV: 32 Other (positive/negative): 20/10 Mutation: 37 Fusion/rearrangement/translocation: 10 Indexing statistics 5350 trials with at least one biomarker 855 different biomarkers with hits Top markers: BCR/ABL1 (907), ERBB2 positive (725), ERBB2 negative (603), ESR1 positive (467), ERBB2 exp-up (403) 13
Clinical Trials for example case NSCLC and Erlotinib 14
Clinical Trials for copy number data (CNV) 15
Trials matching patient biomarkers and disease Cancer type, e.g., NSCLC Filter Manual curation Domain knowledge Clinical Trials GUI or command line CNA SNP EXP INDEL SNA FUSION 16
Clinical Trials for combined data NSCLC BRAF G469A BRAF D594G BRAF V600E EGFR T790M KIF5B/RET CD74/ROS1 KIF5B/ALK BCR/ABL1 17
Details from Clinical Trial information NCT01922583 18
Clinical Trials matching to patient data Various levels of stringency for matching trial to patient Perfect match Other alteration (incl. same effect) Same gene (other biomarker) Related gene S = weighted sum of scores Biomarker specific scoring models due to different prioritization of relevance of other alterations AIM: To better map/identify other alterations with same/similar effect, e.g., amplification/up-regulation with activating mutation Example: Patient ERBB2 Exp up Trial: 1. Perfect match: ERBB2 Exp up 2. Same effect: ERBB2 CNV gain 3. Similar effect: ERBB2 Positive 4. Other alteration: ERBB2 mutation 5. Likely opposite effect: ERBB2 Neg. 6. Opposite effect: ERBB2 Exp down or, ERBB2 CNV loss 7. Gene Only: ERBB2 8. Related Gene: EGFR 19
Clinical NLP in PubGene - examples PubGene in Ahus Optique Courtesy of DNV-GL (Tore Hartvigsen) 20
Akershus University Hospital (Ahus) Optique project. Increase patient security by providing easier access to existing information Courtesy of DNV-GL (Tore Hartvigsen) Human touch and empathy with professional skill
The Surgery Planning Form is completed in 3 Stages Surgery Planning Form ( The Green Form ) Stage 1: Examination Stage 2: Preparations DIPS Ahus Structured data Text Metavision Ahus Metavision O Metavision I Metavision DKS Stage 3: Check/ QA Additional systems System System To complete the form, data must be collected from a number of systems! This is today done manually. Courtesy of DNV-GL (Tore Hartvigsen) 22
Leave the data in the source systems! Expert users «Ordinary» users Researchers/ Analysts A semantic IT solution and ontology for clinical use in Health Care Data warehousing is an option Ahus research Databases. Metavision O Metavision I Metavision DKS Ahus production databases DIPS DIPS (EPJ) (EPJ) (EPJ) Metav DIPS Courtesy of DNV-GL (Tore Hartvigsen) 23
We want to «lift» the data out of the silos! Expert users «Ordinary» users A semantic IT solution and ontology for clinical use in Health Care Structured data Unstructured data (text) Solutions provided by the Optique project Text mining Courtesy of DNV-GL (Tore Hartvigsen) 24
PubGene in Ahus Optique, information extraction Unstructured information Height 1,83 m Fields ASA BMI Height Weight Puls Blood pressure Temperature Diagnose codes Treatment codes Structured information name=height, type=int, unit=cm, value=183 25
PubGene i Ahus Optique, allergy information 26
PubGene i Ahus Optique, status on smoking Sentence Røyker. Røyker 15-20 om dagen. Ifølge datter erhan også storrøyker, 40/ dag siste 50 år. Røykeplaster? Tidligere storrøyker. Ikke røyker og drikker ikke alkohol, tidligere, måteholdent alkoholbruk. Eks-røyker, lite alkohol. Status Yes Yes Yes Uncertain Stopped No Stopped Text analysis Separate text in sentences, detection of sentences containing røyke, røyki, røykt Classification of sentences based on recognition of keywords and word or sentence patterns NB: Based on a small database 27
Ahus Optique Screenshots Courtesy of DNV-GL (Tore Hartvigsen) 28
Courtesy of DNV-GL (Tore Hartvigsen)
Page for surgery planning form Courtesy of DNV-GL (Tore Hartvigsen)
Courtesy of DNV-GL (Tore Hartvigsen)
Courtesy of DNV-GL (Tore Hartvigsen) BMI
Courtesy of DNV-GL (Tore Hartvigsen)
Courtesy of DNV-GL (Tore Hartvigsen)
Courtesy of DNV-GL (Tore Hartvigsen)
Courtesy of DNV-GL (Tore Hartvigsen)
Allergy Courtesy of DNV-GL (Tore Hartvigsen)
Smoking Courtesy of DNV-GL (Tore Hartvigsen)
Courtesy of DNV-GL (Tore Hartvigsen)
Courtesy of DNV-GL (Tore Hartvigsen) Surgery planning form
Further development, text processing/analysis A large set of options and potential Far more effective collection of more relevant information, e.g., by filling surgery forms ( The green form ) Improved quality through automatic detection of errors in documents and control of consistency with structured data Further steps for Ahus Optique Simple: Extraction of more static fields, like lab results Information about medication Information on heart function, lung function Exploit document structure and information on document types 42