Schema-Driven Relationship Extraction from Unstructured Text

Similar documents
Knowledge networks of biological and medical data An exhaustive and flexible solution to model life sciences domains

Automated Annotation of Biomedical Text

Semantic Web & Semantic Web Services: Applications in Healthcare And Scientific Research

Semantic Alignment between ICD-11 and SNOMED-CT. By Marcie Wright RHIA, CHDA, CCS

How preferred are preferred terms?

Semi-Automatic Construction of Thyroid Cancer Intervention Corpus from Biomedical Abstracts

Semantic Web Applications in Financial Industry, Government, Health Care and Life Sciences

Case-based reasoning using electronic health records efficiently identifies eligible patients for clinical trials

38 Int'l Conf. Bioinformatics and Computational Biology BIOCOMP'16

Seeking Informativeness in Literature Based Discovery

Towards Querying Bioinformatic Linked Data in Natural Langua

Text Mining of Patient Demographics and Diagnoses from Psychiatric Assessments

Scientific Discovery as Link Prediction in Influence and Citation Graphs

Biomedical resources for text mining

Clinical Genome Knowledge Base and Linked Data technologies. Aleksandar Milosavljevic

Lecture 10: POS Tagging Review. LING 1330/2330: Introduction to Computational Linguistics Na-Rae Han

A Predictive Chronological Model of Multiple Clinical Observations T R A V I S G O O D W I N A N D S A N D A M. H A R A B A G I U

Text mining for lung cancer cases over large patient admission data. David Martinez, Lawrence Cavedon, Zaf Alam, Christopher Bain, Karin Verspoor

Multi-modal Patient Cohort Identification from EEG Report and Signal Data

Extraction of Adverse Drug Effects from Clinical Records

Erasmus MC at CLEF ehealth 2016: Concept Recognition and Coding in French Texts

How to code rare diseases with international terminologies?

Analyzing the Semantics of Patient Data to Rank Records of Literature Retrieval

Statistical Weights of Mixed DNA Profiles

Using a grammar implementation to teach writing skills

A Simple Pipeline Application for Identifying and Negating SNOMED CT in Free Text

Looking for Subjectivity in Medical Discharge Summaries The Obesity NLP i2b2 Challenge (2008)

Semantic empowerment of Health Care and Life Science Applications

How can Natural Language Processing help MedDRA coding? April Andrew Winter Ph.D., Senior Life Science Specialist, Linguamatics

Semantic Structure of the Indian Sign Language

Heiner Oberkampf. DISSERTATION for the degree of Doctor of Natural Sciences (Dr. rer. nat.)

Facts from text: Automated gene annotation with ontologies and text-mining

Kalpana Raja, PhD 1, Andrew J Sauer, MD 2,3, Ravi P Garg, MSc 1, Melanie R Klerer 1, Siddhartha R Jonnalagadda, PhD 1*

HBML: A Representation Language for Quantitative Behavioral Models in the Human Terrain

DISCOVERING IMPLICIT ASSOCIATIONS BETWEEN GENES AND HEREDITARY DISEASES

Memory-Augmented Active Deep Learning for Identifying Relations Between Distant Medical Concepts in Electroencephalography Reports

A Method for Analyzing Commonalities in Clinical Trial Target Populations

Rumor Detection on Twitter with Tree-structured Recursive Neural Networks

Hypertension encoded in GLIF

PubMed Tutorial Author: Gökhan Alpaslan DMD,Ph.D. e-vident

Animal Disease Event Recognition and Classification

Distillation of Knowledge from the Research Literatures on Alzheimer s Dementia

Bjoern Peters La Jolla Institute for Allergy and Immunology Buenos Aires, Oct 31, 2012

Factuality Levels of Diagnoses in Swedish Clinical Text

Mining Medline for New Possible Relations of Concepts

Extracting geographic locations from the literature for virus phylogeography using supervised and distant supervision methods

A Study of Abbreviations in Clinical Notes Hua Xu MS, MA 1, Peter D. Stetson, MD, MA 1, 2, Carol Friedman Ph.D. 1

UKParl: A Data Set for Topic Detection with Semantically Annotated Text

WikiWarsDE: A German Corpus of Narratives Annotated with Temporal Expressions

Query Refinement: Negation Detection and Proximity Learning Georgetown at TREC 2014 Clinical Decision Support Track

Innovative Risk and Quality Solutions for Value-Based Care. Company Overview

. Semi-automatic WordNet Linking using Word Embeddings. Kevin Patel, Diptesh Kanojia and Pushpak Bhattacharyya Presented by: Ritesh Panjwani

Experiment Presentation CS Chris Thomas Experiment: What is an Object? Alexe, Bogdan, et al. CVPR 2010

SNOMED CT and Orphanet working together

Building Cognitive Computing for Healthcare

Keeping Abreast of Breast Imagers: Radiology Pathology Correlation for the Rest of Us

Data Structures vs. Study Results:

Automatically extracting, ranking and visually summarizing the treatments for a disease

KNOWLEDGE-BASED METHOD FOR DETERMINING THE MEANING OF AMBIGUOUS BIOMEDICAL TERMS USING INFORMATION CONTENT MEASURES OF SIMILARITY

Insights into Analogy Completion from the Biomedical Domain

Applying Universal Schemas for Domain Specific Ontology Expansion

UMLS and phenotype coding

Identifying Adverse Drug Events from Patient Social Media: A Case Study for Diabetes

Foundations of Natural Language Processing Lecture 13 Heads, Dependency parsing

Network Analysis of Toxic Chemicals and Symptoms: Implications for Designing First-Responder Systems

The Impact of Directionality in Predications on Text Mining

Curriculum Vitae. Degree and date to be conferred: Masters in Computer Science, 2013.

W3C Semantic Sensor Networks Ontologies, Applications, and Future Directions

Semantic Infrastructure for Automated Lipid Classification

Leman Akoglu Hanghang Tong Jilles Vreeken. Polo Chau Nikolaj Tatti Christos Faloutsos SIAM International Conference on Data Mining (SDM 2013)

Data Mining in Bioinformatics Day 4: Text Mining

Data and Text-Mining the ElectronicalMedicalRecord for epidemiologicalpurposes

Building a Diseases Symptoms Ontology for Medical Diagnosis: An Integrative Approach

Automatic Indexing and Retrieving Context-Specific Evidence to Support Physician Decision Making at the Point of Care

IBM Research Report. Automated Problem List Generation from Electronic Medical Records in IBM Watson

Learning the Fine-Grained Information Status of Discourse Entities

Automatic coding of death certificates to ICD-10 terminology

Phenobridge WP 7. Crossing the species bridge between mouse and human. 17 February 2015, Helmholtz Zentrum München

Combining unsupervised and supervised methods for PP attachment disambiguation

Medical information: Where to find it, what to trust. Lewis H. Rowett Executive Editor Annals of Oncology

Information Extraction

A Corpus of Clinical Narratives Annotated with Temporal Information

Chapter IR:VIII. VIII. Evaluation. Laboratory Experiments Logging Effectiveness Measures Efficiency Measures Training and Testing

University of Pittsburgh Cancer Institute UPMC CancerCenter. Uma Chandran, MSIS, PhD /21/13

A Descriptive Delta for Identifying Changes in SNOMED CT

Do Future Work sections have a purpose?

cagrid, cabig, CVRG and NCIBI Joel Saltz MD, PhD Director Center for Comprehensive Informatics

A Comparison of Collaborative Filtering Methods for Medication Reconciliation

Assessing the Implications for Close Relatives in the Event of Similar but Non-Matching DNA Profiles

Clinical Coreference Annotation Guidelines (with excerpts from ODIE guidelines and modified for SHARP) Arrick Lanfranchi and Kevin Crooks

Aligning Medical Domain Ontologies for Clinical Query Extraction

PhenDisco: a new phenotype discovery system for the database of genotypes and phenotypes

Building Evaluation Scales for NLP using Item Response Theory

Novel Integrative Bioinformatics Approaches to Biomedical Ontology Practice for Translational Informatics. Sirarat Sarntivijai

TIES Cancer Research Network Y2 Face to Face Meeting U24 CA October 29 th, 2014 University of Pennsylvania

Formalizing UMLS Relations using Semantic Partitions in the context of task-based Clinical Guidelines Model

Shades of Certainty Working with Swedish Medical Records and the Stockholm EPR Corpus

Rare Diseases Nomenclature and classification

SPICE: Semantic Propositional Image Caption Evaluation

Transcription:

Wright State University CORE Scholar Kno.e.sis Publications The Ohio Center of Excellence in Knowledge- Enabled Computing (Kno.e.sis) 2007 Schema-Driven Relationship Extraction from Unstructured Text Cartic Ramakrishnan Wright State University - Main Campus Follow this and additional works at: http://corescholar.libraries.wright.edu/knoesis Part of the Bioinformatics Commons, Communication Technology and New Media Commons, Databases and Information Systems Commons, OS and Networks Commons, and the Science and Technology Studies Commons Repository Citation Ramakrishnan, C. (2007). Schema-Driven Relationship Extraction from Unstructured Text.. http://corescholar.libraries.wright.edu/knoesis/266 This Presentation is brought to you for free and open access by the The Ohio Center of Excellence in Knowledge-Enabled Computing (Kno.e.sis) at CORE Scholar. It has been accepted for inclusion in Kno.e.sis Publications by an authorized administrator of CORE Scholar. For more information, please contact corescholar@www.libraries.wright.edu.

Schema-Driven Relationship Extraction from Unstructured Text Cartic Ramakrishnan Kno.e.sis Center, Wright State University, Dayton, OH

Outline Motivation Problem Description & Approach Results Future Work

Anecdotal Example mentioned_in UNDISCOVERED PUBLIC KNOWLEDGE Discovering connections hidden in text Harry Potter Nicolas Flammel mentioned_in Nicolas Poussin The Hunchback of Notre Dame member_of painted_by Holy Blood, Holy Grail written_by Victor Hugo member_of Priory of Sion cryptic_motto_of Et in Arcadia Ego The Da Vinci code mentioned_in member_of displayed_at mentioned_in Leonardo Da Vinci painted_by painted_by painted_by The Last Supper The Mona Lisa displayed_at The Louvre displayed_at The Vitruvian man Santa Maria delle Grazie

Motivation 1 Undiscovered Public knowledge in biology Migraine Stress? Calcium Channel Blockers Magnesium Swanson s Discoveries Spreading Cortical Depression PubMed Associations Discovered based on keyword searches These associations were discovered in 1986 followed by manually analysis of text to establish possible relevant relationships

Motivation 2 - Hypothesis Driven retrieval of Scientific Literature Migraine affects Stress inhibit isa Magnesium Patient Calcium Channel Blockers Keyword query: Migraine[MH] + Magnesium[MH] Complex Query PubMed Supporting Document sets retrieved

Motivation 3 -- Growth Rate of Public Knowledge Data captured per year = 1 exabyte (10 18 ) (Eric Neumann, Science, 2005) How much is that? Compare it to the estimate of the total words ever spoken by humans = 12 exabyte A small but significant portion is text data PubMed 16 Million abstracts MedlinePlus health information OMIM catalog of human genes and genetic disorders Undiscovered public knowledge may have also increased by a large amount

Our past work in Connection Discovery Semantic Associations over RDF graphs Discovery and Ranking Semantically Connected affects Migraine It is therefore critical to bridge the gap between unstructured and structured data Magnesium Assumption: by extracting Rich entities Semantic and relationships Stress Metadata inhibit containing between resulting entities isa related in semantic by a diverse set metadata of relationships Patient Calcium Channel Blockers

Outline Motivation Problem Description & Approach Results Future Work

Problem Extracting relationships between MeSH terms from PubMed Biologically active substance complicates causes affects UMLS Semantic Network Lipid causes affects Disease or Syndrome instance_of Fish Oils??????? instance_of Raynaud s Disease MeSH 9284 documents 5 documents 4733 documents PubMed

Background knowledge used UMLS A high level schema of the biomedical domain 136 classes and 49 relationships Synonyms of all relationship using variant lookup (tools from NLM) 49 relationship + their synonyms = ~350 mostly verbs MeSH T147 effect 22,000+ topics organized as a forest of 16 trees Used to query PubMed PubMed Over 16 million abstract Abstracts annotated with one or more MeSH terms T147 induce T147 etiology T147 cause T147 effecting T147 induced

Method Parse Sentences in PubMed SS-Tagger (University of Tokyo) SS-Parser (University of Tokyo) Entities (MeSH terms) in sentences occur in modified forms (TOP (S adenomatous (NP (NP (DT An) modifies (JJ excessive) hyperplasia (ADJP (JJ endogenous) (CC or) (JJ exogenous) An excessive ) (NN stimulation) endogenous ) (PP or exogenous (IN by) (NP stimulation (NN estrogen) modifies ) ) ) (VP (VBZ induces) estrogen (NP (NP (JJ adenomatous) (NN hyperplasia) ) (PP (IN of) (NP (DT the) Entities (NN endometrium) can also occur ) as ) ) composites ) ) ) of 2 or more other entities adenomatous hyperplasia and endometrium occur as adenomatous hyperplasia of the endometrium

Method Identify entities and Relationships in Parse Tree DT the NP NP JJ excessive ADJP NN stimulation JJ endogenous CC or JJ exogenous IN by PP TOP S NN estrogen VBZ induces VP JJ adenomatous NP NP NN hyperplasia Modifiers Modified entities Composite Entities IN of PP DT the NP NN endometrium

Entities The simple, the modified and the composite To capture the various types of entities we define Simple entities as MeSH terms Modifiers as siblings of entities that are Determiners Y induces no X Noun Phrases An excessive endogenous or exogenous stimulation Adjective phrases adenomatous Prepositional phrases M is induced by the X in the Z Modified Entities as any entity that has a sibling which is a modifier Composite Entity as any entity that has another entity as a sibling

Resulting RDF adenomatous hyperplasia hasmodifier An excessive endogenous or exogenous stimulation hasmodifier modified_entity2 haspart haspart modified_entity1 induces composite_entity1 haspart estrogen haspart Modifiers Modified entities Composite Entities endometrium

Outline Motivation Approach Results Future Work

Results Dataset 1 Swanson s discoveries Associations between Migraine and Magnesium [Hearst99] stress is associated with migraines stress can lead to loss of magnesium calcium channel blockers prevent some migraines magnesium is a natural calcium channel blocker spreading cortical depression (SCD) is implicated in some migraines high levels of magnesium inhibit SCD migraine patients have high platelet aggregability magnesium can suppress platelet aggregability

Results Creation of Dataset 1 Keywords pairs e.g. stress + migraine etc. against PubMed return PubMed abstracts that are annotated (by NLM) with both terms 8 pairs of terms in this scenario result in 8 subsets of PubMed Semantic Metadata Represented in RDF With complex entities and relationships connecting them Pointers to original document and sentence Size ~2MB RDF for Migraine Magnesium subset of PubMed

Evaluating the Result of Extraction Ideal method to evaluate the Extraction method Domain experts read a set of abstract given a set of relationship names and entities to look for In addition to this give them the extracted triples and entities For every abstract the expert judges counts the correct, incorrect and missed triples Measure precision and recall

Evaluating the Result of Extraction In the absence of a domain expert we focus of getting a feel for the utility of the extracted data We know the association manually discovered between Migraine and Magnesium We locate paths of various lengths between them and manually inspect these paths If the paths are indicative of the manually discovered associations the extracted data is useful

Paths between Migraine and Magnesium Paths are considered interesting if they have one or more named relationship Other than haspart or hasmodifiers in them

An example of such a path stimulated caused_by migraine (D008881) haspart platelet (D001792) haspart collagen (D003094) haspart magnesium (D008274) stimulated me_2286 _13%_and_17%_adp_and_collagen_induced_platelet_aggregation me_3142 by_a_primary_abnormality_of_platelet_behavior

Results Dataset 2 Neoplasm (C04) For subtree of MeSH rooted at Neoplasms all topics under this subtree are used as query terms against PubMed The resulting dataset contains ~500,000 PubMed abstracts The extraction process run on this data returns ~150MB Processing the tagged and parsed sentences for Dataset 2 (Neoplasm) to generate RDF took approx. 5 minutes Stats 211 different named relationships found 500,000 instance-property-instance statements 260,000 instance-property-literal statements Currently setting up to extract RDF from all of PubMed

Outline Motivation Problem Description & Approach Results Future Work

Future Extensions to the Extraction process Short-term goals (1 month) MeSH qualifiers (blood pressure, contraindications) Curate and release Migraine-Magnesium RDF Long-Term goals More complex structures Conjunctions X causes Y to inhibit Z Rule-action language to test new extraction rules Finding new terms to enrich existing vocabularies Perhaps ontology enrichment

The projected future of research in Biology From Hypothesis driven wet lab experiments To Data-driven reduction/pruning of hypothesis space leading to new insight and possibly discovery What challenges does this transition bring?

Use of Generated Semantic Metadata Semantic Browsing of PubMed based on named relationships between MeSH terms Path/hypothesis based document retrieval Knowledge discovery from literature Coprus-based complex relationship discovery and ranking Corpus-based relevant connection subgraph discovery

Support such retrieval and discovery operations across multiple data sources Extract Semantic Metadata about entities in all of these databases that might occur in PubMed text Resulting metadata will contain relationships between genes (OMIM), diseases (MeSH), nucleotide anomalies (SNP) hypothesis validation and knowledge discovery in biology.

THANK YOU!