Data driven Ontology Alignment. Nigam Shah

Similar documents
Semantic Alignment between ICD-11 and SNOMED-CT. By Marcie Wright RHIA, CHDA, CCS

CS 6824: Tissue-Based Map of the Human Proteome

DATA STANDARDS AND QUALITY CONTROL MEMORANDUM DSQC #

A Simple Pipeline Application for Identifying and Negating SNOMED CT in Free Text

Research Scholar, Department of Computer Science and Engineering Man onmaniam Sundaranar University, Thirunelveli, Tamilnadu, India.

Kentucky Cancer Registry Spring Training 2017

Oncology Ontology in the NCI Thesaurus

Binding Ontologies and Coding Systems to Electronic Health Records and Message

Building a Diseases Symptoms Ontology for Medical Diagnosis: An Integrative Approach

A Content Model for the ICD-11 Revision

Issues in Integrating Epidemiology and Research Information in Oncology: Experience with ICD-O3 and the NCI Thesaurus

An Ontology-Based Methodology for the Migration of Biomedical Terminologies to Electronic Health Records

Biomedical resources for text mining

Defined versus Asserted Classes: Working with the OWL Ontologies. NIF Webinar February 9 th 2010

Other Sites. Table 2 Continued. MPH Rules 11/8/07. NAACCR Webinar Series 1

SHN-1 Human Digestive Panel Test results

UMLS and phenotype coding

Macmillan Publications

A Lexical-Ontological Resource forconsumerheathcare

2007 New Data Items. Slide 1. In this presentation we will discuss five new data items that were introduced with the 2007 MPH Coding Rules.

ICD-O CODING Third Edition

Facts from text: Automated gene annotation with ontologies and text-mining

Knowledge networks of biological and medical data An exhaustive and flexible solution to model life sciences domains

These Terms Synonym Term Manual Used Relationship

Ontologies for the Study of Neurological Disease

How to code rare diseases with international terminologies?

An Ontology for TNM Clinical Stage Inference

Overview of 2010 Hong Kong Cancer Statistics

TIES Cancer Research Network Y2 Face to Face Meeting U24 CA October 29 th, 2014 University of Pennsylvania

Heiner Oberkampf. DISSERTATION for the degree of Doctor of Natural Sciences (Dr. rer. nat.)

Influenza IDO. Influenza Ontology

SNOMED CT and Orphanet working together

Pilot Study: Clinical Trial Task Ontology Development. A prototype ontology of common participant-oriented clinical research tasks and

Supporting Information

performed to help sway the clinician in what the appropriate diagnosis is, which can substantially alter the treatment of management.

Sarah Cohen-Boulakia. Université Paris Sud, LRI CNRS UMR 8623

Hypertension encoded in GLIF

Quiz 1. Assign Race 1, Race 2 and Spanish Hispanic Origin to the following scenarios.

Instructions for Coding Grade for 2014+

Schema-Driven Relationship Extraction from Unstructured Text

CODING TUMOUR MORPHOLOGY. Otto Visser

Q&A Session NAACCR Webinar Series Collecting Cancer Data: Pancreas January 05, 2012

Ontology-based interactive visualization of patient-generated research questions

20 CP Transverse positioning of pre-clinical research small animal subjects

7/29/ Grade Coding Instructions ICD-O-3 Updates. Outline. Introduction to Coding Grade for 2014+

Using APR-DRG DataVision Reports for Hospital Ranking and Physician Review

Question: If in a particular case, there is doubt about the correct T, N or M category, what do you do?

ICD-O-3 UPDATES - PENDING

ICD-O-3 UPDATES - PENDING

Chapter 12 Conclusions and Outlook

Format Of ICD-O Terms In Numerical List Each topographic and morphologic term appears only once The first listed term in Bold Type is the Preferred Te

A quick review. The clustering problem: Hierarchical clustering algorithm: Many possible distance metrics K-mean clustering algorithm:

Icd 10 code met renal cell

Session 4 Rebecca Poulos

Harvard-MIT Division of Health Sciences and Technology HST.952: Computing for Biomedical Scientists. Data and Knowledge Representation Lecture 6

Deriving an Abstraction Network to Support Quality Assurance in OCRe

List of Available TMAs in the PRN

IBM Research Report. Automated Problem List Generation from Electronic Medical Records in IBM Watson

Hide & Go Cecum. Name: Hypothesis: My animal is a(n) which is a(n) herbivore carnivore

Standardize and Optimize. Trials and Drug Development

Towards Best Practices for Crowdsourcing Ontology Alignment Benchmarks

ONTOLOGY-BASED METHODS FOR DISEASE SIMILARITY ESTIMATION AND DRUG REPOSITIONING. A DISSERTATION IN Computer Science and Mathematics

How preferred are preferred terms?

MODULE 4: SPLICING. Removal of introns from messenger RNA by splicing

Ambiguity of Human Gene Symbols in LocusLink and MEDLINE: Creating an Inventory and a Disambiguation Test Collection

Decision Support in Radiation Therapy. Summary. Clinical Decision Support 8/2/2012. Overview of Clinical Decision Support

AN ONTOLOGICAL APPROACH TO REPRESENTING AND REASONING WITH TEMPORAL CONSTRAINTS IN CLINICAL TRIAL PROTOCOLS

PBSI-EHR Off the Charts!

Icd 10 prostate cancer stage 4

A framework for the study of diseases and adverse drug reactions

Tumour Structure and Nomenclature. Paul Edwards. Department of Pathology and Cancer Research UK Cambridge Institute, University of Cambridge

Interpreting Patient Data using Medical Background Knowledge

came from a carcinoma and in 12 from a sarcoma. Ninety lesions were intrapulmonary and the as the chest wall and pleura. Details of the primary

Gene Ontology and Functional Enrichment. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein

Prof. Dr. med. Beata BODE-LESNIEWSKA Institute of Pathology and Molecular Pathology University Hospital; Zurich

Using an Integrated Ontology and Information Model for Querying and Reasoning about Phenotypes: The Case of Autism

Multiple Primary and Histology Site Specific Coding Rules BREAST. FLORIDA CANCER DATA SYSTEM MPH Breast Site Specific Coding Rules

FINALIZED SEER SINQ QUESTIONS

Adverse Event Profiles for Multi-drug Combinations Srinivasan Iyer, Kushal Tayal, Siddhi Soman, Jeff Ullman CS 341 Report

Exploring TCGA Pan-Cancer Data at the UCSC Cancer Genomics Browser

Pathology of Sarcoma ELEANOR CHEN, MD, PHD, ASSISTANT PROFESSOR DEPARTMENT OF PATHOLOGY UNIVERSITY OF WASHINGTON

Using the CEN/ISO Standard for Categorial Structure to Harmonise the Development of WHO International Terminologies

2010 Update. NAACCR Webinar Series 1 4/1/2010. Agenda. Access to 2010 Information. CSv2. Collecting Cancer Data: Soft Tissue Sarcoma

Early Cancer Care FAQ

Session 4 Rebecca Poulos

Advancing methods to develop behaviour change interventions: A Scoping Review of relevant ontologies

FIGHTING DRUG-RESISTANT MALARIA

A Visual Representation of Part-Whole Relationships in BFO-Conformant Ontologies

On carcinomas and other pathological entities

TeamHCMUS: Analysis of Clinical Text

Pathway Exercises Metabolism and Pathways

Keeping Abreast of Breast Imagers: Radiology Pathology Correlation for the Rest of Us

Reporting of Cancer Stage Information by Acute Care Hospitals in Ontario

Diagnostic Assessment for The Inside Story

Outline. BIONT Goals. Work so Far. Collaborative BIONT BIORDF use case. Next Steps

2018 New Grade Coding Rules It s a Good Thing!

Causal Knowledge Modeling for Traditional Chinese Medicine using OWL 2

PARAPHRASE YOUR WAY TO THE TOP

Report on Cancer Statistics in Alberta. Kidney Cancer

Using Semantic Web technologies for Clinical Trial Recruitment

Transcription:

Data driven Ontology Alignment Nigam Shah nigam@stanford.edu

What is Ontology Alignment? Alignment = the identification of near synonymy relationship b/w terms from different ontologies. Mapping = the identification of some relationship b/w terms from different ontologies. Alignment (CS) = the process of detecting potential mappings Alignment Mapping Alignment (CS) 25-Jul-06 2

Approaches to alignment Pre-defined, during the process of creation of the ontology The OBO Foundry paradigm (http://obofoundry.org) Authors discuss, argue, vote and reach a consensus Takes a long time! Post-hoc, after the relevant ontologies have been in use for some time Human curated does not scale Algorithm driven (PROMPT, FOAM ) Data driven (which we discuss today) 25-Jul-06 3

Steps in Alignment (CS) Anchor identification Identify similar class labels in the ontologies to be aligned Usually done by string matching Term-1 Root Term-2 t1 R t2 Ontology structure Use the similar classes as anchors and examine the local [graph] structure around them to inform the similarity metric Term-3 Term-4 Term-5 t3 t4 t5 t6 t7 25-Jul-06 4

How can the annotated data help? Root R Term-1 Term-2 t1 t2 Term-2 t1 Term-3 Term-4 t3 t4 Term-5 t5 Term-5 t5 t6 t7 Ontology [graph] structure based step Provide Anchors from annotated data 25-Jul-06 5

Annotated data (biomedical) Annotation = A statement declaring a relationship b/w a biomedical thing and a term [class name] (or an instance of a class) from an ontology. e.g. p53 <associated_with> cell death Annotations tell us what the biologists believe to be true (in particular or in general) Most annotations are created after particular observations and then are generalized during interpretation by a biologist. Annotations of clinical / medical data are usually NOT generalized but remain at the particular (or instance) level. 25-Jul-06 6

Example annotated data set Each donor block in the TMA has semistructured text associated with it. ID Organ Diagnosis Subclass 1 Subclass 2 Subclass 3 Subclass 4 2334 3335 7022 Ovary Prostate Bladder MMMT Carcinoma Carcinoma Adeno Transitional cell intraductal In situ 7288 Testis teratoma immature Embryonal carcinoma 8060 6662 6663 4713 Liver Carcinoma hepatocellular No vascular invasion Soft tissue Sarcoma Leiomyo epithelioid lung Sarcoma Leiomyo epithelioid stomach carcinoma unknown HepC cirrhosis 25-Jul-06 7

Map text to ontology terms Make all possible permutations Rules to weed out bad permutations Check for an exact match with NCI and SNOMED-CT terms (and/or synonyms) Rules to weed out bad matches Prostate Carcinoma Adeno intraductal 24 permutations Prostate_Ductal_Adenocarcinoma Prostate Carcinoma Adeno intraductal : Carcinoma Prostate intraductal Adeno : Adeno Carcinoma intraductal Prostate : Prostate intraductal Adeno Carcinoma 25-Jul-06 8

Sample matches Organ Diagnosis Subclass 1 Subclass 2 Subclass 3 Ontology Terms 2334 3335 7022 7288 Testis teratoma immature Embryonal carcinoma 8060 6662 6663 4713 Ovary MMMT Malignant_Mixed_Mesodermal_Mullerian_T umor Prostate Carcinoma Adeno intraductal Prostate_Ductal_Adenocarcinoma Bladder Carcinoma Transitional cell In situ Liver Carcinoma hepatocellular No vascular invasion HepC cirrhosis Stage_0_Transitional_Cell_Carcinoma Transitional_Cell_Carcinoma Bladder_Carcinoma Carcinoma_in_situ Immature Teratoma Testicular_Embryonal_Carcinoma Immature_Teratoma Hepatocellular_Carcinoma Soft tissue Sarcoma Leiomyo epithelioid Soft_Tissue_Sarcoma Leiomyosarcoma Epithelioid_Sarcoma lung Sarcoma Leiomyo epithelioid Lung_Sarcoma Leiomyosarcoma Epithelioid_Sarcoma stomach carcinoma unknown Gastric_carcinoma 25-Jul-06 9

Some boring results (and validation) Mapped the term-sets for 8495 records, which correspond to 783 distinct term-sets. 577 term-sets (6614 records) matched to the NCI thesaurus 365 term-sets (3465 records) matched to SNOMED-CT In total mapped 6871 records (80%) of annotated records in TMAD (641 distinct term-sets) to one or more ontology terms. Validation NCI SNOMED-CT Appropriate Inappropriate Appropriate Inappropriate Set-1 41 9 41 9 Set-2 42 8 43 7 Set-3 46 4 38 12 Total 129 21 122 28 Average (%) 43.0 (86%) 7.0 (14%) 40.66 (81%) 9.33 (19%) 25-Jul-06 10

Context for the project 25-Jul-06 11

Click on the Red Node link to get data 25-Jul-06 12

Annotations performed using multiple ontologies are the key S1 t1 Term-2 The relationship [blue arrows] embodied in this annotation is fuzzy but that s life. S2 t5 Term-5 However, (depending on the data) this gives a way to say: Term-2 Term-5 t1 t5 Term-2 <is synonymous to> t1 Term-5 <is synonymous to> t5 25-Jul-06 13

How good are the anchors? Strategy: Evaluate against a manually defined gold standard [UMLS] Find the CUI of the NCI-term (Nt) from the UMLS. Find the CUI of the SNOMED-CT term (St) from the UMLS Examine if the CUIs are the same or within two links of each other Results: The CUIs were identical for 2335 records at one link from each other for 403 records at two links from each other for 189 records. Overall, Nt St pairs from 2927 records (= 259 distinct terms) were appropriately aligned. [259 = 88%] The CUIs for the Nt St pairs for 281 records (corresponding to 36 distinct terms), were separated by more than two links. 25-Jul-06 14

We might improve alignment Ontology [graph] structure based step S2 t5 Term-5 t5 Root R S2 Term-5 Term-1 Term-2 t1 t2 Term-2 t1 Term-3 Term-4 t3 t4 Term-5 t5 Term-5 t5 t6 t7 Provide Anchors from annotated data 25-Jul-06 15

Better Text-mapping Better Alignment 2/17 7/23 783 791 Distinct Terms 577 620 Terms with NCI match 365 610 Terms with SNOMEDCT match 641 654 Terms with any match 295 576 Terms with both match 900 800 2/17/2006 7/23/2006 700 600 500 400 300 200 100 0 Distinct Terms Distinct Terms w ith NCI match Distinct Terms w ith SNOMEDCT match Distinct Terms w ith any match Distinct Terms w ith both match 25-Jul-06 16

Validation of the [new] alignment Identify anchors using [standard] methods for the set of terms aligned using annotated data Run the structural step of the alignment Use anchors identified using annotated data Run the structural step using the annotation derived anchors Also looking at indexing for text-mapping [instead of permutation generation] With Sean Falconer Compare the two alignments Either using an expert created gold standard (UMLS) Or by direct review by experts We will have results at the next Protégé conference ;) 25-Jul-06 17

Use of more structured annotations Root R If the relationship embodied in this annotation is well defined (the blue arrows) Term-1 Term-2 t1 t2 Term-3 Term-4 t3 t4 We might be able to say: Term-5 <has this relationship with> t5 Term-5 S2 t5 t6 t7 If S2 is an instance of Term-5 and/or t5, we might be able to propagate the relationship to the parents of Term-5 and t5 (until we see a counter example) 25-Jul-06 18

Mappings/Alignments at various granularity levels Ontologies at different scales of granularity Reactome Machine Prose BioPAX PaTO GO, FMA is_a, part_of has_quality has_participant has_reaction effects, induces Relations with varying degrees of formality 25-Jul-06 19

Acknowledgements Natasha Noy Kaustubh Supekar Daniel Rubin Mark Musen York Sure (Tricia d Entremont) Pictorial Ontology Navigation National Center for Biomedical Ontology www.bioontology.org 25-Jul-06 20