PhenDisco: a new phenotype discovery system for the database of genotypes and phenotypes

Similar documents
Core Technology Development Team Meeting

CLAMP-Cancer an NLP tool to facilitate cancer research using EHRs Hua Xu, PhD

Knowledge networks of biological and medical data An exhaustive and flexible solution to model life sciences domains

38 Int'l Conf. Bioinformatics and Computational Biology BIOCOMP'16

A Study of Abbreviations in Clinical Notes Hua Xu MS, MA 1, Peter D. Stetson, MD, MA 1, 2, Carol Friedman Ph.D. 1

A Simple Pipeline Application for Identifying and Negating SNOMED CT in Free Text

Text mining for lung cancer cases over large patient admission data. David Martinez, Lawrence Cavedon, Zaf Alam, Christopher Bain, Karin Verspoor

A Method for Analyzing Commonalities in Clinical Trial Target Populations

Data mining with Ensembl Biomart. Stéphanie Le Gras

How can Natural Language Processing help MedDRA coding? April Andrew Winter Ph.D., Senior Life Science Specialist, Linguamatics

The PhenX Toolkit: Standard Measures for Collaborative Research

Multi-modal Patient Cohort Identification from EEG Report and Signal Data

Keeping Abreast of Breast Imagers: Radiology Pathology Correlation for the Rest of Us

Assessing Health Disparities and Closing the Gap

Releasing SNP Data and GWAS Results with Guaranteed Privacy Protection

Extracting geographic locations from the literature for virus phylogeography using supervised and distant supervision methods

Introduction to the Partners Biobank Portal. December 2016

SEQUENCE FEATURE VARIANT TYPES

UMLS and phenotype coding

Predictive Analytics for Retention in HIV Care

Query Refinement: Negation Detection and Proximity Learning Georgetown at TREC 2014 Clinical Decision Support Track

Semantic Alignment between ICD-11 and SNOMED-CT. By Marcie Wright RHIA, CHDA, CCS

IBM Research Report. Automated Problem List Generation from Electronic Medical Records in IBM Watson

The Hospital Anxiety and Depression Scale Guidance and Information

AudGenDB: a Public, Internet-Based, Audiologic - Otologic - Genetic Database for Pediatric Hearing Research

Web Feature Services Tutorial

Schema-Driven Relationship Extraction from Unstructured Text

Biomedical resources for text mining

A framework for the study of diseases and adverse drug reactions

Finding subtle mutations with the Shannon human mrna splicing pipeline

Text Mining of Patient Demographics and Diagnoses from Psychiatric Assessments

Wikipedia-Based Automatic Diagnosis Prediction in Clinical Decision Support Systems

Erasmus MC at CLEF ehealth 2016: Concept Recognition and Coding in French Texts

Pilot Study: Clinical Trial Task Ontology Development. A prototype ontology of common participant-oriented clinical research tasks and

Integrated Analysis of Copy Number and Gene Expression

Using an Integrated Ontology and Information Model for Querying and Reasoning about Phenotypes: The Case of Autism

Information Retrieval from Electronic Health Records for Patient Cohort Discovery

Computer Models for Medical Diagnosis and Prognostication

Chapter 12 Conclusions and Outlook

The NIMH Data Repositories

Asthma Surveillance Using Social Media Data

Analysis with SureCall 2.1

The Origins and Promise of PROMIS Patient Reported Outcomes Measurement Information System

Semantic Interoperable Electronic Patient Records: The Unfolding of Consensus based Archetypes

A Web based Computer aided Diagnosis Tool for Bone Age Assessment:

An Intelligent Writing Assistant Module for Narrative Clinical Records based on Named Entity Recognition and Similarity Computation

Advancing methods to develop behaviour change interventions: A Scoping Review of relevant ontologies

Cochrane Breast Cancer Group

Overview of the Synthetic Derivative

Annotation and Retrieval System Using Confabulation Model for ImageCLEF2011 Photo Annotation

Simplifying Treatment Protocol Development with.. By Healthy at Work and SaluGenecists

Automatically extracting, ranking and visually summarizing the treatments for a disease

11/11/14. Clinical Research Panel. Barriers to multi-site collaborations. Definition: Common Data Elements

The NIH Biosketch. February 2016

Identifying Novel Targets for Non-Small Cell Lung Cancer Just How Novel Are They?

Detecting Patient Complexity from Free Text Notes Using a Hybrid AI Approach

Chapter 9. Tests, Procedures, and Diagnosis Codes The McGraw-Hill Companies, Inc. All rights reserved.

Bioinformatics Laboratory Exercise

Clinical Genome Knowledge Base and Linked Data technologies. Aleksandar Milosavljevic

National Library of Medicine: Overview of Electronic Resources

Health informatics Digital imaging and communication in medicine (DICOM) including workflow and data management

FINAL REPORT Measuring Semantic Relatedness using a Medical Taxonomy. Siddharth Patwardhan. August 2003

cagrid, cabig, CVRG and NCIBI Joel Saltz MD, PhD Director Center for Comprehensive Informatics

Real-time Summarization Track

National Academies Next Generation SAMPLE Researchers TITLE Initiative HERE

Sim TwentyFive: An Interactive Visualization System for Data-Driven Decision Support

What Smokers Who Switched to Vapor Products Tell Us About Themselves. Presented by Julie Woessner, J.D. CASAA National Policy Director

EBP ASKING. Constructing a Good Clinical Question Using the PICO Format

Guidelines for Effective Usage of Text Highlighting Techniques

Hands-On Ten The BRCA1 Gene and Protein

KNOWLEDGE-BASED METHOD FOR DETERMINING THE MEANING OF AMBIGUOUS BIOMEDICAL TERMS USING INFORMATION CONTENT MEASURES OF SIMILARITY

DPV. Ramona Ranz, Andreas Hungele, Prof. Reinhard Holl

Application of AI in Healthcare. Alistair Erskine MD MBA Chief Informatics Officer

How Big Data and Advanced Analytics Can Improve Population Health: Now and In the Near Future

HHS Public Access Author manuscript Hum Mutat. Author manuscript; available in PMC 2016 April 16.

Building Cognitive Computing for Healthcare

Delphi Survey Results. MPIs: Drs. William Dale, Arti Hurria, Supriya Mohile

Guide to Use of SimulConsult s Phenome Software

The Impact of Belief Values on the Identification of Patient Cohorts

Electronic Care Centre for Caregivers of Patients with Alzheimer's disease

Predicting the Effect of Diabetes on Kidney using Classification in Tanagra

Big Data Phenomics in the VA. Outline

SAGE. Nick Beard Vice President, IDX Systems Corp.

Genome. Institute. GenomeVIP: A Genomics Analysis Pipeline for Cloud Computing with Germline and Somatic Calling on Amazon s Cloud. R. Jay Mashl.

A review of approaches to identifying patient phenotype cohorts using electronic health records

Global infectious disease surveillance through automated multi lingual georeferencing of Internet media reports

Combining Archetypes with Fast Health Interoperability Resources in Future-proof Health Information Systems

EMBASE Find quick, relevant answers to your biomedical questions

TeamHCMUS: Analysis of Clinical Text

PROPOSED WORK PROGRAMME FOR THE CLEARING-HOUSE MECHANISM IN SUPPORT OF THE STRATEGIC PLAN FOR BIODIVERSITY Note by the Executive Secretary

Bellagio, Las Vegas November 26-28, Patricia Davis Computer-assisted Coding Blazing a Trail to ICD 10

Analyzing the Semantics of Patient Data to Rank Records of Literature Retrieval

UCLA at TREC 2014 Clinical Decision Support Track: Exploring Language Models, Query Expansion, and Boosting

An unsupervised machine learning model for discovering latent infectious diseases using social media data

Cancer Gene Panels. Dr. Andreas Scherer. Dr. Andreas Scherer President and CEO Golden Helix, Inc. Twitter: andreasscherer

SFARI Gene 2.0 User Guide

A Web Tool for Building Parallel Corpora of Spoken and Sign Languages

SciENcv: alpha/beta and beyond. Bart Trawick, PhD National Center for Biotechnology Information National Library of Medicine

!"#$%&'!(!%&# !"#$%"&'(") *"+,-. /0##"%120 /02&3"$45 64#10 '475"#0 8919": ;"2"<91

Toward a Unified Representation of Findings in Clinical Radiology

Transcription:

PhenDisco: a new phenotype discovery system for the database of genotypes and phenotypes Son Doan, Hyeoneui Kim Division of Biomedical Informatics University of California San Diego Open Access Journal Club, 09/05/2013

Roadmap to the Presentation Background dbgap Challenges in using dbgap pfindr program PhenDisco development User requirement analysis for PhenDisco Data standardization (variables, study metadata) System development: technical details PhenDisco demo Performance evaluation 9/5/13 2

Background 9/5/13 3

Overview on dbgap Database of Genotypes and Phenotypes Developed by NCBI Stores and distributes the data and outputs of the studies on the interactions of genotypes & phenotypes Provides 2 levels of access Open access: variable information including summary statistics and study information Controlled access: raw data upon approval by NIH DAC 9/5/13 4

A Typical Challenge in Using dbgap Potentially, dbgap is great it contains so many different types of studies and their data! However, I find it very hard to reuse dbgap data because there is no easy but robust way to filter studies by important study related information such as study design, analysis methods, analysis data produced by the studies. Even if I find the studies that seem fitting to my needs, I still need to make sure that the studies have the genotype and/or the phenotype information that I need. Of course, dealing with the data values with all sort of different formats is another challenge to go through (Erin Smith, PhD, Division of Genome Information Science, UCSD) 9/5/13 5

http://www.ncbi.nlm.nih.gov/gap 9/5/13 6

http://www.ncbi.nlm.nih.gov/gap 9/5/13 7

http://www.ncbi.nlm.nih.gov/gap 9/5/13 8

pfindr (phenotype Finding IN Data Repositories) Funded by NHLBI To facilitate dbgap use by improving accuracy and completeness of search returns Standardized phenotype variables Searchable study related information 9/5/13 9

User Requirement Analysis 9/5/13 10

Use-Case Driven Development User requirements collected from Analysis of data use descriptions from data requests available in dbgap (14,287 requests) Online user survey (17 users) User interviews (8 local dbgap users) NIH officers/scientific Advisory Board recommendations and suggestions 9/5/13 11

Data Request Analysis Health Care Daily Function or Activity Activity Food Organism Social Function Behavior Other Cardiovascular Disease (8.1%) Mental Process Qualitative Concept Mood, Emotion, and Individual Behavior Disease Neoplasm/Cancer (30%) Genetic Disease Congenital Abnormality (8.6%) Clinical Attributes Diagnostic Procedure Psychiatric Disease (13%) Signs or Symptoms Pathologic Function Laboratory Procedure or Test Research Activity Chemical or Biological Substance Therapeutic or preventive Procedure 9/5/13 12

Interviews, Survey and SAB/NIH officers feedback Functions that maximize search efficiency Examples option to expand search terms through synonyms studies displayed in the order of relevancy select studies from the returned list and save for later review search results organized in a way that supports quick browsing 9/5/13 13

Problems We Addressed Focus areas: Completeness and accuracy of search results Abbreviation expansion Concept-based search Ease of result review Sorting the results by relevancy Highlighting search keywords in the retrieved records Additional functionality Export of selected study and variable information Categorization of variables 9/5/13 14

Data Standardization Variable Standardization Study Level Metadata Generation 9/5/13 15

Phenotype Variable Standardization Variable ID Variable Name Variable Description Phv00116192.v2.p2 C41RPACE Get pain when walk at ordinary pace? Used variable descriptions Focused on identifying Topic (main theme: pain, walking ) Subject of information (i.e., bearer: study subject ) Mapped the topic and SOI concepts to UMLS Metathesaurus 9/5/13 16

Variable Descriptions 135,608 variables Phenotype Variable Standardization 9/5/13 17

Phenotype Variable Standardization Variable Descriptions 77 age mom diagnosed stroke (tia) 135,608 variables 9/5/13 18

Variable Descriptions 135,608 variables Phenotype Variable Standardization 77 age mom diagnosed stroke (tia) Normalization Spell out abbreviations and short hand expressions Drop question numbers and other unimportant characters age mother diagnosed stroke (tia) 9/5/13 19

Variable Descriptions 135,608 variables Phenotype Variable Standardization 77 age mom diagnosed stroke (tia) Normalization Spell out abbreviations and short hand expressions Drop question numbers and other unimportant characters age mother diagnosed stroke (tia) MetaMap Processing Generate CUIs, concept names, semantic types C0001779: age [organism attribute], C0026591: Mother [family group] C0038454: Stroke [disease or syndrome] 9/5/13 20

Variable Descriptions 135,608 variables Phenotype Variable Standardization 77 age mom diagnosed stroke (tia) Normalization Spell out abbreviations and short hand expressions Drop question numbers and other unimportant characters age mother diagnosed stroke (tia) MetaMap Processing Generate CUIs, concept names, semantic types C0001779: age [organism attribute], C0026591: Mother [family group] C0038454: Stroke [disease or syndrome] Semantic Role Assignment Semantic types and keywordbased role identification Evaluation from random sample of 500: 73% accuracy C0001779: age, C0038454: Stroke topic C0026591: Mother subject of information 9/5/13 21

Variable Descriptions 135,608 variables Phenotype Variable Standardization 77 age mom diagnosed stroke (tia) Normalization Spell out abbreviations and short hand expressions Drop question numbers and other unimportant characters age mother diagnosed stroke (tia) MetaMap Processing Generate CUIs, concept names, semantic types C0001779: age [organism attribute], C0026591: Mother [family group] C0038454: Stroke [disease or syndrome] Semantic Role Assignment Semantic types and keywordbased role identification Evaluation from random sample of 500: 73% accuracy C0001779: age, C0038454: Stroke topic C0026591: Mother subject of information Variable Categorization Semantic types and keyword-based categorization Evaluation from random sample of 500: 71% accuracy family history, demographics 9/5/13 22

Category Examples Variable Descriptions Gender of the participant Last known smoking status Topics Cigarettes/day, exam 1 smoking, medical examination Age in years at uric acid measurement Subject of Information Variable Categories gender study subject Demographics smoking study subject Smoking History age, uric acid measurement study subject study subject Smoking History Healthcare Activity Finding Demographics Lab Tests AGE of living mother age mother Demographics - Family Age at dementia onset as defined by the DSM IV definition age, dementia study subject Demographics Medical History 9/5/13 23

Variable Descriptions 135,608 variables Phenotype Variable Standardization 77 age mom diagnosed stroke (tia) Normalization Spell out abbreviations and short hand expressions Drop question numbers and other unimportant characters age mother diagnosed stroke (tia) MetaMap Processing Generate CUIs, concept names, semantic types C0001779: age [organism attribute], C0026591: Mother [family group] C0038454: Stroke [disease or syndrome] Semantic Role Assignment Semantic types and keywordbased role identification Evaluation from random sample of 500: 73% accuracy C0001779: age, C0038454: Stroke topic C0026591: Mother subject of information Variable Categorization Semantic types and keywordbased categorization Evaluation from random sample of 500: 71% accuracy family history, demographics in progress Identification of Similar Variables Same CUI, similar keywords, and same category 9/5/13 24

Study Level Metadata Annotation Manual annotation of 422 studies (07/31/13) Metadata items generated Disease topics (encoded with UMLS) Geographical information (encoded with ISO 3166-2 subdivision code: state and country) IRB approval (required or not) Consent type (not restricted, restricted, unspecified) Sample demographics (race and/or ethnicity, gender, age) 9/5/13 25

System Development: Integration 9/5/13 26

PhenDisco: Put-it-all-together NLP tools + MetaMap dbgap Free text Query parser Information Model Mapping sdgap Ranked studies BM25 ranking algorithm Relevant studies 9/5/13 27

System Development: Query Parser 9/5/13 28

Contextual Query Language Query types: Simple queries: keywords, phrases. Using Boolean logic: AND, OR, NOT Can process index values, e.g., age > 40 Build a language guideline: BNF form 9/5/13 29

BNF form cqlquery ::= prefixassignment cqlquery scopedclause prefixassignment ::= '>' prefix '=' uri '>' uri scopedclause ::= scopedclause booleangroup searchclause searchclause booleangroup ::= boolean [modifierlist] boolean ::= 'and' 'or' 'not' 'prox' searchclause ::= '(' cqlquery ') index relation searchterm searchterm relation ::= comparitor [modifierlist] comparitor ::= comparitorsymbol namedcomparitor comparitorsymbol ::= '=' '>' '<' '>=' '<=' '<>' '==' namedcomparitor ::= identifier modifierlist ::= modifierlist modifier modifier modifier ::= '/' modifiername [comparitorsymbol modifiervalue] prefix, uri, modifiername, modifiervalue, searchterm, index ::= term term ::= identifier 'and' 'or' 'not' 'prox' 'sortby' identifier ::= charstring1 charstring2 9/5/13 30

System Development: Study Ranking 9/5/13 31

BM25 ranking algorithm N: total number of studies. n t number of studies contains the term t c field in study d w c boost factor for each field c Tf term frequency Idf inverted document frequency 9/5/13 32

Technical Infrastructure URL: http://pfindr-data.ucsd.edu/_phdver1/ Linux machine: Ubuntu 64 bits Memory: 32GB RAM Database: MySQL 14.14 Apache 2.2.20 Web server Programming languages: PHP, Python, JavaScripts Python toolkits: pyparsing, Whoosh 9/5/13 33

System Demonstra-on 9/5/13 34

System Evaluation Search Accuracy User Interface 9/5/13 35

Evaluation on Basic Search Basic Search dbgap PhenDisco Recall Precision Recall Precision COPD 100 % 41.67% 80.00% 100 % macular degeneration AND white 100 % 42.86% 100 % 85.71% breast cancer AND breast density (as of July 7, 2013) 100 % 66.67% 50.00% 100 % schizophrenia 100 % 46.88% 86.67% 92.86% cardiomyopathy 100 % 35.00% 100 % 100 % Average 100 % 46.61% 83.33% 95.71% Average F-measure 0.64 0.89 9/5/13 36

Evaluation on Advanced Search Advanced Search in PhenDisco Recall Precision macular degeneration AND white AND [whole genome genotyping] breast cancer AND breast density AND [IRB not required] AND [whole genome genotyping] 100 % 66.67% 100 % 100 % schizophrenia AND [female] AND [AFFY_6.0] 100 % 100 % cardiomyopathy AND [copy number variant analysis] (as of July 7, 2013) 100 % 100 % Average 100 % 91.67 % Average F-measure 0.96 9/5/13 37

Feedback on the User Interface (N=6) 9/5/13 38

Trainees Post-doctoral trainees Ko-Wei Lin, DVM, PhD (Study Abstraction, Standardization, Evaluation) Mindy Ross, MD, MBA (Study Abstraction, Ontology Building) Neda Alipanah, PhD (Ontology Building) Xiaoqian Jiang, PhD (Ranking Algorithm) Mike Conway, PhD (Study Abstraction) Undergraduate trainees Alexander Hsieh (Standardization) Vinay Venkatesh (System Development) Rafael Talavera (Evaluation) Karen Truong (Study Abstraction) Asher Garland (System Development) 9/5/13

Acknowledgements Lucila Ohno-Machado (PI) Collaborator Hua Xu Other contribution Jihoon Kim Wendy Chapman Melissa Tharp Staff Stephanie Feudjio Feupe, MS Seena Farzaneh, MS Rebecca Walker, BS Funding: UH2HL108785 from NHLBI, NIH 9/5/13 40

Questions? Project Homepage: http://pfindr.net PhenDisco: http://pfindr-data.ucsd.edu/_phdver1/index.php Contact: lohnomachado@ucsd.edu hyk038@ucsd.edu sondoan@ucsd.edu