PhenDisco: a new phenotype discovery system for the database of genotypes and phenotypes Son Doan, Hyeoneui Kim Division of Biomedical Informatics University of California San Diego Open Access Journal Club, 09/05/2013
Roadmap to the Presentation Background dbgap Challenges in using dbgap pfindr program PhenDisco development User requirement analysis for PhenDisco Data standardization (variables, study metadata) System development: technical details PhenDisco demo Performance evaluation 9/5/13 2
Background 9/5/13 3
Overview on dbgap Database of Genotypes and Phenotypes Developed by NCBI Stores and distributes the data and outputs of the studies on the interactions of genotypes & phenotypes Provides 2 levels of access Open access: variable information including summary statistics and study information Controlled access: raw data upon approval by NIH DAC 9/5/13 4
A Typical Challenge in Using dbgap Potentially, dbgap is great it contains so many different types of studies and their data! However, I find it very hard to reuse dbgap data because there is no easy but robust way to filter studies by important study related information such as study design, analysis methods, analysis data produced by the studies. Even if I find the studies that seem fitting to my needs, I still need to make sure that the studies have the genotype and/or the phenotype information that I need. Of course, dealing with the data values with all sort of different formats is another challenge to go through (Erin Smith, PhD, Division of Genome Information Science, UCSD) 9/5/13 5
http://www.ncbi.nlm.nih.gov/gap 9/5/13 6
http://www.ncbi.nlm.nih.gov/gap 9/5/13 7
http://www.ncbi.nlm.nih.gov/gap 9/5/13 8
pfindr (phenotype Finding IN Data Repositories) Funded by NHLBI To facilitate dbgap use by improving accuracy and completeness of search returns Standardized phenotype variables Searchable study related information 9/5/13 9
User Requirement Analysis 9/5/13 10
Use-Case Driven Development User requirements collected from Analysis of data use descriptions from data requests available in dbgap (14,287 requests) Online user survey (17 users) User interviews (8 local dbgap users) NIH officers/scientific Advisory Board recommendations and suggestions 9/5/13 11
Data Request Analysis Health Care Daily Function or Activity Activity Food Organism Social Function Behavior Other Cardiovascular Disease (8.1%) Mental Process Qualitative Concept Mood, Emotion, and Individual Behavior Disease Neoplasm/Cancer (30%) Genetic Disease Congenital Abnormality (8.6%) Clinical Attributes Diagnostic Procedure Psychiatric Disease (13%) Signs or Symptoms Pathologic Function Laboratory Procedure or Test Research Activity Chemical or Biological Substance Therapeutic or preventive Procedure 9/5/13 12
Interviews, Survey and SAB/NIH officers feedback Functions that maximize search efficiency Examples option to expand search terms through synonyms studies displayed in the order of relevancy select studies from the returned list and save for later review search results organized in a way that supports quick browsing 9/5/13 13
Problems We Addressed Focus areas: Completeness and accuracy of search results Abbreviation expansion Concept-based search Ease of result review Sorting the results by relevancy Highlighting search keywords in the retrieved records Additional functionality Export of selected study and variable information Categorization of variables 9/5/13 14
Data Standardization Variable Standardization Study Level Metadata Generation 9/5/13 15
Phenotype Variable Standardization Variable ID Variable Name Variable Description Phv00116192.v2.p2 C41RPACE Get pain when walk at ordinary pace? Used variable descriptions Focused on identifying Topic (main theme: pain, walking ) Subject of information (i.e., bearer: study subject ) Mapped the topic and SOI concepts to UMLS Metathesaurus 9/5/13 16
Variable Descriptions 135,608 variables Phenotype Variable Standardization 9/5/13 17
Phenotype Variable Standardization Variable Descriptions 77 age mom diagnosed stroke (tia) 135,608 variables 9/5/13 18
Variable Descriptions 135,608 variables Phenotype Variable Standardization 77 age mom diagnosed stroke (tia) Normalization Spell out abbreviations and short hand expressions Drop question numbers and other unimportant characters age mother diagnosed stroke (tia) 9/5/13 19
Variable Descriptions 135,608 variables Phenotype Variable Standardization 77 age mom diagnosed stroke (tia) Normalization Spell out abbreviations and short hand expressions Drop question numbers and other unimportant characters age mother diagnosed stroke (tia) MetaMap Processing Generate CUIs, concept names, semantic types C0001779: age [organism attribute], C0026591: Mother [family group] C0038454: Stroke [disease or syndrome] 9/5/13 20
Variable Descriptions 135,608 variables Phenotype Variable Standardization 77 age mom diagnosed stroke (tia) Normalization Spell out abbreviations and short hand expressions Drop question numbers and other unimportant characters age mother diagnosed stroke (tia) MetaMap Processing Generate CUIs, concept names, semantic types C0001779: age [organism attribute], C0026591: Mother [family group] C0038454: Stroke [disease or syndrome] Semantic Role Assignment Semantic types and keywordbased role identification Evaluation from random sample of 500: 73% accuracy C0001779: age, C0038454: Stroke topic C0026591: Mother subject of information 9/5/13 21
Variable Descriptions 135,608 variables Phenotype Variable Standardization 77 age mom diagnosed stroke (tia) Normalization Spell out abbreviations and short hand expressions Drop question numbers and other unimportant characters age mother diagnosed stroke (tia) MetaMap Processing Generate CUIs, concept names, semantic types C0001779: age [organism attribute], C0026591: Mother [family group] C0038454: Stroke [disease or syndrome] Semantic Role Assignment Semantic types and keywordbased role identification Evaluation from random sample of 500: 73% accuracy C0001779: age, C0038454: Stroke topic C0026591: Mother subject of information Variable Categorization Semantic types and keyword-based categorization Evaluation from random sample of 500: 71% accuracy family history, demographics 9/5/13 22
Category Examples Variable Descriptions Gender of the participant Last known smoking status Topics Cigarettes/day, exam 1 smoking, medical examination Age in years at uric acid measurement Subject of Information Variable Categories gender study subject Demographics smoking study subject Smoking History age, uric acid measurement study subject study subject Smoking History Healthcare Activity Finding Demographics Lab Tests AGE of living mother age mother Demographics - Family Age at dementia onset as defined by the DSM IV definition age, dementia study subject Demographics Medical History 9/5/13 23
Variable Descriptions 135,608 variables Phenotype Variable Standardization 77 age mom diagnosed stroke (tia) Normalization Spell out abbreviations and short hand expressions Drop question numbers and other unimportant characters age mother diagnosed stroke (tia) MetaMap Processing Generate CUIs, concept names, semantic types C0001779: age [organism attribute], C0026591: Mother [family group] C0038454: Stroke [disease or syndrome] Semantic Role Assignment Semantic types and keywordbased role identification Evaluation from random sample of 500: 73% accuracy C0001779: age, C0038454: Stroke topic C0026591: Mother subject of information Variable Categorization Semantic types and keywordbased categorization Evaluation from random sample of 500: 71% accuracy family history, demographics in progress Identification of Similar Variables Same CUI, similar keywords, and same category 9/5/13 24
Study Level Metadata Annotation Manual annotation of 422 studies (07/31/13) Metadata items generated Disease topics (encoded with UMLS) Geographical information (encoded with ISO 3166-2 subdivision code: state and country) IRB approval (required or not) Consent type (not restricted, restricted, unspecified) Sample demographics (race and/or ethnicity, gender, age) 9/5/13 25
System Development: Integration 9/5/13 26
PhenDisco: Put-it-all-together NLP tools + MetaMap dbgap Free text Query parser Information Model Mapping sdgap Ranked studies BM25 ranking algorithm Relevant studies 9/5/13 27
System Development: Query Parser 9/5/13 28
Contextual Query Language Query types: Simple queries: keywords, phrases. Using Boolean logic: AND, OR, NOT Can process index values, e.g., age > 40 Build a language guideline: BNF form 9/5/13 29
BNF form cqlquery ::= prefixassignment cqlquery scopedclause prefixassignment ::= '>' prefix '=' uri '>' uri scopedclause ::= scopedclause booleangroup searchclause searchclause booleangroup ::= boolean [modifierlist] boolean ::= 'and' 'or' 'not' 'prox' searchclause ::= '(' cqlquery ') index relation searchterm searchterm relation ::= comparitor [modifierlist] comparitor ::= comparitorsymbol namedcomparitor comparitorsymbol ::= '=' '>' '<' '>=' '<=' '<>' '==' namedcomparitor ::= identifier modifierlist ::= modifierlist modifier modifier modifier ::= '/' modifiername [comparitorsymbol modifiervalue] prefix, uri, modifiername, modifiervalue, searchterm, index ::= term term ::= identifier 'and' 'or' 'not' 'prox' 'sortby' identifier ::= charstring1 charstring2 9/5/13 30
System Development: Study Ranking 9/5/13 31
BM25 ranking algorithm N: total number of studies. n t number of studies contains the term t c field in study d w c boost factor for each field c Tf term frequency Idf inverted document frequency 9/5/13 32
Technical Infrastructure URL: http://pfindr-data.ucsd.edu/_phdver1/ Linux machine: Ubuntu 64 bits Memory: 32GB RAM Database: MySQL 14.14 Apache 2.2.20 Web server Programming languages: PHP, Python, JavaScripts Python toolkits: pyparsing, Whoosh 9/5/13 33
System Demonstra-on 9/5/13 34
System Evaluation Search Accuracy User Interface 9/5/13 35
Evaluation on Basic Search Basic Search dbgap PhenDisco Recall Precision Recall Precision COPD 100 % 41.67% 80.00% 100 % macular degeneration AND white 100 % 42.86% 100 % 85.71% breast cancer AND breast density (as of July 7, 2013) 100 % 66.67% 50.00% 100 % schizophrenia 100 % 46.88% 86.67% 92.86% cardiomyopathy 100 % 35.00% 100 % 100 % Average 100 % 46.61% 83.33% 95.71% Average F-measure 0.64 0.89 9/5/13 36
Evaluation on Advanced Search Advanced Search in PhenDisco Recall Precision macular degeneration AND white AND [whole genome genotyping] breast cancer AND breast density AND [IRB not required] AND [whole genome genotyping] 100 % 66.67% 100 % 100 % schizophrenia AND [female] AND [AFFY_6.0] 100 % 100 % cardiomyopathy AND [copy number variant analysis] (as of July 7, 2013) 100 % 100 % Average 100 % 91.67 % Average F-measure 0.96 9/5/13 37
Feedback on the User Interface (N=6) 9/5/13 38
Trainees Post-doctoral trainees Ko-Wei Lin, DVM, PhD (Study Abstraction, Standardization, Evaluation) Mindy Ross, MD, MBA (Study Abstraction, Ontology Building) Neda Alipanah, PhD (Ontology Building) Xiaoqian Jiang, PhD (Ranking Algorithm) Mike Conway, PhD (Study Abstraction) Undergraduate trainees Alexander Hsieh (Standardization) Vinay Venkatesh (System Development) Rafael Talavera (Evaluation) Karen Truong (Study Abstraction) Asher Garland (System Development) 9/5/13
Acknowledgements Lucila Ohno-Machado (PI) Collaborator Hua Xu Other contribution Jihoon Kim Wendy Chapman Melissa Tharp Staff Stephanie Feudjio Feupe, MS Seena Farzaneh, MS Rebecca Walker, BS Funding: UH2HL108785 from NHLBI, NIH 9/5/13 40
Questions? Project Homepage: http://pfindr.net PhenDisco: http://pfindr-data.ucsd.edu/_phdver1/index.php Contact: lohnomachado@ucsd.edu hyk038@ucsd.edu sondoan@ucsd.edu