Automated Annotation of Biomedical Text Kevin Livingston, Ph.D. Postdoctoral Fellow Pharmacology Department, School of Medicine University of Colorado Anschutz Medical Campus Kevin.Livingston@ucdenver.edu http://compbio.ucdenver.edu/hunter_lab/livingston
Biomedical researchers are interested in understanding their data in the context of all known background knowledge: curated databases & literature. 2
3
Muscle Cell Development 4
Biomedical Data Sources Total Manual GO Annotations: 1,116,848 1,380 Database s in 2012 Total GO Annotations: 132,425,702 PubMed Articles Referenced: 94,518 5
1987 1989 1991 1993 1995 1997 1999 2001 2003 2005 2007 2009 2011 New Entries (thousands) Total Entries (millions) Pubmed Growth Rate 1100 1000 900 y = ~e 0.0405x R² = 0.99 25 20 800 700 600 500 400 300 y = ~e 0.0402x R² = 0.94 15 10 200 100 0 973,499 PubMed entries in 2011 (>2,600 per day) 5 0 2 journal articles per minute! 6
Vision DBs Ontologies Knowledge Base Intelligent Application s Texts Text Mining 7
Annotation for Computation Computer understandable Composable Provenance of compositions traceable 8
Compositional Annotation occurs_in & Knowledge vertebrate pigmentatio n denotes text annotation 3 subclassof TAXON:7742 Vertebrata basedon basedon GO:0043474 pigmentation CRAFT PMID:1473718 3 denotes text annotation 1 text annotation 2 denotes 9
Understanding language requires relating what has been said to existing knowledge structures. 10
Typical / Pipeline Model Corpus POS tagging Entity Recognition Word Sense Syntactic Parse Semantic Forms Discourse Model? Integration With Existing Knowledge KB 11
Direct Memory Access Parsing (DMAP) Corpus POS tagging Entity Recognition Word Sense Syntactic DMAP Parse Semantic Forms Discourse Model? Integration With Existing Knowledge KB 12
Direct Memory Access Parsing (DMAP) Identifies concepts in text using patterns composed of lexical and semantic concepts Incremental continuous recognition of concepts in text Parsing is fundamentally about recognition and integration with existing knowledge 13
Standardized Representations GOA UniProtKB O75151 GO:0032452 GOA record 147 denotes Text annotation 43 PMID:12345678 PHF2-mediated demethylation of histones denotes GO:0003824 catalytic activity subclassof GO:0032452 histone demethylase activity subclassof PHF2 histone has_agent demethylase activity Text annotation 71 Text annotation 89 subclassof UniProt:O7515 1 PHF2 14
OpenDMAP Patterns [plasma_membrane] = membrane of [cell] cell] [muscle cell membrane of a muscle 15
OpenDMAP Patterns [plasma_membrane] = membrane of [cell] [nuclear_membrane] = membrane of [nucleus] 16
OpenDMAP Patterns [plasma_membrane] = membrane of [cell] [nuclear_membrane] = membrane of [nucleus] [transmembrane_transport] = transportation through [membrane] 17
OpenDMAP Patterns [plasma_membrane] = membrane of [cell] membrane_of [cell] [nuclear_membrane] = membrane of membrane_of [nucleus] [nucleus] membrane_crossed [membrane] [transmembrane_transport] = transportation through [membrane] 18
Example Biomedical ga3 denotesgraph g3 Annotations regulates subclassof Negative Regulation N1 of Biological Process resultsinregulationby Interferon P1 subclassof Positive Regulation of Biological Process resultsinregulationof STAT6 basedon Interferon Positive Regulation of Biological Process ga2 denotesgraph basedon STAT6 g2 subclassof Positive Regulation P1 of Biological Process resultsinregulationof STAT6 denotes Resource denotes Resource denotes Resource ra5 ra6 ra7 Interferons inhibit activation of STAT6 19
How is Cav3 involved in muscle cell development? Text Book CL: muscle cell part_of Caveola of muscle cell is-a CC: Caveola contains CC: Caveola CC: Caveola Caveolin Caveolin has_go:cc_annotation Is-a [PRO] Caveolin3 Translatio n of Cav3 Caveolin3 CC: T-tubule CC: membrane raft CC: membrane fraction Caveolin3 Protein or CAV3 gene has_go:cc_annotation Cav3 gene Gene-Protein Protein Ontology Gene Ontology Annotation 20
part_of BP: glucose import positively_regulates is_a BP: positive regulation of glucose import CHEBI: glucose BP: glucose transport is_a glucose transmembrane transport results_in_transport_of is_a BP: transport BP: vesicle-mediated transport is_a is_a Attempted representation of CAV3 example using existing ontologies and relations. M. Bada & H. Tipney 04.22.09 BP: membrane budding CC: membrane-bounded vesicle part_of results_in_formation_of transcytotic caveolar budding in muscle cell results_in_formation_of BP: transcytosis CC: vesicle membrane membrane-bounded vesicle budded from caveola of muscle cell (human) insulin [INS, GeneID: 3630] is_a transcytosis of glucose transporter in muscle cell part_of has_part membrane of vesicle budded from caveola of muscle cell MF: glucose transmembrane transporter activity part_of CL: muscle cell Caveola of muscle cell Caveolin 3 [Cav3, GeneID: 859, 12391] CL: cell == CC: cell CC: caveola is_a has_function transcytotic plasma membrane to early endosome transport of glucose transporter in muscle cell results_in_transport_from plasma membrane of muscle cell CC: plasma membrane part_of CC: plasma membrane part precedes glucose transporter is_a BFO: continuant results_in_transportation_of New terms and relationships Hypotheses/guesses transcytotic early endosome to recycling endomsome transport of glucose transporter in muscle cell precedes transcytotic recycling endomsome to plasma membrane transport of glucose transporter in muscle cell occurs_in results_in_transport_from Semi-official relationships (i.e. provisional cross products) early endosome of muscle cell recycling endosome of muscle cell Protein [Gene Symbol, Entrez GeneID] and associated (useful) GO annotations is_a CC: early endosome CC: recycling endosome Existing terms and relationships CC: endosome (mouse) glucose transporter [Slc2a4, GeneID: 20528] 21
Annotation for Consumers? The linguistic community typically uses annotation as training data or for specific tasks An abundance of tools that can produce annotations in the specific format of those resources Tools for computational linguistics Biomedical annotation typically used for curating, indexing, or enrichment analysis But what about re-using annotations and tools in other contexts and for other purposes? 22
Summary Model that covers syntactic and semantic annotation Linguistic annotation Entity-based annotation Capture complex content that is not necessarily best represented via a single URI Created a GraphAnnotation that denotes a RDF named graph Add kiao:basedon to enable annotation compositions and provenance tracking Annotation-level Assertion-level 23
Acknowledgements University of Colorado: Hunter Lab Larry Hunter Mike Bada Bill Baumgartner Chris Roeder National ICT Australia Karin Verspoor Funding: NIH/NLM training grant Andrew W. Mellon Foundation 24
Automated Annotation of Biomedical Text Kevin Livingston, Ph.D. Postdoctoral Fellow Pharmacology Department, School of Medicine University of Colorado Anschutz Medical Campus Kevin.Livingston@ucdenver.edu http://compbio.ucdenver.edu/hunter_lab/livingston