Leveraging Machines for Causal Modeling

Similar documents
Scientific Discovery as Link Prediction in Influence and Citation Graphs

Measuring impact. William Parienté UC Louvain J PAL Europe. povertyactionlab.org

OncoPPi Portal A Cancer Protein Interaction Network to Inform Therapeutic Strategies

Clinical Genome Knowledge Base and Linked Data technologies. Aleksandar Milosavljevic

Myanmar Food and Nutrition Security Profiles

The Evolving Global Nutrition Situation: Why Forests and Trees Matter

Skin cancer reorganization and classification with deep neural network

How to strengthen the link between biology and implementation for sustainable action

Introduction. Introduction

Nature Genetics: doi: /ng Supplementary Figure 1. SEER data for male and female cancer incidence from

Cancer Informatics Lecture

MBios 478: Systems Biology and Bayesian Networks, 27 [Dr. Wyrick] Slide #1. Lecture 27: Systems Biology and Bayesian Networks

Myanmar - Food and Nutrition Security Profiles

Laos - Food and Nutrition Security Profiles

Executive Board meeting

Tuvalu Food and Nutrition Security Profiles

How can Natural Language Processing help MedDRA coding? April Andrew Winter Ph.D., Senior Life Science Specialist, Linguamatics

Democratic Republic of Congo

Identification of Tissue Independent Cancer Driver Genes

Breast cancer. Risk factors you cannot change include: Treatment Plan Selection. Inferring Transcriptional Module from Breast Cancer Profile Data

Madagascar. Monitoring, Evaluation, Accountability, Learning (MEAL) COUNTRY DASHBOARD MADAGASCAR

38 Int'l Conf. Bioinformatics and Computational Biology BIOCOMP'16

Projecting the Economic Consequences of Malnutrition in Lao PDR

The Human Behaviour-Change Project

The global evidence-base for what different sectors can do to contribute to undernutrition

Undernutrition & risk of infections in preschool children

New methods for collecting evidence

CHAPTER 3 PROBLEM STATEMENT AND RESEARCH METHODOLOGY

Nauru Food and Nutrition Security Profiles

Draft of the Rome Declaration on Nutrition

PCxN: The pathway co-activity map: a resource for the unification of functional biology

Research Proposal Development. Saptawati Bardosono

Semantic Alignment between ICD-11 and SNOMED-CT. By Marcie Wright RHIA, CHDA, CCS

Central African Republic

Monitoring, Evaluation, Accountability, Learning (MEAL) Enabling Environment Finance for. Nutrition

Mapping evolutionary pathways of HIV-1 drug resistance using conditional selection pressure. Christopher Lee, UCLA

Monitoring, Evaluation, Accountability, Learning (MEAL) Enabling Environment Finance for. Nutrition

The Curing AI for Precision Medicine. Hoifung Poon

MALNUTRITION: NIGERIA S SILENT CRISIS

Papua New Guinea. Monitoring, Evaluation, Accountability, Learning (MEAL) COUNTRY DASHBOARD PAPUA NEW GUINEA

Monitoring, Evaluation, Accountability, Learning (MEAL) Enabling Environment Finance for. Nutrition

Monitoring, Evaluation, Accountability, Learning (MEAL) Enabling Environment Finance for. Nutrition

Monitoring, Evaluation, Accountability, Learning (MEAL) Enabling Environment Finance for. Nutrition

Uganda. Monitoring, Evaluation, Accountability, Learning (MEAL) COUNTRY DASHBOARD UGANDA

Childhood Undernutrition: a biological perspective

Innovative Risk and Quality Solutions for Value-Based Care. Company Overview

Critical Issues in Child and Maternal Nutrition. Mainul Hoque

The role of double duty actions in addressing the double burden of malnutrition

SUPPLEMENTARY INFORMATION

Take-Home Final Exam: Mining Regulatory Modules from Gene Expression Data

Cook Islands Food and Nutrition Security Profiles

National Academies Next Generation SAMPLE Researchers TITLE Initiative HERE

Nature Methods: doi: /nmeth.3115

Autism Pathways Analysis: A Functional Framework and Clues for Further Investigation. Martha Herbert PhD MD Ya Wen PhD July 2016

WHO Updates Essential Nutrition Actions: Improving Women s, Newborn, Infant and Young Child Health and Nutrition

Clinical Decision Support Technologies for Oncologic Imaging

Schema-Driven Relationship Extraction from Unstructured Text

Case Studies on High Throughput Gene Expression Data Kun Huang, PhD Raghu Machiraju, PhD

Marshall Islands Food and Nutrition Security Profiles

Indonesia - Food and Nutrition Security Profiles

Selecting a research method

Knowledge networks of biological and medical data An exhaustive and flexible solution to model life sciences domains

2018 Edition The Current Landscape of Genetic Testing

Clustered mutations of oncogenes and tumor suppressors.

Global database on the Implementation of Nutrition Action (GINA)

IPA Advanced Training Course

Predicting the Effect of Diabetes on Kidney using Classification in Tanagra

Bjoern Peters La Jolla Institute for Allergy and Immunology Buenos Aires, Oct 31, 2012

Learning the Fine-Grained Information Status of Discourse Entities

Experimental Methods. Anna Fahlgren, Phd Associate professor in Experimental Orthopaedics

CHAPTERS 1-2. Developmental Psychology. A Topical Approach to Lifespan Development

Brendan O Connor,

Welcome. Nanostring Immuno-Oncology Summit. September 21st, FOR RESEARCH USE ONLY. Not for use in diagnostic procedures.

What Smokers Who Switched to Vapor Products Tell Us About Themselves. Presented by Julie Woessner, J.D. CASAA National Policy Director

Variant Classification. Author: Mike Thiesen, Golden Helix, Inc.

www. epratrust.com Impact Factor : p- ISSN : e-issn :

Building Cognitive Computing for Healthcare

Reviewers' comments: Reviewer #1 (Remarks to the Author):

Watson Summit Prague 2017

Philippines - Food and Nutrition Security Profiles

Extracting geographic locations from the literature for virus phylogeography using supervised and distant supervision methods

Support for Cognitive Science Experiments

Solomon Islands Food and Nutrition Security Profiles

Translational Bioinformatics: Connecting Genes with Drugs

How_Scientific_Teams_Develop_New_Anti- Cancer_Drugs English mp4_ [00:00:00.00]

Catalyzing Progress Toward the Global Nutrition Targets: Three Potential Financing Packages

THE ROME ACCORD ICN2 zero draft political outcome document for 19 November 2014

Cellular Signaling Pathways. Signaling Overview

Improving HIV/AIDS management in children:nutrition as a vital component

Asthma Surveillance Using Social Media Data

Social Determinants of Health

Chapt 15: Molecular Genetics of Cell Cycle and Cancer

Genetic alterations of histone lysine methyltransferases and their significance in breast cancer

Trial Designs. Professor Peter Cameron

2018 Global Nutrition

Update on the nutrition situation in the Asia Pacific region

Malnutrition Experience in Sultanate of Oman. Dr Salima almamary Family physician Nutrition Department

arxiv: v1 [cs.lg] 4 Feb 2019

JOINT FAO/WHO FOOD STANDARDS PROGRAMME

Child Survival Basic Training

Transcription:

Leveraging Machines for Causal Modeling MIHAI SURDEANU October 25, 2016

PROBLEM STATEMENT AND DATA ANALYSIS METHODOLOGY Goal: Build machines that help humans to model and analyze very complicated systems by reading fragmented literatures, and assembling and reasoning with models. Reading Assembly Reasoning Predict, explain, test, curate, etc. Bill & Melinda Gates Foundation 2

PROBLEM STATEMENT AND DATA ANALYSIS METHODOLOGY Goal: Build machines that help humans to model and analyze very complicated systems by reading fragmented literatures, and assembling and reasoning with models. Use case 1: modeling cancer pathways, the solution to which involves reasoning over models of these systems, which have been assembled, usually from text sources, by automatic or semiautomatic methods. Use case 2: understanding biological and socio-economic factors that impact children s health, through automated reading of literature. Bill & Melinda Gates Foundation 3

KEY FINDINGS AND INSIGHTS High-quality machine reading at scale is possible Read all PubMed Open Access literature at 5 seconds/paper Accuracy comparable to human cancer experts, at much higher throughput Use case 1: discovery of novel cancer driving mechanisms Works for 11 different cancer types New hypotheses for precision medicine for cancer treatment Use case 2: discovery of causal mechanisms in children s health Reduced model development time from ~1 month to 2 days Bill & Melinda Gates Foundation 4

HOW WE STARTED, AND WHY This effort is initiated by DARPA s Big Mechanism program. Technology to understand complicated systems Bill & Melinda Gates Foundation 5

TECHNOLOGY TO UNDERSTAND COMPLICATED SYSTEMS Our reach exceeds our grasp, which can be dangerous For many problems and many reasons methodology, cognitive limitations, specialization we have component-level, not system-level understanding. Understanding means cause è effect. Big data associations are not enough. If humans can t understand complicated systems, then machines must help! Flash crash of US t-bills; Oct. 24, 2014 Slide courtesy of: Paul Cohen, DARPA Bill & Melinda Gates Foundation 6

WHICH COMPLICATED SYSTEM? Big Mechanism is not cancer biology program, but molecular signaling in cancer, specifically Ras-driven cancers, is hugely complicated. Slide courtesy of: NIH Ras Initiative Bill & Melinda Gates Foundation 7

WHICH COMPLICATED SYSTEM? Thousands of molecular interactions within two hops of Ras in Pathway Commons. Tiny fragment of a model built by computers reading the literature: Image courtesy of: Andrey Rzhetsky, University of Chicago Bill & Melinda Gates Foundation 8

OVERVIEW Machine reading system Use case 1: discovery of novel cancer drivers Use case 2: modeling children s health Bill & Melinda Gates Foundation 9

OVERVIEW Machine reading system Use case 1: discovery of novel cancer drivers Use case 2: modeling children s health Bill & Melinda Gates Foundation 10

MACHINE LEARNING DOES NOT APPLY HERE Black box approaches to reading using machine learning not a great choice, especially in inter-disciplinary projects! Unstructured text Machine learning Structured data There was no training data to match the Big Mechanism vision Bill & Melinda Gates Foundation 11

NEW PLATFORM So we ended up designing a new platform for machine reading: information extraction language + runtime environment Mark the object of the verb phosphorylate as a protein being phosphorylated, and the subject as the controller of the interaction Expressive: patterns over syntax or over plain text Captures complex language structures such as nested interactions Handles language ambiguity Fast runtime Bill & Melinda Gates Foundation 12

HIGH-LEVEL ARCHITECTURE Reading Causal Relations Assembly Discourse Analysis Context Extraction Bill & Melinda Gates Foundation 13

HIGH-LEVEL ARCHITECTURE Reading Causal Relations Assembly Discourse Analysis Context Extraction Bill & Melinda Gates Foundation 14

WALKTHROUGH EXAMPLE FOR READING OF CAUSAL RELATIONS Protein ubiquitination of Protein Heavy use of bio ontologies here inhibits Order of layers is automa0cally detected, or it can be manually configured Bill & Melinda Gates Foundation 15

FEW GRAMMAR RULES Type Syntax Surface Total Entities 0 15 15 Generic entities 0 2 2 Modifications 0 6 6 Mutants 0 9 9 Total entities 0 32 32 Simple events 15 11 26 Binding 30 7 37 Hydrolysis 8 2 10 Translocation 12 0 12 Positive regulation 16 4 20 Negative regulation 14 3 17 Total events 95 27 122 Total 95 59 154 Bill & Melinda Gates Foundation 16

HIGH-LEVEL ARCHITECTURE Reading Causal Relations Assembly Discourse Analysis Context Extraction Bill & Melinda Gates Foundation 17

COREFERENCE RESOLUTION Cells were transfected with N540K, G380R, R248C, Y373C, K650M and K650E- FGFR3 mutants and analyzed for activatory STAT1(Y701) phosphorylation 48 hours later. [...] In 293T and RCS cells, all six FGFR3 mutants induced activatory ERK(T202/Y204) phosphorylation (Fig. 2). Unspecified mutations, leading to 6 x 2 events LL-37 forms a complex together with the IGF-1R... and this binding results in IGF-1R activation Common noun phrase refers to a specific interaction Bill & Melinda Gates Foundation 18

HEDGING We hypothesize that O 2 inhibits the Cdc25-mediated wt Ras GDP dissociation The result suggests that both the intrinsic and the Cdc25-mediated wt Ras GDP dissociation are insensitive to H2O2. Degree of substantiation None 43% Partial 60% Full 80% Precision Analysis courtesy of: Dayne Freitag, SRI Bill & Melinda Gates Foundation 19

HIGH-LEVEL ARCHITECTURE Reading Causal Relations Assembly Discourse Analysis Context Extraction Bill & Melinda Gates Foundation 20

CONTEXTUALIZING THE EXTRACTIONS IS CRUCIAL Species Cell type The ability of CCL18 to inhibit CCL11- and CCL13-induced chemotactic responses of human eosinophils has previously been reported [27]. We show here that it is able to inhibit the chemotactic responses of other CCR3 agonists (Figure 1A). Slide courtesy of: Clayton Morrison, UA Bill & Melinda Gates Foundation 21

BIOLOGICAL CONTEXT Biological context serves as an index to the type of biological system: which mechanisms are (possibly) present / active. Context as specification of biological containers: Species: human, yeast Organ: liver, lung Tissue type: lymphoid, embryo Cell type: t-cell, endothelial Cell line: PCS-100-020 (human, artery, endothelial) Subcellular locations: endosome, nucleus Slide courtesy of: Clayton Morrison, UA Bill & Melinda Gates Foundation 22

HIGH-LEVEL ARCHITECTURE Reading Causal Relations Assembly Discourse Analysis Context Extraction Bill & Melinda Gates Foundation 23

ASSEMBLY DRIVEN BY CAUSAL PRECEDENCE Causal precedence = Interaction A precedes B if and only if the output of A is necessary for the successful execution of B Bill & Melinda Gates Foundation 24

WHY CAUSAL PRECEDENCE IS IMPORTANT The SH2 domain of c-src binds VEGFR2, when VEGFR2 is phosphorylated at Y1057 precedes Bill & Melinda Gates Foundation 25

WHY CAUSAL PRECEDENCE IS IMPORTANT The SH2 domain of c-src binds VEGFR2, before when VEGFR2 is phosphorylated at Y1057 follows (skipping the technical details in the interest of time) Bill & Melinda Gates Foundation 26

A COMPLETE MACHINE READING SYSTEM Reading Causal Relations Assembly Discourse Analysis Context Extraction Bill & Melinda Gates Foundation 27

HOW WELL DOES MACHINE READING WORK? Human domain experts are around here 1.00 Precision 0.75 0.50 Team 1 Team 3 Team 4(B) Our System Team 4 Team 2 0.25 0.00 0 250 500 750 1000 Throughput Bill & Melinda Gates Foundation 28

OVERVIEW Machine reading system Use case 1: discovery of novel cancer drivers Use case 2: modeling children s health Bill & Melinda Gates Foundation 29

THE MUTEX ALGORITHM We have huge amounts of data relating gene expression levels to cancers. We don t know the underlying mechanisms. Mutex insight: If a tumor wants to disable a mechanism, it will mutate something upstream, but it generally won t pay for two mutations that do the same thing. So mutually exclusive mutations plus a good model can tell us which mechanisms the tumor disables. Y is a kill switch X that prevents proliferation B Y C Z Blue and Brown are mutations associated with cancer, but which mechanism(s) do they influence? If you wanted to disable Y, and you could do it by mutating Blue or Brown, but you wouldn t get additional value from mutating both, then a population with disabled Y will have some Blue and some Brown mutations but very few instances of both. Now reverse the reasoning: If Blue and Brown are mutually exclusive mutations, then their purpose is probably to disable Y, not X or Z. Thus, Y is part of the cancer mechanism. Slide courtesy of: Emek Demir, OHSU Bill & Melinda Gates Foundation 30

OBSERVATION OF MUTUAL EXCLUSIVITY OF GENOMIC ALTERATIONS IN CANCER For instance, in Glioblastoma: each column is a patient, each row is a gene One alteration in these genes is enough for activating E2F1 function. These alterations are not random but they are selected by the cancer. When one is altered, selection pressure is lifted from others, hence generating the mutual exclusion pattern. If those were random alterations, then we would observe more overlaps of those alterations in patients. Slide courtesy of: Emek Demir, OHSU Bill & Melinda Gates Foundation 31

WHY IT MATTERS Why it matters: Mutations often disable kill switches that prevent proliferation. Fixing the downstream effects of one mutation is not enough: The cell will mutate again to find another way to have the same downstream effects. Therapy should disable all upstream mutations, or the driver itself. Knowing more is better. Bill & Melinda Gates Foundation 32

HOW MUTEX WORKS Mutex does a graph search on the signaling network to find subgraphs of genes that Are altered in mutually exclusive manner, and Have a common downstream signaling target. The important issue is what data is used to search for downstream targets. Bill & Melinda Gates Foundation 33

SEARCHING FOR DOWNSTREAM PROTEINS Previously Demir et al. were searching over Pathway Commons (PC), a manually-curated database of biochemical interactions. PC is estimated to cover 1% of the literature. We expanded the dataset automatically by reading all PubMed Open Access papers using the system introduced before. Bill & Melinda Gates Foundation 34

DATA COMPARISON Pathway Commons v7 215,611 relations REACH extracted information 632,509 relations 10,913 relations Bill & Melinda Gates Foundation 35

POTENTIAL CANCER DRIVING MECHANISMS DISCOVERED HUS1B PIK3CA PTEN post-translational control (protein-level) expression control (RNA-level) Yellow indicates new discovery after adding our data CHEK1 HUS1B CHEK1: Hus1 loss results in abnormal gammah2ax localization and increased CHK1 phosphorylation. PTEN CHEK1: PTEN also induces phosphorylation and monoubiquitination of DNA damage checkpoint kinase, Chk1 Members of a mutex group are shown in a compound node, where its label indicates sample-coverage of the alterations. Highlighted edges do not exist in Pathway Commons (PC), and highlighted nodes cannot be detected using PC only. PIK3CA PTEN: p110alpha inactivation can inhibit the impact of PTEN loss Slide courtesy of: Emek Demir, OHSU Bill & Melinda Gates Foundation 36

POTENTIAL CANCER DRIVING MECHANISMS FOR TCGA BREAST CANCER DATASET (BRCA) post-translational control (protein-level) expression control (RNA-level) Yellow indicates new discovery after adding our data Members of a mutex group are shown in a compound node, where its label indicates sample-coverage of the alterations. Highlighted edges do not exist in Pathway Commons (PC), and highlighted nodes cannot be detected using PC only. Slide courtesy of: Emek Demir, OHSU Bill & Melinda Gates Foundation 37

GBM MUTEX RESULTS AFTER ADDING THE NEW DATA Highlighted genes can be detected after adding our data Image courtesy of: Emek Demir, OHSU Bill & Melinda Gates Foundation 38

LGG MUTEX RESULTS AFTER ADDING THE NEW DATA Highlighted genes can be detected after adding our data Image courtesy of: Emek Demir, OHSU Bill & Melinda Gates Foundation 39

LUAD MUTEX RESULTS AFTER ADDING THE NEW DATA Highlighted genes can be detected after adding our data Image courtesy of: Emek Demir, OHSU Bill & Melinda Gates Foundation 40

OVERVIEW Machine reading system Use case 1: discovery of novel cancer drivers Use case 2: modeling children s health Bill & Melinda Gates Foundation 41

NEW USE CASE: CHILDREN S HEALTH Do exactly the same, but model children s health Collaboration with Lyn Powell HBGDki-qPM team Very preliminary work: we had < 2 weeks to adapt our system Turns out biology was the easy use case Many factors impact children s health, from biology to socio-economic No ontologies Bill & Melinda Gates Foundation 42

PROCESS Search PubMed for relevant papers Query: (undernutrition OR stunting OR ) AND (infant OR child OR...) Result: 1K papers Apply our modified machine reading system Result: 117 causal pathways, 158 unique causal relations The system was modified to function without ontologies Expert inspect the generated pathways and constructs curated model Bill & Melinda Gates Foundation 43

RESULTING MODEL DE - MODIFIED 1K OUTPUT health practitioner informed parent supplementary Finding food one of these childhood links obesity manually EBF rural setting obesity may well take 1-3 days sometimes later obesity acute malnutrition overweight obesity MGNREGR one is lucky and comes across a HIV poverty malnutrition review paper where someone Breastfeeding has childhood/infant mal. Infection under-nutrition already drawn a fairly Diarrhea Disease mature IMR mental inadequate nutrition (incl ci) Death model PEM but filling in the details and CVD EBF food insecurity making sure they are not biased birth ARI NCD can economic decisions asphyxia in Bas-Congo take much longer. economic growth Campylobacter nutrition (including all positive food supply changes) excess high protein intake Adult obesity weight gain in infancy Stunting cheap processed food nutrient enriched diet rapid urbanisation government subsidy HFD pneumonia mms Nutrition education low maternal cd4 cell count socio-economic changes parent caregiver poor maternal heath/death Inadequate infant care Immune system abnormality poor nutrition Improved water access provision enforcement of state regulations early childhood obesity prevention Constructed in 2 days. Normally, it takes 1 month. Model courtesy of: Lyn Powell, HBGDki-qPM team Bill & Melinda Gates Foundation 6 Bill & Melinda Gates Foundation 44

KEY FINDINGS AND INSIGHTS High-quality machine reading at scale is possible Read all PubMed Open Access literature at 5 seconds/paper Accuracy comparable to human cancer experts, at much higher throughput Use case 1: discovery of novel cancer driving mechanisms Works for 11 different cancer types New hypotheses for precision medicine for cancer treatment Use case 2: discovery of causal mechanisms in children s health Reduced model development time from ~1 month to 2 days Bill & Melinda Gates Foundation 45

THANK YOU! Acknowledgments: Vision: Paul Cohen, DARPA Big Mechanism program manager Actually implementing it: the University of Arizona team: Marco A. Valenzuela-Escarcega, Gus Hahn-Powell, Dane Bell, Thomas Hicks, Enrique Noriega-Atala, Clayton Morrison Use case 1: Emek Demir, Özgün Babur Oregon Health & Science University Use case 2: Lyn Powell HBGDki-qPM team Bill & Melinda Gates Foundation 46