Contents. Just Classifier? Rules. Rules: example. Classification Rule Generation for Bioinformatics. Rule Extraction from a trained network

Similar documents
Question 1 Multiple Choice (8 marks)

Evaluating Classifiers for Disease Gene Discovery

Use of BONSAI decision trees for the identification of potential MHC Class I peptide epitope motifs.

the HLA complex Hanna Mustaniemi,

CS229 Final Project Report. Predicting Epitopes for MHC Molecules

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Intelligent Control Systems

Evolutionary Programming

Immuno-Oncology Therapies and Precision Medicine: Personal Tumor-Specific Neoantigen Prediction by Machine Learning

Predicting Breast Cancer Recurrence Using Machine Learning Techniques

The Major Histocompatibility Complex (MHC)

Learning Convolutional Neural Networks for Graphs

Improved Intelligent Classification Technique Based On Support Vector Machines

Predicting Breast Cancer Survivability Rates

Antigen Recognition by T cells

EECS 433 Statistical Pattern Recognition

Immuno-Oncology Therapies and Precision Medicine: Personal Tumor-Specific Neoantigen Prediction by Machine Learning

A HMM-based Pre-training Approach for Sequential Data

Breast Cancer Diagnosis using a Hybrid Genetic Algorithm for Feature Selection based on Mutual Information

Antigen Presentation to T lymphocytes

Vaccine Design: A Statisticans Overview

Lesson 6 Learning II Anders Lyhne Christensen, D6.05, INTRODUCTION TO AUTONOMOUS MOBILE ROBOTS

Machine Learning For Personalized Cancer Vaccines. Alex Rubinsteyn February 9th, 2018 Data Science Salon Miami

DIABETIC RISK PREDICTION FOR WOMEN USING BOOTSTRAP AGGREGATION ON BACK-PROPAGATION NEURAL NETWORKS

IMMUNOINFORMATICS: Bioinformatics Challenges in Immunology

Data mining for Obstructive Sleep Apnea Detection. 18 October 2017 Konstantinos Nikolaidis

Chapter 1. Introduction

Convolutional and LSTM Neural Networks

Immunology - Lecture 2 Adaptive Immune System 1

Using Bayesian Networks to Analyze Expression Data. Xu Siwei, s Muhammad Ali Faisal, s Tejal Joshi, s

A Novel Iterative Linear Regression Perceptron Classifier for Breast Cancer Prediction

Multiple sequence alignment

Design of Multi-Class Classifier for Prediction of Diabetes using Linear Support Vector Machine

Introduction and Historical Background. August 22, 2007

Enhanced Detection of Lung Cancer using Hybrid Method of Image Segmentation

Predicting Protein-Peptide Binding Affinity by Learning Peptide-Peptide Distance Functions

Analyzing Human Negotiation using Automated Cognitive Behavior Analysis: The Effect of Personality. Pedro Sequeira & Stacy Marsella

Mapping evolutionary pathways of HIV-1 drug resistance using conditional selection pressure. Christopher Lee, UCLA

Data Mining in Bioinformatics Day 4: Text Mining

Lecture 6. Burr BIO 4353/6345 HIV/AIDS. Tetramer staining of T cells (CTL s) Andrew McMichael seminar: Background

An Efficient Attribute Ordering Optimization in Bayesian Networks for Prognostic Modeling of the Metabolic Syndrome

Brain Tumor Segmentation Based On a Various Classification Algorithm

cure research HIV & AIDS

Learning and Adaptive Behavior, Part II

The Immune Epitope Database Analysis Resource: MHC class I peptide binding predictions. Edita Karosiene, Ph.D.

Intro, Graph and Search

Classıfıcatıon of Dıabetes Dısease Usıng Backpropagatıon and Radıal Basıs Functıon Network

A framework for the Recognition of Human Emotion using Soft Computing models

Machine learning for HIV-1 protease cleavage site prediction

International Journal of Digital Application & Contemporary research Website: (Volume 1, Issue 1, August 2012) IJDACR.

7.012 Quiz 3 Answers

SUPPLEMENTARY INFORMATION

Significance of the MHC

Error Detection based on neural signals

Application of Tree Structures of Fuzzy Classifier to Diabetes Disease Diagnosis

Applied Machine Learning in Biomedicine. Enrico Grisan

38 Int'l Conf. Bioinformatics and Computational Biology BIOCOMP'16

Deep learning and non-negative matrix factorization in recognition of mammograms

Introduction to Computational Neuroscience

Convolutional and LSTM Neural Networks

Gibbs sampling - Sequence alignment and sequence clustering

Correlate gestational diabetes with juvenile diabetes using Memetic based Anytime TBCA

Auto-Encoder Pre-Training of Segmented-Memory Recurrent Neural Networks

COMP9444 Neural Networks and Deep Learning 5. Convolutional Networks

Antigen Presentation to T lymphocytes

a) The statement is true for X = 400, but false for X = 300; b) The statement is true for X = 300, but false for X = 200;

Basic Immunology. Lecture 5 th and 6 th Recognition by MHC. Antigen presentation and MHC restriction

Cognitive Neuroscience History of Neural Networks in Artificial Intelligence The concept of neural network in artificial intelligence

Wavelet Neural Network for Classification of Bundle Branch Blocks

Predicting Breast Cancer Survival Using Treatment and Patient Factors

Definition of MHC supertypes through clustering of MHC peptide binding repertoires

Pre-mRNA Secondary Structure Prediction Aids Splice Site Recognition

Genetics and Genomics in Medicine Chapter 8 Questions

Data Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics

Sebastian Jaenicke. trnascan-se. Improved detection of trna genes in genomic sequences

A Learning Method of Directly Optimizing Classifier Performance at Local Operating Range

BREAST CANCER EPIDEMIOLOGY MODEL:

Profiling HLA motifs by large scale peptide sequencing Agilent Innovators Tour David K. Crockett ARUP Laboratories February 10, 2009

Predicting Human Immunodeficiency Virus Type 1 Drug Resistance From Genotype Using Machine Learning. Robert James Murray

6/7/17. Immune cells. Co-evolution of innate and adaptive immunity. Importance of NK cells. Cells of innate(?) immune response

Efficacy of the Extended Principal Orthogonal Decomposition Method on DNA Microarray Data in Cancer Detection

Rolls,E.T. (2016) Cerebral Cortex: Principles of Operation. Oxford University Press.

Learning Classifier Systems (LCS/XCSF)

Detection of Cognitive States from fmri data using Machine Learning Techniques

Agent-Based Systems. Agent-Based Systems. Michael Rovatsos. Lecture 5 Reactive and Hybrid Agent Architectures 1 / 19

Effective Diagnosis of Alzheimer s Disease by means of Association Rules

Chapter 6. Antigen Presentation to T lymphocytes

CONSTRUCTION OF PHYLOGENETIC TREE USING NEIGHBOR JOINING ALGORITHMS TO IDENTIFY THE HOST AND THE SPREADING OF SARS EPIDEMIC

Automatic Classification of Perceived Gender from Facial Images

Using Genetic Algorithms to Optimise Rough Set Partition Sizes for HIV Data Analysis

Recognition of HIV-1 subtypes and antiretroviral drug resistance using weightless neural networks

CHAPTER 3 PROBLEM STATEMENT AND RESEARCH METHODOLOGY

RNA Processing in Eukaryotes *

DPPred: An Effective Prediction Framework with Concise Discriminative Patterns

Artificial Intelligence

A prediction model for type 2 diabetes using adaptive neuro-fuzzy interface system.

Radiotherapy Outcomes

An Artificial Neural Network Architecture Based on Context Transformations in Cortical Minicolumns

Survey on Breast Cancer Analysis using Machine Learning Techniques

Fuzzy Based Early Detection of Myocardial Ischemia Using Wavelets

Transcription:

Contents Classification Rule Generation for Bioinformatics Hyeoncheol Kim Rule Extraction from Neural Networks Algorithm Ex] Promoter Domain Hybrid Model of Knowledge and Learning Knowledge refinement Network refinement Rule Generation with Genetic Algorithm SARS CoV Protease Cleavage Site Prediction MHC Binding Peptide Prediction Just Classifier? NN and SVM Good Classifiers Just Classification, But No Description No Explanation Symbolic knowledge is VERY important in the medical and molecular biological domain. They want why and how too, in addition to just answers. Rule Extraction from a trained network Provides the symbolic interpretation! Domain Neural Network Learning Connection Weights Extraction Symbolic Domain Understanding, Knowledge Acquisition, Mining, etc Example of If-Then Rule : 2 dimension, multi-valued dimension. DNA sequence of size 2. G C p2 T A A T C G p1 If p1=c and p2=t, then Class 1 If p1=c, then Class 1 If p1=~c, then Class 1 If *, then Class 1 4 2 =16 cases 5 2 =25 s Looking for the s that are maximally general, and accurate Most specific. Covers just 1 case. Most general. Covers all the cases : example A dataset of 362 amino acid sequences of each length 8. 8 dimensions. Covers 20 6 =64,000,000 instances Each dimension is 20-valued. Our representation for convenience xxxfxpxx => if F@4 & P@6, then Cleaved. xxxxxexx => if E@6 then Cleaved. where x means don t care. Covers 20 7 =1,280,000,000 instances 1

If-Then cover Rectangle Areas. Thus, cannot cover 100%. However, in most cases Maximally general s are good enough Issues Types of s? Descriptive accuracy Human understandability How to generate the s? The s generated from DT, NN and SVM would be different? class1 class2 Rule extraction from NN Rule extraction from NN Output vector Input vector 5 dimension Binary inputs x1 y1 x2 x3 x4 x5 x x 1 0 x Min/Max output??? Covers 2 3 = 8 instances If Min_output > 0.7, Then we get the xx10x with 100% accuracy Two approaches Black-box approach NN is a black box. Just look at input-output pairs to induce a set of s. White-box (Decompositional) approach 1. Extract s from each hidden and output node. 2. Aggregate the intermediate s to form the composite base. Decompositional Approaches: Rule extraction from a single node x1 W=1 2-6 3 0.5 x2 x3 Threshold= -1 Consider a combination (x1, x2) weight-sum (x1, x2) = 1*1 + 1*2 +?*(-6) +?*3 +?*0.5. There are 2 5-2 possible inputs. Lowest weight-sum (x1,x2) = 1*1 + 1*2 + 1*(-6) + 0*3 + 0*0.5 = -33 < threshold thus, the (x1, x2) is NOT valid. Examples of valid combination: (x1, x2, x4), (not x3) x4 x5 Decompositional Approaches: Rule extraction from a single node Find a complete set of s (not a single or a partial set of s) that are: valid (I.e., lowest weight-sum > threshold) maximally-general, (I.e., smaller size) not subsumed ex) (x1, x2) subsumes (x1, x2, x4). 2

Decompositional Approaches: Complexity Rule search space: exponential complexity with the # of attributes. n attributes: space 3^n Heuristics KT [Fu], MofN [Towell], RX [Setino], OAS [Kim] Promoter Domain obtained from Public Domain (Univ. of Wisconsin) 106 instances Each instance string is comprised of 57 sequential nucleotides. 50 nucleotides before transcription start point 6 nucleotides following the transcription start point An instance is Positive if the promoter region is present in the string Negative otherwise Promoter Domain Domain 2 classes (promoter and non-promoter) 57 attributes (discretized into 57*4=228 binary attributes) 106 instances (53 promoters, 53 non-promoters) Neural Network (MLP) 228-4-1 architecture Promoter: experiment 1 NN Trained 100% correct on the 106 instances 7 s Extracted Promoter: experiment 1 Hybrid System (Knowledge+ Experience) Domain-specific Model Domain Domain Knowledge 3

My Work on Neural Networks 1. Knowledge Extraction 2. Knowledge revision 3. Network revision based on Knowledge Knowledge Revision using Neural Networks Extract knowledge train train Neural Network extract mapping Neural Network extract Knowledge () Revised Domain Knowledge train Revised Neural Network extract Knowledge Neural Network () Knowledge Revised Base Knowledge Base mapping Knowledge-based Neural Network training Experiment: Promoter domain 1. Initial domain knowledge : 14 s 2. Mapped to KBNN (228-4-1) 3. KBNN trained by 106 instances 4. Interpret the KBNN into 14 s : revised knowledge Experiment Promoter domain The revised theory improves considerably over the initial theory Error Rate: 53/106 -> 3/106 Neural Networks based on Self-Extracted Knowledge Restructuring Process Domain Neural Network Learning Connection Weights Extraction Architecture (# of nodes, Connections, etc) Symbolic Concise, Available, Complete, Less noisy, etc 4

Experiment: Promoter HIV Cleavage Site Prediction # of layers # of connections Ave. generalization HIV Protease Cleavage Original NN (228-4-1) RBNN 3 5 921 64 81.1% 82.1% ARBNN (224-5-1) 3 12 84.9% Genetic Algorithm Initial population 1 0 0 1 1 1 0 1 1 1 0 0 1 1 1 1 Evolution Process 1 1 0 0 1 1 1 1 1 1 0 0 1 1 1 1 Final population Problem to be Solved encoding Crossover Mutation Natural Selection Fitness Function Most fitted ones survives. (fitted to fitness function) decoding Problem Solved Rule Space of Cleavage Site Prediction 8-residue amino acids Total 21^8 (= 37,822,859,361) different s possible. What search strategy? Random Search Exhaustive Search Heuristic Search Genetic Algorithm Search Genetic Algorithm Drawbacks of GA Sensitive to initial population 5

Genetic Algorithm Knowledge-Based Genetic Algorithm (KBGA) Knowledge-based 1 Domain Expert 2 -Oriented Knowledge Genetic Algorithm Population size :10 Each chromosome representation xfxxaxxl Random cross-over point Random mutation Fitness function Generality: # of x symbols in a chromosome Accuracy: over the dataset SARS CoV Protease Cleavage Site Prediction Experiment Mutant of Corona virus set Instance: a sequence of 8 amino-acids 70 positive, 1267 negative instances All positive instances include Q@p1 set Sequence Logo DT NN KBGA Sequence Logo GA Three Consensus LQ LQ[S/A] [T/S/A]x[L/F]Q[S/A/G]. 6

GA Prediction of MHC class I binding peptides T-cells are key players in regulating a specific immune response. Activation of cytotoxic T-cells requires recognition of specific peptides bound to Major Histocompatibility Complex (MHC) class I molecules. Therefore prediction of MHC-binding properties is very important issue for the rational design of peptide vaccines aimed at boosting the immune response against a foreign antigen. Only one in 100 to 200 potential binders actually binds to a certain MHC molecule, therefore a good prediction method for MHC class I binding peptides can reduce the number of candidate binders that need to be synthesized and tested. Experiment MHC set set DT Each instance belongs to binder or non-binder NN SVM KBGA Performance ANN, SVM Experimental Results HLA-A*0201(9) (> Accuracy 75%) HLA-A*0201(10) (> Accuracy 70%) 7

Experimental Results HLA-A1 (> Accuracy 75%) Experimental Results HLA-A3 (> Accuracy 75%) Experimental Results HLA-B*8 (> Accuracy 75%) Experimental Results HLA-B*2705 (> Accuracy 75%) Sequence Logo of MHCpeptides Sequence Logo of MHCpeptides 8

Knowledge Issues Rule extraction from SVM Domain DT NN GA Nahla Barakat and Joachim Diederich ; Learning-based Rule-Extraction from Support Vector Machines Núñez, Angulo and Català : Rule extraction from support vector machines, 2002 Glenn Fung, Sathyakama Sandilya, R. Bharat Rao : Rule extraction from Linear Support Vector Machines. 9