The Development and Application of Bayesian Networks Used in Data Mining Under Big Data

Similar documents
ERA: Architectures for Inference

Lecture 3: Bayesian Networks 1

High-level Vision. Bernd Neumann Slides for the course in WS 2004/05. Faculty of Informatics Hamburg University Germany

International Journal of Software and Web Sciences (IJSWS)

Bayesian Inference Bayes Laplace

3. L EARNING BAYESIAN N ETWORKS FROM DATA A. I NTRODUCTION

Sparse Coding in Sparse Winner Networks

Hierarchical Bayesian Modeling of Individual Differences in Texture Discrimination

TEACHING YOUNG GROWNUPS HOW TO USE BAYESIAN NETWORKS.

Subjective randomness and natural scene statistics

CS 4365: Artificial Intelligence Recap. Vibhav Gogate

A Brief Introduction to Bayesian Statistics

Evaluating Classifiers for Disease Gene Discovery

Some Connectivity Concepts in Bipolar Fuzzy Graphs

Introduction. Patrick Breheny. January 10. The meaning of probability The Bayesian approach Preview of MCMC methods

Artificial Intelligence For Homeopathic Remedy Selection

Logistic Regression and Bayesian Approaches in Modeling Acceptance of Male Circumcision in Pune, India

MBios 478: Systems Biology and Bayesian Networks, 27 [Dr. Wyrick] Slide #1. Lecture 27: Systems Biology and Bayesian Networks

Using Bayesian Networks to Direct Stochastic Search in Inductive Logic Programming

Neuro-Inspired Statistical. Rensselaer Polytechnic Institute National Science Foundation

Lecture 9 Internal Validity

Outline. What s inside this paper? My expectation. Software Defect Prediction. Traditional Method. What s inside this paper?

AB - Bayesian Analysis

Bayesian Belief Network Based Fault Diagnosis in Automotive Electronic Systems

Bayesian (Belief) Network Models,

Advanced Bayesian Models for the Social Sciences. TA: Elizabeth Menninga (University of North Carolina, Chapel Hill)

Type II Fuzzy Possibilistic C-Mean Clustering

Advanced Bayesian Models for the Social Sciences

Chapter 2. Knowledge Representation: Reasoning, Issues, and Acquisition. Teaching Notes

Bayesian Estimation of a Meta-analysis model using Gibbs sampler

A Comparison of Collaborative Filtering Methods for Medication Reconciliation

A Fuzzy Improved Neural based Soft Computing Approach for Pest Disease Prediction

Decisions and Dependence in Influence Diagrams

Introduction to Survival Analysis Procedures (Chapter)

Bayesian Networks in Medicine: a Model-based Approach to Medical Decision Making

Data Mining in Bioinformatics Day 4: Text Mining

On the Representation of Nonmonotonic Relations in the Theory of Evidence

Rethinking Cognitive Architecture!

Using Dynamic Bayesian Networks for a Decision Support System Application to the Monitoring of Patients Treated by Hemodialysis

A Vision-based Affective Computing System. Jieyu Zhao Ningbo University, China

Biomedical Machine Learning

STATISTICAL INFERENCE 1 Richard A. Johnson Professor Emeritus Department of Statistics University of Wisconsin

Improved Intelligent Classification Technique Based On Support Vector Machines

BAYESIAN NETWORK FOR FAULT DIAGNOSIS

Support system for breast cancer treatment

A Bayesian Network Model of Knowledge-Based Authentication

Bayesian Networks Representation. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University

Representation Theorems for Multiple Belief Changes

Fuzzy Decision Tree FID

Commentary: Practical Advantages of Bayesian Analysis of Epidemiologic Data

The 29th Fuzzy System Symposium (Osaka, September 9-, 3) Color Feature Maps (BY, RG) Color Saliency Map Input Image (I) Linear Filtering and Gaussian

Multi Parametric Approach Using Fuzzification On Heart Disease Analysis Upasana Juneja #1, Deepti #2 *

International Journal of Pharma and Bio Sciences A NOVEL SUBSET SELECTION FOR CLASSIFICATION OF DIABETES DATASET BY ITERATIVE METHODS ABSTRACT

Machine Learning-based Inference Analysis for Customer Preference on E-Service Features. Zi Lui, Zui Zhang:, Chenggang Bai-', Guangquan Zhang:

Combining Risks from Several Tumors Using Markov Chain Monte Carlo

in Engineering Prof. Dr. Michael Havbro Faber ETH Zurich, Switzerland Swiss Federal Institute of Technology

Representation and Analysis of Medical Decision Problems with Influence. Diagrams

Causal Knowledge Modeling for Traditional Chinese Medicine using OWL 2

Progress in Risk Science and Causality

EEL-5840 Elements of {Artificial} Machine Intelligence

Object recognition and hierarchical computation

Artificial Intelligence Programming Probability

A NOVEL VARIABLE SELECTION METHOD BASED ON FREQUENT PATTERN TREE FOR REAL-TIME TRAFFIC ACCIDENT RISK PREDICTION

A Taxonomy of Decision Models in Decision Analysis

Application of Bayesian Network Model for Enterprise Risk Management of Expressway Management Corporation

BIOSTATISTICAL METHODS AND RESEARCH DESIGNS. Xihong Lin Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA

COHERENCE: THE PRICE IS RIGHT

Russian Journal of Agricultural and Socio-Economic Sciences, 3(15)

Hypothesis-Driven Research

Ordinal Data Modeling

BAYESIAN NETWORKS AS KNOWLEDGE REPRESENTATION SYSTEM IN DOMAIN OF RELIABILITY ENGINEERING

Spatial Cognition for Mobile Robots: A Hierarchical Probabilistic Concept- Oriented Representation of Space

Credal decision trees in noisy domains

A Study of Methodology on Effective Feasibility Analysis Using Fuzzy & the Bayesian Network

Clinical Decision Support Systems. 朱爱玲 Medical Informatics Group 24 Jan,2003

A prediction model for type 2 diabetes using adaptive neuro-fuzzy interface system.

AClass: A Simple, Online Probabilistic Classifier. Vikash K. Mansinghka Computational Cognitive Science Group MIT BCS/CSAIL

Clinical decision support (CDS) and Arden Syntax

Keywords Missing values, Medoids, Partitioning Around Medoids, Auto Associative Neural Network classifier, Pima Indian Diabetes dataset.

Bayesian Networks for the Human Element Leroy A. "Jack" Jackson and MAJ Jonathan K. Alt, US Army Training and Doctrine Command

Using Test Databases to Evaluate Record Linkage Models and Train Linkage Practitioners

Bayesian models of inductive generalization

A Network Partition Algorithm for Mining Gene Functional Modules of Colon Cancer from DNA Microarray Data

A Multichannel Deep Belief Network for the Classification of EEG Data

Models for Inexact Reasoning. Imprecision and Approximate Reasoning. Miguel García Remesal Department of Artificial Intelligence

Implementation of Inference Engine in Adaptive Neuro Fuzzy Inference System to Predict and Control the Sugar Level in Diabetic Patient

Statistical Analysis Using Machine Learning Approach for Multiple Imputation of Missing Data

Data Mining Approaches for Diabetes using Feature selection

Predicting Breast Cancer Survivability Rates

THE INDIRECT EFFECT IN MULTIPLE MEDIATORS MODEL BY STRUCTURAL EQUATION MODELING ABSTRACT

CLASSIFICATION OF BRAIN TUMOUR IN MRI USING PROBABILISTIC NEURAL NETWORK

Probabilistic Graphical Models: Applications in Biomedicine

Response to the ASA s statement on p-values: context, process, and purpose

Downloaded from ijbd.ir at 19: on Friday March 22nd (Naive Bayes) (Logistic Regression) (Bayes Nets)

Challenges of Automated Machine Learning on Causal Impact Analytics for Policy Evaluation

Knowledge Discovery for Clinical Decision Support

Applying Machine Learning Techniques to Analysis of Gene Expression Data: Cancer Diagnosis

Bayes Linear Statistics. Theory and Methods

BACKPROPOGATION NEURAL NETWORK FOR PREDICTION OF HEART DISEASE

Introduction to Bayesian Analysis 1

Transcription:

2017 International Conference on Arts and Design, Education and Social Sciences (ADESS 2017) ISBN: 978-1-60595-511-7 The Development and Application of Bayesian Networks Used in Data Mining Under Big Data JINGWEI ZHANG ABSTRACT Big data era has arrived. With the rapid development of big data, the machine learning based on probability statistics has drawn great attention in recent years from both industry and academia. It also has great applications in the fields of the Internet, finance, natural language, biology, among which the Bayesian network has experienced significant development last year and has become a kind of important machine learning method. This paper summarizes the data mining concept principle and basic method of data mining under the background of big data. Since data mining has strong statistic characteristic, it can be integrated with the Bayesian networks naturally. Therefore, this paper starts with the Bayesian networks, stretching out the Bayesian formula, Bayesian network technology, as well as the advantages of the combination of data mining and the Bayesian networks. Finally, with the biomedicine as an example, it summarizes its development and applications. KEYWORDS Big data, Bayesian network, data mining, Baycs statistics, machine learning INTRODUCTION With the rapid development of big data, the machine learning based on probability statistics has drawn great attention in recent years from both industry and academia. It also has great applications in the fields of the Internet, finance, natural language, biology, among which the Bayesian network has experienced significant development last year and has become a kind of important machine learning method. Bayesian network is model which describes the causal relationship among random variables. It is a combination of probability theory, causal reasoning and graph theory, as well as the unified method of artificial intelligence and data statistics method. It is beyond the artificial intelligence based on the logical theoretical basis and the expert rules based on the category of science and technology. One of its important application is the representation and reasoning of random variable causal knowledge [1]. Jingwei Zhang, 2176246207@qq.com, College of Letters and Science, UC Berkeley, Berkeley 94703, US 911

Bayesian network consists of structure and parameters, which are used to describe the causal relationship between the variables of the qualitative and quantitative analysis. It has the characteristics of versatility, effectiveness and openness (Bayesian network is a platform where can integrate other intelligent technology and other statistical methods), which can effectively transform data into knowledge (with visual and intuitionistic knowledge representation and can organically combined with expert knowledge). With this knowledge, it can reason (with similar reasoning mode of human thinking), in order to solve the uncertainty problems, its effectiveness has been implemented in financial risk analysis, information security, DNA analysis, intelligent medical software the diagnosis, system analysis and control and many other areas, Bayesian network also has a wide application prospect in the analysis of time series causal aspects [2]. Bayesian network has more implementation in statistics, decision analysis, and artificial intelligence and so on. Since data mining has strong statistic characteristic and it origins from Bayesian statistics, it can be integrated with the Bayesian networks naturally. DATA MINING TECHNOLOGY Data mining [1] refers to the progress of extracting the implicit, not known by people in advance, but potentially useful information from large, incomplete, noisy, fuzzy and random data. It is a comprehensive application of statistics, database technology and artificial intelligence technology, and is a set of technology extracted from a large data set by comprehensively using the methods of statistics and machine learning in the database management system. The main data mining methods include association rule learning, cluster analysis, classification analysis, sequence analysis, deviation detection, prediction analysis, pattern similarity mining and regression analysis. The basic process of data mining is shown in Figure 1. Fig. 1. Basic process of Data Mining. 912

BASIC METHODS OF DATA MINING Data mining combines the knowledge of database technology, machine learning, statistics and other fields, and explores the effective, novel, and practical patterns existing in the data from a deeper level. From the specific implementation technology, data mining has the following methods: Data and concepts summarization: data aggregation always induces a data set into higher conceptual level information. The concept summarization is to abstract related data in the database from the low concept layer to the high concept layer. There are mainly two methods which are data cube and object-oriented induction. (2) The decision tree method: Using the information gain in the information theory to find the database with the largest amount of information in the field, create a node to build the decision tree, then according to the different values of the establishment of field tree branches in each sub tree, repeat the establishment of lower nodes and branches. (3) Neural network method: neural network attempts to build the neural structure of human brain. At present, there are three kinds of neural network models: feed-forward network, feedback network and self-organizing network. Neural network is a nonlinear learning model, which can accomplish data mining tasks, such as classification, clustering and feature extraction. (4) Bayesian Network Method: Extracts network models and parameters from the database to infer any knowledge of interest. This is the method that will be discussed in detail in this article. BAYESIAN STATISTICS Bayesian statistical thought was first proposed by Thomas Bayes in the middle of the eighteenth century [4], and he mentioned that the prior distribution of parameters can be obtained by combining the prior distribution of the parameters and the likelihood function, which is the famous Bayesian formula. The posteriori distribution of the parameters obtained by this formula is the basis for the statistical description and deduction of Bayesian statistics. Therefore, compared with the classical statistics, Bayesian statistics not only makes use of the general information and sample information, but also uses the a priori information. Thomas Bayes' s paper pioneered a new idea of statistical inference [5]. Bayesian formula Assume that the two events are denoted as A and B, respectively, and satisfy: Then under condition B, the conditional probability of event Ai is:, (1) The formula (1) is called the Bayesian formula [6]. More generally, the formula can be written as, 913

Assuming that B is the final observation, and Ai is one of the causes of event B, then means that when the event B is observed, it is the probability caused by reason Ai. Bayesian inference is also based on this idea. The above is the Bayesian formula of the event form, in addition, there are Bayesian formula density function form. BAYESIAN NETWORK Bayesian networks has drawn attention from researchers in a number of areas such as data mining, artificial intelligence, and pattern recognition. The structure of the Bayesian network is shown in Figure 2 [7]. Bayesian network is also known as the belief network, is a probability network, which is used to describe the statistical relationship between multiple variables [7]. The Bayesian network is a graphical network based on probability theory, which provides an intuitive representation of the complex relationships between multiple variables. The Bayesian network was originally proposed to solve the problem of uncertainty, because it can express the large-scale complex relationship with a simple straightforward way, hence it is accepted by people of all age. At present, Bayesian network has been paid attention to by researchers in many fields such as data mining, artificial intelligence and pattern recognition. The structure of the Bayesian network is shown in Figure 2 [7]. In this directed acyclic graph structure S, where each node represents a variable, attribute, state, object, proposition, or other entity, the arc between nodes reflects the dependencies between variables. The nodes pointing to the node X are called the parent nodes of X. The Bayesian network can be thought of as a system representation based on probabilistic rules. In addition to what is known as the Bayesian network, it has other terms: causal networks, probabilistic causal networks, and belief networks. Where Xl,X2 are the parent nodes of X4. COMBINATION OF DATA MINING AND BAYESIAN NETWORK The Bayesian network expresses the conditional probability distribution and causality of the object in the event, and the uncertainty knowledge and rules implied by it are the main tools for inaccurate reasoning. Its structure implies the rules, and the conditional probability of each node expresses some kind of knowledge. Bayesian network has more and more implementations in the statistics, decision analysis, artificial intelligence and other fields. In recent years, it has been found that the data mining with the use of Bayesian network can dig out the multi-layer and multi-point causal concept link from the database, which is a common relationship between the objective world. Since data mining has strong statistic characteristic, it can be integrated with the Bayesian networks naturally. Bayesian network for data mining has the following advantages: 914

Fig. 2. Bayesian network structure diagram. (1) The Bayesian network describes the causal links between variables, and the confidence of such connections is expressed in probabilities. Probability allows the sample in Bayesian Network to be incomplete and allows the presence of noise data. (2) Digging out the implication of knowledge. Data mining with the Bayesian network is essentially a network structure from the database or a conditional probability table for finding variables under the premise that the structure is known. Only can people get the knowledge, concepts and decision information by reasoning and interpreting the Bayesian network. (3) Bayesian networks have good comprehensibility and logic. It naturally combines prior knowledge with probabilistic reasoning, which is close to real problems and helps to optimize people's decisions. (4) The Bayesian network combines a priori knowledge and describes the interrelationships between the data in the form of graphs, making it easy to predict and analyze. THE APPLICATION OF BAYESIAN STATISTICS IN BIOMEDICINE Bayesian statistical theory has become more and more mature and has wide applications in many fields such as financial risk analysis, information security, DNA analysis, software intelligence, medical diagnosis, system analysis and control under the background of big data. In the case of biomedical science, Bayesian statistics were introduced into the medical field before the high-dimensional integral problem was solved, and it was used to solve some medical problems with other methods [8-10]. In 1982, Bayesian statistics were used to find the treatment of kidney transplant patients [8]. In 1989, the Bayesian formula was used to adjust the odds ratio [OR] in the epidemiological study [9]. With the computational difficulties in Bayesian analysis, the Bayesian statistics began to be used independently to solve some medical problems [11-14]. The Bayesian hierarchy model was used to identify outliers in hospital data in the fourth session of the Valencia conference papers held in 1994 [11]. In 1996, cross validation was performed on experimental data and observed data, the Meta-analysis was used to explore the problem of model fitting [12]; In 1995 and 1998, Bayesian statistics were used for standard diagnostic trial evaluation [13,14]. Over the past decade, the depth and breadth of Bayesian statistics have been further strengthened, including clinical trials, observational studies, diagnostic and screening tests, and metaanalysis [15], processed data includes complex temporal and spatial data, genetic data, missing data, and survival data [16,17]. 915

APPLICATION OF BAYESIAN THEORY IN PHOSPHORYLATION Phosphorylation Proteomics is an important subject in today's bioinformatics field. With the development of biotechnology, a large number of phosphorylation data accumulated. Therefore, how to dig out from these biological data of the inherent regulation of phosphorylation and then to explain its mechanism is the problem that bioinformatics experts need to solve. The machine learning method is an effective tool for solving such problems which uses the relevant biometrics to build the model of training data to classify or return the biological data of the unknown tag. There are 518 genes encoding protein kinases in the human genome [18]. Therefore, it is a typical multi-class classification problem to predict which protein kinase phosphorylation or phosphorylation site of an amino acid residue corresponds to which kinase. Xue et al. introduced Bayesian decision theory into kinase-specific phosphorylation site prediction studies [19]. The amino acid frequencies of each pair of positive and negative data sets were counted according to the conservativeness of the sequences. The amino acid frequencies of 9 amino acids were calculated by selecting 8 amino acids in the neighbor phosphorylation sites. They calculated the probability distribution for each position and each amino acid in each sample. CONCLUSION This paper summarizes the concepts and methods of data mining, and focuses on the concept and development of Bayesian statistics. It is necessary to elaborate on Bayesian statistics because the idea of the Bayesian network was born entirely in Bayesian statistics. We can only understand the principles, methods and applications of Bayesian statistics in order to study the Bayesian network more effectively and deeply. At the same time, the paper summarizes the advantages of data mining and Bayesian network. Through the analysis, we realized that Bayesian network has great theoretical and practical value in fuzzy theory, control theory, speech recognition and natural language understanding (machine translation). It is expected that this paper will provide guidance and help to the researchers of Bayesian network big data mining. REFERENCES 1. D Eaton, KP Murphy. Exact Bayesian structure learning from uncertain interventions [R]. Ai & Statistics in, 2007, 2: 107-114. 2. G Li, TY Leong. Active learning for causal Bayesian network structure with non-symmetrical entropy [C]. In: Proceedings of 13th Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2009, 290-301. 3. J Han, M Kamber. Data mining: concepts and techniques [M]. Morgan Kaufmann Publishers, 2001. 4. RT Bayes. An essay toward solving a problem in the doctrine of chances [J]. Philosophical ransactions of the Royal Society of London, 1763, 53: 370-418. 5. DR Cox, DV Hinkley. Theoretical Statistics [M]. Chapman & Hall: London, 1974. 6. P Congdon. Bayesian statistical modeling [M]. John Wiley & Sons, Ltd, Chichester, 2006. 7. MM Zhu, W Liu, YL Yang. Construction algorithm of MPD-JT for Bayesian networks based on full conditional independence [J]. Systems Engineering and Electronics, 2010, 32(6): 8-11. 8. AFM Smith, GO Roberts. Bayesian computation via the Gibbs sampler and related Markov chain Monte Carlo methods [J]. Journal of the Royal Statistical Society, 1993, 55(1): 3-23. 9. WR Gilks, S Richardson, DJ Spiegelhalter. Markov chain Monte Carlo in practice [M]. Chapman & Hall: London, 1996. 10. CN Morris, CL Christiansen. Hierarchical models for ranking and for identifying extremes, with applications [C]. In Bayersian Statistics, Clarendon Press: Oxford, 1996, 277-296. 916

11. W DuMouchel. Predictive cross-validation of Bayesian meta-analyses [C]. In Bayesian Statistics, Clarendon Press: Oxford, 1996, 107-127. 12. AE Raftery, D Madigan, CT Bolinsky. Accounting for model uncertainty in survival analysis improves predictive performance [C]. In Bayesian Statistics, Clarendon Press: Oxford, 1996, 323-349. 13. N Dendukuri, L Joseph. Bayesian approaches to modeling the conditional dependence between multiple diagnostic tests [J]. Biometrics, 2015, 57(1): 158-167. 14. MP Georgiadis, WO Johnson, IA Gardner, S Ramanpreet. Correlation-adjusted estimation of sensitivity and specificity of two diagnostic tests [J]. Journal of the Royal Statistical Society, 2010, 52(1): 63-76. 15. P Liu, HT Yang, LY Qiang, H Jin, S Xiao, ZX Shi. Evaluation of 30 commercial assays for the detection of antibodies to HIV in China using classical and Bayesian statistics [J]. Journal of Virological Methods, 2010, 170(1-2): 73-79. 16. D Lunn, D Spiegelhalter, A Thomas, N Best. The BUGS project: Evolution, critique and future directions [J]. Statistics in Medicine, 2009, 28(25): 3073-3074. 17. JM Ren, Y Zhao, F Chen, et al. Hot-deck imputation for missing data on matched data [J]. Chinese Journal of Health Statistics, 2004, 21(5): 303-306. 917