MTAT.05.113 Bayesian Networks Introductory Lecture Sven Laur University of Tartu
Motivation Probability calculus can be viewed as an extension of classical logic. We use many imprecise and heuristic rules in everyday life. Often even experts find it difficult to formalise their knowledge. One often wants to infer rules comprehensible rules directly form data. Bayesian networks provide one possible solution. Network structure reveals causal relations between attributes Each individual table of conditional probabilities is comprehensible. The inference can be completely automated. Bayesian networks can be used to support decision making. MTAT.05.113 Bayesian Networks, Introductory Lecture, 11 February, 2008 1
A simple example Trojan in a computer Money transfer to attacker s account Successful phishing Security breach Attack is detected Attack is successful Attack against server Money transfer to mule s account Mule is caught Arrow indicate direct causal relations between events or indicators. For each node v, we have to specify Pr[v parents of v]. For nodes u without parents, we have to specify prior probabilities Pr[u]. MTAT.05.113 Bayesian Networks, Introductory Lecture, 11 February, 2008 2
Three main tasks Knowledge representation Encoding of prior beliefs (expert knowledge). Parameter estimation from experimental data. Structural inference from experimental data. Inference and belief propagation Fast computation of marginal probabilities. Coherence and sensitivity analysis. Decision theory Optimal and near-optimal strategies Relevance of observations. Sensitivity analysis. MTAT.05.113 Bayesian Networks, Introductory Lecture, 11 February, 2008 3
Presentation topics 1. Interpretations of Probabilities (1) 2. Construction of Bayesian Networks (1) 3. Inference and Belief Propagation (2) 4. Analysis of Bayesian Networks (1) 5. Parameter Estimation (1) 6. Network Structure Estimation (1) 7. Bayesian Networks as Classifiers (1) 8. Elements of Decision Theory (2) 9. Optimal Strategies and Ways to Find Them (2) 10. Some Properties of Decision Problems (1) MTAT.05.113 Bayesian Networks, Introductory Lecture, 11 February, 2008 4
What is a probability?
Five main interpretations Standford Encyclopedia of Philosophy lists five interpretations. Classical probability is a ratio between favourable and all events. Logical probability assigns plausibility over a set of formal statements. Frequentistic interpretation states that probability is relative frequency in a finite or infinite trial sequence, i.e., it is a property of sequence. Propensity interpretation states that probability is a property of physical objects, which manifests itself in an experiment(s). Subjective probability is a normalised degree of belief that a rational entity assigns to plausible events based on observations. Kolmogorov s calculus of probabilities is interpretation agnostic. It provides universal consistent axiomatisation. All rules for manipulating probabilities can be derived within this theory. MTAT.05.113 Bayesian Networks, Introductory Lecture, 11 February, 2008 5
Three dominant schools of thought Frequentism Knowledge of the Long Run Fair Price Bayesianism Probability Mathematical Statistics Three notions in the graph form a vicious cycle. Depending on the starting point we get different interpretations. Each of them has its own application area and weaknesses. MTAT.05.113 Bayesian Networks, Introductory Lecture, 11 February, 2008 6
1764 T. Bayes Bayes Theorem Approximate time-line 1810 P. Laplace Central Limit Theorem 1900 K. Pearson χ 2 -test 1931 F. Ramsey Subjective probability 1919 R. Mises Kollektivs 1954 Savage Subjective utility 1700 1800 1900 2000 1713 J. Bernoulli Ars Conjectandi 1774 P. Laplace Bayes Theorem 1834 A. Cournot Finite frequentism 1921 M. Keynes Logical probability 1933 A. Kolmogorov Grundbegriffe 1969 P. Martin-Löf Random sequences 1937 de Finetti Coherence principle Kolmogorov s neat axiomatisation of probability as a measure set off the balance and mathematical statistics quickly became a dominant school. It took decades for other interpretations to return. Th resurrection of Mises theory of kollektivs is particularly interesting. MTAT.05.113 Bayesian Networks, Introductory Lecture, 11 February, 2008 7
Main points of controversy Is Bayes Theorem really a theorem? Bayesianists have to make a big effort to prove it. In Mises s and Kolmogorov s axiomatisation it is just a tautology. Is there any probability left when the coin has landed? For Bayesianist it depends on the observed data. Frequentists do not consider individual observations. The measurement is outside of the realm of mathematical statistics. What about average behaviour of an inference algorithm? An orthodox Bayesianist considers only individual events. Analysis of average-case behaviour is the core of mathematical statistics. MTAT.05.113 Bayesian Networks, Introductory Lecture, 11 February, 2008 8
Strict Frequentism
Main ideas in one slide Von Mises. Probability is a property of an infinite sequence. A sequence x {0, 1} is a kollektiv if it satisfies the following conditions. Relative frequency has a limiting value P(x). For any admissible sub-sequence x, the corresponding relative frequency must converge to P(x). An sub-sequence is admissible if it is chosen by a method that uses only values x 1,...,x i to decide whether to take x i+1 or not. Additionally, we have construction for creating conditional events so that the Bayes theorem would hold. Most results of classical probability theory can be proved in this theory. MTAT.05.113 Bayesian Networks, Introductory Lecture, 11 February, 2008 9
Main reasons why the theory was rejected A kollektiv is a non-constructible object. There are no kollektivs without restrictions on admissible selections. It is impossible to derive the law of iterated logarithms. Even if we restrict the set of admissible selections, there are kollektivs with weird properties. Namely, there are kollectivs for which the relative frequency approaches the limit from above. For all finite sub-sequences the gambler gains more than looses. There are kollektivs such that a gambler can win infinite amount of money if he or she varies bet prices. MTAT.05.113 Bayesian Networks, Introductory Lecture, 11 February, 2008 10
Bayesianism
Main ideas in one slide Objective Bayesianism Internal consistency. Quantitative correspondence with common sense. Inference results should be acceptable for most of us. Tries to minimise the amount personal of prior information. Subjective Bayesianism Probabilities are formalised as prices of bets. Dutch Book argument a rational person does not give money away. Betting prices are continuous, i.e., Pr[A n ] n Pr[lim n A n ]. The outcome of the inference procedure is inherently individual. MTAT.05.113 Bayesian Networks, Introductory Lecture, 11 February, 2008 11
Dutch Book argument Let p(x) denote the price a rational entity is willing to pay for a lottery ticket with a prize 1 if the event X happens. First note that p(a) + p(a) = 1 or otherwise our entity is either willing to buy or sell ticket pairs A, A for a price slightly above or below 1. Analogously, prices of mutually exclusive events A and B must satisfy p(a) + p(p) = p(a B) or we can trick the entity to buy or sell A, B and A B a price slightly above or below 1 that is again irrational. MTAT.05.113 Bayesian Networks, Introductory Lecture, 11 February, 2008 12
Mathematical Statistics
Main ideas in one slide We are interested in average-case behaviour of inference algorithms. What is the probability over the data that true value lies in the interval? Does the expected value of an estimate coincide with the true value? What is the maximal false negative ratio for fixed false positive ratio? What is the probability of getting a sample form distribution H 0? Since the measurement is outside of the scope of the theory, we have to connect average-case guarantees with real world measurements. MTAT.05.113 Bayesian Networks, Introductory Lecture, 11 February, 2008 13
Cournot principle. Events with a sufficiently small probability do not happen. P-values If probability of getting a sample x from a distribution H 0 is below 10 6 then the sample is not from the distribution H 0. Confidence intervals If an inference method returns an interval [a, b] that contains the true value with probability 95% over the assumed data distribution, then the interval returned by the algorithm contains the true value. More precisely, algorithm works on typical data samples. Our losses are tolerable in the long run. MTAT.05.113 Bayesian Networks, Introductory Lecture, 11 February, 2008 14