CISC453 Winter Probabilistic Reasoning Part B: AIMA3e Ch

CISC453 Winter 2010 Probabilistic Reasoning Part B: AIMA3e Ch 14.5-14.8

Overview 2 a roundup of approaches from AIMA3e 14.5-14.8 14.5 a survey of approximate methods alternatives to the direct computing of Conditional Probability Tables for Bayesian networks using sampling to estimate CPTs for queries specifics of several sampling approaches 14.6 relational & first-order probability approaches expanding the reach of Bayesian network models moving toward first-order logic semantics 14.7 alternative uncertain reasoning approaches rule-based Dempster-Shafer fuzzy sets & fuzzy logic 14.8 summary

Rationale for non-exact approaches 3 the complexity of Bayesian network calculations even the best algorithms are, in the worst case, exponential in the number of variables though for simple networks & queries about individual variables efficient calculations are possible but alternatives to exact calculations are feasible we have available the priors and conditional probability tables associated with nodes in the Bayesian network, as part of the constructing of the network given these we can get approximate solutions for query probabilities through randomized sampling (Monte Carlo) algorithms by repeated sampling we estimate the desired posterior probabilities & can control the accuracy of the approximation by the number of samples generated of multiple techniques available, we have a look at two direct sampling & Markov chain sampling

Probability estimation by sampling 4 (1) direct sampling given (a) we know a probability distribution & (b) we have a source of uniformly distributed random numbers we can simulate the sampling of actual events by appropriate sampling of the distribution as we do to simulate dice rolls, for example we can apply this to a Bayesian net by using the prior and conditional probabilities associated with the nodes we start at the top & sample in the order given by the topology of the network, using the sampled results from parent nodes to select the distribution for sampling at subsequent nodes at the end this produces a sampled event - a set of sampled values for each of the variables in the net see the algorithm on the next slide & examples for the "wet grass" Bayesian net on subsequent slides

Direct Sampling Algorithm 5 it generates events from a Bayesian network passes over the topology & uses the probabilities specified by values generated for parent nodes function PRIOR-SAMPLE(bn) returns an event sampled from the prior specified by bn inputs: bn, a Bayesian network specifying joint distribution P(X 1,..., X n ) x an event with n elements foreach variable X i in X 1,..., X n do x[i] a random sample from P(X i parents(x i )) return x

Direct Sampling Algorithm 6 the sample network for the "wet grass" problem

Direct Sampling Algorithm 7 the next few slides illustrate the operation of the algorithm on the "wet grass" network yellow means sample from this distribution

Direct Sampling Algorithm 8 the sampled value for Cloudy is true shown in green

Direct Sampling Algorithm 9 sample from the corresponding rows of the CPTs of the child nodes

Direct Sampling Algorithm 10 for the Sprinkler variable, the sampled value is false shown as red

Direct Sampling Algorithm 11 for the Rain variable, the sampled value is true

Direct Sampling Algorithm 12 sample from the F T row of the WetGrass CPT

Direct Sampling Algorithm 13 for the WetGrass variable, the sampled value is true

Direct Sampling Algorithm 14 this process yields sample events they reflect the probabilities in the joint distribution given by the Bayesian net for a given total number of sample events, the fraction for a specific event is the answer to the query about that event, an estimate of its probability in the large sample limit, this is the actual probability of the event, so the process produces consistent estimates for partially specified events (on m variables, where m <= n), estimates are just the fraction of all complete events produced by the algorithm that match the partially specified event the REJECTION-SAMPLING algorithm computes conditional probabilities using PRIOR-SAMPLE to generate events it rejects sampled events not consistent with the conditional probability of interest, then counts the proportion of the remaining events that match the evidence condition of interest

Rejection Sampling 15 the REJECTION-SAMPLING algorithm function REJECTION-SAMPLING(X, e, bn, N) returns an estimate of P(X e) inputs: X, the query variable e, observed values for variables E bn, a Bayesian network N, the total number of samples to be generated local variables: N, a vector of counts for each value of X, initially zero for j = 1 to N do x PRIOR-SAMPLE(bn) if x is consistent with e then N[x] N[x] + 1 where x is the value of X in x return NORMALIZE(N)

Rejection Sampling 16 the rejection sampling approach yields estimates that converge to true probabilities as the number of samples increases, with the standard deviation of the error proportional to 1/sqrt(n), where n is the number of samples but as E (the set of evidence variables) grows, the proportion of samples consistent with e decreases exponentially & the process becomes impractical for complex problems it begins to look like counting (rare) real world events to estimate conditional probabilities a partial solution to the inefficiency of the rejection sampling approach is offered by likelihood weighting

Likelihood Weighting 17 likelihood weighting to improve efficiency of sampling it avoids generating large numbers of events that don't apply to the conditional probability of interest by producing only events that are consistent with the evidence variable values e so, it fixes the evidence variables E and samples only the nonevidence variables then every event generated is consistent with the evidence but fixing the evidence variables means it isn't enough just to count events, rather, we need to weight each sampled event by the likelihood that it agrees with the evidence the weight is given by the product of conditional probabilities for each evidence variable, given its parents the result is that events in which the actual evidence appears unlikely are given less weight see AIMA3e pp 533-534 for the weight calculation procedure

Likelihood Weighting 18 issues with likelihood weighting for a non-evidence variable Z i, its sampled values will be influenced by evidence among its ancestor nodes, but Z i will not be influenced by evidence variables that are not ancestors given the query: P(Rain Cloudy = true, WetGrass = true) the sampled Sprinkler & Rain variable values will include some samples with both false, though (non-ancestor) evidence rules out these events

Likelihood Weighting 19 as with rejection sampling the likelihood weighting estimates can be shown to be consistent since all generated samples are used, the algorithm may be more efficient problems arise as the number of evidence variables grows, since most samples will have very low weights, with only a small fraction agreeing appreciably with the evidence even worse if evidence variables happen to be late in the variable ordering, so that the non-evidence variables don't have any among their parents & ancestors to guide the sample generation then they will be mostly unrelated to the evidence of the query

Markov Chain Monte Carlo (MCMC) 20 another algorithmic approach to estimating posterior probabilities by sampling note that this class of algorithms includes the WALKSAT and simulated annealing examples seen in earlier chapters it operates by generating samples "incrementally", each from the previously sampled state by randomly applying changes one specific MCMC algorithm is Gibbs sampling 1. fix evidence variables 2. assume other variables are in some arbitrary state 3. update by sampling a value for some nonevidence variable X i the sampled X i is conditioned on the current values of the variables in its Markov blanket (its parent variables + child variables + parents of its child variables - see the next slide for an example) 4. repeat, moving in the space of complete assignments, always keeping evidence variables fixed & sampling a new value for a nonevidence variable

Markov Chain Monte Carlo (MCMC) 21 illustration a partial Bayes network showing the Markov blanket for X

The Gibbs Sampling Algorithm 22 here's the algorithm function GIBBS-ASK(X, e, bn, N) returns an estimate of P(X e) inputs: X, the query variable; e, observed values for variables E bn, a Bayesian network; N, the total number of samples to be generated local variables: N, a vector of counts for each value of X, initially zero Z, the nonevidence variables in bn x, the current state of the network, initially copied from e initialize x with random values for the variables in Z for j = i to N do foreach Z i in Z do set the value of Z i in x by sampling from P(Z i mb(z i )) N[x] N[x] + 1 where x is the value of X in x return NORMALIZE(N)

Markov Chain Monte Carlo (MCMC) 23 properties of Gibbs sampling MCMC each state visited is a sample for the query variable posterior probability, which is estimated from the proportions of states visited it can be shown that Gibbs sampling returns consistent probability estimates this relies on the sampling process reaching a condition of dynamic equilibrium in which the long run fraction of time in each state is exactly proportional to its posterior probability there is a rather long & detailed proof in the AIMA3e conveniently omitted here

Relational & First-Order Probability 24 Bayesian nets are "propositional" in their basics a fixed, finite set of random variables, each with a fixed domain of values we could extend their use to many more problems if we somehow include the first-order properties of capturing relations among objects along with quantification over variables that stand for objects the example: an online book seller wishes to capture customer evaluations of books - summarized as a posterior distribution over book quality, given the evidence any simple summary statistic does not capture variations in the kindness &/or honesty of evaluators examining a Bayesian net representation for expressing the recommendation relationships shows that it becomes impractical as the number of customers & books is non-trivial see the figure on the next slide

Relational & First-Order Probability 25 book recommendation problem as Bayes nets (a) shows the net for 1 book & 1 customer (b) shows the corresponding net for 2 books & 2 customers for both, Honesty(C i ) is Boolean, & the other variables are assumed to be on a 1 to 5 integer scale the net structure in (b) shows repetition for Recommendation(c, b) & indicates CPTs for all Recommendation(c, b) variables will be identical (as will those for Honesty(c) and so on) our goal: capture this commonality in a first-order-like way

Relational & First-Order Probability 26 we return to the ideas of possible worlds for both probability & first-order representations if we could assign probabilities to the possible worlds of FOL that is, to the models that result from an interpretation & a mapping of constants to objects, predicates to relations, & function symbols to functional relations then the probability of a FOL sentence could be calculated as in Bayes nets, summing over corresponding possible worlds P(φ) = sum (over ω such that φ is true in ω) P(ω)

Relational & First-Order Probability 27 relational & first order models as we've seen before, one problem is that with functions, FOL models are infinite a solution may be feasible by adopting some alternative semantics, from those for database systems in particular, the unique names assumption & domain closure unique names: constant names indicate unique objects domain closure: the only objects are those that are explicitly named top: FOL possible worlds semantics, bottom: database semantics

Relational & First-Order Probability 28 these database semantics yield relational probability models (RPMs) RPMs do not make the CWA (closed world assumption) that says that unknown facts are false, since it is just those things we want to reason about probabilistically remaining points referenced on later slides also note that, unfortunately, RPMs will fail when the assumptions don't hold examples from the book recommendation problem multiple ISBNs for the same "logical" book multiple customer IDs for a single customer particularly one who would like to skew a recommendation system (in what is called a sybyl attack) both existence uncertainty (what are the real objects) & identity uncertainty (which symbols really refer to the same object) mean that we'll eventually need to adopt a fuller version of FOL semantics

Relational Probability Models 29 RPMs include constant, function & predicate symbols we treat predicates as functions that return a value we assume there's a type signature for each function it gives the type of each argument & of the function value this eliminates spurious possible worlds, provided we know the type of each object we define an RPM in terms of the types & type signatures here are examples for the book-recommendation problem types: Customer, Book signatures: Honest: Customer {t, f}, Kindness: Customer {1, 2, 3, 4, 5} Quality: Book {1, 2, 3, 4, 5} Recommendation: Customer x Book {1, 2, 3, 4, 5} constants: the customer & book names that the retailer records C 1, C 2, B 1, B 2 the random variables of the RPM are derived from these

Relational Probability Models 30 the random variables instantiate each function with each possible combination of objects write out the dependencies that govern the random variables Honest(c) <0.99, 0.01> Kindness(c) <0.1, 0.1, 0.2, 0.3, 0.3> Quality(b) <0.05, 0.2, 0.4, 0.2, 0.15> Recommendation(c, b) RecCPT(Honest(c), Kindness(c), Quality(b)) RecCPT is a conditional distribution with 2x5x5 rows of 5 entries semantics of the RPM are given by instantiating the dependencies for all known constants (see part (b) of the earlier figure) to form a Bayesian network that defines a joint distribution over the RPM's random variables

Relational Probability Models 31 RPMs: context-specific independence a variable is independent of some parents, given certain values of others so Recommendation(c, b) is independent of Kindness(c) & Quality(b) when Honest(c) = false we can capture idea that a fan of a particular author will always give that author's books a 5, independent of quality these are expressed in a form that resembles a programming language conditional statement but the inference algorithm does not "know" the value of the conditional test rather, posterior probabilities will reflect high probability that a customer is a fan of an author when the customer only gives 5s to books by the author, and otherwise is not particularly kind

Relational Probability Models 32 how do you do inferencing in an RPM? convert to an equivalent Bayes net by a process called "unrolling" the methods that unroll an RPM into a Bayesian network are analogous to the propositionalization process for FOL inference as in FOL resolution that instantiates logical variables only as needed for inferencing, lifting the inference process above the level of ground sentences, similar techniques can be applied to the random variables of RPMs when ground random variables differ only in the constant symbols used to generate them

Open Universe Probability Models 33 another acronym: OUPMs many real world problems may not allow for the unique names and domain closure assumptions and require OUPMs that adopt standard FOL semantics these in turn extend the "generative" property of the system Bayes nets generate possible worlds as assignments of values to variables RPMs generate sets of events through instantiation of logical variables in predicates/functions OUPMs generate possible worlds through the addition of objects there are inference algorithms that may be used to derive consistent posterior probabilities for FOL queries, given the revised representation

FOL & Probabilistic Reasoning 34 the state of the art? first-order probabilistic reasoning is a relatively new research area & the techniques are not well-established they may be applicable to many real world problems that involve uncertain information AIMA3e cites example domains including computer vision, text understanding, & military intelligence analysis even more generally, the interpretation of sensor data

More Uncertain Reasoning 35 14.7 Other Approaches though using probability to model uncertainty has a long history in many sciences, AI has been slow to adopt probabilistic techniques one reason was its numeric character, when some in AI have seen it as requiring symbolic & qualitative approaches alternative AI approaches have included default reasoning systems: conclusions don't have a degree of belief, but can be superseded when a better reason for some alternative is found: these have had a degree of success rule-based systems: the rules have some associated numeric uncertainty property: these were popular in expert systems Dempster-Shafer theory: it uses interval-valued degree of belief to capture knowledge about the probability of a proposition probability & logic share the ontological commitment that propositions are true or false though an agent may be uncertain about which holds, but fuzzy logic ontology allows vagueness about the degree of truth of a proposition

Rule-Based Uncertain Reasoning 36 rule-based systems have some desirable properties, including truth-functionality: the truth of complex logical sentences depends only on truth of components but this is only the case for probability when strong independence assumptions hold over the history of AI there have been attempts to develop uncertain reasoning schemes that retain the advantages of logical representation but simple examples can show how the truth functional property is not appropriate for general uncertain reasoning it is only successful with highly restricted tasks & carefully engineered rule bases but as the rule base expands, it is difficult to avoid undesirable interactions among rules as a result, Bayesian networks have mostly replaced rule-based approaches

Dempster-Shafer theory 37 Dempster-Shafer theory takes another approach it deals not with the probabilities of propositions but probabilities that evidence supports propositions the measure of belief is a belief function, notationally Bel(A) Bel(A) and Bel( A) don't have to sum to 1.0, depending on the evidence available - the difference is a "gap" when there is a gap, decision problems can be defined so that a Dempster-Shafer system cannot reach a decision the Bayesian model can handle these cases, and as shown in the text in a biased-coin example, uses the evidence of the outcome of coin flips to compute the posterior distribution of a Bias random variable allows beliefs to change under future information gathering

Fuzzy Sets, Fuzzy Logic 38 vagueness in Fuzzy Sets & Fuzzy Logic fuzzy sets provides a method of specifying how well some object matches a vague description it is not uncertainty about the world, but about degrees of matching to some linguistically vague term like "tall" Tall becomes a fuzzy predicate that for a specific object has a truth value between 0 & 1 and thus defines a set of members for which the boundaries are not sharp fuzzy logic provides tools for reasoning over expressions related to membership in fuzzy sets fuzzy logic is truth-functional (complex sentence depends only on truth values of components) this results in problems because there may be interactions among the components that aren't accounted for in the formalism

Fuzzy Sets, Fuzzy Logic 39 fuzzy control systems these use fuzzy rules to map from real-valued input to output parameters they are used successfully & commercially in many systems but that may just reflect the concise, intuitive mapping rather than any general applicability to uncertain reasoning when accounts of fuzzy logic appeal to probability theory they may give up the truth-functional property they are not fully developed or understood in the sense of properly representing the relationship between linguistic observations & continuous quantities

Summary 40 a summary for stochastic sampling techniques (likelihood weighting, Markov Chain Monte Carlo) provide consistent estimates of posterior probabilities even for large networks (too large for the exact algorithms) we can combine probability theory & FOL representational tools to get systems that reason under uncertainty Relational Probability Models adopt semantics that allow a well defined probability distribution with an equivalent Bayesian network Open Universe Probability Models relax semantic restrictions to allow existence & identity uncertainty, & define probability distributions over an infinite space of first-order possible worlds among alternative systems proposed for reasoning under uncertainty, truth-functional systems typically fail to capture interactions among components