Introduction to Artificial Intelligence

Lecture 5: Probabilistic reasoning

Prof. Gilles Louppe
g.louppe@uliege.be

Today

Bayesian networks
- Semantics
- Construction
- Independence relations
Inference
Parameter learning

.center.width-65[]

Representing uncertain knowledge

The explicit representation of the joint probability distribution grows exponentially with the number of variables.

Independence and conditional independence assumptions reduce the number of probabilities that need to be specified. They can be represented explicitly in the form of a Bayesian network.

Bayesian networks

A Bayesian network is a .bold[directed acyclic graph] where

each node corresponds to a random variable;
- observed or unobserved
- discrete or continuous
each edge is directed and indicates a direct probabilistic dependency between two variables;
each node $X_i$ is annotated with a conditional probability distribution $${\bf P}(X_i | \text{parents}(X_i))$$ that defines the distribution of $X_i$ given its parents in the network.

] .kol-1-4.width-100[] ]

???

In the simplest case, conditional distributions are represented as conditional probability tables (CPTs).

.center.width-40[]

Example 1

Variables: $\text{Burglar}$, $\text{Earthquake}$, $\text{Alarm}$, $\text{JohnCalls}$, $\text{MaryCalls}$.
The network topology can be defined from domain knowledge:
- A burglar can set the alarm off
- An earthquake can set the alaram off
- The alarm can cause Mary to call
- The alarm can cause John to call

???

I am at work, neighbor John calls to say my alarm is ringing, but neighbor Mary does not call. Sometimes it's set off by minor earthquakes. Is there a burglar?

.center.width-90[]

???

Blackboard: example of calculation, as in the next slide.

Semantics

A Bayesian network implicitly encodes the full joint distribution as a product of local distributions, that is

$$P(x_1, ..., x_n) = \prod_{i=1}^n P(x_i | \text{parents}(X_i)).$$

Proof:

By the chain rule, $P(x_1, ..., x_n) = \prod_{i=1}^n P(x_i | x_1, ..., x_{i-1})$.
Provided that we assume conditional independence of $X_i$ with its predecessors in the ordering given the parents, and provided $\text{parents}(X_i) \subseteq \{ X_1, ..., X_{i-1}\}$, we have $$P(x_i | x_1, ..., x_{i-1}) = P(x_i | \text{parents}(X_i)).$$
Therefore, $P(x_1, ..., x_n) = \prod_{i=1}^n P(x_i | \text{parents}(X_i))$.

Example 1 (continued)

$$ \begin{aligned} P(j, m, a, \lnot b, \lnot e) &= P(j|a) P(m|a)P(a|\lnot b,\lnot e)P(\lnot b)P(\lnot e)\\\ &= 0.9 \times 0.7 \times 0.001 \times 0.999 \times 0.998 \\\ &\approx 0.00063 \end{aligned} $$

Example 2

The dentist's scenario can be modeled as a Bayesian network with four variables, as shown on the right.

By construction, the topology of the network encodes conditional independence assertions. Each variable is independent of its non-descendants given its parents:

$\text{Weather}$ is independent of the other variables.
$\text{Toothache}$ and $\text{Catch}$ are conditionally independent given $\text{Cavity}$.

???

A dentist is examining a patient's teeth. The patient has a cavity, but the dentist does not know this. However, the patient has a toothache, which the dentist observes.

.grid.center[ .kol-1-3[.width-70[]] .kol-2-3[.width-80[]

] ]

Example 3

Edges may correspond to causal relations.

.grid.center[ .kol-1-5[.width-60[]] .kol-2-5[ ${\bf P}(R)$

| $R$ | $P$ | | --- | --- | --- | | $\text{r}$ | $0.25$ | | $\lnot\text{r}$ | $0.75$ | ] .kol-2-5[ ${\bf P}(T|R)$

$R$	$T$	$P$
$\text{r}$	$\text{t}$	$0.75$
$\text{r}$	$\lnot\text{t}$	$0.25$
$\lnot\text{r}$	$\text{t}$	$0.5$
$\lnot\text{r}$	$\lnot\text{t}$	$0.5$
]
]

???

Causal model

.center.width-60[]

Example 3 (bis)

... but edges need not be causal!

.grid.center[ .kol-1-5[.width-60[]] .kol-2-5[ ${\bf P}(T)$

| $T$ | $P$ | | --- | --- | --- | | $\text{t}$ | $9/16$ | | $\lnot\text{t}$ | $7/16$ | ] .kol-2-5[ ${\bf P}(R|T)$

$T$	$R$	$P$
$\text{t}$	$\text{r}$	$1/3$
$\text{t}$	$\lnot\text{r}$	$2/3$
$\lnot\text{t}$	$\text{r}$	$1/7$
$\lnot\text{t}$	$\lnot\text{r}$	$6/7$
]
]

???

Diagnostic model

Construction

Bayesian networks can be constructed in any order, provided that the conditional independence assertions are respected.

Algorithm

Choose some ordering of the variables $X_1, ..., X_n$.
For $i=1$ to $n$:
1. Add $X_i$ to the network.
2. Select a minimal set of parents from $X_1, ..., X_{i-1}$ such that $P(x_i | x_1, ..., x_{i-1}) = P(x_i | \text{parents}(X_i))$.
3. For each parent, insert a link from the parent to $X_i$.
4. Write down the CPT.

.center.width-100[ ]

???

For the left network:

P (J|M ) = P (J)? No
P (A|J, M ) = P (A|J)? P (A|J, M ) = P (A)? No
P (B|A, J, M ) = P (B|A)? Yes
P (B|A, J, M ) = P (B)? No
P (E|B, A, J, M ) = P (E|A)? No
P (E|B, A, J, M ) = P (E|A, B)? Yes

Independence relations

Since the topology of a Bayesian network encodes conditional independence assertions, it can be used to answer questions about the independence of variables given some evidence.

.center.width-45[]

Cascades

Counter-example:

Low pressure causes rain causes traffic, high pressure causes no rain causes no traffic.
In numbers:
- $P(y|x)=1$,
- $P(z|y)=1$,
- $P(\lnot y|\lnot x)=1$,
- $P(\lnot z|\lnot y)=1$ ] .kol-1-2.center[.width-100[]

$X$: low pressure, $Y$: rain, $Z$: traffic.

$P(x,y,z)=P(x)P(y|x)P(z|y)$] ]

$$\begin{aligned} P(z|x,y) &= \frac{P(x,y,z)}{P(x,y)} \\\ &= \frac{P(x)P(y|x)P(z|y)}{P(x)P(y|x)} \\\ &= P(z|y) \end{aligned}$$

We say that the evidence along the cascade blocks the influence.

] .kol-1-2.center[.width-100[]

$X$: low pressure, $Y$: rain, $Z$: traffic.

$P(x,y,z)=P(x)P(y|x)P(z|y)$] ]

Common parent

Is $X$ independent of $Z$? No.

Counter-example:

Project due causes both forums busy and lab full.
In numbers:
- $P(x|y)=1$,
- $P(\lnot x|\lnot y)=1$,
- $P(z|y)=1$,
- $P(\lnot z|\lnot y)=1$ ] .kol-1-2.center[.width-80[]

$X$: forum busy, $Y$: project due, $Z$: lab full.

$P(x,y,z)=P(y)P(x|y)P(z|y)$] ]

$$\begin{aligned} P(z|x,y) &= \frac{P(x,y,z)}{P(x,y)} \\\ &= \frac{P(y)P(x|y)P(z|y)}{P(y)P(x|y)} \\\ &= P(z|y) \end{aligned}$$

Observing the parent blocks the influence between the children. ] .kol-1-2.center[.width-80[]

$X$: forum busy, $Y$: project due, $Z$: lab full.

$P(x,y,z)=P(y)P(x|y)P(z|y)$] ]

v-structures

Are $X$ and $Y$ independent? Yes.

The ballgame and the rain cause traffic, but they are not correlated.
(Prove it!)

Are $X$ and $Y$ independent given $Z$? No!

Seeing traffic puts the rain and the ballgame in competition as explanation.
This is backwards from the previous cases. Observing a child node activates influence between parents. ] .kol-1-2.center[.width-80[]

$X$: rain, $Y$: ballgame, $Z$: traffic.

$P(x,y,z)=P(x)P(y)P(z|x,y)$] ]

???

Proof:

$$P(x,y,z) = P(x)P(y)P(z|x,y)$$

and

$$P(x,y,z) = P(x,y)P(z|x,y)$$

therefore

$$P(x,y) = P(x)P(y)$$

d-separation

Let us assume a complete Bayesian network. Are $X_i$ and $X_j$ conditionally independent given evidence $Z_1=z_1, ..., Z_m=z_m$?

Consider all (undirected) paths from $X_i$ to $X_j$:

If one or more active path, then independence is not guaranteed.
Otherwise (i.e., all paths are inactive), then independence is guaranteed.

A path is active if each triple along the path is active:

Cascade $A \to B \to C$ where $B$ is unobserved (either direction).
Common parent $A \leftarrow B \rightarrow C$ where $B$ is unobserved.
v-structure $A \rightarrow B \leftarrow C$ where $B$ or one of its descendents is observed.

] .kol-1-3.width-100[] ]

Example

$L \perp T' | T$?
$L \perp B$?
$L \perp B|T$?
$L \perp B|T'$?
$L \perp B|T, R$?

] .kol-1-2.width-80.center[] ]

???

Yes
Yes
(maybe)
(maybe)
Yes

Local semantics

.center.width-60[]

A node $X$ is conditionally independent to its non-descendants (the $Z_{ij}$) given its parents (the $U_i$).

Global semantics

.center.width-60[]

A node $X$ is conditionally independent of all other nodes in the network given its Markov blanket.

Inference

Inference is concerned with the problem .bold[computing a marginal and/or a conditional probability distribution] from a joint probability distribution:

.grid[ .kol-1-3.center[Simple queries:] .kol-2-3[${\bf P}(X_i|e)$] ] .grid[ .kol-1-3.center[Conjunctive queries:] .kol-2-3[${\bf P}(X_i,X_j|e)={\bf P}(X_i|e){\bf P}(X_j|X_i,e)$] ] .grid[ .kol-1-3.center[Most likely explanation:] .kol-2-3[$\arg \max_q P(q|e)$] ] .grid[ .kol-1-3.center[Optimal decisions:] .kol-2-3[$\arg \max\_a \mathbb{E}_{p(s'|s,a)} \left[ V(s') \right]$] ]

.center.width-30[]

???

Explain what $\arg \max$ means.

Insist on the importance of inference. Inference <=> reasoning.

Inference by enumeration

Start from the joint distribution ${\bf P}(Q, E_1, ..., E_k, H_1, ..., H_r)$.

Select the entries consistent with the evidence $E_1, ..., E_k = e_1, ..., e_k$.
Marginalize out the hidden variables to obtain the joint of the query and the evidence variables: $${\bf P}(Q,e_1,...,e_k) = \sum_{h_1, ..., h_r} {\bf P}(Q, h_1, ..., h_r, e_1, ..., e_k).$$
Normalize:

$$\begin{aligned} Z &= \sum_q P(q,e_1,...,e_k) \\\\ {\bf P}(Q|e_1, ..., e_k) &= \frac{1}{Z} {\bf P}(Q,e_1,...,e_k) \end{aligned}$$

.width-25.center[]

Consider the alarm network and the query ${\bf P}(B|j,m)$. We have $$\begin{aligned} {\bf P}(B|j,m) &= \frac{1}{Z} \sum_e \sum_a {\bf P}(B,j,m,e,a) \\ &\propto \sum_e \sum_a {\bf P}(B,j,m,e,a). \end{aligned}$$ Using the Bayesian network, the full joint entries can be rewritten as the product of CPT entries $$\begin{aligned} {\bf P}(B|j,m) &\propto \sum_e \sum_a {\bf P}(B)P(e){\bf P}(a|B,e)P(j|a)P(m|a). \end{aligned}$$

???

&\propto P(B) \sum_e P(e) \sum_a P(a|B,e)P(j|a)P(m|a)

.center.width-80[]

Inference by enumeration is slow because the whole joint distribution is joined up before summing out the hidden variables.

Factors that do not depend on the variables in the summations can be factored out, which means that marginalization does not necessarily have to be done at the end, hence saving some computations.

For the alarm network, we have $$\begin{aligned} {\bf P}(B|j,m) &\propto \sum_e \sum_a {\bf P}(B)P(e){\bf P}(a|B,e)P(j|a)P(m|a) \\ &= {\bf P}(B) \sum_e P(e) \sum_a {\bf P}(a|B,e)P(j|a)P(m|a). \end{aligned}$$

.center.width-100[]

Same complexity as DFS: $O(n)$ in space, $O(d^n)$ in time.

???

$n$ is the number of variables.
$d$ is the size of their domain.

Evaluation tree for $P(b|j,m)$

.center.width-80[]

Despite the factoring, inference by enumeration is still inefficient. There are repeated computations!

e.g., $P(j|a)P(m|a)$ is computed twice, once for $e$ and once for $\lnot e$.
These can be avoided by storing intermediate results.

???

Inefficient because the product is evaluated left-to-right, in a DFS manner.

Inference by variable elimination

The .bold[Variable Elimination] algorithm carries out summations right-to-left and stores intermediate factors to avoid recomputations. The algorithm interleaves:

Joining sub-tables
Eliminating hidden variables

.center.width-80[![](figures/lec5/elimination.png)]

Variable Elimination

Query: ${\bf P}(Q|e_1, ..., e_k)$.

Start with the initial factors (the local CPTs, instantiated by the evidence).
While there are still hidden variables:
1. Pick a hidden variable $H$
2. Join all factors mentioning $H$
3. Eliminate H
Join all remaining factors
Normalize

Factors

Each factor $\mathbf{f}_i$ is a multi-dimensional array indexed by the values of its argument variables. E.g.: .grid[ .kol-1-2[ $$ \begin{aligned} \mathbf{f}_4 &= \mathbf{f}_4(A) = \left(\begin{matrix} P(j|a) \\ P(j|\lnot a) \end{matrix}\right) = \left(\begin{matrix} 0.90 \\ 0.05 \end{matrix}\right) \\ \mathbf{f}_4(a) &= 0.90 \\ \mathbf{f}_4(\lnot a) &= 0.5 \end{aligned}$$ ] ]
Factors are initialized with the CPTs annotating the nodes of the Bayesian network, conditioned on the evidence.

Join

The pointwise product $\times$, or join, of two factors $\mathbf{f}_1$ and $\mathbf{f}_2$ yields a new factor $\mathbf{f}_3$.

Exactly like a database join!
The variables of $\mathbf{f}_3$ are the union of the variables in $\mathbf{f}_1$ and $\mathbf{f}_2$.
The elements of $\mathbf{f}_3$ are given by the product of the corresponding elements in $\mathbf{f}_1$ and $\mathbf{f}_2$.

.center.width-100[]

Elimination

Summing out, or eliminating, a variable from a factor is done by adding up the sub-arrays formed by fixing the variable to each of its values in turn.

For example, to sum out $A$ from $\mathbf{f}_3(A, B, C)$, we write:

$$\begin{aligned} \mathbf{f}(B,C) &= \sum_a \mathbf{f}_3(a, B, C) = \mathbf{f}_3(a, B, C) + \mathbf{f}_3(\lnot a, B, C) \\\ &= \left(\begin{matrix} 0.06 & 0.24 \\\ 0.42 & 0.28 \end{matrix}\right) + \left(\begin{matrix} 0.18 & 0.72 \\\ 0.06 & 0.04 \end{matrix}\right) = \left(\begin{matrix} 0.24 & 0.96 \\\ 0.48 & 0.32 \end{matrix}\right) \end{aligned}$$

.center.width-35[]

Relevance

Consider the query ${\bf P}(J|b)$: $${\bf P}(J|b) \propto P(b) \sum_e P(e) \sum_a P(a|b,e) {\bf P}(J|a) \sum_m P(m|a)$$

$\sum_m P(m|a) = 1$, therefore $M$ is irrelevant for the query.
In other words, ${\bf P}(J|b)$ remains unchanged if we remove $M$ from the network.

.italic[Theorem.] $H$ is irrelevant for ${\bf P}(Q|e)$ unless $H \in \text{ancestors}(\{Q\} \cup E)$.

Complexity

.center.width-50[]

Consider the query ${\bf P}(X_n|y_1,...,y_n)$.

Work through the two elimination orderings:

$Z, X_1, ..., X_{n-1}$
$X_1, ..., X_{n-1}, Z$

What is the size of the maximum factor generated for each of the orderings?

Answer: $2^{n+1}$ vs. $2^2$ (assuming boolean values)

The computational and space complexity of variable elimination is determined by the largest factor.

The elimination ordering can greatly affect the size of the largest factor.
The optimal ordering is NP-hard to find. There is no known polynomial-time algorithm to find it.

Approximate inference

Exact inference is intractable for most probabilistic models of practical interest. (e.g., involving many variables, continuous and discrete, undirected cycles, etc).

We must resort to approximate inference algorithms:

Sampling methods: produce answers by repeatedly generating random numbers from a distribution of interest.
Variational methods: formulate inference as an optimization problem.
Belief propagation methods: formulate inference as a message-passing algorithm.
Machine learning methods: learn an approximation of the target distribution from training examples.

Parameter learning

When modeling a domain, we can choose a probabilistic model specified as a Bayesian network. However, specifying the individual probability values is often difficult.

A workaround is to use a parameterized family ${\bf P}(X | \theta)$ (sometimes also noted ${\bf P}_\theta(X)$) of models, and estimate the parameters $\theta$ from data.

???

Connect back to the Kolmogorov axioms: we have upgraded $P$ to a family of distributions $P_\theta$.

.center.width-100[]

Maximum likelihood estimation

Suppose we have a set of $N$ i.i.d. observations $\mathbf{d} = \{x_1, ..., x_N\}$.

The likelihood of the parameters $\theta$ is the probability of the data given the parameters $$P(\mathbf{d}|\theta) = \prod_{j=1}^N P(x_j | \theta).$$

The maximum likelihood estimate (MLE) $\theta^*$ of the parameters is the value of $\theta$ that maximizes the likelihood $$\theta^* = \arg \max_\theta P(\mathbf{d}|\theta).$$

In practice,

Write down the log-likelihood $L(\theta) = \log P({\bf d}|\theta)$ of the parameters $\theta$.
Write down the derivative $\frac{\partial L}{\partial \theta}$ of the log-likelihood of the parameters $\theta$.
Find the parameter values $\theta^*$ such that the derivatives are zero (and check whether the Hessian is negative definite).

???

Note that:

evaluating the likelihood may require summing over hidden variables, i.e., inference.
finding $\theta^*$ may be hard; modern optimization techniques help.

Case (a)

What is the fraction $\theta$ of cherry candies?

Suppose we unwrap $N$ candies, and get $c$ cherries and $l=N-c$ limes. These are i.i.d. observations, therefore $$P(\mathbf{d}|\theta) = \prod_{j=1}^N P(x_j | \theta) = \theta^c (1-\theta)^l.$$ Maximize this w.r.t. $\theta$, which is easier for the log-likelihood and leads to $$\begin{aligned} L(\mathbf{d}|\theta) &= \log P(\mathbf{d}|\theta) = c \log \theta + l \log(1-\theta) \\ \frac{\partial L(\mathbf{d}|\theta)}{\partial \theta} &= \frac{c}{\theta} - \frac{l}{1-\theta}=0. \end{aligned}$$ Hence $\theta=\frac{c}{N}$.

Case (b)

Red and green wrappers depend probabilistically on flavor. E.g., the likelihood for a cherry candy in green wrapper is $$\begin{aligned} &P(\text{cherry}, \text{green}|\theta,\theta_1, \theta_2) \\ &= P(\text{cherry}|\theta,\theta_1, \theta_2) P(\text{green}|\text{cherry}, \theta,\theta_1, \theta_2) \\ &= \theta (1-\theta_1). \end{aligned}$$

The likelihood for the parameters, given $N$ candies, $r_c$ red-wrapped cherries, $g_c$ green-wrapped cherries, etc., is $$\begin{aligned} P(\mathbf{d}|\theta,\theta_1, \theta_2) =&,, \theta^c (1-\theta)^l \theta_1^{r_c}(1-\theta_1)^{g_c} \theta_2^{r_l} (1-\theta_2)^{g_l} \\ L =&,, c \log \theta + l \log(1-\theta) + \\ &,, r_c \log \theta_1 + g_c \log(1-\theta_1) + \\ &,, r_l \log \theta_2 + g_l \log(1-\theta_2). \end{aligned}$$

The derivatives of $L$ yield $$\begin{aligned} \frac{\partial L}{\partial \theta} &= \frac{c}{\theta} - \frac{l}{1-\theta} = 0 \Rightarrow \theta = \frac{c}{c+l} \\ \frac{\partial L}{\partial \theta_1} &= \frac{r_c}{\theta_1} - \frac{g_c}{1-\theta_1} = 0 \Rightarrow \theta_1 = \frac{r_c}{r_c + g_c} \\ \frac{\partial L}{\partial \theta_2} &= \frac{r_l}{\theta_2} - \frac{g_l}{1-\theta_2} = 0 \Rightarrow \theta_2 = \frac{r_l}{r_l + g_l}. \end{aligned}$$

???

Again, results coincide with intuition.

.question[In case (a), if we unwrap 1 candy and get 1 cherry, what is the MLE? How confident are we in this estimate?]

With small datasets, maximum likelihood estimation can lead to overfitting.
The MLE does not provide a measure of uncertainty about the parameters.

Bayesian parameter learning

We can treat parameter learning as a .bold[Bayesian inference] problem:

Make the parameters $\theta$ random variables and treat them as hidden variables.
Specify a prior distribution ${\bf P}(\theta)$ over the parameters.
Then, as data arrives, update our beliefs about the parameters to obtain the posterior distribution ${\bf P}(\theta|\mathbf{d})$.

Case (a)

What is the fraction $\theta$ of cherry candies?

We assume a Beta prior $$P(\theta) = \text{Beta}(\theta|a,b) = \frac{1}{Z} \theta^{a-1} (1-\theta)^{b-1}$$ where $Z$ is a normalization constant.

Then, observing a cherry candy yields the posterior $$\begin{aligned} P(\theta|\text{cherry}) &\propto P(\text{cherry}|\theta) P(\theta) \\ &= \theta \text{Beta}(\theta|a,b) \\ &= \theta (1-\theta)^{b-1} \theta^{a-1} (1-\theta)^{b-1} \\ &= \theta^a (1-\theta)^{b-1} \\ &= \text{Beta}(\theta|a+1,b). \end{aligned}$$

Case (b)

.center.width-100[]

Maximum a posteriori estimation

When the posterior cannot be computed analytically, we can use maximum a posteriori (MAP) estimation, which consists in approximating the posterior with the point estimate $\theta^*$ that maximizes the posterior distribution, i.e., $$\theta^* = \arg \max_\theta P(\theta|\mathbf{d}) = \arg \max_\theta P(\mathbf{d}|\theta) P(\theta).$$

(demo)

Summary

A Bayesian Network specifies a full joint distribution. BNs are often exponentially smaller than an explicitly enumerated joint distribution.
The topology of a Bayesian network encodes conditional independence assumptions between random variables.
Inference is the problem of computing a marginal and/or a conditional probability distribution from a joint probability distribution.
- Exact inference is possible for simple Bayesian networks, but is intractable for most probabilistic models of practical interest.
- Approximate inference algorithms are used in practice.
Parameters of a Bayesian network can be learned from data using maximum likelihood estimation or Bayesian inference.

The end.

Files

lecture5.md

Latest commit

History

lecture5.md

File metadata and controls

Introduction to Artificial Intelligence

Today

Representing uncertain knowledge

Bayesian networks

Example 1

Semantics

Example 1 (continued)

Example 2

Example 3

Example 3 (bis)

Construction

Algorithm

Independence relations

Cascades

Common parent

v-structures

d-separation

Example

Local semantics

Global semantics

Inference

Inference by enumeration

Evaluation tree for $P(b|j,m)$

Inference by variable elimination

Variable Elimination

Factors

Join

Elimination

Relevance

Complexity

Approximate inference

Parameter learning

Maximum likelihood estimation

Case (a)

Case (b)

Bayesian parameter learning

Case (a)

Case (b)

Maximum a posteriori estimation

Summary