Lei Mao bio photo

Lei Mao

Machine Learning, Artificial Intelligence, Computer Science.

Twitter Facebook LinkedIn GitHub   G. Scholar E-Mail RSS

Introduction

The factorization theorem are the fundamentals for the representations in Bayesian networks. However, I found many online tutorials, notes, and even books described the theorem quite superficially by just providing a proof by example. For some materials that have formal proofs, they are either incorrect or incomplete.


In this blog post, I would like to prove the factorization theorem for Bayesian networks without any missing information.

Prerequisites

Ordered Graph

In an ordered graph, the parents of a node are the nodes that are adjacent to it and precede it in the ordering. To create an ordered graph for directed acyclic graph (DAG), we could just run the topological sort algorithm for the graph. A Bayesian network is also a DAG.

D-Separation

Node $X$ and $Y$ are d-separated in a graph $G$ given node $Z$ if there is no active trail in $G$ between $X$ and $Y$ given $Z$, i.e., $X \bot Y | Z$.


Note that any node in $G$ is d-separated from its non-descendants given its parents. Formally, $X_i \bot \text{NonDesc}(X_i) | \text{Pa}(X_i)$ where $\text{NonDesc}(X_i)$ is the non-descendants of node $X_i$ and $\text{Pa}(X_i)$ is the parents of node $X_i$.

Independency Mapping

Given a multivariate distribution, we have the independence between the variables in the distribution. Given a probabilistic graph, we have the independence between the nodes in the graph. The independence mapping (I-map) is a relationship between these two concepts.


We define $I(P)$ be the set of the independencies of form $(X \bot Y | Z)$ in the probability distribution $P$.


Similarly, we also define $I(G)$ be the set of the d-separated independencies of form $(X \bot Y | Z)$ in the graph $G$.


We claim $G$ is a I-map of $P$ if $I(G) \subseteq I(P)$.


In fact, it is trivial to see the following fact for Bayesian networks,

  • $(X_i \bot \text{Desc}(X_i) | \text{Pa}(X_i)) \notin I(G)$ and thus$(X_i \bot \text{Desc}(X_i) | \text{Pa}(X_i)) \notin I(P)$.
  • $(X_i \bot \text{Desc}(X_i) | \text{Desc}(X_i)) \notin I(G)$ and thus$(X_i \bot \text{Desc}(X_i) | \text{Desc}(X_i)) \notin I(P)$.
  • $(X_i \bot \text{NonDesc}(X_i) | \text{Desc}(X_i)) \notin I(G)$ and thus$(X_i \bot \text{NonDesc}(X_i) | \text{Desc}(X_i)) \notin I(P)$.

We define $I_L(G)$ be the set of the d-separated independencies of form $(X_i \bot \text{NonDesc}(X_i) | \text{Pa}(X_i))$ in the graph $G$.


Therefore, $I(G) = I_L(G)$ for Bayesian networks.

Factorization

Given a distribution $P$ and a graph $G$, $P$ is said to factorize over $G$ if $P$ can be expressed as a product of each variable’s conditional probability on their parents. Formally,

\[P(X_1, X_2, \cdots, X_n) = \prod_{i=1}^{n} P(X_i | \text{Pa}(X_i))\]

This is a much more compact representation of the joint distribution compared with the chain rule expansion of the joint distribution.

Factorization Theorem

I-Map -> Factorization

If $G$ is an I-map of $P$, then $P$ factorize over $G$, i.e., $P(X_1, X_2, \cdots, X_n) = \prod_{i=1}^{n} P(X_i | \text{Pa}(X_i))$.


This means that the conditional independencies encoded in $G$ imply factorization according to $G$.


Proof


Without loss of generality, we assume $X_1, X_2, \cdots, X_n$ is an ordering consistent with $G$, i.e., $G$ is an ordered graph.


By the chain rule of the joint probability distribution,

\[P(X_1, X_2, \cdots, X_n) = \prod_{i=1}^{n} P(X_i | X_1, X_2, \cdots, X_{i-1})\]

From the ordering assumption, we have

\[\text{Pa}(X_i) \subseteq \{X_1, X_2, \cdots, X_n\}\]

and

\[\{X_1, X_2, \cdots, X_n\} - \text{Pa}(X_i) \subseteq \text{NonDesc}(X_i)\]

Note that $\text{NonDesc}(X_i)$ might contain $X_j$ in which $j > i$.


Because $G$ is an I-map of $P$, and remember $(X_i \bot \text{NonDesc}(X_i) | \text{Pa}(X_i))$ must be d-separated and thus must be in $I(P)$.

\[\begin{align} P(X_i | X_1, X_2, \cdots, X_{i-1}) &= P(X_i | \text{Pa}(X_i), \{X_1, X_2, \cdots, X_n\} - \text{Pa}(X_i)) \\ &= P(X_i | \text{Pa}(X_i)) \\ \end{align}\]

Therefore, the chain rule expansion could be simplified as $P$ factorize over $G$, i.e.,

\[P(X_1, X_2, \cdots, X_n) = \prod_{i=1}^{n} P(X_i | \text{Pa}(X_i))\]

This concludes the proof.

Factorization -> I-Map

If $P$ factorize over $G$, i.e., $P(X_1, X_2, \cdots, X_n) = \prod_{i=1}^{n} P(X_i | \text{Pa}(X_i))$, then $G$ is an I-map of $P$.


This means that factorization according to $G$ implies the associated conditional independencies.


Proof


Without loss of generality, we assume $X_1, X_2, \cdots, X_n$ is an ordering consistent with $G$, i.e., $G$ is an ordered graph.


We would like to first prove a lemma, that is, if $P(X_1, X_2, \cdots, X_n) = \prod_{i=1}^{n} P(X_i | \text{Pa}(X_i))$ then $P(X_1, X_2, \cdots, X_k) = \prod_{i=1}^{k} P(X_i | \text{Pa}(X_k))$ for $k \leq n$.

\[\begin{align} P(X_1, X_2, \cdots, X_{n-1}) &= \sum_{x_n} P(X_1, X_2, \cdots, X_{n-1}, X_{n}) \\ &= \sum_{x_n} \prod_{i=1}^{n} P(X_i | \text{Pa}(X_i)) \\ &= \sum_{x_n} \bigg( \prod_{i=1}^{n-1} P(X_i | \text{Pa}(X_i)) P(X_n | \text{Pa}(X_n)) \bigg) \\ &= \prod_{i=1}^{n-1} P(X_i | \text{Pa}(X_i)) \sum_{x_n} \bigg( P(X_n | \text{Pa}(X_n)) \bigg) \\ &= \prod_{i=1}^{n-1} P(X_i | \text{Pa}(X_i)) \times 1\\ &= \prod_{i=1}^{n-1} P(X_i | \text{Pa}(X_i)) \\ \end{align}\]

We have proved $P(X_1, X_2, \cdots, X_k) = \prod_{i=1}^{k} P(X_i | \text{Pa}(X_i))$ for $k = n - 1$. Using (strong) induction, it is then trivial to prove $P(X_1, X_2, \cdots, X_k) = \prod_{i=1}^{k} P(X_i | \text{Pa}(X_i))$ for $k \leq n$.


Once finished the proof of this lemma, let’s come back to the original statement.


We need to show that given $P(X_1, X_2, \cdots, X_n) = \prod_{i=1}^{n} P(X_i | \text{Pa}(X_i))$, $I(G) \subseteq I(P)$, i.e., $I_L(G) \subseteq I(P)$. This is further equivalent to show that each $(X_i \bot \text{NonDesc}(X_i) | \text{Pa}(X_i)) \in I(P)$.


There will be a common mistake that people make in the proof. To make sure we don’t make this mistake, we have to note that the following two sets are not necessarily equal in a ordered graph (even a topologically sorted graph).

\[\{X_1, X_2, \cdots, X_i\} \subseteq \{X_i, \text{NonDesc}(X_i)\}\]

However, we could always rearrange the order of the nodes for $\{X_i, \text{NonDesc}(X_i)\}$ such that $\{X_i, \text{NonDesc}(X_i)\} = \{Y_1, Y_2, \cdots, Y_m\}$ where $m \geq i$, the projection is a one-to-one projection, and $Y_m = X_i$ without affecting the property of an ordered graph (Please try to understand why this is the case. We may also draw diagrams for topologically sorted DAGs to understand).


Note here, because

\[\begin{align} P(X_1, X_2, \cdots, X_n) &= \prod_{i=1}^{n} P(X_i | \text{Pa}(X_i)) \\ &= P(Y_1, Y_2, \cdots, Y_n) \\ &= \prod_{i=1}^{n} P(Y_i | \text{Pa}(Y_i)) \\ \end{align}\]

So $P(Y_1, Y_2, \cdots, Y_k) = \prod_{i=1}^{k} P(Y_i | \text{Pa}(Y_i))$ for $k \leq n$ still holds.

\[\begin{align} P(X_i | \text{NonDesc}(X_i) ) &= \frac{P(X_i, \text{NonDesc}(X_i) )}{P(\text{NonDesc}(X_i))} \\ &= \frac{P(Y_1, Y_2, \cdots, Y_m) }{P(Y_1, Y_2, \cdots, Y_{m-1})} \\ &= \frac{\prod_{j=1}^{m} P(Y_j | \text{Pa}(Y_{j})) }{\prod_{j=1}^{m-1} P(Y_j | \text{Pa}(Y_{j}))} \\ &= P(Y_m | \text{Pa}(Y_{m})) \\ &= P(X_i | \text{Pa}(X_{i})) \\ \end{align}\]

This means each $(X_i \bot \text{NonDesc}(X_i) | \text{Pa}(X_i)) \in I(P)$.


This concludes the proof.

Caveats

The factorization -> I-Map proofs in the references are incorrect or incomplete.

References