### Lei Mao

Machine Learning, Artificial Intelligence, Computer Science.

# Markov Random Field and Conditional Random Field

### Introduction

Markov random field and conditional random field are common models for undirected probabilistic graphical models.

In this blog post, I would like to discuss mathematically about the definitions, relationships, caveats, tricks, and applications of Markov random field and conditional random field.

### Markov Random Field

Suppose we have a set of factors,

$\Phi = \{ \phi_1(\boldsymbol{D}_1), \phi_2(\boldsymbol{D}_2), \cdots, \phi_k(\boldsymbol{D}_k) \}$

where $\boldsymbol{D}_i$ is a clique of the undirected graph, $\boldsymbol{D}_i \subseteq \{ X_1, X_2, \cdots, X_n \}$, and $\bigcup_{i=1}^{k} \boldsymbol{D}_i = \{ X_1, X_2, \cdots, X_n \}$.

We define the joint distribution $P(X_1, X_2, \cdots, X_n)$ as

$P(X_1, X_2, \cdots, X_n) = \frac{\widetilde{P}_{\Phi}(X_1, X_2, \cdots, X_n)}{Z_{\Phi}}$

where $\widetilde{P}_{\Phi}(X_1, X_2, \cdots, X_n)$ is the unnormalized measure and $Z_{\Phi}$ is the partition function. Concretely,

$\widetilde{P}_{\Phi}(X_1, X_2, \cdots, X_n) = \prod_{i=1}^{k} \phi_i(\boldsymbol{D}_i)$

and

$Z_{\Phi} = \sum_{X_1, X_2, \cdots, X_n} \widetilde{P}_{\Phi}(X_1, X_2, \cdots, X_n)$

Note that by definition $Z_{\Phi}$ is intractable to compute, even if $X_i$ is a binary variable.

### Conditional Random Field

Conditional random field as to Markov random field is just like the conditional probability distribution as to the joint distribution.

Suppose we have a set of factors and the random variables could be separated into two disjoint sets $\boldsymbol{X}$ and $\boldsymbol{Y}$,

$\Phi = \{ \phi_1(\boldsymbol{D}_1), \phi_2(\boldsymbol{D}_2), \cdots, \phi_k(\boldsymbol{D}_k) \}$

where $\boldsymbol{D}_i$ is a clique of the undirected graph, $\boldsymbol{D}_i \subseteq \{ \boldsymbol{X}, \boldsymbol{Y} \}$, and $\bigcup_{i=1}^{k} \boldsymbol{D}_i = \{ \boldsymbol{X}, \boldsymbol{Y} \}$.

We define the conditional distribution $P(\boldsymbol{Y} | \boldsymbol{X})$ as

$P(\boldsymbol{Y} | \boldsymbol{X}) = \frac{\widetilde{P}_{\Phi}(\boldsymbol{X}, \boldsymbol{Y})}{Z_{\Phi}(\boldsymbol{X})}$

where $\widetilde{P}_{\Phi}(\boldsymbol{X}, \boldsymbol{Y})$ is the unnormalized measure which is the same as the one in Gibbs distribution and $Z_{\Phi}(\boldsymbol{X})$ is a different partition function. Concretely,

$\widetilde{P}_{\Phi}(\boldsymbol{X}, \boldsymbol{Y}) = \prod_{i=1}^{k} \phi_i(\boldsymbol{D}_i)$

and

$Z_{\Phi}(\boldsymbol{X}) = \sum_{\boldsymbol{Y}} \widetilde{P}_{\Phi}(\boldsymbol{X}, \boldsymbol{Y})$

Note that computing $Z_{\Phi}(\boldsymbol{X})$ is still intractable. However, if the number of variables in $\boldsymbol{Y}$ are small, and the number of possible values for $\boldsymbol{Y}$ are small, it is feasible to compute $Z_{\Phi}(\boldsymbol{X})$. For example, $\boldsymbol{X} = \{X_1, X_2, \cdots, X_{1080\times3}\}$ could be the pixels of a $1080$P RGB image, $\boldsymbol{Y} = \{Y\}$ is just the label of the image, and $|\boldsymbol{Y}| = 10$, computing $Z_{\Phi}(\boldsymbol{X})$ will be very fast.

In many discriminative tasks, conditional random field is favorable over Markov random field.

### Exponential Form

We could change the form of the unnormalized measure $\widetilde{P}_{\Phi}(\boldsymbol{X})$ or $\widetilde{P}_{\Phi}(\boldsymbol{X}, \boldsymbol{Y})$ to exponential form from the exponential family. For example, for $\widetilde{P}_{\Phi}(\boldsymbol{X})$, we have

\begin{align} \widetilde{P}_{\Phi}(\boldsymbol{X}) &= \prod_{i=1}^{k} \phi_i(\boldsymbol{D}_i) \\ &= \prod_{i=1}^{k} \exp \big( \log \phi_i(\boldsymbol{D}_i) \big) \\ &= \exp\bigg( \sum_{i=1}^{k} \log \phi_i(\boldsymbol{D}_i) \bigg) \\ &= \exp\bigg( \sum_{i=1}^{k} f_i(\boldsymbol{D}_i) \bigg) \\ \end{align}

We further tweak the above expression.

\begin{align} \widetilde{P}_{\Phi}(\boldsymbol{X}) &= \exp\bigg( \sum_{i=1}^{k} f_i(\boldsymbol{D}_i) \bigg) \\ &= \exp\bigg( - \sum_{i=1}^{k} \psi_i(\boldsymbol{D}_i) \bigg) \\ \end{align}

where $\psi_i(\boldsymbol{D}_i) = -f_i(\boldsymbol{D}_i)$.

Therefore,

$P(\boldsymbol{X}) = \frac{\exp\bigg( - \sum_{i=1}^{k} \psi_i(\boldsymbol{D}_i) \bigg)}{Z_{\Phi}}$

Now, it just looks like a Boltzmann distribution (Gibbs distribution).

We call $\psi_i(\boldsymbol{D}_i)$ the potential function for clique $\boldsymbol{D}_i$, and the energy of the state $E(\boldsymbol{X})$, sometimes referred as the energy function, is defined as

$E(\boldsymbol{X}) = \sum_{i=1}^{k} \psi_i(\boldsymbol{D}_i)$

### Conditional Random Field and Conventional Classification Model

Suppose $k = 1$, and $\boldsymbol{D} = \{ \boldsymbol{X}, \boldsymbol{Y} \}$, we then have

\begin{align} \widetilde{P}_{\Phi}(\boldsymbol{X}, \boldsymbol{Y}) &= \exp \Big( f(\boldsymbol{X}, \boldsymbol{Y}) \Big) \\ \end{align}

So the conditional probability distribution $P(\boldsymbol{Y} | \boldsymbol{X})$ is

$P(\boldsymbol{Y} | \boldsymbol{X}) = \frac{\exp \Big( f(\boldsymbol{X}, \boldsymbol{Y}) \Big)}{\sum_{\boldsymbol{Y}} \exp \Big( f(\boldsymbol{X}, \boldsymbol{Y}) \Big)}$

Doesn’t this look familiar? It is just the softmax function used for conventional classification models, and $f(\boldsymbol{X}, \boldsymbol{Y})$ is just the logit for class $\boldsymbol{Y}$ with input $\boldsymbol{X}$.

### Choices for $f$ or $\phi$

The definition of Markov random field or conditional random field does not specify the form of $f$ or $\phi$. So this means it could be any function. However, in practice, indicator functions and weights are used, thus the number of parameters used for the model could be reduced. Concretely,

$\phi_i(\boldsymbol{D}_i) = \exp \bigg( - \sum_{j=1}^{l_i} w_j g_j(\boldsymbol{D}_i) \bigg)$ $f_i(\boldsymbol{D}_i) = - \sum_{j=1}^{l_i} w_j g_j(\boldsymbol{D}_i)$

where $g$ is just a binary indicator function, and usually $l_i = | \boldsymbol{D}_i |$.

### Intractable Partition Function

The intractable partition function in Markov random field and conditional random field is a headache. To make it tractable to compute, we usually have to pose some structure to it when we create the model.

\begin{align} Z_{\Phi} &= \sum_{X_1, X_2, \cdots, X_n} \widetilde{P}_{\Phi}(X_1, X_2, \cdots, X_n) \\ &= \sum_{X_1, X_2, \cdots, X_n} \exp\bigg( \sum_{i=1}^{k} f_i(\boldsymbol{D}_i) \bigg) \\ \end{align}

For example, suppose $n = 2k$ is even and $D_1 = \{X_1, X_2\}$, $D_2 = \{X_3, X_4\}$, $\cdots$, $D_k = \{X_{2k-1}, X_{2k}\}$, $f = f_1 = f_2 = \cdots = f_k$, $X_i$ is binary variable for all $i \in [1, n]$,

\begin{align} Z_{\Phi} &= \sum_{X_1, X_2, \cdots, X_n} \exp\bigg( \sum_{i=1}^{k} f_i(\boldsymbol{D}_i) \bigg) \\ &= \sum_{X_1, X_2, \cdots, X_n} \exp\bigg( \sum_{i=1}^{k} f(\boldsymbol{D}_i) \bigg) \\ &= \sum_{X_1, X_2, \cdots, X_n} \exp\bigg( \sum_{i=1}^{k} f(X_{2i-1}, X_{2i}) \bigg) \\ &= \sum_{X_1, X_2, \cdots, X_n} \exp\bigg( \sum_{i=1}^{k} f(X_{2i-1}, X_{2i}) \bigg) \\ &= \sum_{X_1, X_2, \cdots, X_n} \prod_{i=1}^{k} \exp\Big( f(X_{2i-1}, X_{2i}) \Big) \\ &= \bigg( \sum_{X_1, X_2} \exp\Big( f(X_{1}, X_{2}) \Big) \bigg) \bigg( \sum_{X_3, X_4} \exp\Big( f(X_{3}, X_{4}) \Big) \bigg) \cdots \bigg( \sum_{X_{2k-1}, X_{2k}} \exp\Big( f(X_{2k-1}, X_{2k}) \Big) \bigg) \\ &= \bigg( \sum_{X_1, X_2} \exp\Big( f(X_{1}, X_{2}) \Big) \bigg)^k \\ \end{align}

So with the original form, we have to do $k2^n = \frac{n 2^n}{2}$ additions, whereas using the refined structured form, we only have to do $4$ additions.

The structure we could pose to the partition function varies from problem to problem. However, the example above demonstrated that we could apply some structures to the random variables in the model so that computing the partition function might become feasible.

### Inference

To find the optimal $\boldsymbol{X}^{\ast}$ that has the largest $P(\boldsymbol{X})$, computing the potential function $Z_{\Phi}$ or $Z_{\Phi}(\boldsymbol{X})$ is not necessary as it is just a constant value.

Because

\begin{align} E(\boldsymbol{X}) &= \sum_{i=1}^{k} \psi_i(\boldsymbol{D}_i) \\ &= - \log \big( Z_{\Phi} P(\boldsymbol{X}) \big) \\ &= - \log Z_{\Phi} - \log P(\boldsymbol{X}) \\ \end{align} \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \begin{align} \boldsymbol{X}^{\ast} &= \argmax_{\boldsymbol{X}} P(\boldsymbol{X}) \\ &= \argmin_{\boldsymbol{X}} \big( - \log P(\boldsymbol{X}) \big) \\ &= \argmin_{\boldsymbol{X}} \big( - \log Z_{\Phi} - \log P(\boldsymbol{X}) \big) \\ &= \argmin_{\boldsymbol{X}} E(\boldsymbol{X}) \\ \end{align}

However, it is still very difficult to find the optimal $\boldsymbol{X}^{\ast}$ especially when each variable in $\boldsymbol{X}$ is a discrete variable.

There are some optimization algorithms that approximately find the optimum when the energy function $E(\boldsymbol{X})$ is of certain form. We will probably discuss some of them in the future.