Lei Mao bio photo

Lei Mao

Machine Learning, Artificial Intelligence, Computer Science.

Twitter Facebook LinkedIn GitHub   G. Scholar E-Mail RSS

Introduction

The application of the autoregressive model and the autoregressive decoding for sequence to sequence tasks has been discussed in my article “Autoregressive Model and Autoregressive Decoding for Sequence to Sequence Tasks”. The autoregressive model and autoregressive decoding is very natural to human intelligence. However, its major drawback is that the autoregressive decoding process could not be parallelized, and therefore its application becomes limited in the latency-constraint systems.


In this blog post, I would like to discuss whether it is feasible to apply the non-autoregressive model and non-autoregressive decoding for sequence to sequence tasks, with reasonable conditional independence assumptions, so that the decoding process is tractable and could be parallelized.

Naive Non-Autoregressive Model

Given a sequence of input variables $X = \{X_1, X_2, \cdots, X_{T^{\prime}}\}$ and a sequence of output variables $Y = \{Y_1, Y_2, \cdots, Y_{T}\}$, it is very easy to propose a naive non-autoregressive model for sequence to sequence tasks.

\[\begin{align} P(Y | X; \theta) = P(T | X_{1:T^{\prime}}; \theta) \prod_{t=1}^{T} P(Y_t | X_{1:T^{\prime}}; \theta) \end{align}\]

where $P(T | X_{1:T^{\prime}}; \theta)$ models the conditional probability distribution for the length of output sequence, and $P(Y_t | X_{1:T^{\prime}}; \theta)$ models the conditional probability distribution for the token at each position of the output sequence.


This means, given some context, we know the output sequence length, and for each token in the output sequence, we know what it is without even having to think what its previous and afterwards tokens are. During inference, the generation of each of the tokens in the output sequence could be parallelized. The time complexity of the decoding process is $O(T)$ which is tractable.


However, we find it problematic immediately because we know that the conditional independence assumption we made above for the variables in the output sequence is not valid in many sequence to sequence tasks. Therefore, the model accuracy will likely be poor.

Autoregressive Model VS Naive Non-Autoregressive Model

Let’s take a look at an example of translating English “Thank you” to German. The corresponding translation in German could be “Danke”, “Danke schön”, or “Vielen Dank”. That is to say, we have a dataset consists of (“Thank you”, “Danke”), (“Thank you”, “Danke schön”), (“Thank you”, “Vielen Dank”). We will apply the autoregressive model and the autoregressive decoding, and the non-autoregressive model and the non-autoregressive decoding for this sequence to sequence task.


After the autoregressive model is trained with the dataset, we must have,

\[P(\text{Danke}_1 | \langle \text{BOS} \rangle_0, \text{Thank}_1 \text{you}_2; \theta) \neq 0\] \[P(\text{Vielen}_1 | \langle \text{BOS} \rangle_0, \text{Thank}_1 \text{you}_2; \theta) \neq 0\] \[P(\text{schön}_2 | \langle \text{BOS} \rangle_0 \text{Danke}_1, \text{Thank}_1 \text{you}_2; \theta) \neq 0\] \[p(\text{Danke}_2 | \langle \text{BOS} \rangle_0 \text{Vielen}_1, \text{Thank}_1 \text{you}_2; \theta) \neq 0\] \[P(\text{Danke}_2 | \langle \text{BOS} \rangle_0 \text{Danke}_1, \text{Thank}_1 \text{you}_2; \theta) = 0\] \[P(\text{schön}_2 | \langle \text{BOS} \rangle_0 \text{Vielen}_1, \text{Thank}_1 \text{you}_2; \theta) = 0\]

In autoregressive decoding, the chain rule is then applied to get the joint conditional probability. If the reader is not familiar with the math of the autoregressive model and the autoregressive decoding, please read my article “Autoregressive Model and Autoregressive Decoding for Sequence to Sequence Tasks”.

\[\begin{align} P(\text{Danke}_1 | \text{Thank}_1 \text{you}_2; \theta) &= P(\text{Danke}_1 | \langle \text{BOS} \rangle_0, \text{Thank}_1 \text{you}_2; \theta) \\ &\neq 0 \\ \end{align}\] \[\begin{align} P(\text{Danke}_1 \text{schön}_2 | \text{Thank}_1 \text{you}_2; \theta) &= P(\text{Danke}_1 | \langle \text{BOS} \rangle_0, \text{Thank}_1 \text{you}_2; \theta) P(\text{schön}_2 | \langle \text{BOS} \rangle_0 \text{Danke}_1, \text{Thank}_1 \text{you}_2; \theta) \\ &\neq 0 \\ \end{align}\] \[\begin{align} P(\text{Vielen}_1 \text{Dank}_2 | \text{Thank}_1 \text{you}_2; \theta) &= P(\text{Vielen}_1 | \langle \text{BOS} \rangle_0, \text{Thank}_1 \text{you}_2; \theta) P(\text{Dank}_2 | \langle \text{BOS} \rangle_0 \text{Vielen}_1, \text{Thank}_1 \text{you}_2; \theta) \\ &\neq 0 \\ \end{align}\] \[\begin{align} P(\text{Danke}_1 \text{Danke}_2 | \text{Thank}_1 \text{you}_2; \theta) &= P(\text{Danke}_1 | \langle \text{BOS} \rangle_0, \text{Thank}_1 \text{you}_2; \theta) P(\text{Danke}_2 | \langle \text{BOS} \rangle_0 \text{Danke}_1, \text{Thank}_1 \text{you}_2; \theta) \\ &= 0 \\ \end{align}\] \[\begin{align} P(\text{Vielen}_1 \text{schön}_2 | \text{Thank}_1 \text{you}_2; \theta) &= P(\text{Vielen}_1 | \langle \text{BOS} \rangle_0, \text{Thank}_1 \text{you}_2; \theta) P(\text{schön}_2 | \langle \text{BOS} \rangle_0 \text{Vielen}_1, \text{Thank}_1 \text{you}_2; \theta) \\ &= 0 \\ \end{align}\]

All the joint conditional probabilities make sense.


After the naive non-autoregressive model is trained with the dataset, we must have,

\[P(T = 1 | \text{Thank}_1 \text{you}_2; \theta) \neq 0\] \[P(T = 2 | \text{Thank}_1 \text{you}_2; \theta) \neq 0\] \[P(\text{Danke}_1 | \text{Thank}_1 \text{you}_2; \theta) \neq 0\] \[P(\text{Vielen}_1 | \text{Thank}_1 \text{you}_2; \theta) \neq 0\] \[P(\text{schön}_2 | \text{Thank}_1 \text{you}_2; \theta) \neq 0\] \[P(\text{Danke}_2 | \text{Thank}_1 \text{you}_2; \theta) \neq 0\]

In the naive non-autoregressive decoding, the conditional independence is applied.

\[\begin{align} P(\text{Danke}_1 | \text{Thank}_1 \text{you}_2; \theta) &= P(T = 1 | \text{Thank}_1 \text{you}_2; \theta) P(\text{Danke}_1 | \text{Thank}_1 \text{you}_2; \theta) \\ &\neq 0 \\ \end{align}\] \[\begin{align} P(\text{Danke}_1 \text{schön}_2 | \text{Thank}_1 \text{you}_2; \theta) &= P(T = 2 | \text{Thank}_1 \text{you}_2; \theta) P(\text{Danke}_1 | \text{Thank}_1 \text{you}_2; \theta) \\ &\qquad P(\text{schön}_2 | \text{Thank}_1 \text{you}_2; \theta) \\ &\neq 0 \\ \end{align}\] \[\begin{align} P(\text{Vielen}_1 \text{Dank}_2 | \text{Thank}_1 \text{you}_2; \theta) &= P(T = 2 | \text{Thank}_1 \text{you}_2; \theta) P(\text{Vielen}_1 | \text{Thank}_1 \text{you}_2; \theta) \\ &\qquad P(\text{Dank}_2 | \text{Thank}_1 \text{you}_2; \theta) \\ &\neq 0 \\ \end{align}\] \[\begin{align} P(\text{Danke}_1 \text{Danke}_2 | \text{Thank}_1 \text{you}_2; \theta) &= P(T = 2 | \text{Thank}_1 \text{you}_2; \theta) P(\text{Danke}_1 | \text{Thank}_1 \text{you}_2; \theta) \\ &\qquad P(\text{Danke}_2 | \text{Thank}_1 \text{you}_2; \theta) \\ &\neq 0 \\ \end{align}\] \[\begin{align} P(\text{Vielen}_1 \text{schön}_2 | \text{Thank}_1 \text{you}_2; \theta) &= P(T = 2 | \text{Thank}_1 \text{you}_2; \theta) P(\text{Vielen}_1 | \text{Thank}_1 \text{you}_2; \theta) \\ &\qquad P(\text{schön}_2 | \text{Thank}_1 \text{you}_2; \theta) \\ &\neq 0 \\ \end{align}\]

We found that $P(\text{Danke}_1 \text{Danke}_2 | \text{Thank}_1 \text{you}_2; \theta)$ and $P(\text{Danke}_1 \text{Danke}_2 | \text{Thank}_1 \text{you}_2; \theta)$ are not exactly $0$. If they are very small probabilities, the model is still a fine model. However, in fact, those probabilities are not even small. This significantly violates the ground truth probability that $P(\text{Danke}_1 \text{Danke}_2 | \text{Thank}_1 \text{you}_2; \theta)$ and $P(\text{Danke}_1 \text{Danke}_2 | \text{Thank}_1 \text{you}_2; \theta)$ are exactly $0$.


This means that the naive non-autoregressive decoding will not work very well in practice, as its independence assumption does not make sense for most of the sequence to sequence tasks.

Orchestrated Non-Autoregressive Model

We introduce a latent variable $Z$ which is conditionally dependent on the input sequence $X$.

\[\begin{align} P(Y | X; \theta) &= \sum_{z \in Z}^{} P(Y, Z | X; \theta) \\ &= \sum_{z \in Z}^{} P(Z | X; \theta) P(Y | X, Z; \theta) \\ &= \sum_{z \in Z}^{} P(Z | X_{1:T^{\prime}}; \theta) P(T | X_{1:T^{\prime}}, Z; \theta) \prod_{t=1}^{T} P(Y_t | X_{1:T^{\prime}}, Z; \theta) \\ \end{align}\]

During the non-autoregressive decoding, if we could forget about the marginalization of the latent variable $Z$, we first sample $z$ from the distribution $P(Z | X_{1:T^{\prime}} = x_{1:T^{\prime}}; \theta)$, then we sample $y_t$ from $P(Y_t | X_{1:T^{\prime}} = x_{1:T^{\prime}}, Z = z; \theta)$ for each of the tokens in the output sequence. This decoding process could of course be parallelized.


Does the conditional independence assumption with the latent variable $Z$ still have the same problem as we have in the naive non-autoregressive model? We will first focus on the example introduced in the “Autoregressive Model VS Naive Non-Autoregressive Model” section above.


Suppose $Z \in \{0, 1, 2\}$, and because the dataset is very small, the non-autoregressive model has unambiguously learned

\[P(Z = 0 | X_{1:T^{\prime}} = x_{1:T^{\prime}}; \theta) = \frac{1}{3}\] \[P(Z = 1 | X_{1:T^{\prime}} = x_{1:T^{\prime}}; \theta) = \frac{1}{3}\] \[P(Z = 2 | X_{1:T^{\prime}} = x_{1:T^{\prime}}; \theta) = \frac{1}{3}\] \[\begin{align} P(Y = \text{Danke}_1 | X = \text{Thank}_1 \text{you}_2, Z = 0; \theta) &= 1 \\ \end{align}\] \[\begin{align} P(Y = \text{Danke}_1 \text{schön}_2 | X = \text{Thank}_1 \text{you}_2, Z = 1; \theta) &= 1 \\ \end{align}\] \[\begin{align} P(Y = \text{Vielen}_1 \text{Dank}_2 | X = \text{Thank}_1 \text{you}_2, Z = 2; \theta) &= 1 \\ \end{align}\]

Then, we must have

\[\begin{align} P(Y = \text{Danke}_1 \text{Danke}_2 | X = \text{Thank}_1 \text{you}_2, Z; \theta) &= 0 \\ \end{align}\] \[\begin{align} P(Y = \text{Vielen}_1 \text{schön}_2 | X = \text{Thank}_1 \text{you}_2, Z; \theta) &= 0 \\ \end{align}\]

no matter what the value of $Z$ is.


This property looks very good. This means no matter what $Z$ value we sample, we will always get good translations for “Thank you”. So this allows us to remove the marginalization for $Z$. When the dataset becomes large and the model becomes more complex, the dimensionality of $Z$ would become very large and the distribution of $Z$ could complex. Marginalization for $Z$ would become intractable. Thus this latent variable $Z$ is very critical for parallelizing the decoding process.


More generally, during the non-autoregressive model optimization, we would like to maximize $P(Y | X; \theta)$. However, we see that directly optimizing $P(Y | X; \theta)$ is intractable.

\[\begin{align} \log P(Y | X; \theta) &= \log \bigg( \sum_{z \in Z}^{} P(Z | X_{1:T^{\prime}}; \theta) P(T | X_{1:T^{\prime}}, Z; \theta) \prod_{t=1}^{T} P(Y_t | X_{1:T^{\prime}}, Z; \theta) \bigg) \\ \end{align}\]

We introduce a proposal distribution $Q(Z | X, Y)$ that approximates $P(Z | X; \theta)$. According to the variational inference, we have the following inequality.

\[\begin{align} \log P(Y | X; \theta) \geq \mathbb{E}_{z \sim Q(Z | X, Y)} [\log P(Y , Z| X; \theta)] - \mathbb{E}_{z \sim Q(Z| X, Y)} [\log Q(Z| X, Y)] \end{align}\]

Note that $\mathbb{E}_{z \sim Q(Z | X, Y)} [\log P(Y , Z| X; \theta)] - \mathbb{E}_{z \sim Q(Z | X, Y)} [\log Q(Z | X, Y)]$ is just the evidence lower bound (ELBO) in variational inference. We could have selected the distribution $Q$ to be $Q(Z)$, $Q(Z | X)$, or $Q(Z | Y)$, but we choose $Q$ to be $Q(Z | X, Y)$ on purpose. The reason will become apparent as we go through the following optimization process.


If $Q(Z | X, Y)$ is deterministic with respect to $Z$ and $Q(Z = z | X, Y) = 1$, the ELBO becomes

\[\begin{align} \mathbb{E}_{z \sim Q(Z | X, Y)} [\log P(Y , Z| X; \theta)] - \mathbb{E}_{z \sim Q(Z| X, Y)} [\log Q(Z| X, Y)] &= \log P(Y , Z = z | X; \theta) - 0 \\ &= \log P(Y , Z = z | X; \theta) \\ \end{align}\]

Then directly optimizing the ELBO becomes tractable. If $Q(Z | X, Y)$ is not deterministic with respect to $Z$, optimizing the ELBO is very cumbersome and intractable.


So given input sequence $X$ and output sequence $Y$, we get $z$ from the deterministic $Q(Z | X, Y)$. Then we are going to optimize $\log P(Y , Z = z | X; \theta)$ using supervised learning with data $(x, y, z)$.

\[\begin{align} \log P(Y , Z = z | X; \theta) &= \log P(Z = z | X; \theta) + \log P(Y | X, Z = z; \theta) \\ &= \log P(Z = z | X_{1:T^{\prime}}; \theta) + \log P(T | X_{1:T^{\prime}}, Z = z; \theta) + \sum_{t=1}^{T} \log P(Y_t | X_{1:T^{\prime}}, Z = z; \theta) \\ \end{align}\] \[\begin{align} \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \argmax_\theta \log P(Y , Z = z | X; \theta) &= \argmax_\theta \bigg( \log P(Z = z | X; \theta) + \log P(Y | X, Z = z; \theta) \bigg)\\ &= \argmax_\theta \bigg( \log P(Z = z | X_{1:T^{\prime}}; \theta) + \log P(T | X_{1:T^{\prime}}, Z = z; \theta) + \sum_{t=1}^{T} \log P(Y_t | X_{1:T^{\prime}}, Z = z; \theta) \bigg)\\ \end{align}\]

Here the reason why we chose $Q$ to be $Q(Z | X, Y)$ becomes apparent. Because $Q(Z | X, Y)$ is deterministic, we would just denote it as a function $Q(X, Y)$.

\[z_1 = Q(X = \text{Thank}_1 \text{you}_2, Y = \text{Danke}_1 \text{schön}_2)\] \[z_2 = Q(X = \text{Thank}_1 \text{you}_2, Y = \text{Vielen}_1 \text{Dank}_2)\]

We want $z_1 \neq z_2$.


Then the non-autoregressive model could be optimized with the goals of $P(Y = \text{Danke}_1 \text{schön}_2 | X = \text{Thank}_1 \text{you}_2, Z = z_1; \theta) = 1$ and $P(Y = \text{Vielen}_1 \text{Dank}_2 | X = \text{Thank}_1 \text{you}_2, Z = z_2; \theta) = 1$, so that the non-autoregressive model resulting $P(Y = \text{Danke}_1 \text{Danke}_2 | X = \text{Thank}_1 \text{you}_2; \theta) = 0$ and $P(Y = \text{Vielen}_1 \text{schön}_2 | X = \text{Thank}_1 \text{you}_2; \theta) = 0$ becomes feasible.


Otherwise, say if $Q(Z | X)$ is proposed instead of $Q(Z | X, Y)$, we must have $z_1 = z_2$. Then after the non-autoregressive model has been optimized with the goals of $P(Y = \text{Danke}_1 \text{schön}_2 | X = \text{Thank}_1 \text{you}_2, Z = z_1; \theta) = 1$ and $P(Y = \text{Vielen}_1 \text{Dank}_2 | X = \text{Thank}_1 \text{you}_2, Z = z_2; \theta) = 1$, $P(Y = \text{Danke}_1 \text{Danke}_2 | X = \text{Thank}_1 \text{you}_2; \theta)$, $P(Y = \text{Vielen}_1 \text{schön}_2 | X = \text{Thank}_1 \text{you}_2; \theta)$ must deviate from $0$ significantly.

Orchestrated Non-Autoregressive Summary

To train a non-autoregressive model for sequence to sequence tasks. We have to propose a deterministic function $Z = Q(X, Y)$.


During training, Given $(X, Y, Q(X, Y))$, the non-autoregressive model has to learn predicting $Z$ given $X$, $P(Z | X; \theta)$ and predicting $Y$ given $X$ and $Z$, $P(Y | X, Z; \theta)$, simultaneously. Concretely,

\[\begin{align} \argmax_\theta \log P(Y , Z = Q(X, Y) | X; \theta) &= \argmax_\theta \bigg( \log P(Z = Q(X, Y) | X; \theta) + \log P(Y | X, Z = Q(X, Y); \theta) \bigg)\\ \end{align}\]

During inference, given $X$, we cannot use $Q(X, Y)$ anymore because $Y$ remains to be predicted. However, the non-autoregressive model has learned both $P(Z | X; \theta)$ and $P(Y | X, Z; \theta)$. We first predict $Z$ by sampling from $P(Z | X; \theta)$. Once we have $(X, Z)$, we can sample the output sequence from $P(Y | X, Z; \theta)$. Because each token variable $Y_i$ in the $Y = \{Y_1, Y_2, \cdots, Y_T\}$, are conditionally independent, so the decoding process could be parallelized. Concretely, for the non-autoregressive greedy decoding,

\[\begin{gather} z = \argmax_{Z} P(Z | X = x) \\ y_1 = \argmax_{Y_1} P(Y_1 | X = x, Z = z) \\ y_2 = \argmax_{Y_2} P(Y_2 | X = x, Z = z) \\ \vdots \\ y_T = \argmax_{Y_T} P(Y_T | X = x, Z = z) \\ \end{gather}\]

Remaining Questions

One of the remaining questions is that what is the deterministic function $Z = Q(X, Y)$? It could be arbitrary, but the choice of the $Q$ affects the whether the model could learn the task very well. In the “Non-Autoregressive Neural Machine Translation”, for example, it is a hard alignment algorithm between the input sequence and the output sequence, such as IBM Model 2 or any other pre-trained model that does input sequence and output sequence alignment.

References