Non-Autoregressive Model and Non-Autoregressive Decoding for Sequence to Sequence Tasks
Introduction
The application of the autoregressive model and the autoregressive decoding for sequence to sequence tasks has been discussed in my article “Autoregressive Model and Autoregressive Decoding for Sequence to Sequence Tasks”. The autoregressive model and autoregressive decoding is very natural to human intelligence. However, its major drawback is that the autoregressive decoding process could not be parallelized, and therefore its application becomes limited in the latency-constraint systems.
In this blog post, I would like to discuss whether it is feasible to apply the non-autoregressive model and non-autoregressive decoding for sequence to sequence tasks, with reasonable conditional independence assumptions, so that the decoding process is tractable and could be parallelized.
Naive Non-Autoregressive Model
Given a sequence of input variables
where
This means, given some context, we know the output sequence length, and for each token in the output sequence, we know what it is without even having to think what its previous and afterwards tokens are. During inference, the generation of each of the tokens in the output sequence could be parallelized. The time complexity of the decoding process is
However, we find it problematic immediately because we know that the conditional independence assumption we made above for the variables in the output sequence is not valid in many sequence to sequence tasks. Therefore, the model accuracy will likely be poor.
Autoregressive Model VS Naive Non-Autoregressive Model
Let’s take a look at an example of translating English “Thank you” to German. The corresponding translation in German could be “Danke”, “Danke schön”, or “Vielen Dank”. That is to say, we have a dataset consists of (“Thank you”, “Danke”), (“Thank you”, “Danke schön”), (“Thank you”, “Vielen Dank”). We will apply the autoregressive model and the autoregressive decoding, and the non-autoregressive model and the non-autoregressive decoding for this sequence to sequence task.
After the autoregressive model is trained with the dataset, we must have,
In autoregressive decoding, the chain rule is then applied to get the joint conditional probability. If the reader is not familiar with the math of the autoregressive model and the autoregressive decoding, please read my article “Autoregressive Model and Autoregressive Decoding for Sequence to Sequence Tasks”.
All the joint conditional probabilities make sense.
After the naive non-autoregressive model is trained with the dataset, we must have,
In the naive non-autoregressive decoding, the conditional independence is applied.
We found that
This means that the naive non-autoregressive decoding will not work very well in practice, as its independence assumption does not make sense for most of the sequence to sequence tasks.
Orchestrated Non-Autoregressive Model
We introduce a latent variable
During the non-autoregressive decoding, if we could forget about the marginalization of the latent variable
Does the conditional independence assumption with the latent variable
Suppose
Then, we must have
no matter what the value of
This property looks very good. This means no matter what
More generally, during the non-autoregressive model optimization, we would like to maximize
We introduce a proposal distribution
Note that
If
Then directly optimizing the ELBO becomes tractable. If
So given input sequence
Here the reason why we chose
We want
Then the non-autoregressive model could be optimized with the goals of
Otherwise, say if
Orchestrated Non-Autoregressive Summary
To train a non-autoregressive model for sequence to sequence tasks. We have to propose a deterministic function
During training, Given
During inference, given
Remaining Questions
One of the remaining questions is that what is the deterministic function
References
Non-Autoregressive Model and Non-Autoregressive Decoding for Sequence to Sequence Tasks
https://leimao.github.io/blog/Non-Autoregressive-Model-Non-Autoregressive-Decoding/