### Introduction

Recently I came across with layer normalization in the Transformer model for machine translation and I found that a special normalization layer called “layer normalization” was used throughout the model, so I decided to check how it works and compare it with the batch normalization we normally used in computer vision models.

### Mathematical Definition

Given inputs $x$ over a minibatch of size $m$, $B = \{x_1, x_2, …, x_m\}$, each sample $x_i$ contains $K$ elements, i.e. the length of flatten $x_i$ is $K$, by applying transformation of your inputs using some learned parameters $\gamma$ and $\beta$, the outputs could be expressed as $B’ = \{y_1, y_2, …, y_m\}$, where $y_i = {\text{LN}}_{\gamma, \beta} (x_i)$.

More concretely, we first calculate the mean and the variance of of each sample from the minibatch. For sample $x_i$ whose flatten format is $\{x_{i,1}, x_{i,2}, …, x_{i,K}\}$, we have its mean $\mu_i$ and variance $\sigma_i^2$.

\[\mu_i = \frac{1}{K} \sum_{k=1}^{K} x_{i,k} \\ \sigma_i^2 = \frac{1}{K} \sum_{k=1}^{K} (x_{i,k} - \mu_i)^2\]Then we normalize each sample such that the elements in the sample have zero mean and unit variance. $\epsilon$ is for numerical stability in case the denominator becomes zero by chance.

\[\hat{x}_{i,k} = \frac{x_{i,k}-\mu_i}{\sqrt{\sigma_i^2 + \epsilon}}\]Finally, there is a scaling and shifting step. $\gamma$ and $\beta$ are learnable parameters.

\[y_i = \gamma \hat{x}_{i} + \beta \equiv {\text{LN}}_{\gamma, \beta} (x_i)\]We can see from the math above that layer normalization has nothing to do with other samples in the batch.

### Layer Normalization vs Batch Normalization

I had a simple blog post on batch normalization previously. Like that simple blog post, I am not going to talk about the advantage of layer normalization over batch normalization or how to choose normalization techniques in this blog post.

One of the major differences in practice is that layer normalization does not have to use “running mean” and “running variance” since batch does not play a role in the computation.

If layer normalization is working on the outputs from a convolution layer, the math has to be modified slightly since it does not make sense to group all the elements from three distinct channels together and compute the mean and variance. Each channel is considered as an “independent” sample and all the normalization was done for that specific channel only within the sample.

Assume the input tensor has shape $[m, H, W, C]$, for each channel $c \in \{1,2, \cdots, C\}$

\[\mu_{i,c} = \frac{1}{HW} \sum_{j=1}^{H} \sum_{k=1}^{W} x_{i,j,k,c} \\ \sigma_{i,c}^2 = \frac{1}{HW} \sum_{j=1}^{H} \sum_{k=1}^{W} (x_{i,j,k,c} - \mu_{i,c})^2\] \[\hat{x}_{i,j,k,c} = \frac{x_{i,j,k,c}-\mu_{i,c}}{\sqrt{\sigma_{i,c}^2 + \epsilon}}\]Specifically for each channel, we have learnable parameters $\gamma_c$ and $\beta_c$, such that

\[y_{i,:,:,c} = \gamma_c \hat{x}_{i,:,:,c} + \beta_c \equiv {\text{LN}}_{\gamma_c, \beta_c} (x_{i,:,:,c})\]### Layer Normalization vs Instance Normalization?

Instance normalization, however, only exists for 3D or higher dimensional tensor inputs, since it requires the tensor to have batch and each sample in the batch needs to have layers (channels). If the samples in batch only have 1 channel (a dummy channel), instance normalization on the batch is exactly the same as layer normalization on the batch with this single dummy channel removed. Batch normalization and layer normalization works for 2D tensors which only consists of batch dimension without layers. Surprisingly (or not?), instance normalization for 3D or 4D tensor is exactly the same as layer normalization for convolution outputs as I mentioned above, because each sample in the batch is an instance, we are layer normalizing samples which happen to have multiple channels, and we ignore batches during normalization. So we could say about instance normalization in this way, instance normalization is a natural extension of layer normalization to convolutions, or it is just a new name for an old concept.