# PyTorch Leaf Tensor

## Introduction

PyTorch leaf tensor is a concept that is sometimes confusing to the users who are not familiar with the PyTorch’s automatic differentiation engine `torch.autograd`

.

In this blog post, I would like to quickly discuss the PyTorch leaf tensor concept from the perspective of mathematics without going into too much implementation detail.

## PyTorch Leaf Tensor

Depending on whether a PyTorch tensor requires gradient and whether a PyTorch tensor is explicitly created by the user, there are four categories of PyTorch tensors. Each tensor has attributes of whether it is a leaf tensor and whether the gradient will be populated for the tensor which are determined by whether the PyTorch tensor requires gradient and whether the PyTorch tensor is explicitly created by the user.

Requires Grad | User Created | Is Leaf | Grad Populated |
---|---|---|---|

true | true | true | true |

true | false | false | false |

false | true | true | false |

false | false | true | false |

Here, “Requires Grad” is the `requires_grad`

attribute of a `torch.Tensor`

indicating whether it is a constant or variable; “User Created” is true means that a `torch.Tensor`

is not the result of an operation and so the `grad_fn`

attribute of the `torch.Tensor`

is `None`

; “Is Leaf” is true means that a `torch.Tensor`

is a leaf node in a `torch.autograd`

directed acyclic graph (DAG) which only consists of a root (tensor) node, many leaf (tensor) nodes, and many intermediate (backward function call) nodes; “Grad Populated” is true means that the gradient with respect to a `torch.Tensor`

will be saved in the tensor object (for optimization) so that the `grad`

attribute of the `torch.Tensor`

will not be `None`

after a backward pass.

## Example

In addition to the examples from the PyTorch documentation which are rather confusing, we have a more concrete example here illustrating the role of leaf node in `torch.autograd`

.

1 | import torch |

1 | $ python3 leaf_tensor.py |

In some scenarios, the user would expect that the variable tensor `variable_tensor_cuda`

would have `grad`

after the backward pass so that it can be optimized during the neural network training. However, we could see that the `variable_tensor_cuda.grad`

is `None`

whereas the `variable_tensor_cpu`

tensor has `grad`

. This means the `variable_tensor_cpu`

is the actually the variable for optimization. After the optimization is performed after the backward pass, the `variable_tensor_cuda`

value will not be the same as the `variable_tensor_cpu`

until the next forward pass is performed.

In fact, there is a warning when the user tries to access the `.grad`

attribute of a non-leaf tensor which by default has no `.grad`

attribute.

1 | leaf.py:6: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more informations. (Triggered internally at /opt/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:480.) |

We could also visualize the DAG using a third party library `torchviz`

. The `torchviz`

library could be installed using the following command.

1 | $ sudo apt update |

The `torch.autograd`

DAG is built as the Python script is executed. `torchviz`

can visualize the DAG from a root tensor which is the `loss`

tensor in our example.

1 | import torch |

Notice that the DAG visualized using `torchviz`

will not display the leaf node that does not require grad.

The blue box in the DAG diagram, although having no tensor name, is the leaf tensor `variable_tensor_cpu`

in our program.

## FAQ

### Why grad is not populated for a tensor that requires grad but is not a leaf node?

Conventionally, only leaf tensors, usually model parameters to be trained, deserves grad. All the non-leaf tensors, such as the intermediate activation tensors, do not deserve grad. Why would we need to keep a grad for the activation tensors? Even if we keep the grad in the activation tensor and apply the grad to the activation tensor values in the optimization, those values will be overwritten in the next forward pass. So populating grad for non-leaf tensors is usually a waste of memory and computation.

However, in some “rare” use cases, the user would need the grad for non-leaf tensors, and PyTorch has the API `torch.Tensor.retain_grad()`

for that. But usually it’s not making sense and is an indication of problematic implementation.

## References

PyTorch Leaf Tensor