PyTorch Leaf Tensor
Introduction
PyTorch leaf tensor is a concept that is sometimes confusing to the users who are not familiar with the PyTorch’s automatic differentiation engine torch.autograd.
In this blog post, I would like to quickly discuss the PyTorch leaf tensor concept from the perspective of mathematics without going into too much implementation detail.
PyTorch Leaf Tensor
Depending on whether a PyTorch tensor requires gradient and whether a PyTorch tensor is explicitly created by the user, there are four categories of PyTorch tensors. Each tensor has attributes of whether it is a leaf tensor and whether the gradient will be populated for the tensor which are determined by whether the PyTorch tensor requires gradient and whether the PyTorch tensor is explicitly created by the user.
| Requires Grad | User Created | Is Leaf | Grad Populated |
|---|---|---|---|
| true | true | true | true |
| true | false | false | false |
| false | true | true | false |
| false | false | true | false |
Here, “Requires Grad” is the requires_grad attribute of a torch.Tensor indicating whether it is a constant or variable; “User Created” is true means that a torch.Tensor is not the result of an operation and so the grad_fn attribute of the torch.Tensor is None; “Is Leaf” is true means that a torch.Tensor is a leaf node in a torch.autograd directed acyclic graph (DAG) which only consists of a root (tensor) node, many leaf (tensor) nodes, and many intermediate (backward function call) nodes; “Grad Populated” is true means that the gradient with respect to a torch.Tensor will be saved in the tensor object (for optimization) so that the grad attribute of the torch.Tensor will not be None after a backward pass.
Example
In addition to the examples from the PyTorch documentation which are rather confusing, we have a more concrete example here illustrating the role of leaf node in torch.autograd.
1 | import torch |
1 | $ python3 leaf_tensor.py |
In some scenarios, the user would expect that the variable tensor variable_tensor_cuda would have grad after the backward pass so that it can be optimized during the neural network training. However, we could see that the variable_tensor_cuda.grad is None whereas the variable_tensor_cpu tensor has grad. This means the variable_tensor_cpu is the actually the variable for optimization. After the optimization is performed after the backward pass, the variable_tensor_cuda value will not be the same as the variable_tensor_cpu until the next forward pass is performed.
In fact, there is a warning when the user tries to access the .grad attribute of a non-leaf tensor which by default has no .grad attribute.
1 | leaf.py:6: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more informations. (Triggered internally at /opt/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:480.) |
We could also visualize the DAG using a third party library torchviz. The torchviz library could be installed using the following command.
1 | $ sudo apt update |
The torch.autograd DAG is built as the Python script is executed. torchviz can visualize the DAG from a root tensor which is the loss tensor in our example.
1 | import torch |
Notice that the DAG visualized using torchviz will not display the leaf node that does not require grad.
The blue box in the DAG diagram, although having no tensor name, is the leaf tensor variable_tensor_cpu in our program.
FAQ
Why grad is not populated for a tensor that requires grad but is not a leaf node?
Conventionally, only leaf tensors, usually model parameters to be trained, deserves grad. All the non-leaf tensors, such as the intermediate activation tensors, do not deserve grad. Why would we need to keep a grad for the activation tensors? Even if we keep the grad in the activation tensor and apply the grad to the activation tensor values in the optimization, those values will be overwritten in the next forward pass. So populating grad for non-leaf tensors is usually a waste of memory and computation.
However, in some “rare” use cases, the user would need the grad for non-leaf tensors, and PyTorch has the API torch.Tensor.retain_grad() for that. But usually it’s not making sense and is an indication of problematic implementation.
References
PyTorch Leaf Tensor