PyTorch Leaf Tensor

Introduction

PyTorch leaf tensor is a concept that is sometimes confusing to the users who are not familiar with the PyTorch’s automatic differentiation engine torch.autograd.

In this blog post, I would like to quickly discuss the PyTorch leaf tensor concept from the perspective of mathematics without going into too much implementation detail.

PyTorch Leaf Tensor

Depending on whether a PyTorch tensor requires gradient and whether a PyTorch tensor is explicitly created by the user, there are four categories of PyTorch tensors. Each tensor has attributes of whether it is a leaf tensor and whether the gradient will be populated for the tensor which are determined by whether the PyTorch tensor requires gradient and whether the PyTorch tensor is explicitly created by the user.

Requires Grad User Created Is Leaf Grad Populated
true true true true
true false false false
false true true false
false false true false

Here, “Requires Grad” is the requires_grad attribute of a torch.Tensor indicating whether it is a constant or variable; “User Created” is true means that a torch.Tensor is not the result of an operation and so the grad_fn attribute of the torch.Tensor is None; “Is Leaf” is true means that a torch.Tensor is a leaf node in a torch.autograd directed acyclic graph (DAG) which only consists of a root (tensor) node, many leaf (tensor) nodes, and many intermediate (backward function call) nodes; “Grad Populated” is true means that the gradient with respect to a torch.Tensor will be saved in the tensor object (for optimization) so that the grad attribute of the torch.Tensor will not be None after a backward pass.

Example

In addition to the examples from the PyTorch documentation which are rather confusing, we have a more concrete example here illustrating the role of leaf node in torch.autograd.

leaf_tensor.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import torch


def print_tensor_attributes(tensor: torch.Tensor) -> None:

print(f"requires_grad: {tensor.requires_grad}, "
f"grad_fn: {tensor.grad_fn is not None}, "
f"is_leaf: {tensor.is_leaf}, "
f"grad: {tensor.grad is not None}")


if __name__ == "__main__":

cuda_device = torch.device("cuda:0")
variable_tensor_cpu = torch.tensor([2., 3.], requires_grad=True)
variable_tensor_cuda = variable_tensor_cpu.to(cuda_device)
constant_tensor_cuda = torch.tensor([6., 4.],
requires_grad=False,
device=cuda_device)
loss = torch.sum((constant_tensor_cuda - variable_tensor_cuda)**2)
print("Before Backward")
print("variable_tensor_cpu")
print_tensor_attributes(tensor=variable_tensor_cpu)
print("variable_tensor_cuda")
print_tensor_attributes(tensor=variable_tensor_cuda)
print("constant_tensor_cuda")
print_tensor_attributes(tensor=constant_tensor_cuda)
print("loss")
print_tensor_attributes(tensor=loss)
loss.backward()
print("-" * 65)
print("After Backward")
print("variable_tensor_cpu")
print_tensor_attributes(tensor=variable_tensor_cpu)
print("variable_tensor_cuda")
print_tensor_attributes(tensor=variable_tensor_cuda)
print("constant_tensor_cuda")
print_tensor_attributes(tensor=constant_tensor_cuda)
print("loss")
print_tensor_attributes(tensor=loss)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
$ python3 leaf_tensor.py 
Before Backward
variable_tensor_cpu
requires_grad: True, grad_fn: False, is_leaf: True, grad: False
variable_tensor_cuda
requires_grad: True, grad_fn: True, is_leaf: False, grad: False
constant_tensor_cuda
requires_grad: False, grad_fn: False, is_leaf: True, grad: False
loss
requires_grad: True, grad_fn: True, is_leaf: False, grad: False
-----------------------------------------------------------------
After Backward
variable_tensor_cpu
requires_grad: True, grad_fn: False, is_leaf: True, grad: True
variable_tensor_cuda
requires_grad: True, grad_fn: True, is_leaf: False, grad: False
constant_tensor_cuda
requires_grad: False, grad_fn: False, is_leaf: True, grad: False
loss
requires_grad: True, grad_fn: True, is_leaf: False, grad: False

In some scenarios, the user would expect that the variable tensor variable_tensor_cuda would have grad after the backward pass so that it can be optimized during the neural network training. However, we could see that the variable_tensor_cuda.grad is None whereas the variable_tensor_cpu tensor has grad. This means the variable_tensor_cpu is the actually the variable for optimization. After the optimization is performed after the backward pass, the variable_tensor_cuda value will not be the same as the variable_tensor_cpu until the next forward pass is performed.

In fact, there is a warning when the user tries to access the .grad attribute of a non-leaf tensor which by default has no .grad attribute.

1
2
leaf.py:6: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more informations. (Triggered internally at /opt/pytorch/pytorch/build/aten/src/ATen/core/TensorBody.h:480.)
print(f"requires_grad: {tensor.requires_grad}, "

We could also visualize the DAG using a third party library torchviz. The torchviz library could be installed using the following command.

1
2
3
$ sudo apt update
$ sudo apt install -y graphviz
$ pip install torchviz

The torch.autograd DAG is built as the Python script is executed. torchviz can visualize the DAG from a root tensor which is the loss tensor in our example.

leaf_tensor_dag.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import torch
from torchviz import make_dot


def print_tensor_attributes(tensor: torch.Tensor) -> None:

print(f"requires_grad: {tensor.requires_grad}, "
f"grad_fn: {tensor.grad_fn is not None}, "
f"is_leaf: {tensor.is_leaf}, "
f"grad: {tensor.grad is not None}")


if __name__ == "__main__":

cuda_device = torch.device("cuda:0")
variable_tensor_cpu = torch.tensor([2., 3.], requires_grad=True)
variable_tensor_cuda = variable_tensor_cpu.to(cuda_device)
constant_tensor_cuda = torch.tensor([6., 4.],
requires_grad=False,
device=cuda_device)
loss = torch.sum((constant_tensor_cuda - variable_tensor_cuda)**2)
make_dot(loss).render("dag", format="svg")

Notice that the DAG visualized using torchviz will not display the leaf node that does not require grad.

PyTorch Autograd DAG

The blue box in the DAG diagram, although having no tensor name, is the leaf tensor variable_tensor_cpu in our program.

FAQ

Why grad is not populated for a tensor that requires grad but is not a leaf node?

Conventionally, only leaf tensors, usually model parameters to be trained, deserves grad. All the non-leaf tensors, such as the intermediate activation tensors, do not deserve grad. Why would we need to keep a grad for the activation tensors? Even if we keep the grad in the activation tensor and apply the grad to the activation tensor values in the optimization, those values will be overwritten in the next forward pass. So populating grad for non-leaf tensors is usually a waste of memory and computation.

However, in some “rare” use cases, the user would need the grad for non-leaf tensors, and PyTorch has the API torch.Tensor.retain_grad() for that. But usually it’s not making sense and is an indication of problematic implementation.

References

Author

Lei Mao

Posted on

06-19-2023

Updated on

06-19-2023

Licensed under


Comments