为什么我们需要在PyTorch中调用zero_grad()?

问题:为什么我们需要在PyTorch中调用zero_grad()?

zero_grad()训练期间需要调用该方法。但是文档不是很有帮助

|  zero_grad(self)
|      Sets gradients of all model parameters to zero.

为什么我们需要调用此方法?

The method zero_grad() needs to be called during training. But the documentation is not very helpful

|  zero_grad(self)
|      Sets gradients of all model parameters to zero.

Why do we need to call this method?


回答 0

在中PyTorch,我们需要在开始进行反向传播之前将梯度设置为零,因为PyTorch在随后的反向传递中累积梯度。在训练RNN时这很方便。因此,默认操作是在每次调用时累积(即求和)梯度loss.backward()

因此,理想情况下,当您开始训练循环时,应该zero out the gradients正确进行参数更新。否则,梯度将指向预期方向以外的其他方向,即朝向最小值(或最大化,如果达到最大化目标)。

这是一个简单的示例:

import torch
from torch.autograd import Variable
import torch.optim as optim

def linear_model(x, W, b):
    return torch.matmul(x, W) + b

data, targets = ...

W = Variable(torch.randn(4, 3), requires_grad=True)
b = Variable(torch.randn(3), requires_grad=True)

optimizer = optim.Adam([W, b])

for sample, target in zip(data, targets):
    # clear out the gradients of all Variables 
    # in this optimizer (i.e. W, b)
    optimizer.zero_grad()
    output = linear_model(sample, W, b)
    loss = (output - target) ** 2
    loss.backward()
    optimizer.step()

或者,如果您要进行香草梯度下降,则:

W = Variable(torch.randn(4, 3), requires_grad=True)
b = Variable(torch.randn(3), requires_grad=True)

for sample, target in zip(data, targets):
    # clear out the gradients of Variables 
    # (i.e. W, b)
    W.grad.data.zero_()
    b.grad.data.zero_()

    output = linear_model(sample, W, b)
    loss = (output - target) ** 2
    loss.backward()

    W -= learning_rate * W.grad.data
    b -= learning_rate * b.grad.data

注意:当张量上调用时,会发生梯度的累积(即求和)。.backward()loss

In PyTorch, we need to set the gradients to zero before starting to do backpropragation because PyTorch accumulates the gradients on subsequent backward passes. This is convenient while training RNNs. So, the default action is to accumulate (i.e. sum) the gradients on every loss.backward() call.

Because of this, when you start your training loop, ideally you should zero out the gradients so that you do the parameter update correctly. Else the gradient would point in some other direction than the intended direction towards the minimum (or maximum, in case of maximization objectives).

Here is a simple example:

import torch
from torch.autograd import Variable
import torch.optim as optim

def linear_model(x, W, b):
    return torch.matmul(x, W) + b

data, targets = ...

W = Variable(torch.randn(4, 3), requires_grad=True)
b = Variable(torch.randn(3), requires_grad=True)

optimizer = optim.Adam([W, b])

for sample, target in zip(data, targets):
    # clear out the gradients of all Variables 
    # in this optimizer (i.e. W, b)
    optimizer.zero_grad()
    output = linear_model(sample, W, b)
    loss = (output - target) ** 2
    loss.backward()
    optimizer.step()

Alternatively, if you’re doing a vanilla gradient descent, then:

W = Variable(torch.randn(4, 3), requires_grad=True)
b = Variable(torch.randn(3), requires_grad=True)

for sample, target in zip(data, targets):
    # clear out the gradients of Variables 
    # (i.e. W, b)
    W.grad.data.zero_()
    b.grad.data.zero_()

    output = linear_model(sample, W, b)
    loss = (output - target) ** 2
    loss.backward()

    W -= learning_rate * W.grad.data
    b -= learning_rate * b.grad.data

Note: The accumulation (i.e. sum) of gradients happen when .backward() is called on the loss tensor.


回答 1

如果您使用渐变方法来减少错误(或损失),zero_grad()将重新启动循环而不会损失上一步

如果您不使用zero_grad(),那么损失将减少而不是按需增加

例如,如果您使用zero_grad(),则会发现以下输出:

model training loss is 1.5
model training loss is 1.4
model training loss is 1.3
model training loss is 1.2

如果不使用zero_grad(),则会发现以下输出:

model training loss is 1.4
model training loss is 1.9
model training loss is 2
model training loss is 2.8
model training loss is 3.5

zero_grad() is restart looping without losses from last step if you use the gradient method for decreasing the error (or losses)

if you don’t use zero_grad() the loss will be decrease not increase as require

for example if you use zero_grad() you will find following output :

model training loss is 1.5
model training loss is 1.4
model training loss is 1.3
model training loss is 1.2

if you don’t use zero_grad() you will find following output :

model training loss is 1.4
model training loss is 1.9
model training loss is 2
model training loss is 2.8
model training loss is 3.5