Why zero-ing gradients isn’t enough?

3 minute read

Background

While zeroing gradients .zero_grad() is a common practice, there’s a more efficient way to handle gradients in PyTorch: using set_to_none = True while calling .zero_grad() method. This post explores why this option is important and how it can optimize your training process.

The Basic Training Loop

optimizer.zero_grad()

for epoch_num in range(epochs):
  for step_idx, batch in enumerate(train_dataloader):
      loss = model(batch) # 1. forward pass to compute loss
      loss.backward()  # 2. backward pass to compute gradients w.r.t loss
      optimizer.step() # 3. update the weights based on gradients
      optimizer.zero_grad() # 4. zero the gradients for next iteration

In the forward pass, activations are computed and stored as the input moves through the layers of the model.
In the backward pass, gradients are computed for each layer’s weights based on the stored activations. Later, the gradient values are stored in the gradient buffers.
Step 3 (optimizer.step()) updates the model weights with gradient values and Step 4 ( optimizer.zero_grad()) would zero the values in gradient buffers.

A Detour to Gradient Accumulation

accumulation_steps = 4  # Number of steps to accumulate gradients
optimizer.zero_grad()  # Zero gradients at the start

for epoch_num in range(epochs):
    for step_idx, batch in enumerate(train_dataloader):
        loss = model(batch)  # 1. Forward pass
        (loss / accumulation_steps).backward()  # 2. Backward pass

        if (step_idx + 1) % accumulation_steps == 0:
            optimizer.step()  # 3. Update weights
            optimizer.zero_grad()  # 4. Zero gradients

Instead of updating the model weights in every step, we update the model weights after every 4 steps. Until then, we accumulate the gradients of each parameter. By accumulation, we keep adding the gradient values to the current values in gradient buffer.

For instance, in Step 2, the current step’s gradient is added with the accumulated gradient value (which is only Step-1 here). By the end of 4 steps, we would end up with the sum of gradient values over last 4 steps (step-1-grad + step-2-grad + step-3-grad + step-4-grad).

In a regular setup, i.e. without gradient accumulation, we perform a redundant addition operation in loss.backward() step. Specifically, we add the current gradient value with 0. optimizer.zero_grad() # zero the gradient value in gradient buffer

for epoch_num in range(epochs):
  for step_idx, batch in enumerate(train_dataloader):
      loss = model(batch) 
      loss.backward() #Compute gradient & add it with 0 (current gradient buffer)
      optimizer.step()
      optimizer.zero_grad() # zero the gradient value in gradient buffer

The Problem with Simple Gradient Zero-ing

The optimizer.zero_grad() call is typically used to reset gradients between iterations. While the function removes the current gradient values and sets them to 0, it does not free up all the memory associated with them. This approach has some inefficiencies

Gradient buffers: After zeroing, the gradient buffers remain allocated for the next iteration, so this won’t free up significant memory.
Redundant operations: The subsequent backward pass after computing the gradients will add them with the current buffer value (i.e 0). Adding something with 0 is redundant and waste of computer resources.

Alternative Approach: Releasing Gradient Buffers

The question arises: what if we release the gradient buffers until they are needed? This approach could potentially offer several benefits:

More free memory: By deallocating gradient buffers when they’re not in use, we free up memory that can be used for other purposes, potentially allowing for larger batch sizes or more complex models.
Fewer redundant operations: The subsequent backward pass can use assignment instead of addition to store gradients, reducing the number of unnecessary arithmetic operations.
More efficient memory usage: While we would perform more memory allocation and deallocation operations, this could lead to more efficient overall memory usage.

Comparing Operations

Let’s compare the operations involved in the traditional approach versus releasing gradient buffers:

1. Traditional approach (adding with zero):

Transfer gradient value tensor to SRAM
Transfer zero gradient tensor to SRAM
Perform addition operation
Update memory with result

2. set_to_none=True ( Memory de-allocation) approach:

Assign new gradient value directly to memory

The set_to_none=True approach involves fewer data transfer operations, which are often the bottleneck in GPU computations. Even for a GPU with 2TB/s bandwidth, transferring gradients for very large models (e.g., 405B parameters) can be significant.

Conclusion

While zeroing gradients is a common practice in deep learning workflows, using set_to_none=True is generally more efficient. By understanding the nuances of gradient handling, memory management, and the specific requirements of your model and hardware, you can optimize your training process for better performance and resource utilization.

The choice between zeroing gradients and releasing gradient buffers depends on factors such as:

Model size
Available hardware resources
Training batch size
Frequency of gradient updates

For smaller models or when memory isn’t a constraint, the traditional approach of zeroing gradients might be sufficient. However, for large-scale models or memory-constrained environments, considering alternative gradient handling techniques like releasing buffers could lead to significant improvements in training efficiency.

Share on

Twitter Facebook LinkedIn

Murali Manohar

Why zero-ing gradients isn’t enough?

Background

The Basic Training Loop

A Detour to Gradient Accumulation

The Problem with Simple Gradient Zero-ing

Alternative Approach: Releasing Gradient Buffers

Comparing Operations

Conclusion

Share on

Leave a comment

You may also enjoy

Lightweight Guide to understanding GRPO and RL principles

Bridging the Three Gulfs of Agentic Development (and how they shape evals)

Let Agents do the talking: A Scalable Way to Evaluate Multi-Turn Chatbots

CUDA Study Log 4: Optimizing Constrained Decoding with Triton Kernel