Tips for Optimizing Neural Network Training (Including Hands-On Implementation)

Introduction

I have rarely found excitement in the implementation of neural networks, which includes tasks such as defining layers and writing the forward pass. For many machine learning engineers, these tasks are often perceived as monotonous.

The true challenge and enjoyment for me lie in optimizing the network. This is the phase where a decent model is transformed into a highly efficient and finely tuned system, capable of handling large datasets, training more rapidly, and achieving superior results.

Optimization is a craft that necessitates precision, a deep understanding of hardware and software, and a focus on performance improvement. Consequently, I have never regarded the ability to train machine learning models as a core skill. Rather, it has always been about comprehending the underlying science of the model and employing the appropriate techniques to achieve the most efficient outcomes.

In this article, I will explore 15 different strategies for optimizing neural network training. These strategies will cover aspects such as selecting the right optimizers and effectively managing memory and hardware resources. Each technique is accompanied by code examples to assist you in implementing these optimizations in your own projects.

As illustrated in the animation above, some of these techniques are quite fundamental and straightforward, such as:

Utilizing efficient optimizers like AdamW and Adam
Leveraging hardware accelerators, including GPUs and TPUs
Maximizing the batch size

Therefore, we will not cover these topics in detail here.

Let us begin!

1. Max workers and pin memory

Max workers

Setting the num_workers parameter in the PyTorch DataLoader is an effective method to enhance the speed of data loading during training.

The num_workers parameter specifies the number of subprocesses utilized for parallel data loading. By increasing the number of workers, you can often achieve a significant reduction in data loading time, particularly when working with large datasets.

This approach helps to minimize instances where the GPU is idle, waiting for data, thereby facilitating faster model training.

However, the optimal value for num_workers may vary based on your specific machine configuration, including factors such as the number of CPU cores and available RAM. Therefore, it is advisable to experiment with different settings to identify the most effective configuration for your setup.

Max workers implementation

To illustrate the effectiveness of this approach, we will implement a simple feedforward neural network using the MNIST dataset.

We will start by importing the necessary libraries and specifying the device as follows:

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

from torchvision import datasets, transforms
from torch.utils.data import DataLoader
from time import time

# Set device (GPU if available)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

Next, we will define the architecture of our feedforward neural network:

class SimpleFeedForwardNN(nn.Module):
    def __init__(self):
        super(SimpleFeedForwardNN, self).__init__()
        self.fc1 = nn.Linear(28 * 28, 128)
        self.fc2 = nn.Linear(128, 64)
        self.fc3 = nn.Linear(64, 10)

    def forward(self, x):
        x = x.view(-1, 28 * 28)
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = self.fc3(x)
        return x

Next, we will create the MNIST dataset object:

# Define data transformations: convert to tensor and normalize
transform = transforms.Compose([transforms.ToTensor(),
                      transforms.Normalize((0.1307,), (0.3081,))]
                      )

# Load MNIST dataset
train_dataset = datasets.MNIST(root='./data',
                               train=True, 
                               download=True, 
                               transform=transform)

test_dataset = datasets.MNIST(root='./data',
                              train=False,
                              download=True,
                              transform=transform)

Since we will be comparing the performance of a standard DataLoader with that of a DataLoader utilizing multiple workers, let us define a method for this comparison:

def train(model,
          device,
          train_loader,
          optimizer,
          criterion,
          epoch):
          
    model.train()

    loss_value = 0

    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.to(device)

        optimizer.zero_grad()

        output = model(data)

        loss = criterion(output, target)

        loss.backward()

        loss_value += loss.item()

        optimizer.step()

    print(f'Train Epoch: {epoch} \t Loss: {loss_value:.6f}')

The following train function represents a typical PyTorch training loop for a neural network model. Here’s a breakdown of its parameters:

model: The neural network model to be trained.
device: The device (CPU or GPU) where the tensors (data) will be moved for computation.
train_loader: The DataLoader that supplies batches of training data.
optimizer: The optimization algorithm (e.g., Adam, SGD) employed to update the model parameters.
criterion: The loss function (e.g., CrossEntropyLoss) used to calculate the error between the predicted output and the target.
epoch: The current training epoch, which represents one complete iteration over the entire dataset.

Next, we will follow the standard training procedure, which involves training the model for one epoch while processing data in batches. This process includes performing forward propagation to generate predictions, calculating the loss, executing backpropagation to compute gradients, and updating the model's parameters using the optimizer. At the end of the epoch, the total loss will be printed.

Standard Data Loader

To begin, let us first experiment with a standard DataLoader object.

We will define the model, the loss function, and the optimizer as follows:

model = SimpleFeedForwardNN().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

Next, we will create our standard DataLoader object and invoke the train method defined earlier:

train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)

start = time()

num_epochs = 5
for epoch in range(1, num_epochs + 1):
    train(model, device, train_loader, optimizer, criterion, epoch)

print(f"Total time = {time() - start}")

As shown above, the model takes approximately 43 seconds to train for five epochs.

Data Loader with Max Workers

Next, we will experiment with another DataLoader by specifying the max_workers parameter.

To achieve this, we will again define the model, the loss function, and the optimizer as follows:

model = SimpleFeedForwardNN().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

Next, we will create the DataLoader object and invoke the train method defined earlier:

train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True, num_workers=8)

start = time()

num_epochs = 5

for epoch in range(1, num_epochs + 1):
    train(model, device, train_loader, optimizer, criterion, epoch)
print(f"Total time = {time() - start}")

As illustrated above, the model exhibits a similar loss trajectory but now takes approximately 10 seconds to train for five epochs, representing a substantial reduction in training time.

By utilizing multiple workers, the DataLoader can fetch data batches asynchronously, significantly enhancing training speed, particularly when dealing with large datasets or when data preprocessing is involved.

Pin Memory

To further comprehend this, let us consider the standard model training loop in PyTorch, as demonstrated below:

for epoch in range(epochs):

    for batch_idx, (data, target) in enumerate(train_loader):

        data, target = data.to(device), target.to(device)

        optimizer.zero_grad()

        output = model(data)

        loss = criterion(output, target)

        loss.backward()

        optimizer.step()

In the code above:

Line 5 transfers the data from the CPU to the GPU.
Once the data transfer is complete, all subsequent operations (lines 7-15) are executed on the GPU.

This results in a scenario where the GPU is active while the CPU is idle, and vice versa, as illustrated below:

To optimize this process, we can implement the following strategy:

While the model is training on the first mini-batch, the CPU can simultaneously transfer the second mini-batch to the GPU.

This approach ensures that the GPU does not have to wait for the next mini-batch of data after completing the processing of the current mini-batch.

Consequently, the resource utilization chart would resemble the following:

While the CPU may remain idle, this process ensures that the GPU, which is our primary accelerator for model training, always has data to process.

Formally, this technique is referred to as memory pinning. It enhances the speed of data transfer from the CPU to the GPU by facilitating an asynchronous training workflow.

This allows us to prepare the next training mini-batch concurrently while the model is being trained on the current mini-batch.

Enabling this feature in PyTorch is quite straightforward.

By simply setting pin_memory=True in the DataLoader, we can ensure that the data is transferred directly to the GPU from pinned memory, which is memory that is page-locked. This transfer method is faster than moving data from standard CPU memory.

Pin Memory Implementation

To implement this, set pin_memory=True when defining the DataLoader object, as illustrated below:

train_loader = DataLoader(train_dataset,
                          batch_size=64,
                          shuffle=True,
                          pin_memory=True,
                          num_workers=8)

Next, we will return to the train() method we defined earlier. During the data transfer step in the training process, specify non_blocking=True, as shown below:

for epoch in range(epochs):

     for batch_idx, (data, target) in enumerate(train_loader):

        data = data.to(device, non_blocking=True)
        target = target.to(device, non_blocking=True)

        optimizer.zero_grad()

        output = model(data)

        loss = criterion(output, target)

        loss.backward()

        optimizer.step()

Now, if we measure the training time, we obtain the following output:

train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True, num_workers=8)

start = time()

num_epochs = 5

for epoch in range(1, num_epochs + 1):
    train(model, device, train_loader, optimizer, criterion, epoch, non_blocking=True)
print(f"Total time = {time() - start}")

Once again, the model exhibits a similar loss trajectory as before, but in this instance, it takes even less time than when we set max_workers=8.

However, it is important to be cautious when using memory pinning. If multiple tensors, or particularly large tensors, are allocated to pinned memory, this can consume a significant portion of your RAM.

As a result, the overall memory available for other operations may be adversely affected.

Therefore, whenever I employ memory pinning, I make it a practice to profile my model training procedure to monitor memory consumption.

Additionally, I have noticed that when the dataset is relatively small, or when the tensors are small, memory pinning has a negligible effect. This is because the data transfer time from the CPU to the GPU is minimal in such cases.

Overall, these two straightforward settings, num_workers and pin_memory, can significantly enhance the speed of your training procedure. They ensure that your model is continuously supplied with data while maximizing the utilization of your GPU.

Now, let us proceed to...

2. Bayesian optimization

Introduction

The most common approach to determining the optimal set of hyperparameters is through repeated experimentation.

This process involves defining a range of hyperparameter configurations, primarily through random sampling, and running them through the model. Ultimately, we select the configuration that yields the best performance metric.

While this approach is viable for training small-scale machine learning models, it becomes impractical when dealing with larger models.

To illustrate, consider a scenario where it takes 1.5 hours to train a model and you have configured 20 different hyperparameter settings to evaluate.

That amounts to over a day of training.

Therefore, understanding and applying optimized hyperparameter tuning strategies is crucial for developing large machine learning models.

One such highly effective approach for tuning hyperparameters is Bayesian optimization.

The concept behind Bayesian optimization is that, while iterating through various hyperparameter configurations, the algorithm continually updates its beliefs about the distribution of hyperparameters based on the observed performance of the model corresponding to each configuration.

This enables the algorithm to make informed decisions when selecting the next set of hyperparameters, gradually converging toward an optimal configuration.

To illustrate this further, consider the following results obtained using grid search:

The hyperparameter configurations with low accuracy are highlighted in red, while those with high accuracy are indicated in green.

After the 21 model runs (comprising 9 green and 12 red results) shown above, the grid search lacks the ability to make informed predictions about which hyperparameters to evaluate next.

In other words, all trials function independently.

However, when we examine the evaluation results in the figure below, it becomes apparent that it makes more sense to focus our hyperparameter search in the vicinity of the green region.

This concept lies at the heart of Bayesian optimization.

In simple terms, Bayesian optimization operates as follows:

Build a probability model of the objective function, conditioned on the hyperparameters. This objective function can be either loss or accuracy.
Utilize the probability model to make informed decisions regarding the most promising hyperparameters.

In this way, it leverages past results to create a probabilistic model that correlates hyperparameters with the likelihood of achieving a desired score on the objective function.

Tips for Optimizing Neural Network Training (Including Hands-On Implementation)

Introduction

1. Max workers and pin memory

Max workers

Max workers implementation

Standard Data Loader

Data Loader with Max Workers

Pin Memory

Pin Memory Implementation

2. Bayesian optimization

Introduction

Master AI Tools in Just 5 Minutes a Day

Newsletter language