Whether your goal is to use deep learning yourself or just to understand the hype, you need both a sound understanding of the theory behind it as well as hands-on experience. This article offers both: We build and train a neural network in PyTorch and go into the theory of neural networks.
Deep learning is still the hot thing out there. Last year, the most important conference on deep learning sold out in less than 12 minutes. Every company, we are told, should now have an AI strategy — maybe even every country. If you encounter any hype or the promise of the next gamechanger, you are right to be suspicious. Indeed, machine learning and artificial intelligence have been misrepresented, or misunderstood, by mainstream media, public intellectuals, and marketeers alike. Yet, there is substance behind the buzzwords: Deep learning has been applied successfully to many areas of machine learning and pattern recognition. Today, deep neural networks deliver state of the art results for visual recognition tasks such as classification, segmentation and object detection. Moreover, deep learning proved to be highly successful in reinforcement learning, speech recognition, and natural language processing.
This will be the first entry in a series of articles on the fundamentals and some of the most important applications of deep learning. I will try to keep the articles concise, but detailed. You will get runnable code and actionable advice out of it. Yet, this will not be just some dump of code ready to copy and paste. I will go into the theory of neural networks (the machine learning models at the basis of deep learning). Yes, there will be maths, but don’t worry, it won’t be as hard as you might think. If you know a little bit of calculus and linear algebra, you should be fine.
After reading these articles you will have a good understanding how deep learning actually works. We will look beyond the hype, so that you can evaluate whether deep learning methods are suitable for your research, your projects, your organization. There will be no robot brains, no terminators, no anthropomorphism, no bullshit. Just the facts, the maths, and the code.
This article will introduce the idea of a neural network as a universal function approximator and the algorithm we use to train such a network: gradient-based optimization combined with backpropagation.
The Multi-Layer Perceptron
Modern neural networks stand in the tradition of biologically-inspired mathematical models that were designed to resemble neurons in human or animal brains. This line of research goes back to as early as the 1940s. The perceptron, proposed by Frank Rosenblatt in the 1950s, was the first of these models that learned its parameters entirely from training data. The perceptron is able to distinguish between two categories, or classes, by defining a separating hyperplane. It can solve a two-class classification problem, like distinguishing cats from dogs, if the two classes are linearly separable. That means, the perceptron is based on the assumption that all examples of one class, as described by their feature vectors, can be separated by all examples of the other class by drawing a straight line.
More mathematically: The perceptron assigns the classes y \in {-1, +1} using the hyperplane defined by the function f(\boldsymbol{x},\boldsymbol{w}) = \boldsymbol{w}^T \boldsymbol{x} , where \boldsymbol{x} is the input vector (or feature vector) and \boldsymbol{w} is a vector of weights defining the mapping that is learned from the training examples. That means, the decision rule is
\hat{y} = sign(\boldsymbol{w}^T \boldsymbol{x}) = \sum_i w_i x_iWe can also represent the perceptron by a computational graph. The nodes represent computational operations and the edges show how they are connected. It is common to interpret neural networks as computational graphs. In fact, this is where we can see the networky structure of them. For a perceptron the computational graph is quite simple:
Since the perceptron defines a linear function, it is limited to classification problems that are linearly-separable. This makes it unsuited for most real-world use cases. However, we can extend the basic structure of the perceptron by stacking multiple perceptrons on top of each other and using the output of one perceptron as the input of another. The resulting model is called the multi-layer-perceptron (MLP). The MLP can learn arbitrary, nonlinear functions by learning a chain of simpler functions. Just using linear perceptron-style formulas over and over again, however, will still result in a linear function. Thus, we have to introduce a non-linearity or activation function that is applied after every linear function. Now, our model can learn complex functions by stacking basic building blocks — or layers. Every layer applies a nonlinear mapping to the input vector x to produce an output vector h in the following way:
\boldsymbol{h} = \varphi(\boldsymbol{W}x)Here input \boldsymbol{W} is a matrix that defines the connections between input nodes and output nodes, called weight matrix, and \varphi is a nonlinear activation function. The weight matrix is a parameter of the layer that is learned during the training of the network. Drawn as a graph, such a layer looks like this:
Notice that every node of the input layer is connected with every node of the output layer. Therefore, this type of layer is also called fully-connected layers. As noted above, we have to introduce non-linearity, otherwise we limit our network to just learning linear relationships. That is why we apply the a nonlinear activation function \varphi. Today, we mostly use the rectified linear unit (ReLU) as activation function, defined as \varphi(x) = max(0, x).
The computational graph of a multi-layer perceptron is composed of a chain of these layers, each consisting of nodes that act in parallel. You can see a graph of an MLP below. The leftmost layer represents the input that is used as the input of the second layer that applies its weights w_2 and its activation function. Then, the output of the second layer is routed forward to the third and so on. The information flows through the network until the final layer, the output layer. What the final layer does depends on our use case. For instance, it might give us the most likely class label of a given input image — is it a cat, a dog, an airplane or a bicycle?
The layers that are neither input nor output are called hidden layers. The power of neural networks lies in the hidden layers. They enable the network to learn a representation of the data and capture nonlinear dependencies. In fact, the universal function approximator theorem proves that a multilayer perceptron with a suitable activation function and only a single hidden layer is able to approximate any function with arbitrary precision. However, the theorem only proves that such a network exists. It does not tell us how to obtain it. The theorem makes no statements about network topology or training procedure. In practice, we have to design neural network architectures for specific applications and training requires tuning of different hyperparameters.
In PyTorch, we define our neural networks as subclasses of torch.nn.Module. We add Layers to our architecture as instance variables of our network class. There are quite a few different kinds of layers available in PyTorch. The Linear
layer is the fully-connected building block of multi-layer perceptrons. We also have to define how the input is passed through the layers and transformed by our network. We do so by overwriting the method forward
. So, let’s implement our first simple neural network.
import torch.nn as nn import torch.nn.functional as F class SimpleNet(nn.Module): def __init__(self, input_size: int, output_size: int): # Here we define the layers of the network. super(SimpleNet, self).__init__() self.fc1 = nn.Linear(input_size, 128) self.fc2 = nn.Linear(128, 64) self.fc3 = nn.Linear(64, output_size) def forward(self, x): # Forward defines how the network shall transform the input tensor x # The ReLU activation function is applied after each linear layer x = F.relu(self.fc1(x)) x = F.relu(self.fc2(x)) x = F.relu(self.fc3(x)) return x
Gradient Descent Algorithms in Deep Learning
The most common algorithms to train neural networks are based on gradient descent, which was first proposed by Auguste Cauchy in 1847. Gradient descent seeks to optimize — more precisely, minimize — a function by taking steps in the direction of the negative gradient.
In order to train our model using gradient descent, we define an objective function L(\boldsymbol{\theta}) that our model shall optimize by adjusting the model’s parameters \theta. In the context of machine learning and deep learning, the objective function is commonly called loss function or cost function. The loss function is chosen task-dependent, such that by minimizing the loss, the network learns something about the problem to solve. During the optimization procedure, the loss function provides feedback and indicates how the parameters have to be adjusted. If our network is far off the desired solution, it will receive a large loss value.
The lower the loss, the better the model performs on the training data. Eventually we reach a local minimum and the parameters stop changing. For example, if we train a network as an image classifier and it makes many classification errors by assigning wrong classes to a large number of images, we provide negative feedback in the form of a high loss. Vice versa, if the network classifies most of the images correctly, the loss value will only be small, since it is doing quite well on the given task.
Gradient descent minimizes the loss function L(\boldsymbol{\theta}) in an iterative manner by updating the parameters \boldsymbol{\theta} following the negative gradient of the loss function w.r.t. \boldsymbol{\theta}:
\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta \nabla _{\boldsymbol{\theta}} L(\boldsymbol{\theta})Here \eta is the learning rate, defining the magnitude or step size of every update and t denotes the current iteration.
For convex problems, gradient descent is guaranteed to find a global minimum. Yet in deep learning we deal with non-convex functions. That means that gradient descent converges to one of generally many local minima. However, since it turns out that these local minima typically generalize well, gradient descent algorithms are still by far the most popular and successful optimization algorithms used in deep learning.
In practice, you will never compute the gradient over the whole dataset before adjusting the parameters. It is just too inefficient, especially for large datasets. Instead, we commonly use mini-batch stochastic gradient descent: We compute the loss and the gradient over a subset sampled from the training data. Many people call this procedure simply stochastic gradient descent (SGD).
The commonly used deep learning libraries like PyTorch or Tensorflow have built-in support for mini-batch sampling and training procedure.
Extensions to Gradient Descent
Many extensions to (mini-batch) SGD have been proposed and are used in practice. Discussing those in detail is beyond the scope of this article. I will focus on two important concepts, that are part of several algorithms: Momentum and the adaption of the learning rate to individual parameters. For a comprehensive overview of different SGD-style algorithms, see the blog post by Sebastian Ruder.
During the optimization procedure, SGD might oscillate along the major direction towards a local minimum in the parameter space. This zigzagging behavior slows down convergence of the optimization.
We can mitigate this problem by introducing momentum. Instead of updating parameters solely based on the current gradient, we use a velocity term $v_t$ that combines the current gradient with a weighted sum of past gradients:
\boldsymbol{v}_t = \gamma \boldsymbol{v}_{t - 1} + \eta \nabla_{\boldsymbol{\theta}}L(\boldsymbol{\theta}_t)
\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \boldsymbol{v}_t.
The hyperparameter \gamma controls the weight of the old value \boldsymbol{v}_{t - 1} relative to the gradient \nabla{\boldsymbol{\theta}}L(\boldsymbol{\theta}_t) . The smaller the value of \gamma, the more the training procedure is sensitive to the current gradient. Intuitively we can compare the momentum term to physical momentum: Like a ball rolling down the slope of a hill the parameters in SGD gain momentum on their way down the loss surface. Since the current gradient is combined with the sum of decaying previous gradients, the influence of the individual gradients is diminished. As a result, the oscillations are dampened and the convergence of the training procedure is almost always faster.
In deep learning, the learning rate is one of the most important hyperparameters, yet finding a good value requires time-consuming search. Multiple gradient-based optimization algorithms, such as Adagrad, Adadelta and Adam, try to mitigate this problem by adapting the learning rate to individual parameters. The intuition behind those methods is that some features are activated very frequently, and thus receive gradients of large magnitude, while other features might be activated very infrequently, and receive small gradients. If each parameter is updated using the same learning rate, parameters corresponding to frequently seen features get more attention during the training procedure, yet infrequently activated features might be very informative for a well-generalizing model. Therefore, the idea of these algorithms is to adapt the learning rate to perform relatively larger updates for infrequent parameters.
Backpropagation
We have seen how we can optimize neural networks with gradient-based algorithms such as mini-batch SGD or Adam. In order to be able to apply these methods, we have to calculate the gradient with respect to the parameters. That means we need an algorithm to compute gradients efficiently. In the context of neural networks, the algorithm of choice is backpropagation.
As noted earlier, neural networks are complex functions that consist of a computational graph of simpler functions. For a given input \boldsymbol{x}, a feedforward network computes an output \boldsymbol{\hat{y}} by routing information forward through the nodes and layers of the computational graph. This is called forward pass or forward propagation. During training, we then obtain a loss value L(\boldsymbol{x}, \boldsymbol{\theta}). Then, using the backpropagation algorithm, we compute the gradient of the loss \nabla_{\boldsymbol{\theta}} L(\boldsymbol{x}, \boldsymbol{\theta}) w.r.t. the parameters by passing information backward through the network. Since the loss indicates how well (or, rather, how badly) the model performs on the training data, backpropagation is often described intuitively as passing the error back into the network. More precisely, we obtain the gradient of the loss with respect to every parameter \theta_i \in \boldsymbol{\theta} by recursively applying the chain rule of calculus.
Consider two functions f and g , with y = g(x) and z = f(y) = f(g(x)). According to the well-known chain rule of calculus, the following property holds:
\frac{\partial z}{\partial x} = \frac{\partial z}{\partial y} \frac{\partial y}{\partial x}
Using this general rule, we can compute gradients in a neural network using backpropagation. Assume a node in the computational graph of a neural network that computes an output \hat{y} = f(x, w) as a function f of x and w, where x might denote the input and w a parameter to be learned. If f is differentiable we can compute \frac{\partial \hat{y}}{\partial x} and \frac{\partial \hat{y}}{\partial w}. Let the gradient of the loss w.r.t. \hat{y} be \frac{\partial L}{\partial \hat{y}}, then the gradients of the variables w and x, are
\frac{\partial L}{\partial x } = \frac{\partial L}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial x}
\frac{\partial L}{\partial w } = \frac{\partial L}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial w}
Recursively applying this schema, starting with the last layer of the network, we can compute gradients for each parameter and subsequently optimize them using the gradient-based methods described above.
Just by computing local gradients and applying the chain rule at each node, we can solve the problem of computing the gradient of the loss w.r.t every parameter of a neural network. This proves to be computationally efficient due to the dynamic programming structure of the algorithm. More than that: It means that every differentiable function can be incorporated into a neural network. We will use this fact later to include trained global pooling layers in neural networks.
In PyTorch, this backward pass and calculation of the gradient is done automatically. We do not have to worry about it as long as we are using pre-defined layers or define our layers in terms of tensor operations that support automatic differentiation. We create an optimizer object that implements a certain optimization algorithm and pass the parameters to optimize. During training, we perform backpropagation after each forward pass by calling backward()
on our model and run an optimization step using the obtained gradient.
import torch
# define the loss to be used for training criterion = torch.nn.CrossEntropyLoss()
# optimization prodecure for training the network optimizer = torch.optim.SGD(net.parameters(), lr=0.001, momentum=0.9) # one epoch is one iteration over the training dataset num_epochs = 10 for epoch in range(num_epochs): # get batches of training data and labels for data in data_loader: input_data, labels = data # set the gradients of all parameters to zero optimizer.zero_grad() # forward + backward + optimize output = net(input_data) # calculate the loss and compute the gradients using backprop loss = criterion(output, labels) loss.backward() # perform an optimization step based on the gradients optimizer.step()
Putting it all together
You have a basic understanding of neural network layers, loss functions, gradient-based optimization and backpropagation. Now you know the most important concepts to train your own neural networks! Below, we put it all together in PyTorch. The code is available on GitHub.
import torch import torch.nn as nn import torch.optim as optim import torchvision device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") class Net(nn.Module): def __init__(self, output_dim): super().__init__() self.conv = nn.Sequential( nn.Conv2d(3, 64, 3), nn.ReLU(), nn.Conv2d(64, 128, 3), nn.ReLU(), nn.Conv2d(128, 128, 3), nn.ReLU(), nn.Conv2d(128, 64, 3), nn.AdaptiveAvgPool2d(1) ) self.fc = nn.Sequential( nn.Linear(64, 32), nn.ReLU(), nn.Dropout(), nn.Linear(32, output_dim) ) def forward(self, x): x = self.conv.forward(x) x = x.view(x.shape[0], 64) res = self.fc.forward(x) return res def get_data(): transform = torchvision.transforms.Compose([ torchvision.transforms.ToTensor(), torchvision.transforms.Normalize((0.1307,), (0.3081,)) ]) trainset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform) trainloader = torch.utils.data.DataLoader(trainset, batch_size=16, shuffle=True, num_workers=2) testset = torchvision.datasets.CIFAR10(root='./data', train=False, download=True, transform=transform) testloader = torch.utils.data.DataLoader(testset, batch_size=16, shuffle=True, num_workers=2) return trainloader, testloader def train(trainloader, epochs=10, ): net = Net(10) net = net.to(device) print("Start training.") criterion = nn.CrossEntropyLoss() optimizer = optim.Adam(net.parameters(), lr=0.001) for ep in range(epochs): running_loss = 0.0 for i, data in enumerate(trainloader, 0): inputs, labels = data inputs, labels = inputs.to(device), labels.to(device) optimizer.zero_grad() # forward, backward, and optimize outputs = net(inputs) loss = criterion(outputs, labels) loss.backward() optimizer.step() # print statistics: running_loss += loss.item() if i % 1000 == 999: print(f'Ep: {ep + 1} | {i + 1}. Loss: {running_loss/1000:.3f}') running_loss = 0.0 return net def test(net, testloader): correct = 0 total = 0 with torch.no_grad(): for data in testloader: images, labels = data images, labels = images.to(device), labels.to(device) outputs = net(images) _, predicted = torch.max(outputs.data, 1) total += labels.size(0) correct += (predicted == labels).sum().item() accuracy = 100 * correct / total print(f'Accuracy on the test set: {100 * correct / total}%') return accuracy def main(): trainloader, testloader = get_data() net = train(trainloader, epochs=10) test(net, testloader) if __name__ == '__main__': main()
You probably recognize the concepts that that we talked about above. One exception is the convolutional layer. For now, it is enough to know that this a special kind of network layer that is especially well suited to learn a representation for image recognition. However, convolutional neural networks are such an important and rich topic that they deserve a blog post of their own.
Okay, let’s run the code.
Start training.
Epoch: 1 | Loss: 1.872.
Epoch: 2 | Loss: 1.604.
Epoch: 3 | Loss: 1.451.
Epoch: 4 | Loss: 1.347.
Epoch: 5 | Loss: 1.273.
Epoch: 6 | Loss: 1.218.
Epoch: 7 | Loss: 1.169.
Epoch: 8 | Loss: 1.132.
Epoch: 9 | Loss: 1.098.
Epoch: 10 | Loss: 1.070.
Accuracy on the test set: 61.2%
Okay, 61% percent is not too bad for a 10-class problem! We can improve the performance of our model by modifying the architecture or training for a longer time.
Feel free to change the code and use it as a starting point for your own experiments. You might change the network architecture or the optimization procedure to get better performance. You can change the data set and see how well the network does. You could even exchange our modest neural network for one of the powerful pretrained models like ResNet or InceptionNet that come with torchvision.