Backpropagation stands as one of the key algorithms powering learning in neural networks. This comprehensive guide will give you a solid understanding of backpropagation, its significance in neural networks, and its operational mechanics.
What is Backpropagation?
Backpropagation or Backward-propagation is the second process in neural network training, focused on parameter adjustment. Here, we are iteratively adjusting the model’s parameters by propagating the errors backward from the output layer to the input layer. In other words, we update the weights and biases of the neural network in reverse, starting from the output layer and moving toward the input layer, intending to minimize the loss value. Hence, it is referred to as “the backward propagation of errors”.
How does Backpropagation work?
Let’s assume we have built a simple feedforward (MLP) neural network to identify a given 4by4(16) pixel black & white image whether it belongs to the number 0 or 1. This neural network consists of an input layer with 16 neurons, two hidden layers each with 6 neurons and the ReLU activation functions, and an output layer with 2 neurons and the softmax activation function.
Assume that In the forward propagation, this deep learning model gave us the below results for the above training data.
Where the average MSE loss is 0.4768. Clearly, our deep learning model has produced a significant number of incorrect predictions. To enhance model accuracy, we must adjust neural network weight and bias, to reduce this cost value.
Just for now let’s think only about the first data example, its output, and cost value.
But the desired output for that image is number 0:
So, how can we modify our neural network parameters to give the desired output?
First of all, Let’s assume the first neuron of the output layer. It should be given the 1.00 to indicate that the given image belongs to the number 0. But in our result, it gave 0.26. So how we can increase 0.26 to 1.00?
Firstly, let’s see how this 0.26 value was calculated.
You can see that value 0.26 is the result of neural network inner computation, involving the multiplication of input weights, bias addition, and the application of activation functions.
So There are three components we can use to increase the value from 0.26 to 1.00. These components include increasing the weight values, bias values, or the output of each hidden layer (h), either individually or collectively, to attain the desired outcome
If you’re unfamiliar with mathematical computations and how neural networks make predictions, you can refer to our comprehensive guide on Artificial Neural Networks (ANN) for a better understanding.
Now You may think that we can easily get the desired result by increasing weight and basis values by any number. But here is an important thing you should memorize, we do not increase our parameters by large values (e.g. w=0.01 —> 10) because when we use large values, it will be hard to train the neural network for other data examples and our AI model will be unstable. So it is good to keep parameter values between 0 and 1 or use small values.
Although we can get the desired output for one data sample by just increasing or decreasing only weights or biases of the output layer neurons, when making adjustments to neural network parameters that affect all the data, we must also consider and adjust the parameters within the hidden layers.
For that, we have to adjust the output of the hidden layer neurons by going backward from the output layer to the hidden layer:
Like that, we can change every corresponding weight and bias value to get the right answer. In this example, we have 144 weights and 14 bias values (158 parameters) that can be used for increasing the 0.43 to 1.00.
One important fact you should remember is that you can cause a big impact by using the most affected neurons in the hidden layer. You can choose such neurons by their activation types. As here we are using the Relu activation function for the hidden layer, a neuron can get 0 or a positive number. in this example, some hidden layer neurons corresponding to the output layer neurons are dead(inactive), with an output value of 0.
Which is inspired by the expression called, “Neurons that fire together wire together”
Neurons with an activation of 0 have a low or no impact on the output neuron and its result. On the other hand, we can speed up the achievement of the desired output by increasing the weights and bias values of other neurons. Also, by lowering the weights and bias of the inactive neurons, we can increase the value of the output neuron.
Above, we only considered changing the model parameters to get the desired output for one output neuron. There’s another output neuron that we must address.
So, while we increase all the corresponding weights and biases to increase the value of one output neuron from 0.26 to 1.0, we also have to decrease the corresponding weights and biases values to lower the value of the other output neuron.
Basically, we adjust all parameters while considering the desired outcomes of both output neurons. To put it simply, we adjust all the parameters by summing up their values while considering the desired outputs.
Like this, we can find the average values for all parameters in the neural network. As a result, we obtain the average values for each parameter as follows:
It’s important to note that the above adjustment sizes are just for one training data point, To obtain the final parameter adjustments, which are used to update the parameters, we need to calculate the adjustment sizes for each training data point and then add up these adjustments.
This way you can get the optimal parameters that minimize the lost function.
But if you watch closely above procedure, you can see it like a daydream. In this small model, there are 158 parameters that you can adjust in an infinite number of possible ways. What I’ve shown is a simple way to adjust parameters, but it’s not the right way. The above method is something like going on a trip without a map or compass. You’re moving, but you’re not quite sure where you are or the most efficient path to reach your desired destination. The above only shows how we walk, but in reality, we need to have a clear understanding of our current position and the most efficient path to reach our desired destination.
To address these problems, we use a technique called Gradient Descent.
What is Gradient Descent?
Gradient Descent is like a map or compass to our backpropagation process. Before we dive into the Gradient Descent, let’s figure out what is Gradient.
Calculus concepts are fundamental to backpropagation. For a better understanding of backpropagation, read our guide on Calculus.
Gradient
Gradient is a vector that contains partial derivatives with respect to each input variable of a multivariable function(e.g. f(x,y) = x**2 + y**2). Each component of this vector represents the rate of change of the function concerning a specific variable while keeping the other variables fixed. In simple terms, The gradient provides information on how each variable affects the function’s increase or decrease.
The gradient vector points in the direction of the steepest increase of the function at the given point. So, moving in the direction of the gradient by adjusting the input variables accordingly will result in the fastest increase in the function’s value compared to any other direction you could choose from that point. (you will observe the most significant increase in the function’s value).
For example, let’s consider a function f(x, y) = x **2 + y**2 at a particular point (x, y). We can calculate the gradient of this function for (x,y) point as:
This tells us that a small change happening in x and y towards the direction of the gradient vector(2,3), will increase the function f(x,y) hugely than moving to other directions. In other words, if you change the x and y values toward the vector direction of (2,3) then the function will increase hugely compared to changing x and y to move in other directions.
For further understanding look at the 3D graph of the f(x,y) function with x and y points.
We can adjust x and y moves in many directions. But we can only get the highest increase in the function when we adjust x and y to the direction of the gradient.
Remember 2 and 3 are not the sizes we add to the x and y. They are just values of a vector that show the direction in which we should move x and y by small sizes. To calculate the gradient of a function we use a calculus concept called Chain Rule.
When creating an AI model, we use a key feature in deep learning frameworks known as Autograd (automatic differentiation) to automatically compute gradients. If you’re unfamiliar with Autograd, check our comprehensive guide on AutoGrad.
Gradient Descent
We know that Gradient gives the variables for the direction of the steepest increase of the function. But in our case, we don’t need to increase our function (loss function), we need to decrease it. For that, we can move the parameters in the direction opposite to the gradient which gives the direction of the steepest decrease of the function. This opposite direction of the gradients we called Gradient Descent.
For example, we can calculate the Gradient Descent of the above function like this:
Therefore, we can use gradient descent of the function to find the lowest point/ minimum point of the function and move our parameters/point( x and y) towards it.
Remember that ” the gradient descent only gives us the directions that we should move our variables(parameters) from the current point to get the quick decrease of the loss function.”
You can see the above two-parameter function’s minimum point is not zero, because it is a non-zero function. sometimes non-linearity of the function also caused the minimum point to be non-zero.
Now we can calculate the gradient of the loss function with respect to our neural network parameters(weights and biases). It will give us the direction that we should adjust our weight and bias values by a small amount to get a high increase in the loss function. As we need to reduce the loss function we can use the gradient descent. (opposite direction of gradient). which gives us the direction we should move our weight and biases to reduce the loss function.
Here Loss function value represents all data samples, So this adjustment affects all the training data.
In a neural network, we have to deal with more than just 2 variables/parameters (the example problem has 158 variables). This complexity makes it a multi-variable or multi-parameter function, making it multi-dimensional.
We can’t draw multidimensional functions, but for better understanding, I will show you what they look like from a side:
As I said before, Gradient descent represents the direction where we should move our parameters from the current position/point to decrease the loss function highly. So our minimum point of the function depends on our current position/point(initialize point of the weights and biases).
Also,
You can see there are many possible minimum points with respect to the initial or current position in the function. The most minimum point across the entire function is called the Global Minimum, while the others are referred to as Local Minima.
So our minimum point depends on the initial positions or initial parameter values. Although finding a Local Minima is very easy, finding the Global Minima parameters is very rare and hard.
Now let’s talk about how we adjust our model parameters. You should understand that gradient and gradient descent only provide the direction in which we should move our parameter value by small amounts. So, we move and fine-tune our parameters by very small steps in that direction.
Here, the learning rate is deciding the magnitude of those small steps.
For example if use above f(x,y) function:
Likewise:
After each step, we recalculate the gradient and continue moving in that direction until we reach the minimum point.
In each iteration, our new parameter values are updated as follows:
Stochastic Gradient Descent(SGD)
In normal gradient descent, we calculate the gradient for the average loss value of all training data with respect to neural network parameters. This is a slow and time-consuming process.
So instead of using the entire dataset to calculate the gradient(average loss value of all the training data), we use just a single random data point at a time for each update. We calculate the gradient for one data point and update the parameters as needed. Next, we go to another random data point to train the model with data, calculate the gradient, and update it. Like that, we apply the above steps to all other data points. This method we refer to as Stochastic Gradient Descent (SGD).
This method helps us to update our model parameters quickly compared to normal gradient descent.
Nowadays, numerous gradient-based optimization algorithms are available for backpropagation. Learn more about them in our Optimization Algorithms guide.
Backpropagation plays a pivotal role in training neural networks, allowing neural networks to learn from errors and optimize their parameters. In this guide, we’ve covered the basic concepts of backpropagation and explained its workings in an easy-to-understand manner. Whether you’re new to the topic or an experienced individual, you’ll gain a clear and improved understanding of backpropagation through this guide.