My Tools

    No Products in the Wishlist

Loss Functions – ALL You Need To Know

The loss function is a critical component of an AI model, playing a pivotal role in the AI training process. In this article, I will give you a solid understanding of the loss function, different types of loss functions, how they work and their applications, and how to implement them in PyTorch.

What is a Loss Function?

A loss function, also known as a cost function or objective function, is a mathematical function that calculates the difference between the AI model’s predicted outputs and the desired outputs. The value calculated by the loss function is referred to as “loss” or “loss value”. If the model’s predictions are hugely inaccurate, the loss function will output a higher number. If the model’s predictions are pretty good, it’ll output a lower number(as this is a measurement value there is no minus or positive sign).

Optimization algorithms use the loss value to find the best possible values for a model’s parameters, aiming to reduce the loss value as much as possible. Hence, the primary goal during the training process is to decrease the loss value.

Loss_Function

Types of Loss Functions

There are numerous Loss functions available for our AI models. Here are a few significant and commonly used ones that you should know:

Throughout this article, I use Python and PyTorch combination for programming, You can follow our installation guide to set up your PC properly!

Mean Squared Error (MSE)

Mean Squared Error (MSE) also known as the L2 loss, is one of the most popular loss functions used in the AI world. This is the perfect loss function for regression tasks.

The MSE loss function computes the squared difference between each predicted value and its corresponding true value, sums up these squared differences across all data points(First, all the data input to the model and get the predicted result for all input data), and then divides them by the number of data points/examples (n) to get mean)

Mathematically, the Mean Squared Error is represented as:

MSE_Loss_Function
If you’re unfamiliar with these algebra concepts, check our comprehensive guide on Norms for a better understanding.

The squaring operation makes sure that all errors are positive and penalizes larger errors more heavily. It means that when the model’s predictions are far off from the true values, the loss value increases significantly, which is more encourages the model to adjust the parameters to make the predictions closer to the true values.

However, the MSE loss function has one disadvantage which is very sensitive to outliers. When one or some prediction values get too far from the true value loss function will calculate a larger lost value. As this loss value represents all predictions this will affect badly to other predictions.

MSE can be implemented in Python like this:

MSE_Loss_Function_Python

Output:

MSE_Loss_Function_Python-output

Here is a standard implementation in Pytorch using MSELoss() function which is in the torch.nn library.

MSE_Loss_Function_pyTorch

Output:

MSE_Loss_Function_PyTorch-output

Mean Absolute Error (MAE)

The Mean Absolute Error (MAE) loss function, also known as the L1 loss, is another commonly used loss function in regression tasks. This measures the average of absolute differences between the true and the predicted value. This also does not consider the values directions.

Mathematically, the Mean Absolute Error can be represented as:

MAE_Loss_Function

As the Mean Absolute Error is much more resistant to outliers, this can be used if the data is prone to many outliers.

MAE can be implemented in Python like this:

MAE_Loss_Function_Python

Output:

MAE_Loss_Function_Python-output

Here is a standard implementation in Pytorch using L1Loss() function which is in the torch.nn library.

MAE_Loss_Function_PyTorch

Output:

MAE_Loss_Function_PyTorch-output

Binary Cross Entropy (BCE)

Binary Cross Entropy (BCE) also known as Log Loss is a popular loss function used in binary classification problems. where there are only two mutually exclusive classes, and the goal is to predict probabilities for one class (positive class) against the other class (negative class).

The BCE loss measures the difference between the predicted probability and the true binary label(0 or 1) for each data point.

Mathematically, the Binary Cross Entropy (BCE) loss function is defined as follows:

BCE_Loss_Function

Here, one class is defined as the positive class(1). If the data belongs to the positive class(1), the model produces a value of 1 or near to 1. The other class is called the negative class(0). If the data belongs to the negative class(0), the model produces a value of 0 or near 0.

When Data belongs to the positive class(1):

  • If the model predicted_value is close to 1 – The BCE Loss gets a low value(approaches 0) and there is little to no punishment.
  • If the model predicted_value is close to 0 – The BCE Loss becomes very large and the model is heavily penalized by using -log( y_pred)

When Data belongs to the negative class(0):

  • If the model predicted_value is close to 0 – The BCE Loss gets a low value(approaches 0) and there is little to no punishment.
  • If the model predicted_value is close to 1 – The BCE Loss becomes very large and the model is heavily penalized by using -log(1 – y_pred)

The Binary Cross Entropy is most commonly used with the sigmoid activation function in the final layer of the neural network to ensure that the predicted probabilities fall within the range of 0 and 1.

BCE can be implemented in Python like this:

BCE_Loss_Function_Python

Output:

BCE_Loss_Function_Python-output

Here is a standard implementation in Pytorch using BCELoss() function which is in the torch.nn library.

BCE_Loss_Function_PyTorch

Output:

BCE_Loss_Function_PyTorch-output

Categorical Cross Entropy (CCE)

Categorical Cross Entropy (CCE) is a loss function commonly used in neural networks for multiclass classification problems, where there are more than two classes.

This loss function calculates the difference between the predicted class probabilities and the one-hot encoded target labels for each data point. In other words, it represents how much each data point belongs to each class. Here we use a one-hot encoding method to identify each class.

For example, If we have three classes A, B, and C. We can encode A as [0,0,1], B as [0,1,0], and C as [1,0,0] using one-hot encoding. So for each data example, the model outputs should be something like this [0.1, 0.7, 0.2]. The summing value of the probability of each class must equal to 1. (0.1 + 0.7 + 0.2 = 1).

For each data point, CCE calculates the element-wise product of true_values and log(predicted_values) and sums the result over all classes. Also, the negative sign in front is used to ensure that the loss is minimized during training.

Mathematically, the Categorical Cross Entropy (CCE) loss function is defined as follows:

CCE_Loss_Function

The loss value is smaller when the predicted probabilities match with the one-hot encoded target labels. The loss value is much larger when the predicted probabilities differ significantly from the one-hot encoded target labels.

For instance, imagine true_value or class for a data example is ‘class A’ ([0,0,1]). If the model predicted probability for that data example is [0.2, 0.1, 0.7]: The CCE Loss gets a low value(approaches 0) and there is little to no punishment. If the model predicted probability for that data example is [0.6, 0.3, 0.1]: The BCE Loss becomes very large and the model is heavily penalized.

Remember That the final Loss value is calculated by summing all data examples and then getting the average value by dividing by the number of data examples.

Categorical Cross Entropy is commonly used with a softmax activation function in the final layer of the neural network to ensure that the predicted probabilities sum up to 1 and each node outputs a probability value between (0–1) for representing valid class probabilities.

Also, it is important to have the same number of output neurons in the output layer as the number of classes.

CCE can be implemented in Python like this:

CCE_Loss_Function_Python

Output:

CCE_Loss_Function_Python-output

Here is a standard implementation in Pytorch using CrossEntropyLoss() function which is in the torch.nn library.

CCE_Loss_Function_PyTorch

Output:

CCE_Loss_Function_PyTorch-output

The reason for the difference in results between the two implementations is due to the way the logarithm is computed for very small probabilities.

Hinge Loss

Hinge Loss, also known as Multi-class SVM Loss is a  loss function commonly used in support vector machines (SVMs) and some neural networks for binary classification and ranking tasks. It encourages the model to correctly classify examples while maximizing the margin between positive and negative examples.

Mathematically, the Hinge Loss function is defined as follows:

Hinge_Loss_Function

Here, we define one class as the positive class(1) and the other class as the negative class(-1). So model prediction value for each data should be between 1 and -1.

  • If the model prediction for a data example is correct then the loss value becomes 0. For example, if the true class for a data example is 1 and the model prediction is also 1 then max = (0, 1 – 1 * 1) = 0.
  • If the model prediction for a data example is incorrect then the loss is proportional to the magnitude of the margin between the predicted score and the correct score. For example, if the true class for a data example is 1 and the model prediction is also 0.2 then max = (0, 1 – 1 * 0.2) = 0.8

Imagine these two situations:

  • If the prediction value for true label +1 is -1 the loss value is calculated as max(0, 1 – 1 *-1) = max(0, 1 + 1) = max(0, 2) = 2
  • If the prediction value for true label -1 is +1 the loss value is calculated as max(0, 1 -( -1) *1) = max(0, 1 + 1) = max(0, 2) = 2

You can see both situations get the same loss value. As the loss function only measures the size, and the difference between true and predicted labels it doesn’t consider the magnitude. in the optimization process, model parameters are updated to minimize the size of the loss function.

By maximizing the margin between positive and negative examples, the Hinge Loss motivates the model to correctly classify examples. The model is motivated to increase the margin by penalizing incorrect predictions proportionally to the margin violation, which results in better class separation and improved generalization. To train the model using Hinge Loss, the parameters (weights and biases) of the model are updated in a way that minimizes the total Hinge Loss over the entire training dataset. This is typically done using optimization algorithms like Stochastic Gradient Descent (SGD) or its variants.

Hinge Loss can be implemented in Python like this:

Hinge_Loss_Function_Python

Output:

Hinge_Loss_Function_Python-output

Here is a standard implementation in Pytorch using HingeEmbeddingLoss() function which is in the torch.nn library.

Hinge_Loss_Function_PyTorch

Output:

Hinge_Loss_Function_PyTorch-output

margin = 1.0 – This parameter controls the margin value in the Hinge Loss function. The margin represents the minimum distance required between the predicted value and the true label for a prediction to be considered correct. In other words, for prediction to be considered correct, the margin between the predicted value and the true label should be at least 1. If the margin is less than 1, the model will be penalized based on the magnitude of the margin violation. Hinge Loss = max(0, margin – y_true * y_pred)

Huber Loss

Huber Loss, also known as Smooth L1 Loss, is a loss function used primarily in regression tasks. It is a combination of Mean Absolute Error (MAE) and Mean Squared Error (MSE) loss functions to be less sensitive to outliers and differentiable at minima compared to MSE and MAE.

In the Huber loss function, There are two kinds of mathematical operations(MAE and MSE) used to calculate the loss value.  Which operation should be used depends on how much the difference between the predicted value and the true value is large compared to a hyperparameter called delta which you can tune

If the difference between the predicted value and the true value <= Delta(hyperparameter value) then, the Mean Squared Error(MSE) function will be used to calculate the loss. Otherwise, the Mean Absolute error(MAE) function will be used to calculate the loss.

Mathematically, the Huber Loss function is defined as follows:

Huber_Loss_Function

As I said before, Delta controls the threshold for switching between the quadratic (MSE) and linear (MAE) loss terms. Smaller values of delta make the loss less sensitive to outliers, The Huber loss function could be less sensitive to outliers than the MSE loss function, depending on the hyperparameter value. So, you can use the Huber loss function if the data is prone to outliers.

Huber Loss can be implemented in Python like this:

Huber_Loss_Function_Python

Output:

Huber_Loss_Function_Python-output

Here is a standard implementation in Pytorch using SmoothL1Loss() function which is in the torch.nn library.

Huber_Loss_Function_PyTorch

Output:

Huber_Loss_Function_PyTorch-output

How to Choose the Right Loss Function

Choosing the right loss function is very important for making your AI model work its best. The choice of loss function mainly depends on the nature of the problem and the characteristics of the data.

  • Nature of the Problem: Different problems need different loss functions. If you’re doing stuff like figuring out numbers (regression), you can go for Mean Squared Error or Huber Loss. If you’re deciding between two options (binary classification), you might want Binary Cross Entropy or Hinge Loss. And if you’re picking from more than two options (multi-class classification), Categorical Cross Entropy is useful.
  • Characteristics of the data: Take a look at your data. Is it uneven or got some strange values(outliers)? That matters because some loss functions really care about those weird values, especially outliers. If your data contains outliers, consider using Huber loss or other robust loss functions.

Remember, sometimes you’ll have to try different loss functions to see which one works best for your special job and data.

This article clarified the idea of loss functions and taught us about some of the most important loss functions, their internal workings, and how to use them in PyTorch and Python. The complex world of Loss Functions has now become much simpler to understand, whether you’re just getting started or trying to improve your AI skills.

Leave a Reply