My Tools

    No Products in the Wishlist

Probability in AI – Everything You Should Know About

Probability is one of the big branches in mathematics with many use cases in the real world and the AI field. In this article, I will give you a solid understanding of Probability, its practical applications in AI, and how you can effectively employ it through a hands-on example.

What Is Probability?

Probability is the way we measure uncertainty. It answers the question, “How likely is something to happen?”. We represent the probability of an event as a number between 0 and 1, where 0 means the event is impossible (no chance of occurring), and 1 means the event is certain to occur (definitely going to happen). All other values between 0 and 1 represent varying degrees of likelihood.

For example, if you flip a fair coin once, there are two possible outcomes: heads or tails. Each outcome has an equal chance of happening; therefore, the probability for each outcome is 0.5.

Here few key ideas you should know:

  • Sample Space: This is a set of all possible outcomes we can get. For example, when rolling a fair six-sided die, the sample space consists of the numbers 1, 2, 3, 4, 5, and 6 so, S={1,2,3,4,5,6}.
  • Event: An event is a subset of the sample space. It refers to a specific outcome or a combination of outcomes that we are interested in. For the above example, an event could be obtaining an odd number (1, 3, or 5) or rolling a number greater than 4 (5 or 6).
  • Probability as a Ratio: Probability is often expressed as a ratio of the number of favorable outcomes (outcomes that satisfy the event) to the total number of possible outcomes in the sample space. For example, if we roll a fair six-sided die and want to find the probability of getting a 3, there is only one favorable outcome (the number 3), and the total number of outcomes in the sample space is six. Thus, the probability is 1/6.

We have defined what probability means. Now, Let’s dive deeper into some key concepts of probability:

Joint Probability

Joint probability is the probability of two events occurring together. It is denoted by P(A and B) or P(A ∩ B). The joint probability is calculated by dividing the number of outcomes where both events A and B occur by the total number of outcomes.

For example:
Let’s say we toss two coins and we want to find the probability of getting heads on both coins. The sample space consists of four possible outcomes: {HH, HT, TH, TT}. To find the joint probability of getting two heads (HH), we count the number of favorable outcomes (1) and divide it by the total number of outcomes (4), giving us a joint probability of 1/4. In other words, The joint probability of getting heads on both coins is 1/4, since there are four possible outcomes (HH, HT, TH, TT) and only one of them has both coins showing heads.

Conditional Probability

Conditional probability is the probability of event A occurring given that event B has already occurred. It is denoted by P(A|B), where A and B represent two events. The conditional probability is calculated by dividing the joint probability of A and B by the probability of event B.

P(A|B) = P(A and B)\P(B)

For example:
Let’s consider drawing two cards from a standard deck without replacement. We want to find the probability of drawing a king on the second draw given that the first card drawn was a queen. The probability of drawing a queen on the first draw is 4/52 (as there are 4 queens in a deck of 52 cards). After removing the queen, the deck is left with 51 cards. Since there are 4 kings in the remaining cards, the probability of drawing a king on the second draw, given that the first card was a queen, is 4/51. Therefore, the conditional probability is 4/51.

Another example:
Let’s say we have a bag with 5 red marbles and 3 blue marbles. If we randomly select a marble from the bag and it is red, what is the probability that the next marble we select will also be red? The conditional probability of selecting a red marble given that the first marble was red is 4/7 since there are now 4 red marbles and 7 total marbles remaining in the bag.

Bayesian Theorem

The Bayesian theorem is based on conditional probability. It describes the relationship between conditional probabilities and can be used to update the probabilities of events based on new evidence or information.

P(A|B)= (P(B|A)* P(A))/ P(B)

For example:
Let’s consider a medical scenario where we have the following information:

– The probability of a person having a certain disease is 0.2 (P(D) = 0.2).
– The probability of a positive test result for the disease, given that a person has the disease, is 0.9 (P(T|D) = 0.9).
– The probability of a positive test result for the disease, given that a person does not have the disease, is 0.1 (P(T|not D) = 0.1).

We want to calculate the probability that a person actually has the disease given a positive test result P(D|T). Using Bayes’ theorem, we can calculate this as:

P(D|T) = (P(T|D) * P(D)) / P(T))

For that, we need the probability of a positive test result P(T). P(T) can be calculated using the law of total probability:

P(T) = P(T|D) * P(D) + P(T|not D) * P(not D)

In this case, P(not D) is the complement of P(D) (i.e., P(not D) = 1 – P(D)). Substituting the given values, we have:

P(T) = (0.9 * 0.2) + (0.1 * (1 – 0.2)) = 0.18 + 0.08 = 0.26

Now, we can substitute the values of P(T|D), P(D), and P(T) back into Bayes’ theorem to calculate P(D|T):

P(D|T) = (0.9 * 0.2) / 0.26 ≈ 0.6923

‘Therefore, there is approximately a 69.23% probability that a person actually has the disease given a positive test result.’

Independent Events

Two events A and B are considered independent if the occurrence or non-occurrence of one event does not affect the probability of the other event. In other words, the probability of event A happening remains the same regardless of whether event B occurs or not.

For example:
Let’s say we toss a coin and roll a die. The probability of getting heads on the coin and rolling a 6 on the die is the product of the individual probabilities: 1/2 * 1/6 = 1/12. Since the outcome of the coin toss does not affect the outcome of the die roll, we can say that these events are independent.

Another example:
Let’s consider rolling a fair six-sided die twice. The probability of getting a 4 on the first roll is 1/6. Even if we roll the die a second time, the probability of getting a 4 again remains 1/6. The outcome of the first roll does not impact the outcome of the second roll, making the events independent.

Remember Probability helps us make informed decisions by predicting the likelihood of certain outcomes occurring or not occurring based on available information about past occurrences.

Probability theory serves as the foundation of many fields, including statistics, science, economics, and more. By understanding and utilizing probability, we can make informed decisions, analyze risks, and predict outcomes in a wide range of situations.

Why do we need Probability in AI?

Probability plays a crucial role in artificial intelligence (AI). It works like logic which is used for reasoning and decision-making. Probability provides tools like Bayesian decision theory to make decisions based on available information or incomplete, noisy data. In other words, it helps machines learn from experience by assigning probabilities to different outcomes and adjusting their behavior accordingly(By using Bayes’ theorem, we can update our beliefs about these variables as new evidence becomes available.). This allows AI systems to make optimal decisions under uncertainty.

Especially in Natural language processing tasks like language modeling, machine translation, and sentiment analysis use Probability to predict the likelihood of certain words or phrases occurring in a given context and generate coherent and probabilistically grounded responses.

Also In reinforcement learning(RL), an AI agent learns to interact with an environment to maximize rewards. However, since the environment may be stochastic or unpredictable, there’s always some degree of uncertainty involved in choosing which action to take next. To address this issue, RL algorithms use techniques like Monte Carlo methods or Markov decision processes (MDPs) which rely heavily on probability distributions such as Gaussian distribution, etc., allowing them to estimate expected rewards more accurately even when faced with uncertain environments.

In addition to these specific examples above, probability also plays a critical role across various fields within Artificial Intelligence including, computer vision(CV), and robotics. CV requires identifying objects within images despite variations due to lighting conditions. Robotics needs accurate prediction capabilities so robots don’t collide with obstacles while navigating around unfamiliar terrain.

Let’s see how much these probability concepts are valid or accurate in real-world scenarios.

Tossing A Coin

If we toss a coin we can get either head or tail. So the sample space is {head, tail}. If our coin is fair, then both outcomes (heads and tails), are equally likely So P(heads) = 0.5 and P(tails) =0.5. Moreover, if we plan to toss the coin n times then the fraction of heads that we expect to see should exactly match the expected fraction of tails. In other words, if we toss the coin 100 times 50 times it should give heads, and the rest of the time it should give tails (num of heads = numb of tails).

Of course, Although theoretically, it gives 50-50 for both head and tails, if you conduct this experiment many times with n = 1000000 tosses each, you might never see a trial where nh = nt exactly. Let’s see this situation using PyTorch:

Throughout this article, I use Python and PyTorch combination for programming, You can follow our installation guide to set up your PC properly!

To simulate the tosses of a fair coin, we can use any random number generator. Like, Python’s random.random() which generate random numbers between 0 and 1

40

Output:

40 1

This means that from 100 toses head appears 43 times and the tail appears 57 times.

In PyTorch, We can simulate this using a Multinomial function. By setting its first argument to the number of draws(number of toses) and the second as a list of probabilities associated with each of the possible outcomes, we can get the number of times each possible outcome appeared.

To simulate 100 tosses of a fair coin, we assign a probability vector [0.5, 0.5], interpreting index 0 as heads and index 1 as tails. The function returns a vector with a length equal to the number of possible outcomes (here, 2), where the first component tells us the number of occurrences of heads and the second component tells us the number of occurrences of tails.

41

Output:

41 1

Each time you run this sampling process, you will receive a new random value that may differ from the previous outcome.

Dividing the by the number of tosses gives us the frequency of each outcome in our data.

42

Output:

42 1

Note that the sum of all frequency values is equal to 1 (0.51 + 0.49 = 1)

Here, even though our simulated coin is fair (we set the probabilities [0.5, 0.5] ourselves), the counts of heads and tails may not be identical/equal. That is because we only drew a finite number of samples. If we did not implement the simulation ourselves, and only saw the outcome, how would we know if the coin was slightly unfair or if the possible deviation from 1/2 was just an artifact of the small sample size? Let’s see what happens when we simulate 10000 tosses:

43

Output:

43 1

In general, for averages of repeated events (like coin tosses), as the number of repetitions grows, our estimates are guaranteed to converge to the true underlying probabilities. The mathematical proof of this phenomenon is called the law of large numbers and the central limit theorem. They tell us that in many situations, as the sample size n grows, these errors should go down at a rate of (1/√ n). Let’s get some more intuition by studying how our estimate evolves as we grow the number of tosses from 1 to 10000.

44

Output:

Probability of  getting Head or Tail

‘You can see when the number of tosses increases the frequency/possibility of each outcome also gets closer to 0.5’

Expectations

When making decisions, it’s not enough to just consider the probabilities of individual events. We need to combine these probabilities to get useful overall values. For example, when dealing with random variables that have continuous values, we often want to know the average value we can expect. This is called ‘Expectation’.

Let’s take the example of investments. We might be interested in the expected return, which is the average return we can expect by considering all possible outcomes and their probabilities. For instance, if there’s a 50% chance of failure, a 40% chance of a 2× return, and a 10% chance of a 10× return, we can calculate the expected return by multiplying each return by its probability and summing them up. To calculate the expected return, we sum over all returns, multiplying each by the probability that they will occur. This yields the expectation 0.5 · 0 + 0.4 · 2 + 0.1 · 10 = 1.8. This gives us the expectation of 1.8

In general, the expectation of a random variable X is calculated by summing over all its possible values and multiplying each value by its probability. This can be expressed as:

EX=ExPx=xP(X=x)

Similarly, for continuous probabilities, we use integration to calculate the expectation:

EX=x dp(x)

Sometimes, we’re interested in the expected value of a function of the random variable. We can calculate these expectations by multiplying the function value by the probability of each corresponding value and summing them up for discrete probabilities:

ExPf(x)=f(x)Px

or by integrating the function multiplied by the probability density for continuous probabilities:

ExPfx=fxpxdx

The expected value tells us the average amount we can expect to win(in the above example 1.8). However, it does not provide information about the variability or risk involved in the game. When it comes to investments, we might also want to measure the risk involved. We need a measure that quantifies how much the actual outcomes deviate/differ from the expected value. To calculate the expectation of the difference between actual and expected values, we subtract the expected value from each outcome and take the average:

E[X-E[X]] = (0.5*(0-1.8))+(0.4*(2-1.8))+(0.1*(10-1.8)) = -0.9+0.08+0.82= 0

As we can see, taking the expectation of the difference results in zero. This is because the positive and negative deviations from the expected value cancel each other out, giving the impression that no risk is involved. So we cannot take the expectation of the difference between the actual and expected values. That is because the expectation of a difference is the difference of the expectations, and ultimately equals zero:

EXE[X]=E[X]E[E[X]]=0

Variance

To properly assess risk, we need to consider the variability or dispersion of the outcomes. One common measure of dispersion is the variance. It is calculated by taking the expected value of the squared differences between the random variable and its expectation:

Var[X]=E(XE[X])2=E[X2]EX2

The square root of the variance is called the standard deviation, which is expressed in the same units as the original random variable. Similarly, The variance of a function of a random variable can be calculated by taking the expected value of the squared difference between the function value and its expectation. The standard deviation is advantageous as it retains the units of the original quantity represented by the random variable. So, the variance of a function of a random variable can be defined analogously as:

VarxP[f(x)]=ExPf2(x)ExP[f(x)]2

Now we can compute the variance of the investment. It is given by 0.5×0+0.4×22+0.1×1021.82=8.36. From all perspectives, this is a risky investment.

Remember mean and variance are often referenced as µ and σ2.

Similarly, as we introduced expectations and variances for scalar random variables, we can extend these concepts to vector-valued ones. Computing expectations for vectors is straightforward since we can apply them elementwise. For instance:

μdefExP[X] represents the expectation which has coordinates μi=ExP[xi]

Covariances

This indicates how changes in one variable are associated with changes in another variable. Calculating covariances is a more complex task compared to expectations. We overcome the issue by taking expectations of the outer product of the differences between random variables and their means.

defCovxP[X]=ExP[(Xμ)(Xμ)T]

This matrix Σ is known as the covariance matrix. This provides valuable information about the relationships between variables. To grasp its impact, let’s consider a vector v that has the same size as x. In this scenario, we can express the effect of the covariance matrix as follows:

vTv=ExP[vT(xμ)(xμ)Tv]=VarxPvTx

Hence, Σ allows us to compute the variance for any linear function of x through simple matrix multiplication. The off-diagonal elements of Σ indicate the correlation between coordinates: a value of 0 signifies no correlation, while a larger positive value implies a stronger positive correlation.

In summary, the expectation gives us the average value we can expect, and the variance measures the deviation from this average. These concepts can be applied to both scalar and vector-valued random variables, where expectations are calculated elementwise for vectors. The covariance matrix describes the relationships and correlations between different coordinates of the random variables.

Throughout this article, we have learned Basic concepts in Probability like Joint probability, Conditional probability, and Bayesian theorem and saw how these concepts are effective in real-life tasks using simple coin-tossing examples with PyToch. In addition, we learned Probability concepts called Expectation, variance, and covariance which are essential for decision-making in AI models. In simple terms, all of these probability theories help us to navigate uncertainty with confidence and provide useful information to make accurate decisions, increasing the effectiveness and accuracy of the AI model. So you must understand these concepts properly.

Leave a Reply