If you have been following my articles in the blog [1,2], you might have picked the idea that I'm fairly interested in probability, even though my degrees (both MSc and PhD) are in Computer Sciences.
Probability and statistics are a part of mathematics that always fascinated me. However, during my career, I wasn't lucky enough to have enough courses in this area, only an introductory course in probability and statistics (mostly frequentist) and another introductory course in statistical modeling (which I didn't fully grasp at the time). After my college years, especially during my Ph.D. and my work as a researcher in the fields of machine learning and artificial intelligence, I have had more hands-on experience with different applications of probability in what I work with.
One particular area of probability I happen to find exciting is Bayesian probability and Bayesian modeling and inference, which, in layman's terms, is the idea of using evidence to update our hypothesis. In this article, we will go through an overview of Bayesian modeling, how we can use it as a tool to analyze our data, what the difference is from other data modeling approaches like machine learning and traditional statistics data modeling, and how we can benefit from probabilistic programming to model Bayesian inference.
So, What is Bayesian Modeling Exactly?
The key idea of Bayesian modeling is to start with some hypothesis and our prior knowledge and update it with the evidence we see from the data, our evidence, to obtain the final distribution that explains the dataset we have.
The idea of updating our beliefs, and how important this is, is illustrated in the classic Monty Hall problem [3]. In this problem, you are given the choice of changing the door after the host of the game shows you one of the doors without a goat. If you change the door you picked, you can double your chances of getting the car.
Bayesian inference is helpful in developing a solution tailored to the available data. It starts with a piece of prior knowledge and updates that hypothesis based on the evidence presented. It can combine the subjective domain knowledge with the objective data to build a model that explains our data. If it's combined with probabilistic programming, we can do the inference with code instead of math.
Formal Definition of Bayesian Inference
In formal terms, Bayesian inference uses Bayes' Theorem to compute a posterior distribution from a prior distribution and the update of this prior distribution with evidence:
Where:
H is the hypothesis whose probability may be affected by the evidence (or data) E.
P(H) is the prior probability of the hypothesis before any evidence E is observed.
E is the evidence, i.e., the data observed from our dataset.
P(H|E) is the posterior probability, i.e., the probability of the hypothesis H after the evidence E is observed. This is what we are trying to model.
P(E|H) is the likelihood, i.e., the probability of observing the evidence E given the hypothesis H. It's a function of E with a fixed H and indicates the compatibility of the evidence with the given hypothesis.
P(E) (which is > 0) is the marginal likelihood, or "model evidence". This is across all possible hypotheses (since it's not dependent on H). As we explained in a previous article [2], this is the most challenging thing to model, and this is why we use probabilistic programming to simulate it.
Bayesian Modeling vs. Classic Machine Learning
Classic machine learning is primarily interested in predicting the value given an input. This is especially the case for supervised machine learning, where the most important thing is usually that the predicted value, either a number or a class, is the one corresponding with the observed input. It isn't interested in making inferences on the data but rather in the model being "right" on the outcome. On the contrary, Bayesian modeling has the objective of making inferences over the data we are studying.
And what is the difference between prediction and inference? A good example is the case of the Titanic Dataset [4]. For the case of inference, we are interested in analyzing the effect of age, passenger class, and gender on surviving the Titanic disaster, but for the case of prediction, we only use those features to predict whether the passenger lived or died. As we discussed in a previous article [1], there are two primary forms of machine learning models: generative and discriminative. Generally speaking, we can use either of them to do machine learning, but for doing inference, the generative approach, which gives you a probability of a passenger surviving the Titanic shipwreck, would be better suited for doing inference than a discriminative approach, which only gives you whether the passenger survived or not. That said, some discriminative models, like logistic regression, can be more or less exploited to make inferences, while others, like SVM, are more difficult.
Since machine learning is interested in the prediction rather than the inference part, we can use either white box or black box models to do it. And here lies another key difference with Bayesian modeling, where the only way to do it is with a white box approach, which gives a better understanding of what the model is learning and why.
Finally, machine learning is a purely data-driven approach; thus, it doesn't provide any way to incorporate pre-existing knowledge about the problem.
Bayesian Modeling vs. Frequentist Statistics
In the more classic approach for hypothesis testing using frequentist statistics, we have a hypothesis we want to test using classic statistical inference. We use a model that we know more about, the null hypothesis, to train and explain our data, and we try to find some critical value by evaluating the p-value from a computed test statistic.
In contrast to Bayesian modeling, in this type of statistical inference, we usually have some implicit assumption over the data (e.g., that it's normally distributed), and we are bound to a set of pre-packaged tests, or we have to derive a new estimator, which is not an easy feat. Moreover, there has been quite a controversy around the idea of p-values since, by itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis [5].
Personally, I have never found hypothesis testing very intuitive, and I have always had problems understanding or justifying my experiments based on the sole idea that the p-value is "small" enough [6].
Probabilistic Programming with PyMC
One of the main challenges when working with Bayesian inference is the marginal distribution of the probability of the evidence, i.e., P(E), which can become intractable, as explained in more detail in the article about Variational Autoencoders [2]. Thus, we require some way to simulate this, i.e., we need a way to sample data based on this.
In this scenario, probabilistic programming becomes important in the whole process of Bayesian inference. Probabilistic programming allows flexible specification of Bayesian statistical models in code instead of having to deal with them mathematically. It does so with the help of Markov Chain Monte Carlo (or MCMC) algorithms to draw samples from different distributions.
The most popular framework in Python for probabilistic programming is PyMC [7]. It provides an intuitive syntax similar to the one used by statisticians when describing models. It uses state-of-the-art sampling algorithms that work well on high-dimensional and complex posterior distributions, allowing many complex models to fit without specialized knowledge about fitting algorithms.
Under the hood, it leverages PyTensor [8] (a fork of the Theano project) to transparently transcode models to C and compile them to machine code, thereby boosting performance. PyTensor is a library that allows expressions to be defined using generalized vector data structures called tensors. These are tightly integrated with the popular NumPy "ndarray" [9] data structure and similarly allow for broadcasting and advanced indexing, just as NumPy arrays do.
Using PyMC to Model the Probability of Survivorship from the Titanic
To get a more concrete idea of how Bayesian inference works using PyMC, we will try to build a distribution that explains the Titanic's survivorship [4]. The catch is that we will use a sample of only 50 of the 891 passengers.
This is a simple model that will serve only as a driver for you to get a clearer picture of what Bayesian inference does under the hood. If you want to explore more complex examples, I recommend you take a look at the PyMC examples documentation [10], which has a much more extensive collection of examples you can check. Before starting, follow the instructions to install PyMC [11].
We start by loading the Titanic data, from which we are only interested in modeling the survivorship probability using Bayesian statistics. Thus, we discard all the other features and sample 50 passengers from it. Based on this, we are going to model the real survivorship distribution.
import arviz as az
import pandas as pd
titanic = pd.read_csv("../data/titanic.csv", index_col=0)[['Survived']]
titanic_sample = titanic.sample(n=50)
az.plot_dist(titanic_sample);
Now, based on a first view of the data distribution, we need to start from a prior distribution. There are different distributions, but since we are modeling a probability value, i.e. the probability of survivorship in the Titanic, a good start can be a Beta distribution defined in the interval [0, 1]. Now that we have our prior and our evidence, we can model our posterior distribution, and for that, we use the sampling provided by PyMC:
import pymc as pm
with pm.Model() as survivorship_model:
prior = pm.Beta("survival_rate", alpha=3, beta=5)
likelihood = pm.Binomial(
"survivors", p=prior, n=len(titanic_sample), observed=titanic.sum()
)
posterior = pm.sample()
And voilà, as simple as that, we have a model for our posterior distribution, i.e., the probability of survivorship in the Titanic, based on our prior knowledge and the evidence provided.
Now, there are a couple of things to clear here. The first question is, why do we model the prior as a Beta distribution? As I said before, since we are modeling a probability here, we can benefit from a prior that gives values in the range [0, 1] and nothing more. Moreover, by checking the distribution plot, we see there's around a 40% survivorship rate. Thus, we select the parameters of the Beta distribution, with the alpha parameter being a little smaller than the beta parameter. We can check the prior distribution by sampling from it and plotting it:
import matplotlib.pyplot as plt
with survivorship_model:
prior_samples = pm.sample_prior_predictive()
ax = az.plot_dist(prior_samples.prior["survival_rate"])
ax.set(
xlabel="Prior belief of survival rate",
ylabel="Relative Plausibility",
title="Prior Check",
)
plt.tick_params(left=False, labelleft=False)
We see the prior distribution having a slight mode around 40%, which is similar to the observed distribution of the data.
If we want to model other features of the Titanic dataset, we would probably be better off with another type of distribution. For example, if we were trying to model the age of the passengers, a normal (also called Gaussian) distribution would probably be our best bet.
Another question you might have is, why is the likelihood modeled as a Binomial distribution? The main reason is that it matches the generative process: we have 50 passengers of the Titanic that either survived or not, these are binary trials (0s and 1s), and those trials succeed with the prior probability (called "survival_rate"), that we're trying to infer. This describes precisely a process that fits a Binomial distribution. The survival rate, i.e., the p parameter of the Binomial distribution, is an estimate of the actual survival rate of the Titanic.
Now that we have our posterior distribution, we can generate samples from it, and we can compare them with our prior distribution and the real probability of survival of the Titanic passengers. First, we sample from the posterior distribution:
with survivorship_model:
posterior_predictives = pm.sample_posterior_predictive(posterior)
posterior.extend(posterior_predictives)
Finally, we plot the survivorship rate from the posterior distribution, and we compare it with the prior distribution and with the real probability of survivorship from the Titanic data:
ax = az.plot_dist(
posterior.posterior_predictive["survivors"],
label="Posterior Predictive",
)
ax.axvline(
titanic.mean().iloc[0] * 50,
ls="--",
lw=3,
color="orange",
label="Real data",
)
az.plot_dist(
posterior.prior_predictive["survivors"],
label="Prior Predictive",
color="orange",
hist_kwargs={"alpha": 0.75},
ax=ax,
)
ax.set(
xlabel="Posterior prediction of # of survivors",
ylabel="Relative Plausibility",
title="Posterior prediction of survivors out of 100 passengers",
)
ax.legend()
plt.tick_params(left=False, labelleft=False)
The plot above shows the number of passengers out of 50 samples who would have survived the Titanic. We compare it with the actual probability of survival of 50 passengers given by the percentage of survivorship from the whole Titanic dataset. As we can see, even though the prior distribution (in yellow) wasn't too bad, the posterior distribution, which was obtained by adding the evidence, is even more concentrated around the real survivorship rate. This is the power of updating our previous beliefs with evidence.
Final Thoughts
In this article, I gave a quick overview of Bayesian modeling and the power of Probabilistic Programming to perform Bayesian inference. However, this article is by no means a thorough explanation or a comprehensive coverage of the vast world of Bayesian statistics and Probabilistic Programming. The documentation of PyMC [7] can help you go much deeper into this fascinating world of Bayesian statistics.
Bayesian modeling is an excellent tool for cases where we do not only care about the prediction but also for understanding more about the nature of our data. As we saw, we model a distribution for the probability of survivorship from less than 6% of the total data in the Titanic dataset, and this is where Bayesian probability excels since starting from prior knowledge and updating it based on evidence can help us better model for cases where data is scarce, largely unbalanced or plainly uncertain. It's a tool worth exploring and having available in our everyday toolbox for machine learning and data science.
References
[1] Cardellino, C. (2024). "Generative vs. Discriminative Models in Machine Learning". https://www.transcendent-ai.com/post/generative-vs-discriminative-models-in-machine-learning
[2] Cardellino, C. (2024). "Variational Autoencoders: Intuitions and Math". https://www.transcendent-ai.com/post/variational-autoencoders-intuitions-and-math
[3] AsapSCIENCE. (2016). The Monty Hall Problem - Explained. https://www.youtube.com/watch?v=9vRUxbzJZ9Y
[4] Will Cukierski. (2012). Titanic - Machine Learning from Disaster. Kaggle. https://kaggle.com/competitions/titanic
[5] Wasserstein, R. L., & Lazar, N. A. (2016). The ASA statement on p-values: context, process, and purpose. The American Statistician, 70(2), 129-133.
[6] Veritasium. (2016). Is Most Published Research Wrong?https://www.youtube.com/watch?v=42QuXLucH3Q
[7] PyMC. Probabilistic programming library for Python. https://www.pymc.io/welcome.html
[8] PyTensor Documentation. https://pytensor.readthedocs.io/en/latest/
[9] Numpy Documentation. Numpy ndarray. https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray
[10] PyMC Example Gallery. https://www.pymc.io/projects/examples/en/latest/gallery.html
[11] PyMC Installation Instructions. https://www.pymc.io/projects/docs/en/latest/installation.html
Commentaires