6 Variational inference: Scaling to large datasets
This chapter covers
- Turning Bayesian inference into an optimization problem
- The Kullback–Leibler (KL) divergence
- The reparameterization trick
- The pros and cons of variational inference compared to MCMC
We learned about Markov Chain Monte Carlo (MCMC) as a powerful tool for performing Bayesian inference when a conjugate prior is unavailable. But this power comes at a cost: for complex models or big datasets, MCMC can be prohibitively slow due to the need to thoroughly explore the landscape of the posterior distribution. This computational cost can lead to hours or days of waiting, making MCMC impractical in real-world settings.
Variational inference (VI) offers a different workaround for nonconjugate priors. Instead of drawing samples until we (hopefully) have enough representative information, VI reframes inference (that is, finding the posterior distribution) as an optimization problem: we pick a family of probability distributions that are easy to work with, and find the member of that family that’s closest to the true posterior.
The key advantage is speed. VI can scale to millions of datapoints under large models with thousands of parameters without becoming prohibitively time-consuming. It’s particularly popular in large-scale machine learning pipelines, where the models are trained on big data, need to update regularly, and predict on large populations in real time.