chapter six

6 Variational inference: Scaling to large datasets: Approximating Bayesian posteriors with simple distributions

This chapter covers

Turning Bayesian inference into an optimization problem
The Kullback–Leibler (KL) divergence
The reparameterization trick
The pros and cons of variational inference compared to MCMC

We learned about Markov Chain Monte Carlo (MCMC) as a powerful tool for performing Bayesian inference when a conjugate prior is unavailable. But this power comes at a cost: for complex models or big datasets, MCMC can be prohibitively slow due to the need to thoroughly explore the landscape of the posterior distribution. This computational cost can lead to hours or days of waiting, making MCMC impractical in real-world settings.

Variational inference (VI) offers a different workaround for nonconjugate priors. Instead of drawing samples until we (hopefully) have enough representative information, VI reframes inference (that is, finding the posterior distribution) as an optimization problem: we pick a family of probability distributions that are easy to work with, and find the member of that family that’s closest to the true posterior.

MCMC isn’t all we need

Variational inference: From integration to optimization

The approximating distributions

Measuring the difference between two distributions

Gradient descent and the reparameterization trick for VI optimization

Putting it all together

Minibatch variational inference

Implementation in PyMC

Fitting a variational inference model

Checking for optimization convergence

How and when to use variational inference

Beyond simple variational inference

Pathological behaviors

Comparison to MCMC

Summary