chapter six

6 Bayesian tools for machine learning

This chapter covers

Unsupervised machine learning models
Bayes’ theorem, conditional probability, entropy, cross-entropy, and conditional entropy
Maximum likelihood estimation (MLE) and maximum a posteriori (MAP) estimation of model parameters
Evidence maximization
KLD
Gaussian mixture models (GMM) and MLE estimation of GMM parameters

The Bayesian approach to statistics tries to model the world by modeling the uncertainties and prevailing beliefs and knowledge about the system. This is in contrast to the frequentist paradigm, where probability is strictly measured by observing a phenomenon repeatedly and measuring the fraction of time an event occurs. Machine learning, in particular unsupervised machine learning, is a lot closer to the Bayesian paradigm of statistics—the subject of this chapter.

In chapter 1, we primarily discussed supervised machine learning, where the training data is labeled: each input value is accompanied by a manually created desired output value. Labeling training inputs is a manual, labor-intensive process and often the worst pain point in building a machine learning–based system. This has led to considerable recent interest in unsupervised machine learning, where we build a model from unlabeled training data. How is this done?

6.1 Conditional probability and Bayes’ theorem

6.1.1 Joint and marginal probability revisited

6.1.2 Conditional probability

6.1.3 Bayes’ theorem

6.2 Entropy

6.2.1 Geometrical intuition for entropy

6.2.2 Entropy of Gaussians

6.3 Cross-entropy

6.4 KL divergence

6.4.1 KLD between Gaussians

6.5 Conditional entropy

6.5.1 Chain rule of conditional entropy

6.6 Model parameter estimation

6.6.1 Likelihood, evidence, and posterior and prior probabilities

6.6.2 Maximum likelihood parameter estimation (MLE)

6.6.3 Maximum a posteriori (MAP) parameter estimation and regularization

6.7 Latent variables and evidence maximization

6.8 Maximum likelihood parameter estimation for Gaussians

6.8.1 Python PyTorch code for maximum likelihood estimation