concept multinomial distribution in category Keras

This is an excerpt from Manning's book Probabilistic Deep Learning: With Python, Keras and TensorFlow Probability MEAP V06.
In the MNIST task you want to discriminate between 10 classes (0, 1, …,9). Therefore you set up an NN with ten output nodes, each providing the probability that the input corresponds to the respective class. These ten probabilities define the ten parameters of the CPD in the MNIST classification model. The model of the classification CPD is called multinomial distribution, which is an extension of the Bernoulli distribution to more than two classes. In case of the MNIST classification task with ten classes the multinomial CPD and can be expressed as follows:
The number of parameters defining the distribution is often an indicator of the flexibility of the distribution. The Poisson distribution, for example, has only one parameter (often called rate), the ZIP distribution has two parameters (rate and the mixing proportion), and in chapter 5, you saw that you could achieve a better model for the camper data when using the ZIP distribution instead of the Poisson distribution as the conditional probability distribution (CPD). According to this criterion, the multinomial distribution is especially flexible because it has as many parameters as possible values (or actually one parameter less, because probabilities need to sum up to one). In the MNIST example, you used an image as input to predict a multinomial CPD for the categorical outcome. The predicted multinomial CPD has ten (or more correct, nine) parameters, giving us the probabilities of ten possible classes (see figure 6.1).
Figure 6.1 Multinomial distribution with ten classes:
![]()
Indeed, using the multinomial distribution in a digit classification, CNN became the first and most heavily used real-world application for DL models. In 1998, Yann LeCun, who was back then working at AT&T Bell Laboratory, implemented a CNN for ZIP code recognition, which is known as LeNet-5.