concept histogram in category data

This is an excerpt from Manning's book Data Science Bookcamp: Five Python Projects MEAP V04 livebook.
The binned-based plot we just described is called a histogram. Histograms are easy to display in Matplotlib using the
plt.hist
method. The method takes as input the sequence of values to be binned, as well an optionalbins
parameter. That parameter specifies the number of bins used to group the data. Thus, callingplt.hist(frequency_array, bins='77')
will split our data across 77 bins, each covering a width of .01 units. Also, we can optionally pass inbin=auto
, and Matplotlib will select an appropriate bin-width using a widely accepted optimization technique (the details of which are beyond the scope of this book). Lets plot a histogram while optimizing bin-width by callingplt.hist(frequency_array, bins='auto')
. NOTE: Within the code below, we also include anedgecolor='black'
parameter. This helps us visually distinguish the boundaries between bins by coloring the bin-edges in black.
Listing 3.23. Using
argmax
to return a histogram’s peakoutput_bin_coverage(counts.argmax())
Listing 3.24. Plotting a histogram’s relative likelihoods
likelihoods, bin_edges, _ = plt.hist(frequency_array, bins='auto', edgecolor='black', density=True) plt.xlabel('Binned Frequency') plt.ylabel('Relative Likelihood') plt.show()Figure 3.4. A histogram of 500 binned frequencies plotted against their associated relative likelihoods. The area of the histogram sums to 1.0. That area can be computed by summing over the rectangular areas of each bin.
![]()
In our new histogram, the counts have been replaced by relative likelihoods, which are stored within the
likelihoods
array. As mentioned previously, relative likelihood is a term applied to the y-values of a plot whose area sums to 1.0. Of course, the area beneath our histogram now sums to 1.0. We can compute that area by summing the area of each bin. The rectangular area of each bin is equal to its vertical likelihood-value multiplied bybin_width
. Hence, the area beneath the histogram is equal to the summed likelihoods multiplied bybin_width
. We can calculate the summed likelihoods by callinglikelihoods.sum()
. Consequently, the area equalslikelihoods.sum() * bin_width
, which equals 1.0.Listing 3.25. Computing the total area under a histogram
assert likelihoods.sum() * bin_width == 1.0The histogram’s total area sums to 1.0. Thus, the area beneath the histogram’s peak is now a probability. As previously discussed, this is the probability of a randomly sampled frequency falling with the 0.694 - 0.699 interval range. We can compute that probability by calculating the area of the bin positioned at
likelihoods.argmax()
.