concept histogram in category data

appears as: Histograms, histogram, histogram, The histogram, A histogram
Data Science Bookcamp: Five Python Projects MEAP V04 livebook

This is an excerpt from Manning's book Data Science Bookcamp: Five Python Projects MEAP V04 livebook.

The binned-based plot we just described is called a histogram. Histograms are easy to display in Matplotlib using the plt.hist method. The method takes as input the sequence of values to be binned, as well an optional bins parameter. That parameter specifies the number of bins used to group the data. Thus, calling plt.hist(frequency_array, bins='77') will split our data across 77 bins, each covering a width of .01 units. Also, we can optionally pass in bin=auto, and Matplotlib will select an appropriate bin-width using a widely accepted optimization technique (the details of which are beyond the scope of this book). Lets plot a histogram while optimizing bin-width by calling plt.hist(frequency_array, bins='auto'). NOTE: Within the code below, we also include an edgecolor='black' parameter. This helps us visually distinguish the boundaries between bins by coloring the bin-edges in black.

Listing 3.23. Using argmax to return a histogram’s peak
output_bin_coverage(counts.argmax())
Listing 3.24. Plotting a histogram’s relative likelihoods
likelihoods, bin_edges, _ = plt.hist(frequency_array, bins='auto', edgecolor='black', density=True)
plt.xlabel('Binned Frequency')
plt.ylabel('Relative Likelihood')
plt.show()
Figure 3.4. A histogram of 500 binned frequencies plotted against their associated relative likelihoods. The area of the histogram sums to 1.0. That area can be computed by summing over the rectangular areas of each bin.
fig3 4

In our new histogram, the counts have been replaced by relative likelihoods, which are stored within the likelihoods array. As mentioned previously, relative likelihood is a term applied to the y-values of a plot whose area sums to 1.0. Of course, the area beneath our histogram now sums to 1.0. We can compute that area by summing the area of each bin. The rectangular area of each bin is equal to its vertical likelihood-value multiplied by bin_width. Hence, the area beneath the histogram is equal to the summed likelihoods multiplied by bin_width. We can calculate the summed likelihoods by calling likelihoods.sum(). Consequently, the area equals likelihoods.sum() * bin_width, which equals 1.0.

Listing 3.25. Computing the total area under a histogram
assert likelihoods.sum() * bin_width == 1.0

The histogram’s total area sums to 1.0. Thus, the area beneath the histogram’s peak is now a probability. As previously discussed, this is the probability of a randomly sampled frequency falling with the 0.694 - 0.699 interval range. We can compute that probability by calculating the area of the bin positioned at likelihoods.argmax().

sitemap

Unable to load book!

The book could not be loaded.

(try again in a couple of minutes)

manning.com homepage