chapter three

3 Grokking Deeper: Where did the data come from?

“Probability is the intersection of the most rigorous mathematics and the messiest of life”

Nassim Taleb, Lebanese-American statistician

In the previous chapter, we ended with a look at how the data within the sample is distributed. Understanding how the data is distributed is key to choosing the correct model to solve a problem, and because we are interested in models that can work well not just on the sample data but on any other data form the same population, we take a leap from the restrains of the sample and into the realm of the population from which the sample was drawn.

This chapter uses the descriptive statistics we generated in the previous chapter and starts to explore and approximate the original population of the data. By approximating the original populations of our data, we acquire the tools that will allow us to create a much better model to solve the diabetes clinic problem we have; a model that is +31% more accurate than the perceptron one.

This chapter delves into probability theory, which can make it look a bit more mathy then the chapters before it. But as we established before, mathy is not the equivalent of inaccessible. Remember to be patient, take your time, and keep a piece of paper and pencil beside you to doodle around with the equations and examples. At the end, you'll be very rewarded by the results of our new model!

3.1 Probability and Distributions

3 Grokking Deeper: Where did the data come from?

3.1 Probability and Distributions

3.1.1 Random Variables, Distributions, and their Properties

3.1.2 How to read math?

3.1.3 Expectation, Variance, and Estimations

3.2 Conditional Probability

3.2.1 The Bayes Rule

3.2.2 Independent Random Variables

3.3 Applying the Naive Bayes Model with scikit-learn