chapter two

2 Meet your data!

This chapter covers

Understanding data and what it represents
Understanding the relationship between a sample and a population
Common measures in describing data

In statistics, we are often interested in structured data, typically in the form of tabular data, to support analysis tools such as Pandas and Excel. Unstructured data, such as documents, audio, or images, where each pixel acts as a variable (or three if you count the R, G, and B values), is often impractical for manual statistical analysis due to the sheer number of variables involved. Such data is best left to perceptual models, like deep learning and large language models. Statistics play a significant role in these models, but in a manner that is largely inaccessible for internal analysis. We can verify the outputs with statistics, though.

In the examples in this chapter, we will often focus on a single set of data for a single variable, such as temperatures or exam scores. When working with real-world data, however, you will often have tabular data that includes multiple variables in the form of columns, each representing different attributes or measurements. We will see an example of this at the end of the chapter.

Before we get into computing measures with data, let’s start with some critical qualitative questions, such as “What is data anyway?”

What is data?

Family photo experiment

Where does data come from?

Samples and populations

What is a population?

What is a sample?

Sampling bias

Basic number theory

Natural numbers

Integers

Real numbers

Mean, median, percentile, and mode

Mean

Median

Percentile

Mode

Variance and standard deviation

Variance

Standard deviation

Real-World Example

Summary

Citations