chapter two

2 Exploring tabular datasets

This chapter covers

Row and column characteristics in a tabular dataset
Possible pathologies and remedies for tabular datasets
Finding tabular data externally on the internet and internally in organizations
Exploring data to solve common problems in tabular data

Tabular data may consist of practically anything—from low-level scientific research to consumer behavior on a website to the statistics in your fantasy sports league. In the end, though, the commonalities in tabular data prevail over differences, and you can achieve most of your data analysis job just by applying standard approaches and tools even without a lot of domain expertise.

In this chapter, we’ll look at how to gather and prepare tabular datasets. We’ll also take on a practical data analysis exploration that shows the steps you can take to look at data from different viewpoints: by rows, by columns, under the light of the relationship between features, and considering their overall distribution in the dataset. For that example, we will use a simple toy dataset, the Auto MPG Data Set, a dataset freely available on the UCI Machine Learning website (https://archive.ics.uci.edu/dataset/9/auto+mpg).

2.1 Row and column characteristics

2.1.1 The ideal criteria for tabular rows

2.1.2 The ideal criteria for tabular columns

2.1.3 Representing rows and columns

2.2 Pathologies and remedies

2.2.1 Constant or quasi-constant columns

2.2.2 Duplicated and highly collinear features

2.2.3 Irrelevant features

2.2.4 Missing data

2.2.5 Rare categories

2.2.6 Errors in data

2.2.7 Leakage features

2.3 Finding external and internal data

2.3.1 Using pandas to access data stores

2.3.2 Internet data

2.3.3 Synthetic data

2.4 Exploratory data analysis

2.4.1 Loading the Auto MPG example dataset

2.4.2 Examining labels, values, distributions

2.4.3 Exploring bivariate and multivariate relationships

Summary