chapter two

2 Finding, collecting, and exploring tabular datasets

This chapter covers

Examining row and column characteristics in a tabular dataset
Pointing out possible pathologies and remedies for tabular datasets
Finding tabular data externally on the Internet and internally in organizations
Exploring data to solve common problems in tabular data

In chapter 1, we have introduced the differences between deep learning and machine learning approaches in tabular data problems. In this new chapter, we will put models apart for a while and begin getting a more hands-on and in-depth look at what tabular datasets are. We will focus on how tabular data is structured and the common characteristics of all tabular problems. Such matters because tabular data structure and characteristics determine the way and the tools of the trade that you will successfully use in your projects.

Our first topic will be finding out what is in common between tabular datasets, which makes it possible to write a book that generally discusses tabular data and models. The commonalities are found in how data is structured, not in the different and varied content of the features, which is a matter of domain knowledge. Hence, we will glance at how rows and columns organize a tabular dataset, the type of data they contain from a data type point of view, and what problems may arise when you get tabular data from the Web or the data repositories in your organization.

2.1 Tabular dataset row and column characteristics

2.1.1 Discussing ideal criteria for tabular rows

2.1.2 Discussing ideal criteria for tabular columns

2.1.3 Representing rows and columns

2.2 Possible pathologies and remedies for tabular datasets

2.2.1 Avoiding constant or quasi-constant columns

2.2.2 Avoiding duplicated and highly collinear features

2.2.3 Avoiding irrelevant features

2.2.4 Handling missing data

2.2.5 Dealing with rare categories

2.2.6 Spotting errors in data

2.2.7 Excluding leakage features

2.3 Finding tabular data externally and internally

2.3.1 Leveraging pandas to access data storages

2.3.2 Acquiring Data from the Internet

2.3.3 Generating Synthetic Data

2.4 Exploratory data analysis on tabular datasets

2.4.1 Loading the Auto MPG example dataset

2.4.2 Examining labels, values, distributions

2.4.3 Exploring bivariate and multivariate relationships

2.5 Summary