5 Data Analysis and Manipulation

 

This chapter covers

  • Importing and analyzing dataset as a data frame
  • Converting nonnumerical variables to numerical values
  • Summarizing variables in a data frame
  • Dealing with missing and outlier values
  • Normalizing or scaling variables
  • Analyzing pairwise correlation between variables

In the previous chapter we have seen how to read data from different resources and export our data in different formats either for sharing data or for later use.

In this chapter, we will go one step further and analyze the data we import. We will see how to get to know our data, and get insights from it. We will also learn how to manipulate our data to prepare it for modeling. All these correspond to the Analysis part in Figure 1.1.

5.1 Project Description

One of the most challenging tasks of banks is to assess the creditworthiness of their customers. Banks and other financial institutions develop credit scoring models for this aim. The first step to develop credit scoring models is collecting the data. Besides other reasons, banks collect the historical data for credit applications for later analysis and model development. Credit application data contains two parts:

5.2 Import Files

5.3 Remove Duplicates

5.4 Combine Input and Output Data

5.5 Convert Nonnumerical Data

5.6 Summarize the Data

5.7 Missing Data

5.8 Outliers

5.9 Standardize or Scale Data

5.10 Correlation Between Input Features and Output

5.11 Summary