chapter three

3 Fraud detection on tabular data using classical ML

This chapter covers

The machine learning advantage over rules to detect fraud
Analyzing online transaction fraud dataset
Feature engineering and extraction
Using random forest to detect fraudulent transactions
Using gradient boosted trees for fraud detection
Deploying a random forest model using Flask and Docker

We are living in a flourishing era of machine learning. Have you ever noticed how rarely we see spam emails in our inbox (as opposed to the spam folder)? Aren’t those weather forecasts mostly in the ballpark? Haven’t those commute predictions on navigation apps been mostly accurate? All of these applications are powered by machine learning.

Over a decade ago, a phrase became popular – software is eating the world (https://a16z.com/2011/08/20/why-software-is-eating-the-world/) - implying the penetration of software technology across all industries. A similar phenomenon is happening now where machine learning has become the new software or software 2.0, a term coined by Andrej Karpathy (https://karpathy.medium.com/software-2-0-a64152b37c35).

3.1 Machine learning versus rules

3.2 Understanding online transactions fraud

3.3 Analyzing online transaction fraud dataset

3.3.1 Loading and cleaning fraud dataset

3.3.2 Analyzing input features and target output

3.3.3 Processing categorical features for ML

3.4 Building a random forest-based fraud solution

3.4.1 Splitting the dataset into train and test sets

3.4.2 Training a random forest model

3.4.3 Evaluating trained random forest

3.4.4 Interpreting the random forest model

3.5 Building a gradient boosted trees-based fraud solution

3.5.1 Training a GBT model

3.5.2 Comparing GBT with random forest performance

3.5.3 Interpreting GBT model

3.6 Engineering and analyzing new features from existing data

3.6.1 Deriving new string features and encoding them to numbers

3.6.2 Deriving new numerical features

3.6.3 Analyzing correlations between new and original features

3.7 Building a final random forest-based fraud detector

3.7.1 Splitting the dataset into train and test sets

3.7.2 Training best random forest model

3.7.3 Comparing best model performance with previous models

3.7.4 Interpreting the final random forest model

3.7.5 Defining the optimal model threshold

3.8 Deploying the random forest model as a service

3.8.1 Saving the best-trained random forest model

3.8.2 Writing Flask code for model inference in a service

3.8.3 Using Docker to containerize the ML service code

3.8.4 Sending live requests to a dockerized model service

3.9 Summary