8 Detect fake insurance claims using gradient-boosted trees

This chapter covers

Analyzing fake insurance claims dataset
Building gradient-boosted trees (GBT) models
Comparing GBT implementations – scikit-learn, XGBoost, LGBM, CatBoost

There is a good chance that you have bought at least one insurance – be it health insurance, car insurance, life insurance, or something more specific. Insurance has a buyer (you), a seller (an insurance company), and an intermediary (hospitals in the case of health insurance for example). Insurance fraud is an illegal act committed by the buyer, seller, or intermediary of an insurance. While you are a genuine insurance buyer, there are plenty of fraudsters out there, making insurance fraud a multi-billion-dollar industry. According to the Coalition Against Insurance Fraud (https://insurancefraud.org/wp-content/uploads/The-Impact-of-Insurance-Fraud-on-the-U.S.-Economy-Report-2022-8.26.2022.pdf), insurance fraud cost the US $308.6bn annually in 2022. Insurance fraud takes various shapes and forms as shown in figure 8.1.

Figure 8.1 Most common types of insurance fraud.

8.1 Exploring the car insurance fraud dataset

8.1.1 Loading and understanding the dataset

8.1.2 Cleaning the fraud dataset

8.1.3 Analyzing and processing different dataset features

8.2 Building a GBT model to detect insurance fraud

8.2.1 Preparing the training dataset

8.2.2 Training the Scikit-learn GBT model

8.2.3 Evaluating the GBT model

8.2.4 Interpreting the trained GBT model

8.3 Deciding which GBT implementation to use

8.3.1 Using XGBoost model instead of the Scikit-learn GBT model

8.3.2 Using LGBM to detect car insurance fraud

8.3.3 Using CatBoost with processed categorical dataset features

8.3.4 Using CatBoost with raw categorical features

8.3.5 Summarizing results from different GBT implementations

8.4 Summary