chapter nine

9 Supervised machine learning with Random Forest and XGBoost

This chapter covers

Introducing supervised machine learning (ML) and how it relates to threat hunting
Applying supervised ML for threat hunting
The importance of training data sets in supervised ML
Acquiring and processing reliable training data sets
Practicing threat hunting with supervised ML
Evaluating and comparing supervised ML models
Comparing of supervised and unsupervised ML

Chapter 8 introduced unsupervised ML and used a k-means clustering model to group similar data points. Investigating events mapped to the small clusters led us to uncover malicious activities. In this chapter, we introduce supervised ML and compare it with unsupervised ML in the context of threat hunting. We identify the prerequisites of operating supervised ML effectively, some of which translate into operation challenges that threat hunters should be aware of.

9.1 Hunting DNS tunneling

9.2 Supervised machine learning

9.2.1 Acquiring the training data set

9.2.2 Analyzing the data set

9.2.3 Extracting the features

9.2.4 Analyzing the features

9.2.5 Reducing features

9.3 Random Forest

9.3.1 Generating the Random Forest model

9.3.2 Testing the Random Forest model

9.3.3 Hunting with the Random Forest model

9.3.4 Downloading DNS events and extracting features

9.3.5 Engaging the model

9.3.6 Investigating events

9.4 XGBoost

9.4.1 Generating the XGBoost model

9.4.2 Testing the XGBoost model

9.4.3 Hunting with the XGBoost model

9.5 Exercises

9.6 Answers to exercises

Summary