chapter eight

8 Unsupervised machine learning with k-means

This chapter covers

Fundamental machine learning (ML) concepts
Applying unsupervised ML for threat hunting
Exploring, processing, and preparing data for ML
Selecting features
Encoding non-numeric fields using one-hot encoding
Identifying highly correlated features
Using k-means to uncover command-and-control communication in network traffic

So far, we have conducted threat-hunting expeditions based on some explicit logic (such as signs of beaconing by calculating the time difference between connections) and then developed searches (such as search commands for a data store) or code (such as Python code in Jupiter notebooks) to apply the logic to data. In this chapter, we do the reverse: let the data inform us about anomalies, some of which can interest threat hunters. We will apply unsupervised ML constructs to data to uncover anomalies, some of which could be malicious.

This chapter is an advanced chapter in which we will explore and process data, extract features, build unsupervised ML models using k-means, and interpret outputs. We explore building unsupervised ML models using k-means, an algorithm introduced in this chapter, to uncover anomalies of interest in network connection events. Concepts in this chapter represent essential building blocks of the more sophisticated ML models in the following chapters.

8.1 Beaconing with random jitter to a trusted destination

8.1.1 Getting comfortable with the data

8.1.2 Loading the data set

8.1.3 Exploring and processing the data set

8.1.4 Looking for empty fields

8.1.5 Looking for fields with a large number of unique values

8.1.6 Looking for highly correlated fields

8.1.7 Converting non-numerical fields to numerical

8.1.8 Calculating correlation

8.2 K-means clustering

8.2.1 How does k-means work?

8.2.2 Feature scaling

8.2.3 Determining the number of clusters, k

8.2.4 Applying k-means clustering

8.3 Analyzing clusters of interest

8.3.1 Cluster 2

8.3.2 Cluster 0

8.4 Silhouette analysis as an alternative to the elbow method

8.5 K-means with k = 6

8.5.1 Cluster 2