8 Unsupervised machine learning with k-means
This chapter covers
- Fundamental machine learning (ML) concepts
- Applying unsupervised ML for threat hunting
- Exploring, processing, and preparing data for ML
- Selecting features
- Encoding non-numeric fields using one-hot encoding
- Identifying highly correlated features
- Using k-means to uncover command-and-control communication in network traffic
So far, we have conducted threat-hunting expeditions based on some explicit logic (such as signs of beaconing by calculating the time difference between connections) and then developed searches (such as search commands for a data store) or code (such as Python code in Jupiter notebooks) to apply the logic to data. In this chapter, we do the reverse: let the data inform us about anomalies, some of which can interest threat hunters. We will apply unsupervised ML constructs to data to uncover anomalies, some of which could be malicious.
This chapter is an advanced chapter in which we will explore and process data, extract features, build unsupervised ML models using k-means, and interpret outputs. We explore building unsupervised ML models using k-means, an algorithm introduced in this chapter, to uncover anomalies of interest in network connection events. Concepts in this chapter represent essential building blocks of the more sophisticated ML models in the following chapters.