5 Improve phishing detection with machine learning

This chapter covers

Collecting and cleaning the phishing dataset for machine learning
Training, evaluating, and iterating over ML models
Running an ML-based phishing detection system

With recent exponential advancements related to large language models (LLMs) in 2023 (in form of products like ChatGPT), we may forget that not so long ago, we were using rules to solve language-related tasks such as sentiment analysis, as shown in figure 5.1.

Figure 5.1 Glimpses from a 2015 research paper. “Rule-Based Sentiment Analysis for Financial News.” The paper lists sets of rules to process different sequences of positive and negative words in a sentence to decide the overall sentiment of the sentence. The same task is now done using machine learning without requiring even one rule.

For all language-related tasks, we now have much more sophisticated machine learning (ML) models that rely simply on data (texts available all over the internet), and not on a long list of rules set by humans. With the use of machine learning, we have come a long way in solving increasingly complex problems with relatively less manual effort. In this chapter, we demonstrate this power of ML over rules by building an ML-based phishing detection system that performs much better than rules-based systems.

5.1 Collecting and cleaning phishing data

5 Improve phishing detection with machine learning

This chapter covers

5.1 Collecting and cleaning phishing data

5.1.1 Data collection

5.1.2 Data cleaning

5.2 Model selection, training, and evaluation for phishing detection

5.2.1 Using supervised ML models

5.2.2 Using unsupervised ML models

5.3 Running a simple ML model as Python executable

5.4 Summary