chapter three

3 Ligand-based Screening: Machine Learning

This chapter covers

The end-to-end process of a machine learning project in the context of cardiotoxicity prediction.
How to acquire, curate, and standardize molecule datasets.
Training and evaluating a linear model, which we can save for later use.
How to improve our model with regularization and non-linear transformations.
Hyperparameter tuning with grid search and randomized search.

Last chapter, we learned about compound filters and similarity searching in the context of ligand-based virtual screening. In this chapter, we will review one way that ML fits into our virtual screening pipeline. The key stages in our workflow are illustrated in figure 3.1:

3.1 Problem Understanding

3.1.1 Your Machine Learning Task

3.2 Data Acquisition, Exploration, & Curation

3.2.1 Loading and Exploring the hERG Blockers Dataset

3.2.2 Validating & Standardizing SMILES

3.2.3 Feature Generation & Exploration

3.3 Application of Linear Models

3.3.1 Learning from Data

3.3.2 Training our Linear Model

3.3.3 Evaluating our Model

3.4 Improving our Model

3.4.1 Regularization

3.4.2 Non-linear Transformation

3.4.3 Hyperparameter Tuning

3.4.4 Evaluating the Best Model

3.4.5 Saving and Applying our Model

3.5 Summary

3.6 Exercises

3.7 References