chapter four

4 Solubility Deep Dive with Linear Models

This chapter covers

Solubility and how to model it with linear regression.
The mechanics of how linear models are trained.
A tour of linear models accessible via Scikit-Learn.
How to evaluate a model’s regression performance and applicability domain.
What causes overfitting and how we mitigate it by analyzing a model’s bias-variance trade-offs.

In chapter 2, we reviewed common compound filters, such as Lipinski’s Rule of Five, that define criteria that act as a proxy for a compound’s drug-likeness. A critical factor in the development of pharmaceutical compounds is drug solubility. The ability of a drug candidate to dissolve in biological fluids, such as water or gastric juices, directly impacts its bioavailability, efficacy, and ultimately its success as a therapeutic agent. Poor solubility can lead to reduced absorption, necessitating higher drug doses and potentially causing adverse effects, while good solubility can enhance a drug's therapeutic profile.

4.1 Solubility with Linear Regression

4.1.1 Load the Data

4.1.2 Target Variable Distribution

4.1.3 Feature Computation & Correlation

4.1.4 Linear Regression

4.2 The Learning Algorithm

4.2.1 Linear Models

4.2.2 Ordinary Least Squares (OLS)

4.2.3 Gradient Descent

4.3 Touring Scikit-Learn Linear Models

4.3.1 Defining a Benchmark

4.3.2 Ridge Regression & Feature Selection

4.3.3 Robust Estimation with RANSAC

4.3.4 Support Vector Regression

4.4 Bias-Variance Decomposition

4.4.1 A Case Study in Polynomials

4.4.2 Learning Curves

4.4.3 Validation Curves

4.5 Summary

4.6 Exercises

4.7 References