chapter two

2 Ligand-based Screening: Filtering & Similarity Searching

This chapter covers

Virtual screening taxonomy with focus on ligand-based screening.
How to acquire, curate, visualize, and represent molecule datasets.
Compound filtering of undesirable properties and substructures.
Similarity searching to uncover antimalarial hit compounds.

After discussing how drug discovery and machine learning intersect to unearth novel therapeutics, we are ready to focus on specific components of the drug discovery pipeline. We begin our journey with virtual screening. Virtual screening is the computational alternative to experimental, high-throughput screening in a lab (table 2.1 compares these two methods). With advances in robotics and miniaturization, high-throughput facilities can generate large amounts of experimental data and test up to millions of compounds in a reasonable amount of time. While high-throughput screening is cheap for simple testing, it is expensive for complex assays. However, we can use data generated from sources such as high-throughput screens to train machine learning models. With these models, we can quickly and affordably scale up testing to virtually screen billions of compounds.

2.1 What is Virtual Screening?

2.1.1 Virtual Screening Taxonomy

2.1.2 Scenario: Hit Identification of Antimalarial Compounds

2.1.3 Strategy: Similarity Searching

2.2 Loading a Virtual Screening Library

2.2.1 Understanding the Dataset as a Structure Data File

2.2.2 Molecule Sanitization

2.2.3 Molecular Descriptors

2.3 Compound Filters

2.3.1 Property-based Filters

2.3.2 Structure-based Filters

2.4 Fingerprints: Representing Molecules as Numbers

2.4.1 Structural Keys

2.4.2 Hashed Fingerprints

2.4.3 Fingerprinting our Library

2.5 Similarity Searching

2.5.1 Defining “Similarity”

2.5.2 Searching against a Query

2.6 Summary

2.7 Exercises

2.8 References