Appendix B. Chemical Data Repositories

 

A contributor to AI’s surge in drug discovery is the massive amount of data being generated. However, availability of data can impose a fundamental limitation on development of ML and deep learning models. There is an immense size of druglike molecules, large variety and quantity of biological targets, high dimension of chemical and biological properties, and potential applications in not just drug discovery but healthcare and biological chemistry. As a result, it is not uncommon to commit to a research direction where there is a dearth of data.

We will cover contemporary, publicly accessible chemical data repositories to keep track of in accordance with your own personal projects and passions that might come to light while working through this book. Most of the databases we will look at contain existing compounds, though there are also large databases (up to billions) of virtual molecules that do not exist but could be synthesized. By no means do we intend to provide a comprehensive reference of benchmark datasets, as these are in flux and data sets for specific tasks are best found in primary literature.

B.1 Data Sources

ChEMBL & ChEBI

B.2 Notes on Usage

B.3 Garbage In, Garbage Out

B.4 References