- Problem Statement
- Introduction
- Methods and Tools
- EDA and Observations
- Models and Results
- Next Steps
- Conclusions
- References
A pharmaceutical company (currently in stealth mode) wants to reduce the costs of new drugs by at least 25% over the next 3 years. To that end, I want to predict the cardiotoxicity of candidate compounds in advance of synthesizing them, thereby avoiding wasting resources on unproductive development.
For patients, new medicines offer fewer side effects, fewer hospitalizations, improved quality of life, increased productivity, and importantly, extended lives. Developing medicines is a long, complex process. On average, it takes at least ten years for a new medicine to complete the journey from initial discovery to the marketplace, with clinical trials alone taking six to seven years on average. The average cost to research and develop each successful drug is estimated to be $2.6 billion.1 This number incorporates the cost of failures – of the thousands and sometimes millions of compounds that may be screened and assessed early in the R&D process, only a few of which will ultimately receive approval. The overall probability of clinical success (the likelihood that a drug entering clinical testing will eventually be approved) is estimated to be less than 12%.
One of the reasons drugs fail in the clinic is because they exhibit unwanted and often serious side effects. For example, the unintended and often promiscous inhibition of the cardiac human Ether-à-go-go related gene (hERG) potassium channel is a common cause for either delay or removal of therapeutic compounds from development and withdrawal of marketed drugs.2 The clinical manifestion is prolongation of the duration between QRS complex and T-wave measured by surface electrocardiogram (ECG)—hence Long QT Syndrome. Machine learning techniques may be useful for predicting the level of hERG activity of a compound before money is spent on its development into a drug.
The hERGCentral (www.hergcentral.org) is mainly based on experimental data obtained from a primary screen by patch clamp electrophysiology against more than 300,000 structurally diverse compounds. The system is aimed to display and combine three resources: primary electrophysiological data, literature, as well as online reports and chemical library collections. 3,4
- Pandas
- Tensorflow
- Rdkit – for handling chemical structures and information
- DeepChem – for implementing ML models on chemical datasets
- Google Cloud Compute – for much-needed resources
RDKit was used to calculate 13 structural features: molecular weight, logP, polar surface area, no. of H-bond acceptors, # of H-bond donors, # of heteroatoms, # of aromatic rings, total # of rings, # of rotatable bonds, # of stereocenters, formal charge, length of SMILES string, and # of Rule-of-5 violations.
The averages for 10 of the 13 calculated structural features are as follows:
From the above, it seems that hERG inhibitors, on average, have a greater number of H-bond acceptors and fewer H-bond donors, suggesting that the binding site of hERG may have H-bond-donating residues (e.g., serine, threonine, proline). However, the differences are quite small, so it important to avoid over-interpreting that result. The data also suggests that hERG inhibitors tend to have more rings and tend to be more flexible (see # of rotatable bonds), though both attributes could be byproducts of a higher average molecular weight (see below).
Conventional wisdom states that hERG inhibitors tend to 1) have higher molecular weights, 2) be more hydrophobic, and 3) contain basic nitrogen groups. Exploratory data analysis seems to corroborate those ideas (also see violin plot at row 2, column 2 of the image above):
Compare with the active/inactive distribution of the full dataset:
Distributons of PSA, cLogP, and molecular weight:
However, a common issue with compound libraries is that molecular weight is often correlated with PSA and/or cLogP. Such collinearity is, to a large degree, an unintended consequence of the synthetic methods that are commonly employed for making larger compounds (e.g., aromatic groups are readily available and relatively easy to append to small molecules). Thus, it is not always clear that high-molecular weight per se is a liability with respect to cardiotoxicity. This is one of the reasons why metrics that incorporate ligand efficiency are popular among medicial chemists.
The present library exhibits some correlations between MW and cLogP and PSA, though they are not quite as pronounced as in many other libraries:
Embeddings from SMILES strings (TBD), fingerprints, and graph convolutions are herein compared. The fingerprint embeddings were added to a single layer, 4096-neuron perceptron, whereas the graph convolutions were added to a single layer of 256 neurons.
An initial model using the calculated features and their interactions terms (105 features in all) gave a model that was no better than baseline, with a Matthews Correlation Coefficient (MCC) of 0.
The results showed that geometric information was needed to obtain a model that performed any better than baseline. The embeddings form graph convolutions performed slightly better than those from extended-connectivity fingerprints, as determined by the MCC, but the model still needs a lot of work.
Features | Validation MCC | Validation accuracy |
---|---|---|
No Structure - calculated variables only | 0.0 | 0.955 |
ECFP fingerprints5 | 0.456 | 0.962 |
Graph convolutions6 | 0.516 | 0.959 |
The performance of the model will vary in accordance with the way the training and validation data are prepared.
Performance of the best model from the graph convolution embeddings is as follows:
The orange bars to the left of the center line represent false negatives, and the blue bars to the right of the center line represent false positives.
A confusion matrix for the model is shown below:
From the confusion matrix, the follow metrics can be calculated:
Specificity = 0.993
FPR = 0.007
Recall/TPR = 0.7351
Precision = 0.832
- Run tokenized SMILES strings through MLP
- Combine embeddings from SMILES, fingerprints, and graph convolutions in transfer learning models
- Cluster compounds and evaluate performance on each cluster
- Use an explanation technique (e.g., LIME, SHAP, Grad-CAM) to update the conventional wisdom
- Explore 3-D representations
- Incorporate into a generative model
- Drug discovery is a massive multi-parameter optimization problem that lends itself to various machine learning techniques
- The representation of chemical compounds can greatly affect the performance of machine learning models - graphs are particularly attractive solutions
- The quality and quantity of (publicly available) data are still major sources of contention – GIGO is particularly applicable to drug discovery
- In silico drug discovery is very resource-intensive
1. "Biopharmaceutical Research & Development: The Process Behind New Medicines" (http://phrma-docs.phrma.org/sites/default/files/pdf/rd_brochure_022307.pdf)↩︎
2. "Human Ether-a-go-go-Related Gene (hERG) Blockers" (https://www.cambridgemedchemconsulting.com/resources/herg_activity.html)↩︎
3. "Investigation of miscellaneous hERG inhibition in large diverse compound collection using automated patch-clamp assay" (https://www.nature.com/articles/aps2015143)↩︎
4. "hERGCentral: a large database to store, retrieve, and analyze compound-human Ether-à-go-go related gene channel interactions to facilitate cardiotoxicity assessment in drug development" (https://pubmed.ncbi.nlm.nih.gov/22149888/)↩︎
5. "Extended-Connectivity Fingerprints" (https://pubs.acs.org/doi/10.1021/ci100050t)↩︎
6. "Convolutional Networks on Graphs for Learning Molecular Fingerprints" (https://arxiv.org/abs/1509.09292)↩︎