Capstone: Predicting Drug Cardiotoxicity

Kehinde Ajayi, Ph.D.

A pharmaceutical company (currently in stealth mode) wants to reduce the costs of new drugs by at least 25% over the next 3 years. To that end, I want to predict the cardiotoxicity of candidate compounds in advance of synthesizing them, thereby avoiding wasting resources on unproductive development.

Introduction

For patients, new medicines offer fewer side effects, fewer hospitalizations, improved quality of life, increased productivity, and importantly, extended lives. Developing medicines is a long, complex process. On average, it takes at least ten years for a new medicine to complete the journey from initial discovery to the marketplace, with clinical trials alone taking six to seven years on average. The average cost to research and develop each successful drug is estimated to be $2.6 billion.1 This number incorporates the cost of failures – of the thousands and sometimes millions of compounds that may be screened and assessed early in the R&D process, only a few of which will ultimately receive approval. The overall probability of clinical success (the likelihood that a drug entering clinical testing will eventually be approved) is estimated to be less than 12%.

One of the reasons drugs fail in the clinic is because they exhibit unwanted and often serious side effects. For example, the unintended and often promiscous inhibition of the cardiac human Ether-à-go-go related gene (hERG) potassium channel is a common cause for either delay or removal of therapeutic compounds from development and withdrawal of marketed drugs.2 The clinical manifestion is prolongation of the duration between QRS complex and T-wave measured by surface electrocardiogram (ECG)—hence Long QT Syndrome. Machine learning techniques may be useful for predicting the level of hERG activity of a compound before money is spent on its development into a drug.

Dataset information

The hERGCentral (www.hergcentral.org) is mainly based on experimental data obtained from a primary screen by patch clamp electrophysiology against more than 300,000 structurally diverse compounds. The system is aimed to display and combine three resources: primary electrophysiological data, literature, as well as online reports and chemical library collections. 3,4

Methods and Tools

Pandas
Tensorflow
Rdkit – for handling chemical structures and information
DeepChem – for implementing ML models on chemical datasets
Google Cloud Compute – for much-needed resources

RDKit was used to calculate 13 structural features: molecular weight, logP, polar surface area, no. of H-bond acceptors, # of H-bond donors, # of heteroatoms, # of aromatic rings, total # of rings, # of rotatable bonds, # of stereocenters, formal charge, length of SMILES string, and # of Rule-of-5 violations.

EDA and Observations

The averages for 10 of the 13 calculated structural features are as follows:

From the above, it seems that hERG inhibitors, on average, have a greater number of H-bond acceptors and fewer H-bond donors, suggesting that the binding site of hERG may have H-bond-donating residues (e.g., serine, threonine, proline). However, the differences are quite small, so it important to avoid over-interpreting that result. The data also suggests that hERG inhibitors tend to have more rings and tend to be more flexible (see # of rotatable bonds), though both attributes could be byproducts of a higher average molecular weight (see below).

Conventional wisdom states that hERG inhibitors tend to 1) have higher molecular weights, 2) be more hydrophobic, and 3) contain basic nitrogen groups. Exploratory data analysis seems to corroborate those ideas (also see violin plot at row 2, column 2 of the image above):

Compare with the active/inactive distribution of the full dataset:

Distributons of PSA, cLogP, and molecular weight:

However, a common issue with compound libraries is that molecular weight is often correlated with PSA and/or cLogP. Such collinearity is, to a large degree, an unintended consequence of the synthetic methods that are commonly employed for making larger compounds (e.g., aromatic groups are readily available and relatively easy to append to small molecules). Thus, it is not always clear that high-molecular weight per se is a liability with respect to cardiotoxicity. This is one of the reasons why metrics that incorporate ligand efficiency are popular among medicial chemists.

The present library exhibits some correlations between MW and cLogP and PSA, though they are not quite as pronounced as in many other libraries:

Models and Results

Embeddings from SMILES strings (TBD), fingerprints, and graph convolutions are herein compared. The fingerprint embeddings were added to a single layer, 4096-neuron perceptron, whereas the graph convolutions were added to a single layer of 256 neurons.

An initial model using the calculated features and their interactions terms (105 features in all) gave a model that was no better than baseline, with a Matthews Correlation Coefficient (MCC) of 0.

The results showed that geometric information was needed to obtain a model that performed any better than baseline. The embeddings form graph convolutions performed slightly better than those from extended-connectivity fingerprints, as determined by the MCC, but the model still needs a lot of work.

Features	Validation MCC	Validation accuracy
No Structure - calculated variables only	0.0	0.955
ECFP fingerprints 5	0.456	0.962
Graph convolutions 6	0.516	0.959

The performance of the model will vary in accordance with the way the training and validation data are prepared.
Performance of the best model from the graph convolution embeddings is as follows:

The orange bars to the left of the center line represent false negatives, and the blue bars to the right of the center line represent false positives.

A confusion matrix for the model is shown below:

From the confusion matrix, the follow metrics can be calculated:

Specificity = 0.993
FPR = 0.007
Recall/TPR = 0.7351
Precision = 0.832

Next Steps

Run tokenized SMILES strings through MLP
Combine embeddings from SMILES, fingerprints, and graph convolutions in transfer learning models
Cluster compounds and evaluate performance on each cluster
Use an explanation technique (e.g., LIME, SHAP, Grad-CAM) to update the conventional wisdom
Explore 3-D representations
Incorporate into a generative model

Conclusions

Drug discovery is a massive multi-parameter optimization problem that lends itself to various machine learning techniques
The representation of chemical compounds can greatly affect the performance of machine learning models - graphs are particularly attractive solutions
The quality and quantity of (publicly available) data are still major sources of contention – GIGO is particularly applicable to drug discovery
In silico drug discovery is very resource-intensive

References

1. "Biopharmaceutical Research & Development: The Process Behind New Medicines" (http://phrma-docs.phrma.org/sites/default/files/pdf/rd_brochure_022307.pdf)↩︎

2. "Human Ether-a-go-go-Related Gene (hERG) Blockers" (https://www.cambridgemedchemconsulting.com/resources/herg_activity.html)↩︎

3. "Investigation of miscellaneous hERG inhibition in large diverse compound collection using automated patch-clamp assay" (https://www.nature.com/articles/aps2015143)↩︎

4. "hERGCentral: a large database to store, retrieve, and analyze compound-human Ether-à-go-go related gene channel interactions to facilitate cardiotoxicity assessment in drug development" (https://pubmed.ncbi.nlm.nih.gov/22149888/)↩︎

5. "Extended-Connectivity Fingerprints" (https://pubs.acs.org/doi/10.1021/ci100050t)↩︎

6. "Convolutional Networks on Graphs for Learning Molecular Fingerprints" (https://arxiv.org/abs/1509.09292)↩︎

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
imgs		imgs
notebooks		notebooks
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Capstone: Predicting Drug Cardiotoxicity

Kehinde Ajayi, Ph.D.

Contents:

Problem Statement

Introduction

Dataset information

Methods and Tools

EDA and Observations

Models and Results

Next Steps

Conclusions

References

About

Releases

Packages

Languages

kajayi/Capstone

Folders and files

Latest commit

History

Repository files navigation

Capstone: Predicting Drug Cardiotoxicity

Kehinde Ajayi, Ph.D.

Contents:

Problem Statement

Introduction

Dataset information

Methods and Tools

EDA and Observations

Models and Results

Next Steps

Conclusions

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages