Chapter 6: Try to evaluate the similarity of compounds

What does it mean that compounds are similar？

Expressions that are somewhat shape is similar are not scientific. In the chemoinformatics, similarity or unsimilarity (distance) is used as quantitative metrics.

In this section, we will introduce two major metrics.

Descriptor

A parameter that represents the overall characteristics of a molecule numerically is called a descriptor. Many descriptors are proposed so far, such as molecular weight, polar surface area (PSA) and partition coefficient (logP). It is possible to evaluate a similarity between two molecules with these descriptors. Please note that descriptor represents whole molecular feature as a numeric value and it is not a local feature.

Note	There are cases where commercial software is needed to calculate some descriptors.

Fingerprint

A fingerprint is another feature, and is a binary representation of a partial structure of a molecule as a binary 0, 1, and it corresponds to the presence or absence of a partial structure and on (1) or off (0) of a bit, and represents a set of partial structures Represents the characteristics of the molecule. There are two types of fingerprints, fixed-length FP and variable-length FP. Formerly, MACSKey fixed-length FP (FP whose partial structure and index have been determined in advance) was used, but now ECFP 4 (It is common to use a variable-length FP called Morgan2).

As for the RDKit fingerprint, please read Developper of RDKit, Greg’s Slide for details.

Let’s do similarity evaluation using this ECFP 4 (Morgan 2) this time.

Difference between SMILES and fingerprint

SMILES is an ASCII string representation of the structure, and a fingerprint is a binary representation of the presence or absence of a substructure. The difference is that the former is one of the structural expressions , while the latter is one of the feature expressions . Since only the presence or absence of partial structures is expressed, information such as the relationship between partial structures (how connected by positional relationship) is lost, and the original structure is not restored.

Some people call it Bag-of-Fragments because it corresponds to Bag-of-Words often used in text-mining.

Let’s calculate similarity

Let’s evaluate the similarity of toluene and chlorobenzene as simple molecules.

from rdkit import Chem, DataStructs
from rdkit.Chem import AllChem, Draw
from rdkit.Chem.Draw import IPythonConsole

Read molecule from SMILES.

mol1 = Chem.MolFromSmiles("Cc1ccccc1")
mol2 = Chem.MolFromSmiles("Clc1ccccc1")

Confirm it by visual observation.

Draw.MolsToGridImage([mol1, mol2])

Generate radius 2 morgan fingerprint which corresponds to ECFP4.

fp1 = AllChem.GetMorganFingerprint(mol1, 2)
fp2 = AllChem.GetMorganFingerprint(mol2, 2)

Tanimoto coefficient is used for similarity evaluation.

DataStructs.TanimotoSimilarity(fp1, fp2)
# 0.5384615384615384

Virtual screening

So far we have described how to evaluate the similarity of compounds. Using this index of similarity to select a specific group of compounds from a large number of compounds is called virtual screening.

For example, if a compound that is likely to be a drug is published in a patent or a paper, or a compound that is likely to be promising is found in our assay system, similar compounds in the compound library database of our company or the database of commercially available compounds are more promising I want to find out if there is something like that. Here, it is possible to purchase an analog of influenza drug which is known as a neuraminidase inhibitor Inavir link:Find out using ZINC.

The molecular weight of Inavir was about 350, and LogP was about -3. So we selected the fraction of the molecular weight 350-375 and LogP -1 from ZINC. This section is divided into 16 files, but download and use only the first set.

Note	We described how to download the data in chapter 4.

We can perform shell command on jupyter notebook by starting from !. The following is an example of downloading ZINC data set with wget command on jupyter notebook

!wget http://files.docking.org/2D/EA/EAED.smi

Read SMILES from file and make it a mol object, but skip the first line because it is a header. Also, the last character of each line is a newline character, so it is excluded as l [:-1]. Finally, find out how many compounds there are.

mols = []
with open("EAED.smi") as f:
    f.readline()
    for l in f:
        mol = Chem.MolFromSmiles(l[:-1])
        mols.append(mol)
print(len(mols))
# 195493

Next, prepare a function to check the degree of similarity with Inavir (LANIMAMIBIR).

laninamivir = Chem.MolFromSmiles("CO[C@H]([C@H](O)CO)[C@@H]1OC(=C[C@H](NC(=N)N)[C@H]1NC(=O)C)C(=O)O")
laninamivir_fp = AllChem.GetMorganFingerprint(laninamivir, 2)

def calc_laninamivir_similarity(mol):
    fp = AllChem.GetMorganFingerprint(mol, 2)
    sim = DataStructs.TanimotoSimilarity(laninamivir_fp, fp)
    return sim

Check it.

similar_mols =[]
for mol in mols:
    sim = calc_laninamivir_similarity(mol)
    if sim > 0.2:
        similar_mols.append((mol, sim))

Sort the results in descending order of similarity and retrieve only the first ten.

similar_mols.sort(key=lambda x: x[1], reverse=True)
mols = [l[0] for l in similar_mols[:10]]

Let’s draw them.

Draw.MolsToGridImage(mols, molsPerRow=5)

As you can see if the similarity is confirmed, about 200,000 compounds examined this time can only find a compound with a maximum similarity is 23%. However, ZINC contains 750 million entries, so there should be many more similar compounds in it.

Clustering

For example, when purchasing a commercial compound and creating a library, we want to have as much diversity as possible, so we organize similar compounds and select a representative of them so that only similar compounds are not biased. In this way, if you want to organize compounds by structural similarity, use a method called clustering.

Clustering of 5614 hits from Novrtis’s antimalarial assay

Import library for clustering and reading data.

from rdkit.ML.Cluster import Butina
mols = Chem.SDMolSupplier("ch06_nov_hts.sdf")

If RDKit can not read the molecule for some reason, it will generate None instead of a mol object. Since passing this None to the GetMorganFingerprintAsBitVect method results in an error, so we generate a fingerprint while excluding None.

fps = []
valid_mols = []

for mol in mols:
    if mol is not None:
        fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2)
        fps.append(fp)
        valid_mols.append(mol)

Generate a distance matrix (a lower triangular distance matrix) from the fingerprints.

distance_matrix = []
for i, fp in enumerate(fps):
    similarities = DataStructs.BulkTanimotoSimilarity(fps[i], fps[:i+1])
    distance_matrix.extend([1-sim for sim in similarities])

Cluster compounds using a distance matrix. The third argument is the distance threshold. In this example, clustering is performed on compounds with a distance of 0.2 or 80% or more.

clusters = Butina.ClusterData(distance_matrix, len(fps), 0.2, isDistData=True)

Check number of cluster.

len(clusters)
#2492

Visualize structures of first cluster.

mols_ =[valid_mols[i] for i in clusters[0]]
Draw.MolsToGridImage(mols_, molsPerRow=5)

In this case, clustering was performed using the library provided in RDKit, but some methods can be used with Scikit-learn. And in practice this method is often used.

Structure Based Drug Design(SBDD)

Here we evaluate the similarity of apixaban and rivaroxaban, which are marketed as anticoagulants.

apx = Chem.MolFromSmiles("COc1ccc(cc1)n2nc(C(=O)N)c3CCN(C(=O)c23)c4ccc(cc4)N5CCCCC5=O")
rvx = Chem.MolFromSmiles("Clc1ccc(s1)C(=O)NC[C@H]2CN(C(=O)O2)c3ccc(cc3)N4CCOCC4=O")

Draw.MolsToGridImage([apx, rvx], legends=["apixaban", "rivaroxaban"])

The structures are quite similar as you can see, but both of these two compounds are known to bind similarly to the same pocket of the serine protease FXa and to inhibit the function of the protein.

apx_fp = AllChem.GetMorganFingerprint(apx, 2)
rvx_fp = AllChem.GetMorganFingerprint(rvx, 2)

DataStructs.TanimotoSimilarity(apx_fp, rvx_fp)
# 0.40625

It’s about 40% similar. In fact, both apixaban and rivaroxaban have their complex crystal structures solved and were drawn using PyMOL.

NOTE: It does not explain how to use PyMOL because it exceeds the contents of this document, but if you are interested, Please refer to here.

As you can see from the figure, apixaban and rivaroxaban are beautifully overlapping in three dimensions. In particular, methoxyphenyl and chlorothiol are located in a site called S1 pocket and are said to have some kind of strong interaction. As the ligand binding sites (pockets) of proteins become clearer, it becomes easier for the medicinal chemist to develop a strategy for the next modification, and the success rate and progress rate of the project will increase.

An approach that optimizes the structure based on the shape of the protein determined by X-ray or cryo-electric testing is called Structure Based Drug Design (SBDD). Also, if you know the pocket, you can screen for compounds that physically bind to the pocket, which is called structure-based virtual screening (SBVS), and ligand-based virtual screening as you did in the previous chapter. It may be distinguished from ligand-based virtual screenig(LBVS).

History of Xa inhibitors and the importance of quantum chemistry calculation

Although the contents of the chemoinformatics in this book are far apart, it is very useful in molecular design to trace the history of FXa inhibitors and to understand what improvements have been made through generations. In addition, since the interpretation of the S1 pocket interaction is very difficult visually and in classical mechanics, it can be interpreted only by quantum chemical calculation such as Fragment Molecular Orbital Method (FMO), so it is a mistake that quantum chemical calculation becomes essential in future molecular design I think.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ch06_similarity.asciidoc

ch06_similarity.asciidoc

Chapter 6: Try to evaluate the similarity of compounds

What does it mean that compounds are similar？

Descriptor

Fingerprint

Let’s calculate similarity

Virtual screening

Clustering

Structure Based Drug Design(SBDD)

Files

ch06_similarity.asciidoc

Latest commit

History

ch06_similarity.asciidoc

File metadata and controls

Chapter 6: Try to evaluate the similarity of compounds

What does it mean that compounds are similar？

Descriptor

Fingerprint

Let’s calculate similarity

Virtual screening

Clustering

Structure Based Drug Design(SBDD)