Chapter 5: Handling Structural Information with RDKit

In this chapter we will learn the basics of reading molecules with RDKit.

What is SMILES?

Simplified molecular input line entry system(SMILES) is a specification in the form of a line notation for describing the structure of chemical species using short ASCII strings. More detials are described in SMILES Tutorial. For example c1ccccc1 means that there are six aromatic carbon atoms and has a loop structure which is connected with start and end, you know it means benzene.

Let’s draw chemical strcutre with SMILES :)

We could understand SMILES can represent molecules, so let’s read SMILES and draw molecule. At first import Chem class from RDKit to do that. And the function in the second line named 'IPythonConsole' is read for drawing molecules on Notebook. The majority of the basic molecular functionality is found in module rdkit.Chem

from rdkit import Chem
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw

RDKit has MolFromSmiles method which reads SMILES. RDKit mol object can be constructed from SMILES with the function like below.

mol = Chem.MolFromSmiles("c1ccccc1")

Next we draws molecular structure. It is very simple, just evaluate mol object.

mol

Molecular structure will be drawn like following figure.

Both methods connect atoms with bonds(2D Structure) and SMILES can represent same molecule. 2D structure is easy to understand for us and SMILES is not. But SMILES can define molecule as ASCII strings so SMILES can store molecule in low data volume.

How to handle multiple molecules at once？

There are several ways to store multiple molecules in a file but SDF format file is common.

What’s sdf format?

There is MOL format which was developed by MDL. SDF format is an extension of this MOL format. In particular multiple compounds are delimited by lines consisting of four dollar signs (). A feature of the SDF format is its ability to include associated data.

Huge differnce between MOL format and SMILES format is that MOL format can store 3d geometry information of molecule so MOL format can describe not only 2D but also stereo chemistry.

Download sdf file from ChEMBL

Refer to chapter 4, down load Topoisomerase II inhibitor data(CHEMBL669726) from ChEMBL as sdf file format.

NOTE

Specially, open the link page and input 'CHEMBL66926' to search form then search results will be appeared. Then click compounds tab, select all and down load as SDF. File download will start and get file as compressed gzip format. Extract the file with guzip command or using an appropriate soft then rename the file to ch05_compounds.sdf.

Handling sdf with RDKit

SDMolSupplier method is used as sdf file reader of RDKit. Please note that we use mols variable instead of mol because we handle multiple molecules. There isn’t a rule for variables naming but you should use variables name which is easy to understand in order to reduce the unnecessary mistakes.

mols = Chem.SDMolSupplier("ch05_compounds.sdf")

Check how many coumpounds are read. len method is used to count number.

len(mols)

Total 34 molecules are read.

Draw moleculear structures

You can draw molecule one by one with for loop but it is redundant. RDKit has method which can draw multiple molecules at once, so try to use the function named MolsToGridImage method. For your information the function has molsPerRow option which can change number of molecules per row.

Draw.MolsToGridImage(mols)

(bonus)

Following code shows draw molecule one by one with loop for your information.

from IPython.core.display import display
for mol in mols:
    display(mol)

Let’s try to do hetero shuffling

At the leard optimization satage of drug discovery, it often happens that researchers would like to improve molecular properties without changing molecular shape. In this case medicinal chemists of chage atoms such as carbon, nitrogen, sulphur and oxygen which in aromatic rings and it generats good profile molecules sometime. The approach which exchange aromatic atoms (except hydrogen) is called heteroshuffling.

The heteroshuffling strategy is expected to improve physchem properties keeping potency, improve potency and claim avoidance.

Pfizer’s Sildenafil and GSK’s Vardinafil are well-known examples where small structural differences can affect selectivity and pharmacokinetics.

The two structures are very similar except that the arrangement of the nitrogen atoms in the central ring structure is different. Both two molecules inhibt same target protein but their biological activities and pharmacokenetics are different.

Following code shows how to generate image described above. Please note that the code is not just using Draw.MolsToGridImage but align to core structure and add legends option to draw molecular’s name.

from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
from rdkit.Chem import rdDepictor
from rdkit.Chem import rdFMCS
from rdkit.Chem import TemplateAlign
IPythonConsole.ipython_useSVG = True
rdDepictor.SetPreferCoordGen(True)

sildenafil = Chem.MolFromSmiles('CCCC1=NN(C)C2=C1NC(=NC2=O)C1=C(OCC)C=CC(=C1)S(=O)(=O)N1CCN(C)CC1')
vardenafil = Chem.MolFromSmiles('CCCC1=NC(C)=C2N1NC(=NC2=O)C1=C(OCC)C=CC(=C1)S(=O)(=O)N1CCN(CC)CC1')
rdDepictor.Compute2DCoords(sildenafil)
rdDepictor.Compute2DCoords(vardenafil)
res = rdFMCS.FindMCS([sildenafil, vardenafil], completeRingsOnly=True, atomCompare=rdFMCS.AtomCompare.CompareAny)
MCS = Chem.MolFromSmarts(res.smartsString)
rdDepictor.Compute2DCoords(MCS)

TemplateAlign.AlignMolToTemplate2D(sildenafil, MCS)
TemplateAlign.AlignMolToTemplate2D(vardenafil, MCS)
Draw.MolsToGridImage([sildenafil, vardenafil], legends=['sildenafil', 'vardenafil'])

HeteroShuffle class is defined to generate hetero shuffled molecules. To generate the objects, it is needed to input the molecule which would like to do hetero shuffle and core structure to shuffle. The target atoms are aromatic atoms in the core and atoms which has no substituent. The function named make_connector generates reaction objects to construct molecules from shuffled core and substituents. The function named re_construct_mol reconstruct molecules with the reaction objects.

To generate possible combinations of atoms, the code pass candidates of atomic numbers (C, S, N, O) and number of atoms which constructs target ring. Invalid molecule will be removed after possible combinations is generated.

class HeteroShuffle():

    def __init__(self, mol, query):
        self.mol = mol
        self.query = query
        self.subs = Chem.ReplaceCore(self.mol, self.query)
        self.core = Chem.ReplaceSidechains(self.mol, self.query)
        self.target_atomic_nums = [6, 7, 8, 16]


    def make_connectors(self):
        n = len(Chem.MolToSmiles(self.subs).split('.'))
        map_no = n+1
        self.rxn_dict = {}
        for i in range(n):
            self.rxn_dict[i+1] = AllChem.ReactionFromSmarts('[{0}*][*:{1}].[{0}*][*:{2}]>>[*:{1}][*:{2}]'.format(i+1, map_no, map_no+1))
        return self.rxn_dict

    def re_construct_mol(self, core):
        '''
        re construct mols from given substructures and core
        '''
        keys = self.rxn_dict.keys()
        ps = [[core]]
        for key in keys:
            ps = self.rxn_dict[key].RunReactants([ps[0][0], self.subs])
        mol = ps[0][0]
        try:
            smi = Chem.MolToSmiles(mol)
            mol = Chem.MolFromSmiles(smi)
            Chem.SanitizeMol(mol)
            return mol
        except:
            return None

    def get_target_atoms(self):
        '''
        get target atoms for replace
        target atoms means atoms which don't have anyatom(*) in neighbors
        '''
        atoms = []
        for atom in self.core.GetAromaticAtoms():
            neighbors = [a.GetSymbol() for a in atom.GetNeighbors()]
            if '*' not in neighbors and atom.GetSymbol() !='*':
                atoms.append(atom)
        print(len(atoms))
        return atoms

    def generate_mols(self):
        atoms = self.get_target_atoms()
        idxs = [atom.GetIdx() for atom in atoms]
        combinations = itertools.product(self.target_atomic_nums, repeat=len(idxs))
        smiles_set = set()
        self.make_connectors()
        for combination in combinations:
            target = copy.deepcopy(self.core)
            #print(Chem.MolToSmiles(target))
            for i, idx in enumerate(idxs):
                target.GetAtomWithIdx(idx).SetAtomicNum(combination[i])
            smi = Chem.MolToSmiles(target)
            #smi = smi.replace('sH','s').replace('oH','o').replace('cH3','c')
            #print('rep '+smi)
            target = Chem.MolFromSmiles(smi)
            if target != None:
                n_attachment = len([atom for atom in target.GetAtoms() if atom.GetAtomicNum() == 0])
                n_aromatic_atoms = len(list(target.GetAromaticAtoms()))
                if target.GetNumAtoms() - n_attachment == n_aromatic_atoms:
                    try:
                        mol = self.re_construct_mol(target)
                        if checkmol(mol):
                            smiles_set.add(Chem.MolToSmiles(mol))
                    except:
                        pass
        mols = [Chem.MolFromSmiles(smi) for smi in smiles_set]
        return mols

The checkmol function which is used to avoid molecule such as c1coooo1 is defied as aromatic. I defined molecule which is allowd contain O, S is only five menbered hetero aromatic rings.

def checkmol(mol):
    arom_atoms = mol.GetAromaticAtoms()
    symbols = [atom.GetSymbol() for atom in arom_atoms if not atom.IsInRingSize(5)]
    if symbols == []:
        return True
    elif 'O' in symbols or 'S' in symbols:
        return False
    else:
        return True

Use the function.

# Gefitinib
mol1 = Chem.MolFromSmiles('COC1=C(C=C2C(=C1)N=CN=C2NC3=CC(=C(C=C3)F)Cl)OCCCN4CCOCC4')
core1 = Chem.MolFromSmiles('c1ccc2c(c1)cncn2')
#  Oxaprozin
mol2 = Chem.MolFromSmiles('OC(=O)CCC1=NC(=C(O1)C1=CC=CC=C1)C1=CC=CC=C1')
core2 =  Chem.MolFromSmiles('c1cnco1')

Original molecule.

ht=HeteroSuffle(mol1, core1)
res=ht.generate_mols()
print(len(res))
Draw.MolsToGridImage(res, molsPerRow=5)

The image is part of the results Gefitinib as input. The molecules which is different from original molecule are generated. And quinazoline part is changed because I set quinazoline as core.

ht=HeteroSuffle(mol2, core2)
res=ht.generate_mols()
print(len(res))
Draw.MolsToGridImage(res, molsPerRow=5)

This is the result of Oxaprozin is used as input. This molecul has ink:https://en.wikipedia.org/wiki/Oxazole[oxazole] which is five menbered ring as core. There are several hetero aromatic rings that conatin oxygen, sulphur such as thiophen furan.

What’s on your mind? Two examples were shown. The first one is a case of aromatic rings are quinazoline and benzene. Qunazoline is the ring which is fused ring of benzene and pyrimidine. The candidates atoms for six membered aromatic rings will be carbon and nitrogen atoms. (Of cource if we consider for pyririum ion, oxygen will be candidate of atoms but these charged substructure is not common for drug discovery. So we ommited the atom.) Oxaprozin has an oxazole rings. The candidates of atoms for five membered aromatic rings will be carbon, nitrogen, sulphur and oxygen. The second one is introduced as an example of five membered hetero aromatic rings. HeteroShuffled molecules are generated in the both case.

Describes about hetero shuffling more

In the article J. Med. Chem. 2012, 55, 11, 5151-5164 analyzed the effect of nitrogen shuffling for PIM-1 kinase inhibitor project with Fragment Molecular Orbital method which is a method of quantum chemistry. And another article J. Chem. Inf. Model. 2019, 59, 1, 149-158 described mechanism of the stackibng between Asp-Arg salt bridge and hetero rings with quantum chemistry calclation. The approach seems to be good indicator for substituents design.

Also, an example of hetero shuffling for improving the bio availability is ink:https://dx.doi.org/10.1021/jm101027s[J. Med. Chem. 2011, 54, 8, 3076-3080]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ch05_rdkit.asciidoc

ch05_rdkit.asciidoc

Chapter 5: Handling Structural Information with RDKit

What is SMILES?

Let’s draw chemical strcutre with SMILES :)

How to handle multiple molecules at once？

Download sdf file from ChEMBL

Handling sdf with RDKit

Draw moleculear structures

(bonus)

Let’s try to do hetero shuffling

Files

ch05_rdkit.asciidoc

Latest commit

History

ch05_rdkit.asciidoc

File metadata and controls

Chapter 5: Handling Structural Information with RDKit

What is SMILES?

Let’s draw chemical strcutre with SMILES :)

How to handle multiple molecules at once？

Download sdf file from ChEMBL

Handling sdf with RDKit

Draw moleculear structures

(bonus)

Let’s try to do hetero shuffling