Skip to content

This project introduces a novel single-step retrosynthesis approach based on chemical compound substructures and fingerprint descriptors.

License

Notifications You must be signed in to change notification settings

hasic-haris/one_step_retrosynth_ai

Repository files navigation

Single-step Retrosynthesis Prediction based on the Identification of Potential Disconnection Sites using Molecular Substructure Fingerprints

Article Link: https://pubs.acs.org/doi/full/10.1021/acs.jcim.0c01100

Authors: Haris Hasic and Takashi Ishida (Ishida Laboratory @ Tokyo Institute of Technology)

Note to Users

This repository is still under construction in the sense that it requires a few more updates in terms of (memory and execution) efficiency, and, of course, user-friendliness. The authors are currently working on developing multi-processing support, and the usage of PyTorch models since TensorFlow 1.12. is a bit deprecated. The main functionalities, however, are available as scripts, which is described later in this file. It is important to mention that there is still a high possibility of encountering bugs and that some parts do not work well. Once everything that is planned is fully integrated, this note will be removed.

Running the Code

The code can be used by following the next 5 steps:

1. Environment Setup

Initially, a Python environment needs to be set up. This can be easily done with conda by simply running:

conda env create -f environment.yml
conda activate one_step_retrosynth_ai

If you encounter errors or conflicts for whatever reason, you can manually re-construct the environment. The following base libraries were used for the realization of the project:

  • python: 3.6.10
  • tensorflow-gpu: 1.12.0
  • rdkit: 2020.03.3.0
  • numpy: 1.16.0 (NOTE: This version is preferred to avoid TensorFlow warnings.)
  • pandas: 1.1.3

Additional libraries that are necessary for the code to be fully functional are:

  • cairosvg: 2.4.2
  • imbalanced-learn: 0.7.0
  • matplotlib: 3.3.1
  • scikit-learn: 0.23.2
  • tqdm: 4.50.2

Everything else will be installed automatically as requirements for the base libraries.

2. Configuration

The general configuration of each step is stored in the config.json file, which consists of four main sections:

  1. dataset_config - This section contains parameters which are related to the initial dataset processing. In order to initially run the code, please change the input output_folder path to a folder with enough disk space where the output files can be generated. The needed output disk should be less than approx. 100 GB.
  2. descriptor_config - This section contains parameters which are related to the generation of all molecular fingerprint descriptors. In order to initially run the code, no changes are needed.
  3. model_config - This section contains parameters which are related to the model architecture. In order to initially run the code, no changes are needed since the logs will be generated in the project folder.
  4. evaluation_config - This section contains parameters which are related to the final evaluation of the method. In order to initially run the code, please change the input final_evaluation_dataset path to the generated evaluation dataset. The default value is the combination of the output_folder parameter value, and the final_evaluation_dataset.pkl string.

3. Dataset Preparation

WARNING: This part of the code is currently running on CPU only, and it requires a decent amount of resources to reproduce quickly. Main bottlenecks are RAM (up to ~80 GB) and output disk space (up to ~100 GB). This is due to the large amount of 1024-bit fingerprints being handled. If you do not have that kind of hardware available, please feel free to modify the dataset_construction.py functions which deal with the filtering of the non-reactive fingerprints, which is the part of the code that requires the most amount of resources. If this is done, you can run this code easily on any computer with limited resources. Of course, the multiprocessing version of this code will be added later, since it is currently not that big of a priority for the authors.

The starting dataset is now included in the repository, and the pre-processed version be generated by running the following command:

python -m scripts.prepare_dataset config.json

The process consists out of 5 steps, and the final dataset is saved in the output_folder specified in the configuration.

4. Model Training

The described models can be trained and assessed by running the following command:

python -m scripts.train_model config.json

All the hyper-parameters are specified in the model_config section of the configuration.

5. Running the Full Pipeline

The full single-step retrosynthesis pipeline can be assessed by running the following command:

python -m scripts.run_evaluation config.json

All the hyper-parameters are specified in the evaluation_config section of the configuration.

Contact

For any questions and inquiries please feel free to contact the authors.

About

This project introduces a novel single-step retrosynthesis approach based on chemical compound substructures and fingerprint descriptors.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages