GitHub - hasic-haris/one_step_retrosynth_ai: This project introduces a novel single-step retrosynthesis approach based on chemical compound substructures and fingerprint descriptors.

Single-step Retrosynthesis Prediction based on the Identification of Potential Disconnection Sites using Molecular Substructure Fingerprints

Article Link: https://pubs.acs.org/doi/full/10.1021/acs.jcim.0c01100

Authors: Haris Hasic and Takashi Ishida (Ishida Laboratory @ Tokyo Institute of Technology)

Note to Users

This repository is still under construction in the sense that it requires a few more updates in terms of (memory and execution) efficiency, and, of course, user-friendliness. The authors are currently working on developing multi-processing support, and the usage of PyTorch models since TensorFlow 1.12. is a bit deprecated. The main functionalities, however, are available as scripts, which is described later in this file. It is important to mention that there is still a high possibility of encountering bugs and that some parts do not work well. Once everything that is planned is fully integrated, this note will be removed.

Running the Code

The code can be used by following the next 5 steps:

1. Environment Setup

Initially, a Python environment needs to be set up. This can be easily done with conda by simply running:

conda env create -f environment.yml

conda activate one_step_retrosynth_ai

If you encounter errors or conflicts for whatever reason, you can manually re-construct the environment. The following base libraries were used for the realization of the project:

python: 3.6.10
tensorflow-gpu: 1.12.0
rdkit: 2020.03.3.0
numpy: 1.16.0 (NOTE: This version is preferred to avoid TensorFlow warnings.)
pandas: 1.1.3

Additional libraries that are necessary for the code to be fully functional are:

cairosvg: 2.4.2
imbalanced-learn: 0.7.0
matplotlib: 3.3.1
scikit-learn: 0.23.2
tqdm: 4.50.2

Everything else will be installed automatically as requirements for the base libraries.

2. Configuration

The general configuration of each step is stored in the config.json file, which consists of four main sections:

dataset_config - This section contains parameters which are related to the initial dataset processing. In order to initially run the code, please change the input output_folder path to a folder with enough disk space where the output files can be generated. The needed output disk should be less than approx. 100 GB.
descriptor_config - This section contains parameters which are related to the generation of all molecular fingerprint descriptors. In order to initially run the code, no changes are needed.
model_config - This section contains parameters which are related to the model architecture. In order to initially run the code, no changes are needed since the logs will be generated in the project folder.
evaluation_config - This section contains parameters which are related to the final evaluation of the method. In order to initially run the code, please change the input final_evaluation_dataset path to the generated evaluation dataset. The default value is the combination of the output_folder parameter value, and the final_evaluation_dataset.pkl string.

3. Dataset Preparation

WARNING: This part of the code is currently running on CPU only, and it requires a decent amount of resources to reproduce quickly. Main bottlenecks are RAM (up to ~80 GB) and output disk space (up to ~100 GB). This is due to the large amount of 1024-bit fingerprints being handled. If you do not have that kind of hardware available, please feel free to modify the dataset_construction.py functions which deal with the filtering of the non-reactive fingerprints, which is the part of the code that requires the most amount of resources. If this is done, you can run this code easily on any computer with limited resources. Of course, the multiprocessing version of this code will be added later, since it is currently not that big of a priority for the authors.

The starting dataset is now included in the repository, and the pre-processed version be generated by running the following command:

python -m scripts.prepare_dataset config.json

The process consists out of 5 steps, and the final dataset is saved in the output_folder specified in the configuration.

4. Model Training

The described models can be trained and assessed by running the following command:

python -m scripts.train_model config.json

All the hyper-parameters are specified in the model_config section of the configuration.

5. Running the Full Pipeline

The full single-step retrosynthesis pipeline can be assessed by running the following command:

python -m scripts.run_evaluation config.json

All the hyper-parameters are specified in the evaluation_config section of the configuration.

Contact

For any questions and inquiries please feel free to contact the authors.

Name		Name	Last commit message	Last commit date
Latest commit History 120 Commits
assets		assets
chemistry_methods		chemistry_methods
data_methods		data_methods
data_source		data_source
model_methods		model_methods
reactant_retrieval_and_scoring		reactant_retrieval_and_scoring
scripts		scripts
LICENSE		LICENSE
README.md		README.md
config.json		config.json
config.py		config.py
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Single-step Retrosynthesis Prediction based on the Identification of Potential Disconnection Sites using Molecular Substructure Fingerprints

Note to Users

Running the Code

1. Environment Setup

2. Configuration

3. Dataset Preparation

4. Model Training

5. Running the Full Pipeline

Contact

About

Releases

Packages

Languages

License

hasic-haris/one_step_retrosynth_ai

Folders and files

Latest commit

History

Repository files navigation

Single-step Retrosynthesis Prediction based on the Identification of Potential Disconnection Sites using Molecular Substructure Fingerprints

Note to Users

Running the Code

1. Environment Setup

2. Configuration

3. Dataset Preparation

4. Model Training

5. Running the Full Pipeline

Contact

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages