Single-step Retrosynthesis Prediction based on the Identification of Potential Disconnection Sites using Molecular Substructure Fingerprints
Article Link: https://pubs.acs.org/doi/full/10.1021/acs.jcim.0c01100
Authors: Haris Hasic and Takashi Ishida (Ishida Laboratory @ Tokyo Institute of Technology)
This repository is still under construction in the sense that it requires a few more updates in terms of (memory and execution) efficiency, and, of course, user-friendliness. The authors are currently working on developing multi-processing support, and the usage of PyTorch models since TensorFlow 1.12. is a bit deprecated. The main functionalities, however, are available as scripts, which is described later in this file. It is important to mention that there is still a high possibility of encountering bugs and that some parts do not work well. Once everything that is planned is fully integrated, this note will be removed.
The code can be used by following the next 5 steps:
Initially, a Python environment needs to be set up. This can be easily done with conda
by simply running:
conda env create -f environment.yml
conda activate one_step_retrosynth_ai
If you encounter errors or conflicts for whatever reason, you can manually re-construct the environment. The following base libraries were used for the realization of the project:
- python: 3.6.10
- tensorflow-gpu: 1.12.0
- rdkit: 2020.03.3.0
- numpy: 1.16.0 (NOTE: This version is preferred to avoid TensorFlow warnings.)
- pandas: 1.1.3
Additional libraries that are necessary for the code to be fully functional are:
- cairosvg: 2.4.2
- imbalanced-learn: 0.7.0
- matplotlib: 3.3.1
- scikit-learn: 0.23.2
- tqdm: 4.50.2
Everything else will be installed automatically as requirements for the base libraries.
The general configuration of each step is stored in the config.json
file, which consists of four main sections:
- dataset_config - This section contains parameters which are related to the initial dataset processing. In order
to initially run the code, please change the input
output_folder
path to a folder with enough disk space where the output files can be generated. The needed output disk should be less than approx. 100 GB. - descriptor_config - This section contains parameters which are related to the generation of all molecular fingerprint descriptors. In order to initially run the code, no changes are needed.
- model_config - This section contains parameters which are related to the model architecture. In order to initially run the code, no changes are needed since the logs will be generated in the project folder.
- evaluation_config - This section contains parameters which are related to the final evaluation of the method. In
order to initially run the code, please change the input
final_evaluation_dataset
path to the generated evaluation dataset. The default value is the combination of theoutput_folder
parameter value, and thefinal_evaluation_dataset.pkl
string.
WARNING: This part of the code is currently running on CPU only, and it requires a decent amount of resources to
reproduce quickly. Main bottlenecks are RAM (up to ~80 GB) and output disk space (up to ~100 GB). This is due to the
large amount of 1024-bit fingerprints being handled. If you do not have that kind of hardware available, please feel
free to modify the dataset_construction.py
functions which deal with the filtering of the non-reactive fingerprints,
which is the part of the code that requires the most amount of resources. If this is done, you can run this code easily
on any computer with limited resources. Of course, the multiprocessing version of this code will be added later, since
it is currently not that big of a priority for the authors.
The starting dataset is now included in the repository, and the pre-processed version be generated by running the following command:
python -m scripts.prepare_dataset config.json
The process consists out of 5 steps, and the final dataset is saved in the output_folder
specified in the configuration.
The described models can be trained and assessed by running the following command:
python -m scripts.train_model config.json
All the hyper-parameters are specified in the model_config
section of the configuration.
The full single-step retrosynthesis pipeline can be assessed by running the following command:
python -m scripts.run_evaluation config.json
All the hyper-parameters are specified in the evaluation_config
section of the configuration.
For any questions and inquiries please feel free to contact the authors.