EasyChem-ML

EasyChem-ML is a modular structure-based machine learning tool for property, reactivity and structure prediction in (organic) chemistry, which should be easily adoptable to different problem sets.

Current Status

EasyChem-ML is developed and maintained as a PhD project by Marius Kühnemund. The whole project is neither finished nor complete. Frequent modifications can not be excluded ;) ... that's how it is in PhD.

In addition, not all modules are public available, as these were/are used for projects that have not been published yet

The framework has already been used for different research projects:

Frederik Sandfort, Felix Strieth-Kalthoff, Marius Kühnemund, Christian Beecks, Frank Glorius, A Structure-Based Platform for Predicting Chemical Reactivity, Chem, Volume 6, Issue 6, 2020, Pages 1379-1390, ISSN 2451-9294, https://doi.org/10.1016/j.chempr.2020.02.017.
F. Strieth-Kalthoff, F. Sandfort, M. Kühnemund, F. R. Schäfer, H. Kuchen, F.Glorius, Angew. Chem. Int. Ed. 2022, 61, e202204647; Angew. Chem. 2022, 134, e202204647. https://onlinelibrary.wiley.com/doi/full/10.1002/anie.202204647
Pflüger, P. M., Kühnemund, M., Katzenburg, F., Kuchen, H., & Glorius, F. (2024). An evolutionary algorithm for interpretable molecular representations. Chem. https://doi.org/10.1016/j.chempr.2024.02.004

Install

For the installation, you need a working C++(For Cython) and a Rust compiler.

Linux

# Install Rust-Compiler
curl https://sh.rustup.rs -sSf | bash -s -- -y

# Install Dependencies
pip install setuptools-rust

# Install EasyChemML
git clone https://github.com/mkueh/EasyChemML.git
cd EasyChemML
pip install ./

Windows

# install Rust-Compiler
# https://www.rust-lang.org/learn/get-started

# Install Dependencies
pip install setuptools-rust

# Install EasyChemML
git clone https://github.com/mkueh/EasyChemML.git
cd EasyChemML
pip install ./

Short Example

env = EasyProjectEnvironment('TestFolder')
job_factory = Job_Factory(env)
job_runner = LocalRunner(env)

# ----------------------------------- Dataloading --------------------------------------

dataLoader = {'dreher_dataset': XLSX('Examples/DATASETS/Dreher_and_Doyle_input_data.xlsx', sheet_name='FullCV_01')}
di = DataImporter(env)
bp = di.load_data_InNewBatchPartition(dataLoader, max_chunksize=100000)

# ----------------------------------- Preprocessing --------------------------------------

molRdkit_converter = MolRdkitConverter()
molRdkit_converter.convert(bp['dreher_dataset'], columns=['Ligand', 'Additive', 'Base', 'Aryl halide'], n_jobs=10)

mff_encoder = MFF()
mff_encoder.convert(datatable=bp['dreher_dataset'], columns=['Ligand', 'Additive', 'Base', 'Aryl halide'], fp_length=16, n_jobs=64)

# ----------------------------------- Splitting --------------------------------------

split_creator = Splitcreator()
splitter_boilingpoint = ShuffleSplitter(1, 42, test_size=0.1)
splitset_boilingpoint = split_creator.generate_split(bp['dreher_dataset'], splitter_boilingpoint)

dataset_boilingpoint = Dataset(bp['dreher_dataset'],
                               name='dreher_dataset',
                               feature_col=['Ligand', 'Additive', 'Base', 'Aryl halide'],
                               target_col=['Output'],
                               split=splitset_boilingpoint, env=env)

# ----------------------------------- Metric definition-----------------------------------

r2score = R2_Score()
mae = MeanAbsoluteError()
metricStack_r = MetricStack({'r2': r2score, 'mae': mae})

catboost_r = Config(
    CatBoost_r,
    {'verbose': 50,
     'thread_count': 64,
     'allow_writing_files': False,
     'iterations': 100,
     'depth': 4}
)

# ----------------------------------- Job definition-----------------------------------

job: ModelTrainEvalJob = job_factory.create_ModelTrainEvalJob(
    'dataset_boilingpoint_0',
    dataset_boilingpoint,
    catboost_r,
    metricStack_r,
    dataset_boilingpoint.get_Splitset().
    get_outer_split(0)
)


# ----------------------------------- Training ----------------------------------------
job_runner.run_Job(job)

print(f'Test_lipo: {job.result_metric_TEST}')
print(f'Train_lipo: {job.result_metric_TRAIN}')

job.trained_Model.save_model('model_reaxys.catb')

loaded_model = CatBoost_r()
loaded_model.load_model(path='model_reaxys.catb')

X_indices = list(range(len(bp['dreher_dataset'])))
job_predict = ModelPredictJob(job_name='Predict', trained_Model=loaded_model, X=bp['dreher_dataset'],
                              X_cols=['Ligand', 'Additive', 'Base', 'Aryl halide'])
job_runner.run_Job(job_predict)

for val in job_predict.predicted_vals:
    print(str(val))


# ----------------------------------- Cleanup ----------------------------------------

env.clean()

Examples

Look inside the Example folder

Current implemented Modules

Dataloader

CSV
HDF5
XLSX

Splitter

AllTestSplitter (all Data in the Testset)
AllTrainSplitter (all Data in the Trainset)
RangeSplitter (define a Test/Train range)
ShuffleSplitter (Splitts Random)

Encoder

BertTokenizer
Fingerprints (all Rdkit)
MFF
OnehotEncoder
EasyDescriptor (support only Molmass yet)
EVO-MFP (evolutionary trained fingerprint)

Model

Catboost (Classification and Regression)
Pytorch Models
- Bert-Transformer
RandomForest (Classification and Regression)

Jobrunner

Localrunner (Runs the Jobs on the local machine)

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
EasyChemML		EasyChemML
Examples		Examples
Tests		Tests
Tools		Tools
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EasyChem-ML

Current Status

Install

Linux

Windows

Short Example

Examples

Current implemented Modules

Dataloader

Splitter

Encoder

Model

Jobrunner

About

Releases

Packages

Contributors 2

Languages

License

mkueh/EasyChemML

Folders and files

Latest commit

History

Repository files navigation

EasyChem-ML

Current Status

Install

Linux

Windows

Short Example

Examples

Current implemented Modules

Dataloader

Splitter

Encoder

Model

Jobrunner

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages