MSA Generation with Seqs2Seqs Pretraining:Advancing Protein Structure Predictions

Paper

Pretrain

All the commands are designed for slurm cluster, we use huggingface trainer to pretrain the model, more details could be find here

Construct local binary dataset ( load training data from cluster is too slow, so it's better to fisrt construct all your dataset to .bin file as shown in datasets )

python utils.py \
   --output_dir ./datasets/ \
   --random_src --src_seq_per_msa_l 5\
   --src_seq_per_msa_u 10 \
   --total_seq_per_msa 25 \
   --local_file_path  path_to_pretrained_dataset

install dependency libraries
1. conda create -n msagen python=3.10
2. pip install -r requirements.txt
bash run.sh

Inference

download checkpoints (we are trying to retrieve the weight of the model)
run inference by bash scripts/inference.sh

Note: all inference code is in inference.py*

Evaluation

DATASET	MSA	STRUCTURE
CASP15	https://zenodo.org/record/8126538	google drive

Alphafold2 Prediction

Please refer to Alphafold2 GitHub to learn more about set up af2.
We provide scripts to use alphafold2 to launch protein structure prediction by bash scripts/run_af2, one need to modify msa directory

LDDT

follow this document for lddt evaluation tool download https://www.openstructure.org/
follow this document for https://www.openstructure.org/docs/2.4/mol/alg/lddt/ usage

Ensemble

Directly run following to get .json file of final results.

python ensemble.py --predicted_pdb_root_dir ./af2/casp15/orphan/A1T3R1.5/

📎 Citation

@misc{zhang2023enhancing,
      title={Enhancing the Protein Tertiary Structure Prediction by Multiple Sequence Alignment Generation}, 
      author={Le Zhang and Jiayang Chen and Tao Shen and Yu Li and Siqi Sun},
      year={2023},
      eprint={2306.01824},
      archivePrefix={arXiv},
      primaryClass={q-bio.QM}
}

📧 Contact

please let us know if you have further questions or comments, reach out to [[email protected]]

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
config		config
data		data
model		model
scripts		scripts
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
ensemble.py		ensemble.py
inference.ipynb		inference.ipynb
inference.py		inference.py
pretrain.py		pretrain.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MSA Generation with Seqs2Seqs Pretraining:Advancing Protein Structure Predictions

Paper

Pretrain

Inference

Evaluation

Alphafold2 Prediction

LDDT

Ensemble

📎 Citation

📧 Contact

About

Releases

Packages

Contributors 2

Languages

lezhang7/MSAGen

Folders and files

Latest commit

History

Repository files navigation

MSA Generation with Seqs2Seqs Pretraining:Advancing Protein Structure Predictions

Paper

Pretrain

Inference

Evaluation

Alphafold2 Prediction

LDDT

Ensemble

📎 Citation

📧 Contact

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages