Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about the chain/interface clustering files #307

Open
zqcai19 opened this issue Oct 11, 2024 · 3 comments
Open

Questions about the chain/interface clustering files #307

zqcai19 opened this issue Oct 11, 2024 · 3 comments

Comments

@zqcai19
Copy link

zqcai19 commented Oct 11, 2024

@lucidrains @amorehead Hi, thank you very much for your efforts in the reproduction of AlphaFold3. I have downloaded the preprocessed mmCIF files and chain/interface clustering files as described in the README and would like to use the clustered test set to evaluate AF3.

Based on my understanding, the json, csv, and fasta files should contain information on the chain IDs, cluster mapping, and sequences. However, I noticed inconsistencies between them and the RCSB PDB. For example, in filtered_all_chain_sequences.json:

  1. 8a14-assembly1: The file only records 2 chains, whereas RCSB shows that it has 6 chains.
  2. 8sza-assembly1: The file does not seem to include ligand information.
  3. The sequences in both cases appear to be cropped compared to the original sequences in RCSB.

Other entries have similar inconsistencies as well. Am I missing something here? How to use the chain/interface clustering files to evaluate AF3?

Thank you in advance for your help!

@amorehead
Copy link
Contributor

Hi, @zqcai19.

My first thoughts are that these differences may be the result of the PDB dataset's preprocessing scripts, as described in the AF3 paper. This preprocessing script will (in several cases) drop residues or chains that do not meet AF3's strict filtering criteria. For more details, I recommend reviewing the preprocessing scripts in scripts/, and let know if you have any other questions.

@zqcai19
Copy link
Author

zqcai19 commented Oct 11, 2024

@amorehead Thank you for the quick response! I still have some doubts regarding the evaluation process. Should I use the filtered and cropped sequences from filtered_all_chain_sequences.json for inference? I couldn’t find any description in the AF3 paper or its Supplementary Information about cropping the sequences for the evaluation (only the training process was mentioned). Did I miss something?

@amorehead
Copy link
Contributor

Hi, @zqcai19. This filtering of the train, val, and test structures (particularly for the test structures) seems to be implicitly suggested by the AF3 paper. To standardize all three dataset splits, this is how I interpreted the paper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants