Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A successful test of data downloading. #6

Open
Kaihui-Cheng opened this issue Sep 20, 2024 · 3 comments
Open

A successful test of data downloading. #6

Kaihui-Cheng opened this issue Sep 20, 2024 · 3 comments

Comments

@Kaihui-Cheng
Copy link
Collaborator

  1. Make sure you have Git LFS installed:
    sudo apt-get install git-lfs 
    # Initialize Git LFS
    git lfs install
    
  2. Navigate to your DATA_ROOT and clone the source:
GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/datasets/fudan-generative-vision/dynamicPDB.git dynamicPDB_raw
  1. Download data with a specific protein_id, for example 1a62_A:
cd dynamicPDB_raw
git lfs pull --include="{protein_id}/*"
  1. Merge the split-volume compression into one file and then unzip the .tar.gz file:
cat {protein_id}/{protein_id}.tar.gz.part* > {protein_id}/{protein_id}.tar.gz
cd ${Your Storage Root}
mkdir dynamicPDB  # ignore if directory exists
tar -xvzf dynamicPDB_raw/{protein_id}/{protein_id}.tar.gz -C dynamicPDB

Ok! Now we have the simulation data for protein_id.
Note: Sufficient storage space is required for the data. For 1a62_A, 33GB is needed for the unzipped files and 24GB for the zipped files.

@meatball1982
Copy link

Dear Kaihui-Cheng:
01:
There are 10 pdb ID in 1a62_A, ..., 1bq8_A.
If you are so kind to provide a list of all the PDB ID(12.6k filtered proteins) in all your dataset(only PDB ID). Then we( most readers of your paper) can choose the specific PDB to download.
02:
In README
"we have decided to provide the 100ns simulation data for all proteins for online download". Still, I see no instruction to download the 100ns of all protein. Could you help me about that.
Thank you so much and I am looking forward of your reply.
Best
M

@zqcai19
Copy link
Collaborator

zqcai19 commented Sep 29, 2024

@meatball1982 Hi! Thank you for your valuable suggestions.

  1. We are still working on uploading the complete dataset, as its size is significantly large. However, we can provide a list on ModelScope to record the currently available protein data. This list may make it easier for users to choose the specific PDBs they want to download.
  2. The instruction described above by @Kaihui-Cheng is for downloading the 100ns simulation data, which we are actively uploading. If you would like to download all currently available protein data at once, you can use the command git lfs pull (without specifying --include="{protein_id}/*") in step 3.

Please let us know if you have any other questions or suggestions.

@meatball1982
Copy link

@zqcai19

  1. Thank you very much for your reply, and I truly appreciate your willingness to provide the original dynamic trajectories, as I know this can be very time-consuming.

  2. I am currently working on a project related to RMSF, and if possible, could you please share all of your PDB IDs (not just the ones you have uploaded)? I would be even more grateful if you could also provide the initial conformations and the RMSF values corresponding to all the PDBs you calculated. This data should not be very large, and uploading and downloading it shouldn't take too much time.

Thank you again for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants