Skip to content

Commit

Permalink
Merge pull request #41 from MobleyLab/mobley
Browse files Browse the repository at this point in the history
Remove duplicate from database, update DOIs and other info
  • Loading branch information
davidlmobley authored Jun 16, 2017
2 parents 51e85da + 7a9e7ff commit 9d8a809
Show file tree
Hide file tree
Showing 23 changed files with 735 additions and 651 deletions.
6 changes: 5 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -165,6 +165,9 @@ Same as the above but initiates Zenodo DOIs. DOI http://dx.doi.org/10.5281/zenod
**The changes made in the Version 0.5 and 0.51 updates are described in our recent FreeSolv update/mini-review paper in the [Journal of Chemical and Engineering Data](http://pubs.acs.org/doi/abs/10.1021/acs.jced.7b00104)**.

## Changes not yet in a formal release:
- Remove redundant molecule mobley_4689084 (which duplicates mobley_352111 had the same experimental value, and a calculated value within uncertainty)
- Add utility functionality to easily check for duplicates; rebuild database after removing above duplicate and checking for others
- Update reference for calculated values to refer to the J Chem Engr Data reference.

# Contributors

Expand All @@ -174,6 +177,7 @@ Same as the above but initiates Zenodo DOIs. DOI http://dx.doi.org/10.5281/zenod
- The many people who contributed to the SAMPL challenges over the years and our early studies on hydration free energies, prior to construction of this database.
- Guilherme Duarte Ramos Matos (UC Irvine)
- Daisy Y. Kyu (UC Irvine)
- Caitlin Bannan (UC Irvine)
- John D. Chodera (MSKCC)
- Michael R. Shirts (Colorado)
- Hannes H. Loeffler (STFC Daresbury)
Expand All @@ -191,4 +195,4 @@ Same as the above but initiates Zenodo DOIs. DOI http://dx.doi.org/10.5281/zenod
* (8) Mobley, D. L., Liu, S., Cerutti, D. S., Swope, W. C., & Rice, J. E. (2012). Alchemical prediction of hydration free energies for SAMPL.Journal of Computer-Aided Molecular Design,26(5), 551–562. doi:10.1007/s10822-011-9528-8
* (9) Mobley, D. L., Wymer, K. L., Lim, N. M., Guthrie, J. P. (2014) "Blind prediction of solvation free energies from the SAMPL4 challenge", Journal of Computer-Aided Molecular Design, 28:135-150 (2014).
* (10) Mobley, D. L., and Guthrie, J. P., "FreeSolv: A database of experimental and calculated hydration free energies, with input files", Journal of Computer-Aided Molecular Design, 28(7):711-720 (2014)
* (11) Duarte Ramos Matos, G. et al., "Approaches for calculating solvation free energies and enthalpies demonstrated with an update of the FreeSolv database", bioRxiv [10.1101/104281](https://doi.org/10.1101/104281)
* (11) Duarte Ramos Matos, G. et al., "Approaches for calculating solvation free energies and enthalpies demonstrated with an update of the FreeSolv database", Journal of Chemical and Engineering Data 62(5):1559-1569 (2017) [10.1021/acs.jced.7b00104](https://doi.org/10.1021/acs.jced.7b00104)
Binary file modified amber.tar.gz
Binary file not shown.
Binary file modified charmm.tar.gz
Binary file not shown.
2 changes: 1 addition & 1 deletion database.json

Large diffs are not rendered by default.

Binary file modified database.pickle
Binary file not shown.
1,287 changes: 643 additions & 644 deletions database.txt

Large diffs are not rendered by default.

Binary file modified desmond.tar.gz
Binary file not shown.
Binary file modified gromacs.tar.gz
Binary file not shown.
Binary file modified gromacs_original.tar.gz
Binary file not shown.
Binary file modified gromacs_solvated.tar.gz
Binary file not shown.
1 change: 0 additions & 1 deletion groups.txt
Original file line number Diff line number Diff line change
Expand Up @@ -292,7 +292,6 @@ mobley_4678740; m-bis(trifluoromethyl)benzene ; halogen derivative; aromatic
mobley_4683624; 4-propylphenol ; phenol or hydroxyhetarene; aromatic
mobley_4687447; 2-propoxyethanol ; primary alcohol; dialkyl ether
mobley_468867; heptachlor ; alkyl chloride; alkene
mobley_4689084; 2-acetoxyethyl acetate ; carboxylic acid ester
mobley_4690963; 1,2-diethoxyethane ; dialkyl ether
mobley_4694328; octanal ; aldehyde
mobley_4699732; 1,2-dichloropropane ; alkyl chloride
Expand Down
2 changes: 1 addition & 1 deletion iupac_to_cid.json

Large diffs are not rendered by default.

Binary file modified iupac_to_cid.pickle
Binary file not shown.
Binary file modified lammps.tar.gz
Binary file not shown.
Binary file modified mol2files_gaff.tar.gz
Binary file not shown.
Binary file modified mol2files_sybyl.tar.gz
Binary file not shown.
1 change: 1 addition & 0 deletions scripts/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ This contains utility scripts and other tools relating to maintaining, building,
- `generate-tripos-mol2files.py`: The database did not originally contain a consistent set of mol2 files with SYBYL atom types; at one point, this was used to generate such a set, though this has been superseded by the set generated by `rebuild_freesolv.py` below
- `hComponents.py`: Script used to analyze GROMACS xvg files and extract components of the enthalpy change, in the early 2017 update to FreeSolv
- `make_v0.32.py`: Script editing database to update from 0.31 to 0.32.
- `make_v0.52.py`: Script editing database to update from 0.51 to 0.52.
- `make_supporting_files.py`: From database pickle file, makes json version, database.txt, groups.txt, and supporting smiles_to_cid and iupac_to_cid in json and pickle formats.
- `rebuild_freesolv.py`: Rebuilding the contents of the FreeSolv database from primary data (SMILES strings) for use repeating all of the GROMACS calculations for the early 2017 update to the database. Requires the Chodera Lab's `openmoltools` and the Mobley lab's `SolvationToolkit`, both of which are available from the `omnia` conda channel and on GitHub.
- `utils.py`: Shared utilities (very short at present)
4 changes: 2 additions & 2 deletions scripts/make_supporting_files.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@

#Put it in a nice table for easy parsing. Use semicolons to separate fields, making sure each individual field doesn't contain any semicolons since this would break parsing.

outtext = ["#Hydration free energy datbase v0.5, 1/19/17.\n"]
outtext = ["#Hydration free energy datbase v0.52, 6/11/17.\n"]
outtext += ["#Semicolon-delimited text file with fields in the following format:\n"]
outtext += ["# compound id (and file prefix); SMILES; iupac name (or alternative if IUPAC is unavailable or not parseable by OEChem); experimental value (kcal/mol); experimental uncertainty (kcal/mol); Mobley group calculated value (GAFF) (kcal/mol); calculated uncertainty (kcal/mol); experimental reference (original or paper this value was taken from); calculated reference; text notes.\n"]

Expand All @@ -25,7 +25,7 @@
if ';' in notes: #Make sure no semicolon in notes
#Fix issue where I used a semicolon
notes = notes.replace('not presently available;', 'not presently available, so')
if ';' in notes:
if ';' in notes:
print("ERROR: For %s, note contains ;. The note is:" % cid, notes)
if ';' in database[cid]['expt_reference']:
print("ERROR: For %s, experimental reference contains ;. The reference is:" % cid, database[cid]['expt_reference'])
Expand Down
35 changes: 35 additions & 0 deletions scripts/make_v0.52.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
#!/bin/env python

"""This will update the current v0.51 database to v0.52 to reflect the following changes:
- Update DOI for all calculated values to 2017 J Chem Eng Data paper associated with v0.51 (10.1021/acs.jced.7b00104)
- Remove duplicate compound mobley_4689084, which was a SAMPL1 compound that was already present in the earlier "504 molecule" set with the same experimental value and therefore duplicates mobley_352111. In SAMPL1 it was referred to as "ethylene glycol diacetate" and originally it was 1,2-diacetoxyethane in the Mobley and earlier Rizzo sets. Apparently when Guthrie and OpenEye were curating SAMPL1, they did not notice that this compound was already present in public datasets, and somehow I missed it when checking SMILES strings in the database for duplicates.
- Now takes advantage of new functionality added to utils.py to check database for duplicates prior to export by creating new SMILES strings for each from the database SMILES and cross-check.
"""

# Load database
import pickle
import utils
file = open('../database.pickle', 'rb')
database = pickle.load(file, encoding='latin1')
file.close()

# Remove mobley_4689084
database.pop('mobley_4689084')

# Update DOI for calculated values
for cid in database:
database[cid]['calc_reference'] = '10.1021/acs.jced.7b00104'

# Check for duplicates
num_dupes, keypairs = utils.check_for_duplicates( database )
if num_dupes > 0:
raise Exception("Error: %s duplicates found." % num_dupes)

# Write out database
file = open('../database.pickle', 'wb')
pickle.dump(database, file)
file.close()

# Update supporting files
import os
os.system('python make_supporting_files.py')
46 changes: 46 additions & 0 deletions scripts/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@

#import cPickle as pickle
import pickle
from openeye.oechem import *

def read_database():
"""Read the database from a pickle file and return it"""
Expand All @@ -23,3 +24,48 @@ def convert_to_json( database_pickle, database_json):

with open(database_json,"w", encoding='utf-8') as fs:
json.dump(freeSolv,fs)

def check_for_duplicates( database_contents ):
"""Take contents of database and re-generate all SMILES, checking for duplicates.
Parameters:
----------
database_contents : dict
dictionary of FreeSolv database, keyed by compound ID
Returns:
----------
num_dupes : int
Number of duplicated compound pairs found
keypairs : list
List containing tuples of pairs corresponding to the compound IDs of the duplicates
"""

# Pull compound IDs
cids = [ item for item in database_contents ]

# Generate new OEMols from SMILES
oemols = []
for cid in cids:
mol = OEMol()
OEParseSmiles(mol, database_contents[cid]['smiles'])
oemols.append(mol)

# Generate new SMILES from OEMols, thereby standardizing
smiles = []
for mol in oemols:
smiles.append(OEMolToSmiles(mol))

# Build duplicate info
clean_smiles = []
keypairs = []
for idx,cid in enumerate(cids):
smi = smiles[idx]
if smi not in clean_smiles:
clean_smiles.append(smi)
else:
dupe_idx = smiles.index(smi)
keypairs.append( (cids[dupe_idx], cid) )

return len(keypairs), keypairs

Binary file modified sdffiles.tar.gz
Binary file not shown.
2 changes: 1 addition & 1 deletion smiles_to_cid.json

Large diffs are not rendered by default.

Binary file modified smiles_to_cid.pickle
Binary file not shown.

0 comments on commit 9d8a809

Please sign in to comment.