Merge pull request #41 from MobleyLab/mobley

Remove duplicate from database, update DOIs and other info
MobleyLab · Jun 16, 2017 · 9d8a809 · 9d8a809
2 parents 51e85da + 7a9e7ff
commit 9d8a809
Show file tree

Hide file tree

Showing 23 changed files with 735 additions and 651 deletions.
diff --git a/README.md b/README.md
@@ -165,6 +165,9 @@ Same as the above but initiates Zenodo DOIs. DOI http://dx.doi.org/10.5281/zenod
 **The changes made in the Version 0.5 and 0.51 updates are described in our recent FreeSolv update/mini-review paper in the [Journal of Chemical and Engineering Data](http://pubs.acs.org/doi/abs/10.1021/acs.jced.7b00104)**.
 
 ## Changes not yet in a formal release:
+- Remove redundant molecule mobley_4689084 (which duplicates  mobley_352111 had the same experimental value, and a calculated value within uncertainty) 
+- Add utility functionality to easily check for duplicates; rebuild database after removing above duplicate and checking for others
+- Update reference for calculated values to refer to the J Chem Engr Data reference.
 
 # Contributors
 
@@ -174,6 +177,7 @@ Same as the above but initiates Zenodo DOIs. DOI http://dx.doi.org/10.5281/zenod
 - The many people who contributed to the SAMPL challenges over the years and our early studies on hydration free energies, prior to construction of this database.
 - Guilherme Duarte Ramos Matos (UC Irvine)
 - Daisy Y. Kyu (UC Irvine)
+- Caitlin Bannan (UC Irvine)
 - John D. Chodera (MSKCC)
 - Michael R. Shirts (Colorado)
 - Hannes H. Loeffler (STFC Daresbury)
@@ -191,4 +195,4 @@ Same as the above but initiates Zenodo DOIs. DOI http://dx.doi.org/10.5281/zenod
 * (8) Mobley, D. L., Liu, S., Cerutti, D. S., Swope, W. C., & Rice, J. E. (2012). Alchemical prediction of hydration free energies for SAMPL.Journal of Computer-Aided Molecular Design,26(5), 551–562. doi:10.1007/s10822-011-9528-8
 * (9) Mobley, D. L., Wymer, K. L., Lim, N. M., Guthrie, J. P.  (2014) "Blind prediction of solvation free energies from the SAMPL4 challenge", Journal of Computer-Aided Molecular Design, 28:135-150 (2014).
 * (10) Mobley, D. L., and Guthrie, J. P., "FreeSolv: A database of experimental and calculated hydration free energies, with input files", Journal of Computer-Aided Molecular Design, 28(7):711-720 (2014)
-* (11) Duarte Ramos Matos, G. et al., "Approaches for calculating solvation free energies and enthalpies demonstrated with an update of the FreeSolv database", bioRxiv [10.1101/104281](https://doi.org/10.1101/104281)
+* (11) Duarte Ramos Matos, G. et al., "Approaches for calculating solvation free energies and enthalpies demonstrated with an update of the FreeSolv database", Journal of Chemical and Engineering Data 62(5):1559-1569 (2017) [10.1021/acs.jced.7b00104](https://doi.org/10.1021/acs.jced.7b00104)
diff --git a/amber.tar.gz b/amber.tar.gz
diff --git a/charmm.tar.gz b/charmm.tar.gz
diff --git a/database.json b/database.json
diff --git a/database.pickle b/database.pickle
diff --git a/database.txt b/database.txt
diff --git a/desmond.tar.gz b/desmond.tar.gz
diff --git a/gromacs.tar.gz b/gromacs.tar.gz
diff --git a/gromacs_original.tar.gz b/gromacs_original.tar.gz
diff --git a/gromacs_solvated.tar.gz b/gromacs_solvated.tar.gz
diff --git a/groups.txt b/groups.txt
@@ -292,7 +292,6 @@ mobley_4678740; 	 m-bis(trifluoromethyl)benzene 	; halogen derivative; aromatic
 mobley_4683624; 	 4-propylphenol 	; phenol or hydroxyhetarene; aromatic
 mobley_4687447; 	 2-propoxyethanol 	; primary alcohol; dialkyl ether
 mobley_468867; 	 heptachlor 	; alkyl chloride; alkene
-mobley_4689084; 	 2-acetoxyethyl acetate 	; carboxylic acid ester
 mobley_4690963; 	 1,2-diethoxyethane 	; dialkyl ether
 mobley_4694328; 	 octanal 	; aldehyde
 mobley_4699732; 	 1,2-dichloropropane 	; alkyl chloride

diff --git a/iupac_to_cid.json b/iupac_to_cid.json
diff --git a/iupac_to_cid.pickle b/iupac_to_cid.pickle
diff --git a/lammps.tar.gz b/lammps.tar.gz
diff --git a/mol2files_gaff.tar.gz b/mol2files_gaff.tar.gz
diff --git a/mol2files_sybyl.tar.gz b/mol2files_sybyl.tar.gz
diff --git a/scripts/README.md b/scripts/README.md
@@ -7,6 +7,7 @@ This contains utility scripts and other tools relating to maintaining, building,
 - `generate-tripos-mol2files.py`: The database did not originally contain a consistent set of mol2 files with SYBYL atom types; at one point, this was used to generate such a set, though this has been superseded by the set generated by `rebuild_freesolv.py` below
 - `hComponents.py`: Script used to analyze GROMACS xvg files and extract components of the enthalpy change, in the early 2017 update to FreeSolv
 - `make_v0.32.py`: Script editing database to update from 0.31 to 0.32.
+- `make_v0.52.py`: Script editing database to update from 0.51 to 0.52.
 - `make_supporting_files.py`: From database pickle file, makes json version, database.txt, groups.txt, and supporting smiles_to_cid and iupac_to_cid in json and pickle formats. 
 - `rebuild_freesolv.py`: Rebuilding the contents of the FreeSolv database from primary data (SMILES strings) for use repeating all of the GROMACS calculations for the early 2017 update to the database. Requires the Chodera Lab's `openmoltools` and the Mobley lab's `SolvationToolkit`, both of which are available from the `omnia` conda channel and on GitHub.
 - `utils.py`: Shared utilities (very short at present) 
diff --git a/scripts/make_supporting_files.py b/scripts/make_supporting_files.py
@@ -10,7 +10,7 @@
 
 #Put it in a nice table for easy parsing. Use semicolons to separate fields, making sure each individual field doesn't contain any semicolons since this would break parsing.
 
-outtext = ["#Hydration free energy datbase v0.5, 1/19/17.\n"]
+outtext = ["#Hydration free energy datbase v0.52, 6/11/17.\n"]
 outtext += ["#Semicolon-delimited text file with fields in the following format:\n"]
 outtext += ["# compound id (and file prefix); SMILES; iupac name (or alternative if IUPAC is unavailable or not parseable by OEChem); experimental value (kcal/mol); experimental uncertainty (kcal/mol); Mobley group calculated value (GAFF) (kcal/mol); calculated uncertainty (kcal/mol); experimental reference (original or paper this value was taken from); calculated reference; text notes.\n"]
 
@@ -25,7 +25,7 @@
     if ';' in notes: #Make sure no semicolon in notes
         #Fix issue where I used a semicolon
         notes = notes.replace('not presently available;', 'not presently available, so')
-        if ';' in notes:        
+        if ';' in notes:
             print("ERROR: For %s, note contains ;. The note is:" % cid, notes)
     if ';' in database[cid]['expt_reference']:
         print("ERROR: For %s, experimental reference contains ;. The reference is:" % cid, database[cid]['expt_reference'])

diff --git a/scripts/make_v0.52.py b/scripts/make_v0.52.py
@@ -0,0 +1,35 @@
+#!/bin/env python
+
+"""This will update the current v0.51 database to v0.52 to reflect the following changes:
+- Update DOI for all calculated values to 2017 J Chem Eng Data paper associated with v0.51 (10.1021/acs.jced.7b00104)
+- Remove duplicate compound mobley_4689084, which was a SAMPL1 compound that was already present in the earlier "504 molecule" set with the same experimental value and therefore duplicates mobley_352111. In SAMPL1 it was referred to as "ethylene glycol diacetate" and originally it was 1,2-diacetoxyethane in the Mobley and earlier Rizzo sets. Apparently when Guthrie and OpenEye were curating SAMPL1, they did not notice that this compound was already present in public datasets, and somehow I missed it when checking SMILES strings in the database for duplicates.
+- Now takes advantage of new functionality added to utils.py to check database for duplicates prior to export by creating new SMILES strings for each from the database SMILES and cross-check.
+"""
+
+# Load database
+import pickle
+import utils
+file = open('../database.pickle', 'rb')
+database = pickle.load(file, encoding='latin1')
+file.close()
+
+# Remove mobley_4689084
+database.pop('mobley_4689084')
+
+# Update DOI for calculated values
+for cid in database:
+    database[cid]['calc_reference'] = '10.1021/acs.jced.7b00104'
+
+# Check for duplicates
+num_dupes, keypairs = utils.check_for_duplicates( database )
+if num_dupes > 0:
+    raise Exception("Error: %s duplicates found." % num_dupes)
+
+# Write out database
+file = open('../database.pickle', 'wb')
+pickle.dump(database, file)
+file.close()
+
+# Update supporting files
+import os
+os.system('python make_supporting_files.py')
diff --git a/scripts/utils.py b/scripts/utils.py
@@ -5,6 +5,7 @@
 
 #import cPickle as pickle
 import pickle
+from openeye.oechem import *
 
 def read_database():
     """Read the database from a pickle file and return it"""
@@ -23,3 +24,48 @@ def convert_to_json( database_pickle, database_json):
 
     with open(database_json,"w", encoding='utf-8') as fs:
         json.dump(freeSolv,fs)
+
+def check_for_duplicates( database_contents ):
+    """Take contents of database and re-generate all SMILES, checking for duplicates.
+
+    Parameters:
+    ----------
+    database_contents : dict
+        dictionary of FreeSolv database, keyed by compound ID
+
+    Returns:
+    ----------
+    num_dupes : int
+        Number of duplicated compound pairs found
+    keypairs : list
+        List containing tuples of pairs corresponding to the compound IDs of the duplicates
+    """
+
+    # Pull compound IDs
+    cids = [ item for item in database_contents ]
+
+    # Generate new OEMols from SMILES
+    oemols = []
+    for cid in cids:
+        mol = OEMol()
+        OEParseSmiles(mol, database_contents[cid]['smiles'])
+        oemols.append(mol)
+
+    # Generate new SMILES from OEMols, thereby standardizing
+    smiles = []
+    for mol in oemols:
+        smiles.append(OEMolToSmiles(mol))
+
+    # Build duplicate info
+    clean_smiles = []
+    keypairs = []
+    for idx,cid in enumerate(cids):
+        smi = smiles[idx]
+        if smi not in clean_smiles:
+            clean_smiles.append(smi)
+        else:
+            dupe_idx = smiles.index(smi)
+            keypairs.append( (cids[dupe_idx], cid) )
+
+    return len(keypairs), keypairs
+
diff --git a/sdffiles.tar.gz b/sdffiles.tar.gz
diff --git a/smiles_to_cid.json b/smiles_to_cid.json
diff --git a/smiles_to_cid.pickle b/smiles_to_cid.pickle