Merge development into master (#80)

* Various small bug fixes (#74) * Do not delimit more than once per row * Also removed InputFileNotFoundError * Let's just use FileNotFoundError * Also raise exception for unreachable code * Remove comments from test input files * Increase threshold for calling ``ngrams`` * ``get_gram_chunks`` now calls ``ngrams`` if ``input`` has less than 15 tokens * As opposed to less than 7 tokens * Clean up matching logic (#75) * Empty commit * To create WIP pr * Made full and non-full output more similar * Replaced ``Final_Refined_Terms_with_Resource_IDs`` in full format with ``Matched_Components`` * Modified tests accordingly * Clean up ``pipeline.run``, and other things * ``pipeline.run`` * Removed a lot of unneccessary variables * Output columns now stored in clearly named variables, and all columns are populated at the same time * Removed reporting of matches when there are no matches * General clean-up * Other things * Renamed ``punctuationTreatment`` to ``punctuation_treatment`` * Modified tests accordingly * Increased robustness of full term matching * Now calls new function ``map_term`` * Cleaner * Attempts suffix addition on synonyms as well, so more robust * Empty commit * To test modification of coveralls setting * Increased robustness of component matching * Component matching now relies on the more robust ``map_term`` * Removed duplicates in ``micro_status`` * Removed unneccessary variables * Fixed bug in ``_map_term_helper`` * Modified tests accordingly * Cleaned up ``genomeTrackerMaster.csv`` * Removed stale code * Docstrings * ``match_term`` * ``_match_term_helper`` * Modify matching logic * We want to consider suffixes last * Also modifies some code to make adhere to PEP 8 * More transparent matching (#76) * Less randomness, better reporting and clean code * Less randomness * Replaced usages of set to remove duplicates from lists with usages of ``OrderedDict.fromkeys`` method * Preserves original order of list * Did replace some lists with sets, when it did not have any effect on the order of anything important * For speed-up purposes * Better reporting * Added chronological order to ``matched_components`` * Report component matches in exact order they are made * Added chronological order to ``micro_status`` * Report ``micro_status`` elements in exact order they are added * Report exact tokens treatment or match conditions apply to * e.g., 'Abbreviation-Acronym Treatment' is now 'Abbreviation-Acronym Treatment: fluid' * e.g., 'A Direct Match' is now "{cerebrospinal fluid: ['A Direct Match']}" * All the above makes the matching process more transparent * Also sorted third-party final classifications * Clean code * Removed unnecessary functions ``allPermutations`` and ``combi`` * Updated stale comments * Adjusted tests accordingly * Removed some randomness from ``retainedPhrase`` * Usage of set to remove duplicates replaced with ``OrderedDict.fromkeys`` * Adjusted tests accordingly * Add already-cached lookup and classification tables to package installation (#77) * Empty commit to create WIP PR * Add look and classification table to vcs * Modified ```--no-cache`` usage * Only applies to online ontology resources now * No longer needed for ``lookup_table`` and (for now) ``classification_lookup_table`` * Since they are now committed to vcs * Modified ``--help`` message for clarity * Modified tests to take advantage of this for speed-up * Implement ``--profile`` command (#79) * Moved cache logic to new ``pipeline_caching.py`` * Population of ``lookup_table``, ``ontology_lookup_table`` and ``classification_lookup_table`` moved to ``pipeline_caching.py`` * New functions functions ``get_predefined_resources``, ``get_config_resources`` and ``get_classification_resources`` respectively * Moved in following functions from ``pipeline_helpers.py``: * ``get_resource_dict`` * ``create_lookup_table_skeleton`` * ``add_predefined_resources_to_lookup_table`` * ``get_resource_permutation_terms`` * ``get_resource_bracketed_permutation_terms`` * ``add_fetched_ontology_to_lookup_table`` * Moved in ``add_classification_resources_to_lookup_table`` from ``pipeline_classification.py`` * Removed last, stale use of ``pkg_resources`` as opposed to using path from ``lexmapr.definitions.ROOT`` * Improved in-code documentation in some places * Modifies tests and imports accordingly * Docstrings for new functions * Stylistic improvements in ``bin/lexmapr`` * Change single quotes to double quotes * Added new lines to make PEP 8 adherent * Improve ``bin/lexmapr`` some more * Remove unnecessary ``if-else`` conditions by cleaning up ``input_file`` and ``-v, --version`` arguments somewhat * Renamed ``lexmapr.resources``, ``lexmapr.cache`` * ``lexmapr.resources`` is now ``lexmapr.predefined_resources`` * ``lexmapr.cache`` is now ``lexmapr.resources`` * Once we started committing lookup and classification table to ``lexmapr.cache``, (and when we eventually commit profiles`), it ceases to be a cache * Renamed ``pipeline_caching.py`` to ``pipeline_resources.py`` * Updated comments accordingly * Implement ``-p, --profile`` functionality * User can specify profile they want to use via ``-p, --profile`` flag, which will contain a pre-defined set of command-line arguments and online ontology resources * Currently, only ifsac profile available * Profile arguments can be overwritten by explicitly declaring other arguments * Except for ``--no-cache`` and ``--bucket`` * Will figure this out in the future * Two new functions carry most of the logic * ``lexmapr.pipeline_resources.get_profile_args`` * ``lexmapr.pipeline_resources.get_profile_resources`` * Other changes * Brought ``MANIFEST.in`` up-to-date
cidgoh · Oct 8, 2019 · e351d6a · e351d6a
1 parent 3f65c62
commit e351d6a
Show file tree

Hide file tree

Showing 73 changed files with 3,449 additions and 3,586 deletions.
diff --git a/.gitignore b/.gitignore
@@ -107,7 +107,5 @@ venv.bak/
 .idea
 
 # Cached resources
-lexmapr/cache/fetched_ontologies/
-lexmapr/cache/ontology_lookup_tables/
-lexmapr/cache/lookup_table.json
-lexmapr/cache/classification_lookup_table.json
+lexmapr/resources/fetched_ontologies/
+lexmapr/resources/ontology_lookup_tables/
diff --git a/MANIFEST.in b/MANIFEST.in
@@ -1,2 +1,4 @@
 include lexmapr/resources/*
-include lexmapr/cache/*
+include lexmapr/resources/profiles/*
+include lexmapr/resources/profiles/ifsac/*
+include lexmapr/predefined_resources/*
diff --git a/bin/lexmapr b/bin/lexmapr
@@ -12,11 +12,6 @@ logger = logging.getLogger("lexmapr")
 script_name = os.path.basename(os.path.realpath(sys.argv[0]))
 
 
-class InputFileNotFoundError(FileNotFoundError):
-    """Exception raised when input file does not exist."""
-    pass
-
-
 def valid_input_file(path):
     """Raises appropriate errors if input file is invalid.
 
@@ -31,30 +26,32 @@ def valid_input_file(path):
         raise argparse.ArgumentTypeError("Please supply a csv or tsv input file")
 
     if not os.path.exists(path):
-        raise InputFileNotFoundError(path + " not found")
+        raise FileNotFoundError(path + " not found")
 
     return path
 
-if __name__ == '__main__':
-    parser = argparse.ArgumentParser()
-    parser.add_argument('input_file', help='Input csv or tsv file', nargs='?',
-                        type=valid_input_file)
-    parser.add_argument('-o', '--output', nargs='?', help='Output file')
-    parser.add_argument('-f', '--format', default='basic', help='Output format')
-    parser.add_argument('--version', action='store_true', dest='version',
-                        help='Prints version information', required=False)
-    parser.add_argument('-c', '--config',
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(formatter_class=argparse.RawTextHelpFormatter)
+    parser.add_argument("input_file", help="Input csv or tsv file", type=valid_input_file)
+    parser.add_argument("-o", "--output", nargs="?", help="Output file")
+    parser.add_argument("-f", "--format", default="basic", help="Output format")
+    parser.add_argument("-c", "--config",
                         help="Path to JSON file containing the IRI of ontologies to fetch terms "
                              "from"),
-    parser.add_argument('-b', '--bucket', action='store_true',
+    parser.add_argument("-b", "--bucket", action="store_true",
                         help="Classify samples into pre-defined buckets")
-    parser.add_argument('--no-cache', action='store_true',
-                        help="Ignore or replace cached resources, if there are any.")
+    parser.add_argument("--no-cache", action="store_true",
+                        help="Ignore or replace online cached resources, if there are any.")
+    parser.add_argument("-v", "--version", action="version",
+                        version="%(prog)s " + lexmapr.__version__)
+    parser.add_argument("-p", "--profile", choices=["ifsac"],
+                        help="Pre-defined sets of command-line arguments for specialized purposes:"
+                             "\n\n"
+                             "* ifsac: \n"
+                             "  * maps samples to food and environmental resources\n"
+                             "  * classifies samples into ifsac labels\n"
+                             "  * outputs content to ``ifsac_output.tsv``")
     args = parser.parse_args()
-
-    if args.version:
-        print(script_name + ' ' + lexmapr.__version__)
-    elif not args.input_file:
-        parser.error('Please supply an input file')
-    else:
-        lexmapr.pipeline.run(args)
+
+    lexmapr.pipeline.run(args)