Skip to content

Commit

Permalink
Merge development into master (#80)
Browse files Browse the repository at this point in the history
* Various small bug fixes (#74)

* Do not delimit more than once per row

* Also removed InputFileNotFoundError

  * Let's just use FileNotFoundError

* Also raise exception for unreachable code

* Remove comments from test input files

* Increase threshold for calling ``ngrams``

* ``get_gram_chunks`` now calls ``ngrams`` if ``input`` has less than 15
  tokens

  * As opposed to less than 7 tokens

* Clean up matching logic (#75)

* Empty commit

* To create WIP pr

* Made full and non-full output more similar

* Replaced ``Final_Refined_Terms_with_Resource_IDs`` in full format with
  ``Matched_Components``

* Modified tests accordingly

* Clean up ``pipeline.run``, and other things

* ``pipeline.run``

  * Removed a lot of unneccessary variables

  * Output columns now stored in clearly named variables, and all
    columns are populated at the same time

  * Removed reporting of matches when there are no matches

  * General clean-up

* Other things

  * Renamed ``punctuationTreatment`` to ``punctuation_treatment``

  * Modified tests accordingly

* Increased robustness of full term matching

* Now calls new function ``map_term``

* Cleaner

* Attempts suffix addition on synonyms as well, so more robust

* Empty commit

* To test modification of coveralls setting

* Increased robustness of component matching

* Component matching now relies on the more robust ``map_term``

* Removed duplicates in ``micro_status``

* Removed unneccessary variables

* Fixed bug in ``_map_term_helper``

* Modified tests accordingly

* Cleaned up ``genomeTrackerMaster.csv``

* Removed stale code

* Docstrings

* ``match_term``

* ``_match_term_helper``

* Modify matching logic

* We want to consider suffixes last

* Also modifies some code to make adhere to PEP 8

* More transparent matching (#76)

* Less randomness, better reporting and clean code

* Less randomness
  * Replaced usages of set to remove duplicates from lists with usages
    of ``OrderedDict.fromkeys`` method
    * Preserves original order of list
    * Did replace some lists with sets, when it did not have any
      effect on the order of anything important
      * For speed-up purposes

* Better reporting
  * Added chronological order to ``matched_components``
    * Report component matches in exact order they are made
  * Added chronological order to ``micro_status``
    * Report ``micro_status`` elements in exact order they are added
  * Report exact tokens treatment or match conditions apply to
    * e.g., 'Abbreviation-Acronym Treatment' is now
      'Abbreviation-Acronym Treatment: fluid'
    * e.g., 'A Direct Match' is now
      "{cerebrospinal fluid: ['A Direct Match']}"
  * All the above makes the matching process more transparent
  * Also sorted third-party final classifications

* Clean code
  * Removed unnecessary functions ``allPermutations`` and ``combi``
  * Updated stale comments

* Adjusted tests accordingly

* Removed some randomness from ``retainedPhrase``

* Usage of set to remove duplicates replaced with
  ``OrderedDict.fromkeys``

* Adjusted tests accordingly

* Add already-cached lookup and classification tables to package installation (#77)

* Empty commit to create WIP PR

* Add look and classification table to vcs

* Modified ```--no-cache`` usage

* Only applies to online ontology resources now

  * No longer needed for ``lookup_table`` and (for now)
    ``classification_lookup_table``

    * Since they are now committed to vcs

  * Modified ``--help`` message for clarity

  * Modified tests to take advantage of this for speed-up

* Implement ``--profile`` command (#79)

* Moved cache logic to new ``pipeline_caching.py``

* Population of ``lookup_table``, ``ontology_lookup_table`` and
  ``classification_lookup_table`` moved to ``pipeline_caching.py``

  * New functions functions ``get_predefined_resources``,
    ``get_config_resources`` and ``get_classification_resources``
    respectively

  * Moved in following functions from ``pipeline_helpers.py``:

    * ``get_resource_dict``

    * ``create_lookup_table_skeleton``

    * ``add_predefined_resources_to_lookup_table``

    * ``get_resource_permutation_terms``

    * ``get_resource_bracketed_permutation_terms``

    * ``add_fetched_ontology_to_lookup_table``

  * Moved in ``add_classification_resources_to_lookup_table`` from
    ``pipeline_classification.py``

* Removed last, stale use of ``pkg_resources`` as opposed to using path
  from ``lexmapr.definitions.ROOT``

* Improved in-code documentation in some places

* Modifies tests and imports accordingly

* Docstrings for new functions

* Stylistic improvements in ``bin/lexmapr``

* Change single quotes to double quotes

* Added new lines to make PEP 8 adherent

* Improve ``bin/lexmapr`` some more

* Remove unnecessary ``if-else`` conditions by cleaning up
  ``input_file`` and ``-v, --version`` arguments somewhat

* Renamed ``lexmapr.resources``, ``lexmapr.cache``

* ``lexmapr.resources`` is now ``lexmapr.predefined_resources``

* ``lexmapr.cache`` is now ``lexmapr.resources``

* Once we started committing lookup and classification table to
  ``lexmapr.cache``, (and when we eventually commit profiles`), it
  ceases to be a cache

* Renamed ``pipeline_caching.py`` to ``pipeline_resources.py``

* Updated comments accordingly

* Implement ``-p, --profile`` functionality

* User can specify profile they want to use via ``-p, --profile`` flag,
  which will contain a pre-defined set of command-line arguments and
  online ontology resources

  * Currently, only ifsac profile available

  * Profile arguments can be overwritten by explicitly declaring other
    arguments

    * Except for ``--no-cache`` and ``--bucket``

      * Will figure this out in the future

* Two new functions carry most of the logic

  * ``lexmapr.pipeline_resources.get_profile_args``

  * ``lexmapr.pipeline_resources.get_profile_resources``

* Other changes

  * Brought ``MANIFEST.in`` up-to-date
  • Loading branch information
ivansg44 authored Oct 8, 2019
1 parent 3f65c62 commit e351d6a
Show file tree
Hide file tree
Showing 73 changed files with 3,449 additions and 3,586 deletions.
6 changes: 2 additions & 4 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -107,7 +107,5 @@ venv.bak/
.idea

# Cached resources
lexmapr/cache/fetched_ontologies/
lexmapr/cache/ontology_lookup_tables/
lexmapr/cache/lookup_table.json
lexmapr/cache/classification_lookup_table.json
lexmapr/resources/fetched_ontologies/
lexmapr/resources/ontology_lookup_tables/
4 changes: 3 additions & 1 deletion MANIFEST.in
Original file line number Diff line number Diff line change
@@ -1,2 +1,4 @@
include lexmapr/resources/*
include lexmapr/cache/*
include lexmapr/resources/profiles/*
include lexmapr/resources/profiles/ifsac/*
include lexmapr/predefined_resources/*
47 changes: 22 additions & 25 deletions bin/lexmapr
Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,6 @@ logger = logging.getLogger("lexmapr")
script_name = os.path.basename(os.path.realpath(sys.argv[0]))


class InputFileNotFoundError(FileNotFoundError):
"""Exception raised when input file does not exist."""
pass


def valid_input_file(path):
"""Raises appropriate errors if input file is invalid.
Expand All @@ -31,30 +26,32 @@ def valid_input_file(path):
raise argparse.ArgumentTypeError("Please supply a csv or tsv input file")

if not os.path.exists(path):
raise InputFileNotFoundError(path + " not found")
raise FileNotFoundError(path + " not found")

return path

if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('input_file', help='Input csv or tsv file', nargs='?',
type=valid_input_file)
parser.add_argument('-o', '--output', nargs='?', help='Output file')
parser.add_argument('-f', '--format', default='basic', help='Output format')
parser.add_argument('--version', action='store_true', dest='version',
help='Prints version information', required=False)
parser.add_argument('-c', '--config',

if __name__ == "__main__":
parser = argparse.ArgumentParser(formatter_class=argparse.RawTextHelpFormatter)
parser.add_argument("input_file", help="Input csv or tsv file", type=valid_input_file)
parser.add_argument("-o", "--output", nargs="?", help="Output file")
parser.add_argument("-f", "--format", default="basic", help="Output format")
parser.add_argument("-c", "--config",
help="Path to JSON file containing the IRI of ontologies to fetch terms "
"from"),
parser.add_argument('-b', '--bucket', action='store_true',
parser.add_argument("-b", "--bucket", action="store_true",
help="Classify samples into pre-defined buckets")
parser.add_argument('--no-cache', action='store_true',
help="Ignore or replace cached resources, if there are any.")
parser.add_argument("--no-cache", action="store_true",
help="Ignore or replace online cached resources, if there are any.")
parser.add_argument("-v", "--version", action="version",
version="%(prog)s " + lexmapr.__version__)
parser.add_argument("-p", "--profile", choices=["ifsac"],
help="Pre-defined sets of command-line arguments for specialized purposes:"
"\n\n"
"* ifsac: \n"
" * maps samples to food and environmental resources\n"
" * classifies samples into ifsac labels\n"
" * outputs content to ``ifsac_output.tsv``")
args = parser.parse_args()

if args.version:
print(script_name + ' ' + lexmapr.__version__)
elif not args.input_file:
parser.error('Please supply an input file')
else:
lexmapr.pipeline.run(args)

lexmapr.pipeline.run(args)
Loading

0 comments on commit e351d6a

Please sign in to comment.