Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Various small bug fixes (#74) * Do not delimit more than once per row * Also removed InputFileNotFoundError * Let's just use FileNotFoundError * Also raise exception for unreachable code * Remove comments from test input files * Increase threshold for calling ``ngrams`` * ``get_gram_chunks`` now calls ``ngrams`` if ``input`` has less than 15 tokens * As opposed to less than 7 tokens * Clean up matching logic (#75) * Empty commit * To create WIP pr * Made full and non-full output more similar * Replaced ``Final_Refined_Terms_with_Resource_IDs`` in full format with ``Matched_Components`` * Modified tests accordingly * Clean up ``pipeline.run``, and other things * ``pipeline.run`` * Removed a lot of unneccessary variables * Output columns now stored in clearly named variables, and all columns are populated at the same time * Removed reporting of matches when there are no matches * General clean-up * Other things * Renamed ``punctuationTreatment`` to ``punctuation_treatment`` * Modified tests accordingly * Increased robustness of full term matching * Now calls new function ``map_term`` * Cleaner * Attempts suffix addition on synonyms as well, so more robust * Empty commit * To test modification of coveralls setting * Increased robustness of component matching * Component matching now relies on the more robust ``map_term`` * Removed duplicates in ``micro_status`` * Removed unneccessary variables * Fixed bug in ``_map_term_helper`` * Modified tests accordingly * Cleaned up ``genomeTrackerMaster.csv`` * Removed stale code * Docstrings * ``match_term`` * ``_match_term_helper`` * Modify matching logic * We want to consider suffixes last * Also modifies some code to make adhere to PEP 8 * More transparent matching (#76) * Less randomness, better reporting and clean code * Less randomness * Replaced usages of set to remove duplicates from lists with usages of ``OrderedDict.fromkeys`` method * Preserves original order of list * Did replace some lists with sets, when it did not have any effect on the order of anything important * For speed-up purposes * Better reporting * Added chronological order to ``matched_components`` * Report component matches in exact order they are made * Added chronological order to ``micro_status`` * Report ``micro_status`` elements in exact order they are added * Report exact tokens treatment or match conditions apply to * e.g., 'Abbreviation-Acronym Treatment' is now 'Abbreviation-Acronym Treatment: fluid' * e.g., 'A Direct Match' is now "{cerebrospinal fluid: ['A Direct Match']}" * All the above makes the matching process more transparent * Also sorted third-party final classifications * Clean code * Removed unnecessary functions ``allPermutations`` and ``combi`` * Updated stale comments * Adjusted tests accordingly * Removed some randomness from ``retainedPhrase`` * Usage of set to remove duplicates replaced with ``OrderedDict.fromkeys`` * Adjusted tests accordingly * Add already-cached lookup and classification tables to package installation (#77) * Empty commit to create WIP PR * Add look and classification table to vcs * Modified ```--no-cache`` usage * Only applies to online ontology resources now * No longer needed for ``lookup_table`` and (for now) ``classification_lookup_table`` * Since they are now committed to vcs * Modified ``--help`` message for clarity * Modified tests to take advantage of this for speed-up * Implement ``--profile`` command (#79) * Moved cache logic to new ``pipeline_caching.py`` * Population of ``lookup_table``, ``ontology_lookup_table`` and ``classification_lookup_table`` moved to ``pipeline_caching.py`` * New functions functions ``get_predefined_resources``, ``get_config_resources`` and ``get_classification_resources`` respectively * Moved in following functions from ``pipeline_helpers.py``: * ``get_resource_dict`` * ``create_lookup_table_skeleton`` * ``add_predefined_resources_to_lookup_table`` * ``get_resource_permutation_terms`` * ``get_resource_bracketed_permutation_terms`` * ``add_fetched_ontology_to_lookup_table`` * Moved in ``add_classification_resources_to_lookup_table`` from ``pipeline_classification.py`` * Removed last, stale use of ``pkg_resources`` as opposed to using path from ``lexmapr.definitions.ROOT`` * Improved in-code documentation in some places * Modifies tests and imports accordingly * Docstrings for new functions * Stylistic improvements in ``bin/lexmapr`` * Change single quotes to double quotes * Added new lines to make PEP 8 adherent * Improve ``bin/lexmapr`` some more * Remove unnecessary ``if-else`` conditions by cleaning up ``input_file`` and ``-v, --version`` arguments somewhat * Renamed ``lexmapr.resources``, ``lexmapr.cache`` * ``lexmapr.resources`` is now ``lexmapr.predefined_resources`` * ``lexmapr.cache`` is now ``lexmapr.resources`` * Once we started committing lookup and classification table to ``lexmapr.cache``, (and when we eventually commit profiles`), it ceases to be a cache * Renamed ``pipeline_caching.py`` to ``pipeline_resources.py`` * Updated comments accordingly * Implement ``-p, --profile`` functionality * User can specify profile they want to use via ``-p, --profile`` flag, which will contain a pre-defined set of command-line arguments and online ontology resources * Currently, only ifsac profile available * Profile arguments can be overwritten by explicitly declaring other arguments * Except for ``--no-cache`` and ``--bucket`` * Will figure this out in the future * Two new functions carry most of the logic * ``lexmapr.pipeline_resources.get_profile_args`` * ``lexmapr.pipeline_resources.get_profile_resources`` * Other changes * Brought ``MANIFEST.in`` up-to-date
- Loading branch information