Mega-Issue: Refactor Pipeline #308
Replies: 3 comments 3 replies
-
This was completed in #278 |
Beta Was this translation helpful? Give feedback.
-
Great capture of a challenging part of this pipeline. I think lakes end up needing to be matched/crosswalked from four different types of sources:
For you problem 1, I think we have a mix of things that currently work pretty well (I think the polygon process is decent, and I think the WQP crosswalks make sense), are overdone/laborious (e.g., creating a new "Points" xwalk for a dataset of ~15 lakes seems excessive, and like Julie is mentioning, these tiny crosswalks add up), or are ignored/missing (this is probably the case for many of the "Points" datasets and probably all of the "Other" datasets. Lastly, a single site with a known NHDHR ID shouldn't need a separate file with this mapping - hopefully we'd be able to support coding that directly in as a variable in the parser file. For your solution 1, there seems to be a difference between ad hoc ids or information we'd be using to link the lake vs actual IDs that the data contributor is using. If actual IDs are being used (LAGOS IDs, WBICs, DOWs are common examples), I agree we'd be doing a better service by exporting these linkages in a crosswalk that is part of the dataset. As an aside, I'd suggest that the crosswalk include the GNIS name, the state, and the county of the lake (see #278 for an element of this). Sometimes a provider has a naming scheme that is unique, but is probably internal and they don't have that many lakes, so exporting such a xwalk in a data release might be confusing and mostly empty. Lastly, some ways we're linking lakes may not need a persistent target/file with this matching (such as point in polygon, with the exception of WQP, which I think needs this matching because there can be so many unique sites within a lake). I'd like to think we can do this lower cost matching on the fly when we're merging things in your problem 2 and avoid a persistent xwalk for those altogether? revised suggestion for solution 2A larger crosswalk (includes state, county, GNIS name as well) with the major contributors or semi-official IDs is exported in data releases. We have some decisions regarding what belongs in this export. For your problem 2 Yes, I think the current pattern of I'd suggest breaking this problem out into two parts, one for the desired intermediate summary targets (what do those look like? at what stage(s) of the pipeline do they exist? how are they versioned and tracked?) and a second for the proposed overhaul of what One way to handle the function is to modify the upstream data format to have each coop parser actually handle its own NHDHR matching or have the function add a column or something to the output to specify which function or file the downstream target that does the matching should use (e.g., specifying a certain function name means the function will expect certain columns to exist, such as lat/lon and specifying a crosswalk file means a simple join needs to be used). Then the |
Beta Was this translation helpful? Give feedback.
-
Lake-to-state xwalk is complete, see #278. xwalk <- readRDS(sc_retrieve('2_crosswalk_munge/out/lake_to_state_xwalk.rds.ind')) |
Beta Was this translation helpful? Give feedback.
-
Over the past few months a number of small issues have occurred when working in the pipeline that have made debugging difficult and will ultimately result in code that is difficult to maintain.
The current issue is a work in progress. It represents my current understanding of what we would like to refactor based on a few meetings from 2022-01-04. Tagged assignees should edit this issue based on their current understanding of problems with the pipeline (or feel free to make corrections to my current understanding!). Problem definition should include why a portion of the pipeline needs refactoring, notes on how to refactor (if you've given it some thought), and desired outcomes from the refactor (simplified code? An output table? etc). The Lakes Team will then have a meeting to see if we are all on the same page and to discuss timeline for completion. After that meeting, I will update/finalize this issue and start the refactoring process.
Issues
Refactor crosswalks from
1_crosswalk_fetch/in
that will be used in7a_temp_coop_munge
Problem: Using the current pipeline configuration, each unique dataset munged in
7a_temp_coop_munge
needs a crosswalk in1_crosswalk_fetch\in
. This results in an ever expanding number of arguments being passed intocrosswalk_coop_dat
. The ever expanding list of arguments has two problems:site_id
andNavico_ID
in nhdr crosswalk #257, Univ of Missouri Bring inUniversityofMissouri_*
data #269, and the citizen science Bull Shoals data set from PR Add Bull Shoals reservoir, MO #218). While not a problem per se, it again adds repetition tocrosswalk_coop_dat
(i.e., does the citizen science data set from Bull Shoals really need it's own crosswalk when it only contains one lake?)Normalizing crosswalks across datasets may make it possible to cut down on arguments to
crosswalk_coop_dat
, minimize duplicate work, and produce a crosswalk that links cooperator data tonhdhr
IDs which is a known desired output.Key Outcome:
nhdhr
id values that can be returned to cooperators as a data releasePotential Method:
2_crosswalk_munge
creates multiple cooperator crosswalks: here and here. These individual cooperator crosswalks should be munged together into one crosswalk that can then be fed forward intocrosswalk_coop_dat
. This will generate a master cooperator crosswalk and also cut down on the amount of work that is being done incrosswalk_coop_dat
.Understanding the results of
crosswalk_coop_dat
in7a_temp_coop_munge
is difficult because there are no intermediate summary targetsProblem: Currently, understanding exactly what happens
crosswalk_coop_dat
is difficult. Sometimes a data set contains new lakes while other times a data set may be adding data for existing lakes. There are a few potential benefits to adding these intermediary summaries:Key Outcome:
7a_temp_coop_munge/all_coop_dat_linked.feather
nhdhr
idPotential Method:
crosswalk_coop_dat
produces two tables internally that were critical in running down errors:all_dat_coop_linked
(before removingNA
values) anddat_missing
. At a minimum,dat_missing
should be tracked as a log file. It could be useful to track a summary fromall_dat_coop_linked
that includes state ID, NHDHRID, waterbody name, unique day count (as a surrograte forn_profiles
), and record count - this summary should also be joined with state crosswalk from Add a lakes to state xwalk #278.Other Key Questions
Key Areas of the Pipeline to Refactor
Beta Was this translation helpful? Give feedback.
All reactions