Mega-Issue: Refactor Pipeline #308

padilla410 · 2022-01-04T22:23:56Z

padilla410
Jan 4, 2022

Over the past few months a number of small issues have occurred when working in the pipeline that have made debugging difficult and will ultimately result in code that is difficult to maintain.

The current issue is a work in progress. It represents my current understanding of what we would like to refactor based on a few meetings from 2022-01-04. Tagged assignees should edit this issue based on their current understanding of problems with the pipeline (or feel free to make corrections to my current understanding!). Problem definition should include why a portion of the pipeline needs refactoring, notes on how to refactor (if you've given it some thought), and desired outcomes from the refactor (simplified code? An output table? etc). The Lakes Team will then have a meeting to see if we are all on the same page and to discuss timeline for completion. After that meeting, I will update/finalize this issue and start the refactoring process.

Issues

Refactor crosswalks from `1_crosswalk_fetch/in` that will be used in `7a_temp_coop_munge`

Problem: Using the current pipeline configuration, each unique dataset munged in 7a_temp_coop_munge needs a crosswalk in 1_crosswalk_fetch\in. This results in an ever expanding number of arguments being passed into crosswalk_coop_dat. The ever expanding list of arguments has two problems:

There is currently a lot of repeated code within the function which will ultimately result in code that is difficult to maintain into the future.
A waterbody is not always unique to a dataset. For example, data for Bull Shoals Reservoir exists in at least three data sets (Navico Navico data: non-unique relationship between site_id and Navico_ID in nhdr crosswalk #257, Univ of Missouri Bring in UniversityofMissouri_* data #269, and the citizen science Bull Shoals data set from PR Add Bull Shoals reservoir, MO #218). While not a problem per se, it again adds repetition to crosswalk_coop_dat (i.e., does the citizen science data set from Bull Shoals really need it's own crosswalk when it only contains one lake?)
Normalizing crosswalks across datasets may make it possible to cut down on arguments to crosswalk_coop_dat, minimize duplicate work, and produce a crosswalk that links cooperator data to nhdhr IDs which is a known desired output.

Key Outcome:

A final crosswalk that maps local ID values to nhdhr id values that can be returned to cooperators as a data release

Potential Method:

2_crosswalk_munge creates multiple cooperator crosswalks: here and here. These individual cooperator crosswalks should be munged together into one crosswalk that can then be fed forward into crosswalk_coop_dat. This will generate a master cooperator crosswalk and also cut down on the amount of work that is being done in crosswalk_coop_dat.

Understanding the results of`crosswalk_coop_dat` in `7a_temp_coop_munge` is difficult because there are no intermediate summary targets

Problem: Currently, understanding exactly what happens crosswalk_coop_dat is difficult. Sometimes a data set contains new lakes while other times a data set may be adding data for existing lakes. There are a few potential benefits to adding these intermediary summaries:

It will make it easier for team members to double check their own work
It will make it easier for other team members to approve PRs without digging into the guts of a given function
It will provide a framework for intermediary targets for future lakes work.

Key Outcome:

Add intermediate summary targets to the pipeline. Potential options:
- A summary table that inventories unique waterbodies were added by each dataset in 7a_temp_coop_munge/all_coop_dat_linked.feather
- maybe - A summary table that summarizes the number of data sets for each nhdhr id

Potential Method:

crosswalk_coop_dat produces two tables internally that were critical in running down errors: all_dat_coop_linked (before removing NA values) and dat_missing. At a minimum, dat_missing should be tracked as a log file. It could be useful to track a summary from all_dat_coop_linked that includes state ID, NHDHRID, waterbody name, unique day count (as a surrograte for n_profiles), and record count - this summary should also be joined with state crosswalk from Add a lakes to state xwalk #278.

Key Areas of the Pipeline to Refactor

to be finalized after group meeting

lindsayplatt · 2022-01-08T01:33:47Z

lindsayplatt
Jan 8, 2022

One thing that would be nice to have is a lake-to-state crosswalk. I think we could use that for the summary tables/figures mentioned above, but also it would be immediately useful to me in the GCM work. I need to get only MN lakes :)

This was completed in #278

0 replies

jordansread · 2022-01-12T12:12:29Z

jordansread
Jan 12, 2022

Great capture of a challenging part of this pipeline. I think lakes end up needing to be matched/crosswalked from four different types of sources:

Polygons. These are the most "expensive" matches we do, but are pretty straightforward. A separate shapefile with the provider IDs is matched to the NHDHR shapes.
Points. Lakes or sampling locations on lakes are provided usually as lat/lon (many coop data sources and all WQP data work this way)
Single site. Big dumps of data from a single focus lakes come in. There isn't a specific linkage to the lake because it is implied in the file names or explicitly provided by the data contributor.
Other. Lake data has some kind of alternate way of being defined as unique by the data provider. I think the most common pattern I've seen here is "name", "county", and "state", but otherwise no spatial information. These are hard to get a confident match and I don't think we've tried yet.

For you problem 1, I think we have a mix of things that currently work pretty well (I think the polygon process is decent, and I think the WQP crosswalks make sense), are overdone/laborious (e.g., creating a new "Points" xwalk for a dataset of ~15 lakes seems excessive, and like Julie is mentioning, these tiny crosswalks add up), or are ignored/missing (this is probably the case for many of the "Points" datasets and probably all of the "Other" datasets. Lastly, a single site with a known NHDHR ID shouldn't need a separate file with this mapping - hopefully we'd be able to support coding that directly in as a variable in the parser file.

For your solution 1, there seems to be a difference between ad hoc ids or information we'd be using to link the lake vs actual IDs that the data contributor is using. If actual IDs are being used (LAGOS IDs, WBICs, DOWs are common examples), I agree we'd be doing a better service by exporting these linkages in a crosswalk that is part of the dataset. As an aside, I'd suggest that the crosswalk include the GNIS name, the state, and the county of the lake (see #278 for an element of this). Sometimes a provider has a naming scheme that is unique, but is probably internal and they don't have that many lakes, so exporting such a xwalk in a data release might be confusing and mostly empty. Lastly, some ways we're linking lakes may not need a persistent target/file with this matching (such as point in polygon, with the exception of WQP, which I think needs this matching because there can be so many unique sites within a lake). I'd like to think we can do this lower cost matching on the fly when we're merging things in your problem 2 and avoid a persistent xwalk for those altogether?

revised suggestion for solution 2

A larger crosswalk (includes state, county, GNIS name as well) with the major contributors or semi-official IDs is exported in data releases. We have some decisions regarding what belongs in this export.

For your problem 2 Yes, I think the current pattern of crosswalk_coop_dat is not built to scale elegantly and handle an increasing number of unique crosswalks. I see three problems: 1) the function doesn't scale well with new sources (lotsa copy+paste and adding new arguments), 2) as you mentioned, it is hard to know what you changed or track down how and if your new data were added, and 3) the function as designed doesn't support adding sources without an explicit NHDRHR crosswalk, which as discussed above, is overkill for most small datasources and including that many unique crosswalks is a lot of work and would create a lot of small new targets.

I'd suggest breaking this problem out into two parts, one for the desired intermediate summary targets (what do those look like? at what stage(s) of the pipeline do they exist? how are they versioned and tracked?) and a second for the proposed overhaul of what crosswalk_coop_dat does.

One way to handle the function is to modify the upstream data format to have each coop parser actually handle its own NHDHR matching or have the function add a column or something to the output to specify which function or file the downstream target that does the matching should use (e.g., specifying a certain function name means the function will expect certain columns to exist, such as lat/lon and specifying a crosswalk file means a simple join needs to be used). Then the crosswalk_coop_dat would basically be a bind, an !is.na() filter, and a de-duplication and summary function that wouldn't require you to add new code sections when new file types were included.

3 replies

padilla410 Mar 4, 2022
Author

@jread-usgs - I just outlined potential solutions to items 1 and 2 above and totally forgot about your comments here. I will read these and think about them before our discussion when Lindsay gets back. From what I'm seeing, I think we are very close here - I'm trying to think about how to do this and get the biggest result with the least effort.

jordansread Mar 4, 2022

No prob - where did you outline the solutions to 1 and 2? Is that in a doc somewhere?

padilla410 Mar 7, 2022
Author

@jread-usgs I added a heading called "Potential Solution" to each one of the issues

lindsayplatt · 2022-01-12T16:46:22Z

lindsayplatt
Jan 12, 2022

Lake-to-state xwalk is complete, see #278.

xwalk <- readRDS(sc_retrieve('2_crosswalk_munge/out/lake_to_state_xwalk.rds.ind'))

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mega-Issue: Refactor Pipeline #308

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Mega-Issue: Refactor Pipeline #308

padilla410 Jan 4, 2022

Issues

Refactor crosswalks from 1_crosswalk_fetch/in that will be used in 7a_temp_coop_munge

Understanding the results ofcrosswalk_coop_dat in 7a_temp_coop_munge is difficult because there are no intermediate summary targets

Other Key Questions

Key Areas of the Pipeline to Refactor

Replies: 3 comments · 3 replies

lindsayplatt Jan 8, 2022

jordansread Jan 12, 2022

revised suggestion for solution 2

padilla410 Mar 4, 2022 Author

jordansread Mar 4, 2022

padilla410 Mar 7, 2022 Author

lindsayplatt Jan 12, 2022

padilla410
Jan 4, 2022

Refactor crosswalks from `1_crosswalk_fetch/in` that will be used in `7a_temp_coop_munge`

Understanding the results of`crosswalk_coop_dat` in `7a_temp_coop_munge` is difficult because there are no intermediate summary targets

Replies: 3 comments 3 replies

lindsayplatt
Jan 8, 2022

jordansread
Jan 12, 2022

padilla410 Mar 4, 2022
Author

padilla410 Mar 7, 2022
Author

lindsayplatt
Jan 12, 2022