Review of `2wp-temp-observations` example pipeline #4

padilla410 · 2022-02-15T21:36:21Z

Related to Issue #3

The layout of this repo and many of the design decision are similar to nawqa_wqp (reviewed in issue #4). It queries both WQP and NWIS, but this review will only address the WQP pull. It's well organized and well commented. In this case, I was able to get most of the targets in the inventory section of 1_wqp_pull to successfully build but did not complete the data pull because I did not want to modify the files on google drive. After the completion of the inventory target task table, the pipeline asked to push to google drive despite my call to options(scipiper.dry_put = FALSE) (also I don't have access).

Workflow Overview

The WQP portion of the pipeline (1_wqp_pull) completes the following when querying WQP data:

Loads data configuration information that includes the following parameters:
- Defines parameters of interest - for this pull it's only characteristics in the water temperature family
- Set the max number of sites in a pull to 1,000
- Sets the max pull size to 250,000 records
- no information about date or spatial extent in the cfg files; however, data pull does not include info about dates or spatial extent in cfg files, but data is queried from all 50 states, some territories, and locations in Canada and Mexico.

Creates an inventory

Uses the user specified target wqp_pull_date to develop the initial partitions and create a data inventory by year group.

Pipeline output from the 2007 inventory target:

[ BUILD ] 2007_site_inventory                                     |  `2007_site_inventory`...
Retrieving whatWQPdata for the following time period 2007-01-01 : 2007-12-31
Retrieved 55178 rows of data in 125 seconds.

A peek at the head of the 2007 inventory target:

# A tibble: 5 x 9
  OrganizationIdentifier MonitoringLocat~ ResolvedMonitor~ StateName CountyName HUCEightDigitCo~ latitude longitude resultCount
  <chr>                  <chr>            <chr>            <chr>     <chr>      <chr>               <dbl>     <dbl>       <dbl>
1 USGS-AK                USGS-15011858    Stream           Alaska    Ketchikan~ 19010105             55.4     -130.           7
2 USGS-AK                USGS-15011860    Stream           Alaska    Ketchikan~ 19010105             55.4     -130.           2
3 USGS-AK                USGS-15011865    Stream           Alaska    Ketchikan~ 19010105             55.4     -130.          18
4 USGS-AK                USGS-15011870    Stream           Alaska    Ketchikan~ 19010105             55.4     -130.          25
5 USGS-AK                USGS-15011875    Stream           Alaska    Ketchikan~ 19010105             55.4     -130.          24

Pulls the data
- Partitions in this part of the pipeline are by year, station count (max: 1,000) and data (max: 250,000)
- Note - because of the google drive issue, I was not about to get to the point of the pipeline that generated the wqp_pull_tasks.yml task table

Missing pieces, places to improve, and general comments:

I took a quick pass through the issues in this repo and Issue #42 contains a discussion on why the final data pulled does not match the inventory. I generally like checks such as this and I think it could be useful to build in a comparison that reports on any discrepancies between "what we expected to get" (the inventory) and "what we actually got" (pull). This could be a good "gut check" for problems.

Building each target includes a record count and the amount of time it took to build; I love this. Here is an example:

[ BUILD ] 2007_site_inventory                                     |  `2007_site_inventory`...
Retrieving whatWQPdata for the following time period 2007-01-01 : 2007-12-31
Retrieved 55178 rows of data in 125 seconds.

Consistencies across both repos

data pulls start with an inventory and then pull the data
These repos use a "shared cache" method for data storage (keeping scipiper ind files local and pushing the actual data to google drive)

The text was updated successfully, but these errors were encountered:

padilla410 · 2022-02-15T21:36:48Z

I'll be talking to Sam tomorrow to talk about picking a "magic number" for the size of the WQP query

DOI-USGS locked and limited conversation to collaborators Feb 22, 2022

padilla410 converted this issue into discussion #9 Feb 22, 2022

lekoenig added this to the First draft targets pipeline milestone Apr 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Review of `2wp-temp-observations` example pipeline #4

Review of `2wp-temp-observations` example pipeline #4

padilla410 commented Feb 15, 2022

padilla410 commented Feb 15, 2022

This issue was moved to a discussion.

This issue was moved to a discussion.

Review of 2wp-temp-observations example pipeline #4

Review of 2wp-temp-observations example pipeline #4

Comments

padilla410 commented Feb 15, 2022

Workflow Overview

Missing pieces, places to improve, and general comments:

Consistencies across both repos

padilla410 commented Feb 15, 2022

This issue was moved to a discussion.

Review of `2wp-temp-observations` example pipeline #4

Review of `2wp-temp-observations` example pipeline #4