Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review of 2wp-temp-observations example pipeline #4

Closed
padilla410 opened this issue Feb 15, 2022 · 1 comment
Closed

Review of 2wp-temp-observations example pipeline #4

padilla410 opened this issue Feb 15, 2022 · 1 comment

Comments

@padilla410
Copy link
Contributor

Related to Issue #3

The layout of this repo and many of the design decision are similar to nawqa_wqp (reviewed in issue #4). It queries both WQP and NWIS, but this review will only address the WQP pull. It's well organized and well commented. In this case, I was able to get most of the targets in the inventory section of 1_wqp_pull to successfully build but did not complete the data pull because I did not want to modify the files on google drive. After the completion of the inventory target task table, the pipeline asked to push to google drive despite my call to options(scipiper.dry_put = FALSE) (also I don't have access).

Workflow Overview

The WQP portion of the pipeline (1_wqp_pull) completes the following when querying WQP data:

  • Loads data configuration information that includes the following parameters:

    • Defines parameters of interest - for this pull it's only characteristics in the water temperature family
    • Set the max number of sites in a pull to 1,000
    • Sets the max pull size to 250,000 records
    • no information about date or spatial extent in the cfg files; however, data pull does not include info about dates or spatial extent in cfg files, but data is queried from all 50 states, some territories, and locations in Canada and Mexico.
  • Creates an inventory

    • Uses the user specified target wqp_pull_date to develop the initial partitions and create a data inventory by year group.

    Pipeline output from the 2007 inventory target:

    [ BUILD ] 2007_site_inventory                                     |  `2007_site_inventory`...
    Retrieving whatWQPdata for the following time period 2007-01-01 : 2007-12-31
    Retrieved 55178 rows of data in 125 seconds.

    A peek at the head of the 2007 inventory target:

    # A tibble: 5 x 9
      OrganizationIdentifier MonitoringLocat~ ResolvedMonitor~ StateName CountyName HUCEightDigitCo~ latitude longitude resultCount
      <chr>                  <chr>            <chr>            <chr>     <chr>      <chr>               <dbl>     <dbl>       <dbl>
    1 USGS-AK                USGS-15011858    Stream           Alaska    Ketchikan~ 19010105             55.4     -130.           7
    2 USGS-AK                USGS-15011860    Stream           Alaska    Ketchikan~ 19010105             55.4     -130.           2
    3 USGS-AK                USGS-15011865    Stream           Alaska    Ketchikan~ 19010105             55.4     -130.          18
    4 USGS-AK                USGS-15011870    Stream           Alaska    Ketchikan~ 19010105             55.4     -130.          25
    5 USGS-AK                USGS-15011875    Stream           Alaska    Ketchikan~ 19010105             55.4     -130.          24
  • Pulls the data

    • Partitions in this part of the pipeline are by year, station count (max: 1,000) and data (max: 250,000)
    • Note - because of the google drive issue, I was not about to get to the point of the pipeline that generated the wqp_pull_tasks.yml task table

Missing pieces, places to improve, and general comments:

  • I took a quick pass through the issues in this repo and Issue #42 contains a discussion on why the final data pulled does not match the inventory. I generally like checks such as this and I think it could be useful to build in a comparison that reports on any discrepancies between "what we expected to get" (the inventory) and "what we actually got" (pull). This could be a good "gut check" for problems.
  • Building each target includes a record count and the amount of time it took to build; I love this. Here is an example:
    [ BUILD ] 2007_site_inventory                                     |  `2007_site_inventory`...
    Retrieving whatWQPdata for the following time period 2007-01-01 : 2007-12-31
    Retrieved 55178 rows of data in 125 seconds.

Consistencies across both repos

  • data pulls start with an inventory and then pull the data
  • These repos use a "shared cache" method for data storage (keeping scipiper ind files local and pushing the actual data to google drive)
@padilla410
Copy link
Contributor Author

I'll be talking to Sam tomorrow to talk about picking a "magic number" for the size of the WQP query

@DOI-USGS DOI-USGS locked and limited conversation to collaborators Feb 22, 2022
@padilla410 padilla410 converted this issue into discussion #9 Feb 22, 2022
@lekoenig lekoenig added this to the First draft targets pipeline milestone Apr 15, 2022

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants