Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Erroneous reshaping related to expected gradients and RGCN data #13

Closed
jdiaz4302 opened this issue Jun 8, 2022 · 5 comments
Closed

Comments

@jdiaz4302
Copy link
Contributor

This issue is related to a mismatch between how RGCN input data is organized and how expected gradients reshapes the data for sampling. The issue most prominently occurs here https://github.com/USGS-R/xai-workflows/blob/main/utils/xai_utils.py#L39 but also occurs in less modularized notebooks.

The recent discussion in river-dl is worth referencing for the discussion of how RGCN input data is currently organized - USGS-R/river-dl#202

A current example input for the RGCN is of shape [10010, 365, 16] which represents [22 years * 455 segments, 365-day length sequence, 16 features]. The reshape operation reference above creates a [22, 455, 365, 16] shape but that first dimension is actually organized in batches of segments, not years, so this results in very scrambled and incorrect data.

Here are some plots attempting to show all the average air temperature time series for one segment across the different years. The top plot is with current reshaping - giving the impression that average air temperature at one segment is super random across years. While the bottom plot with proposed reshaping shows the average air temperate at one segment being very predictable across years.

RGCN_Input_Reshaping

This does tie back into the commented question of do we want to be sampling random segs or random years? but nevertheless we should correct the code/data structure so that whichever we decide is clear and intentional. Right now, the code states that it will use a random years data for that segment as the sample/baseline value for that iteration, but in reality it is almost certainly using a random segment (and year?)

@SimonTopp
Copy link
Contributor

Good catch @jdiaz4302. A couple thoughts, with no strong opinion as to the "right way".

  • If we are sampling random years, but the spatial relationships of the segments remain the same, then there will never be any feature attribution to the static attributes like slope, width, etc, because there is never a difference between the target input (x in the EG calculation) and the baseline.
  • With the above said, I do think that for any type of space/time aware model, a "training sample" should really be an entire sequence across the entire spatial domain, which leads me to say that we should be sampling years, not segments.
  • In terms of how this relates to Reaches loose spatial relationships when shuffle=True during training in PyTorch USGS-R/river-dl#202, mistakes like this make me think it's worth having an input shape that's more intuitive than the current [years*segments, seq length, features]
  • Definitely need to double check the reshape operation in some of my other work now too 👀 .

Whats your gut feeling re: sampling years versus segments?

@jdiaz4302
Copy link
Contributor Author

I agree with this point

With the above said, I do think that for any type of space/time aware model, a "training sample" should really be an entire sequence across the entire spatial domain, which leads me to say that we should be sampling years, not segments.

I also think this helps with sampling realistic baselines because right now we could potentially sample the coldest year for segment 1 and the hottest year for segment 2 directly downstream. Then the baseline is some unrealistic situation where two adjacent segments appear to be from different climates. My initial avoidance of this was because I was developing on the data version from river-dl that had 22 years with 365-day sequences therefore only 22 samples that way; sampling with individual years per segments makes a lot more unique combinations. Using smaller sequence lengths will help with that number of samples (and the XAI work says we don't need such long sequence lengths)

@SimonTopp
Copy link
Contributor

Using smaller sequence lengths will help with that number of samples (and the XAI work says we don't need such long sequence lengths)

I totally agree, with the caveat that in the DRB which is a rain dominated system we don't need a 365 day sequence length. That might be different in snow dominated systems. The coarse hypertuning that I've done for sequence length resulted in the best performance with 180 days, but it was only marginally better than shorter sequence lengths of 90 and 60 (could easily have been stochasticity in the runs). Training time does increase significantly with smaller sequence lengths though, which is something to keep in mind.

@jdiaz4302
Copy link
Contributor Author

Yeah, good points. Also, I suppose because of how the calculation samples a random point along the difference, we do have more than just 22 samples (i.e. regarding this point):

22 years with 365-day sequences therefore only 22 samples

I'll rerun my notebook where I was fixing this sampling issue and maybe include those changes together.

@SimonTopp
Copy link
Contributor

Update: The reshaping issue was due to an outdated prepped data frame. Re-doing it with a more recent output from the river-dl prep_all_data function resulted in the expected patterns with existing reshape code. Discussion about maintaining space/time relationships in baseline sampling were very fruitful though!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants