Erroneous reshaping related to expected gradients and RGCN data #13

jdiaz4302 · 2022-06-08T20:59:27Z

This issue is related to a mismatch between how RGCN input data is organized and how expected gradients reshapes the data for sampling. The issue most prominently occurs here https://github.com/USGS-R/xai-workflows/blob/main/utils/xai_utils.py#L39 but also occurs in less modularized notebooks.

The recent discussion in river-dl is worth referencing for the discussion of how RGCN input data is currently organized - USGS-R/river-dl#202

A current example input for the RGCN is of shape [10010, 365, 16] which represents [22 years * 455 segments, 365-day length sequence, 16 features]. The reshape operation reference above creates a [22, 455, 365, 16] shape but that first dimension is actually organized in batches of segments, not years, so this results in very scrambled and incorrect data.

Here are some plots attempting to show all the average air temperature time series for one segment across the different years. The top plot is with current reshaping - giving the impression that average air temperature at one segment is super random across years. While the bottom plot with proposed reshaping shows the average air temperate at one segment being very predictable across years.

This does tie back into the commented question of do we want to be sampling random segs or random years? but nevertheless we should correct the code/data structure so that whichever we decide is clear and intentional. Right now, the code states that it will use a random years data for that segment as the sample/baseline value for that iteration, but in reality it is almost certainly using a random segment (and year?)

The text was updated successfully, but these errors were encountered:

SimonTopp · 2022-06-10T12:16:53Z

Good catch @jdiaz4302. A couple thoughts, with no strong opinion as to the "right way".

If we are sampling random years, but the spatial relationships of the segments remain the same, then there will never be any feature attribution to the static attributes like slope, width, etc, because there is never a difference between the target input (x in the EG calculation) and the baseline.
With the above said, I do think that for any type of space/time aware model, a "training sample" should really be an entire sequence across the entire spatial domain, which leads me to say that we should be sampling years, not segments.
In terms of how this relates to Reaches loose spatial relationships when shuffle=True during training in PyTorch USGS-R/river-dl#202, mistakes like this make me think it's worth having an input shape that's more intuitive than the current [years*segments, seq length, features]
Definitely need to double check the reshape operation in some of my other work now too 👀 .

Whats your gut feeling re: sampling years versus segments?

jdiaz4302 · 2022-06-10T13:17:50Z

I agree with this point

With the above said, I do think that for any type of space/time aware model, a "training sample" should really be an entire sequence across the entire spatial domain, which leads me to say that we should be sampling years, not segments.

I also think this helps with sampling realistic baselines because right now we could potentially sample the coldest year for segment 1 and the hottest year for segment 2 directly downstream. Then the baseline is some unrealistic situation where two adjacent segments appear to be from different climates. My initial avoidance of this was because I was developing on the data version from river-dl that had 22 years with 365-day sequences therefore only 22 samples that way; sampling with individual years per segments makes a lot more unique combinations. Using smaller sequence lengths will help with that number of samples (and the XAI work says we don't need such long sequence lengths)

SimonTopp · 2022-06-10T13:24:59Z

Using smaller sequence lengths will help with that number of samples (and the XAI work says we don't need such long sequence lengths)

I totally agree, with the caveat that in the DRB which is a rain dominated system we don't need a 365 day sequence length. That might be different in snow dominated systems. The coarse hypertuning that I've done for sequence length resulted in the best performance with 180 days, but it was only marginally better than shorter sequence lengths of 90 and 60 (could easily have been stochasticity in the runs). Training time does increase significantly with smaller sequence lengths though, which is something to keep in mind.

jdiaz4302 · 2022-06-10T13:42:04Z

Yeah, good points. Also, I suppose because of how the calculation samples a random point along the difference, we do have more than just 22 samples (i.e. regarding this point):

22 years with 365-day sequences therefore only 22 samples

I'll rerun my notebook where I was fixing this sampling issue and maybe include those changes together.

SimonTopp · 2022-06-10T19:37:35Z

Update: The reshaping issue was due to an outdated prepped data frame. Re-doing it with a more recent output from the river-dl prep_all_data function resulted in the expected patterns with existing reshape code. Discussion about maintaining space/time relationships in baseline sampling were very fruitful though!

jdiaz4302 mentioned this issue Jun 10, 2022

Changing RGCN/ExpGrad reshape and sampling #14

Closed

SimonTopp closed this as completed Jun 10, 2022

jdiaz4302 mentioned this issue Jun 10, 2022

Sample network sequences, not individual sequences #15

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Erroneous reshaping related to expected gradients and RGCN data #13

Erroneous reshaping related to expected gradients and RGCN data #13

jdiaz4302 commented Jun 8, 2022

SimonTopp commented Jun 10, 2022

jdiaz4302 commented Jun 10, 2022

SimonTopp commented Jun 10, 2022

jdiaz4302 commented Jun 10, 2022

SimonTopp commented Jun 10, 2022

Erroneous reshaping related to expected gradients and RGCN data #13

Erroneous reshaping related to expected gradients and RGCN data #13

Comments

jdiaz4302 commented Jun 8, 2022

SimonTopp commented Jun 10, 2022

jdiaz4302 commented Jun 10, 2022

SimonTopp commented Jun 10, 2022

jdiaz4302 commented Jun 10, 2022

SimonTopp commented Jun 10, 2022