-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Erroneous reshaping related to expected gradients and RGCN data #13
Comments
Good catch @jdiaz4302. A couple thoughts, with no strong opinion as to the "right way".
Whats your gut feeling re: sampling years versus segments? |
I agree with this point
I also think this helps with sampling realistic baselines because right now we could potentially sample the coldest year for segment 1 and the hottest year for segment 2 directly downstream. Then the baseline is some unrealistic situation where two adjacent segments appear to be from different climates. My initial avoidance of this was because I was developing on the data version from |
I totally agree, with the caveat that in the DRB which is a rain dominated system we don't need a 365 day sequence length. That might be different in snow dominated systems. The coarse hypertuning that I've done for sequence length resulted in the best performance with 180 days, but it was only marginally better than shorter sequence lengths of 90 and 60 (could easily have been stochasticity in the runs). Training time does increase significantly with smaller sequence lengths though, which is something to keep in mind. |
Yeah, good points. Also, I suppose because of how the calculation samples a random point along the difference, we do have more than just 22 samples (i.e. regarding this point):
I'll rerun my notebook where I was fixing this sampling issue and maybe include those changes together. |
Update: The reshaping issue was due to an outdated prepped data frame. Re-doing it with a more recent output from the river-dl |
This issue is related to a mismatch between how RGCN input data is organized and how expected gradients reshapes the data for sampling. The issue most prominently occurs here https://github.com/USGS-R/xai-workflows/blob/main/utils/xai_utils.py#L39 but also occurs in less modularized notebooks.
The recent discussion in
river-dl
is worth referencing for the discussion of how RGCN input data is currently organized - USGS-R/river-dl#202A current example input for the RGCN is of shape
[10010, 365, 16]
which represents[22 years * 455 segments, 365-day length sequence, 16 features]
. The reshape operation reference above creates a[22, 455, 365, 16]
shape but that first dimension is actually organized in batches of segments, not years, so this results in very scrambled and incorrect data.Here are some plots attempting to show all the average air temperature time series for one segment across the different years. The top plot is with current reshaping - giving the impression that average air temperature at one segment is super random across years. While the bottom plot with proposed reshaping shows the average air temperate at one segment being very predictable across years.
This does tie back into the commented question of
do we want to be sampling random segs or random years?
but nevertheless we should correct the code/data structure so that whichever we decide is clear and intentional. Right now, the code states that it will use a random years data for that segment as the sample/baseline value for that iteration, but in reality it is almost certainly using a random segment (and year?)The text was updated successfully, but these errors were encountered: