146 simplify training #148

jsadler2 · 2021-12-06T18:01:14Z

Warning: This is a breaking PR

This moves the definition of the tensorflow model out of train.py. If agreed upon, this means you will no longer be able to pass a string as an model_type argument (e.g., 'rgcn') to the train_model function. You would instead pass a model object (river_dl.RGCN.RGCNModel).

Pros
The main pro is that train.py is much lighter and more flexible. This is mainly because we don't have to handle all of possible model_type arguments passed as strings via if else statements. For example, there are now no gw specific pieces in train.py

We also leave the defining of "pretraining" and "finetuning" out of train.py. This means that you could have any number of training phases with different data/epochs/loss functions etc. This also means that you can define your model anywhere (e.g., my_awesome_model.py) import it into whatever file you are using to call train_model, instantiate it, and pass the object into train_model.

Tradeoff
The tradeoff is that the model has to be defined and compiled with its loss and optimizer somewhere else - so the burden is more on the individual projects (e.g., in the Snakefile). I edited the Snakefile to show what that would look like there.

Summary
This PR is intended to make train.py project agnostic; this means whatever calls train.py (e.g., Snakefile) has the burden of project-specific/model definitions.

closes #146, #118

SimonTopp

Overall I think this is much cleaner and makes some great progress towards having a more modular, generic workflow. The thing that stands out most to me is that since this is more modular, it puts the onus on the user to configure the Snakefile rather than just setting some arguments in the config.yml and hitting go. The only problem with that is that the Snakefile and config.yml change regularly depending on who last did a PR (i.e. they aren't really canonical in the repository). What do you think about adding some language to that extent in the readme, and then maybe adding a folder with some example Snakefile/config pairs for specific use cases as examples?

SimonTopp · 2021-12-07T17:57:19Z

Snakefile

 # Pretrain the model on process based model
 rule pre_train:
    input:
        "{outdir}/prepped.npz"
    output:
        directory("{outdir}/pretrained_weights/"),
-        touch("{outdir}/pretrained_weights/pretrain.done")


What happens here if you don't want to pre-train? This touch call just made it so you could set pretraining to zero but not break the pipeline. Are you thinking that that's a use-case scenario where folks should just write their own snakefile that doesn't include pre-training? This relates to a larger discussion I've had with Janet where we've chatted about creating a handful of example Snakefiles (e.g. baseline run, running replicates, with and without pre-training, etc) and explicitely stating that the config.yml and Snakefile in the repo aren't canonical and should only be used as reference. What do you think?

What I was thinking is that if someone didn't want to pretrain, they would just nix that from their Snakefile. So I think that what you are saying about having example Snakefile/config.yml files instead of canonical ones is exactly what I was thinking.

I'm thinking that those can maybe go in their own directory. I'll adjust that.

SimonTopp · 2021-12-07T18:00:31Z

Snakefile

    params:
        # getting the base path to put the training outputs in
        # I omit the last slash (hence '[:-1]' so the split works properly
-        run_dir=lambda wildcards, output: os.path.split(output[0][:-1])[0],
+        weight_dir=lambda wildcards, output: os.path.split(output[0][:-1])[0],


Not new to this PR, but since the wildcard {outdir} gets defined in the inputs/outputs, can't you just pass wildcards.outdir to the function rather than creating the parameter?

SimonTopp · 2021-12-07T18:14:23Z

Snakefile

-                             hidden_size=config['hidden_size'], io_data=input[1],
-                             partition=wildcards.partition, outfile=output[0],
-                             num_tasks=len(config['y_vars_finetune']),
+        weight_dir = input[0] + '/'


I think this should probably remain an option in config.yml (whether you want to use the final fine-tune weights or the early stopping weights). Seems like an easy thing to overlook in the Snakefile.

I agree that keeping this in the config file is a good idea

I think that's a good idea too. I'll make sure that's in one of the examples.

SimonTopp · 2021-12-07T18:14:53Z

Snakefile

-                             num_tasks=len(config['y_vars_finetune']),
+        weight_dir = input[0] + '/'
+        model.load_weights(weight_dir)
+        predict_from_io_data(model=model, 


So much cleaner just to pass a compiled model into here!!!!!

SimonTopp · 2021-12-07T18:22:36Z

Snakefile_gw

+        y_val_obs = np.concatenate(
+            [io_data["y_obs_val"], io_data["GW_val_reshape"], air_val], axis=2
+        )
+            # Run the finetuning within the training engine on CPU for the GW loss function


Am I missing something or do you to pass the use_cpu argument here?

I think I just missed this. Thanks

SimonTopp · 2021-12-07T18:25:08Z

config.yml

-
-#Choose whether to use the final weights from the end of training ('trained_weights') or the weights from the best
-# validation epoch ('best_val_weights')
-pred_weights: 'best_val_weights'


Again, maybe this isn't the best way to do it, but I think if you define early stopping in the config then it makes sense to be able to point to the early stopping vs finetune weights in the config.

SimonTopp · 2021-12-07T18:29:06Z

river_dl/postproc_utils.py

    dates = np.reshape(dates, [dates.shape[0] * dates.shape[1], dates.shape[2]])
    ids = np.reshape(ids, [ids.shape[0] * ids.shape[1], ids.shape[2]])
    df_preds = pd.DataFrame(data_array, columns=col_names)


Again, not specific to this PR, nor am I sure if it's more or less elegant, but in my adapted version of this I just use x.flatten() for dates, ids, and preds. It just makes it agnostic to the shape of the inputs.

Oh. Wow. That's so much simpler! Great idea.

SimonTopp · 2021-12-07T18:37:53Z

river_dl/train.py


-        # Initialize our model within the training engine
-        engine = trainer(model, optimizer, loss_func, weights)
+    print(best_val_weight_dir)


Might want to make this print statement a little more informative.

haha 😆. that was actually for debugging. glad you caught that.

Suggested change

print(best_val_weight_dir)

SimonTopp · 2021-12-07T18:40:11Z

river_dl/train.py

+    if weight_dir:
+        model.save_weights(weight_dir)
+
+    # Save alternate weight file that saves the best validation weights


Move this comment up to where you define the early stopping log directory.

jsadler2 · 2021-12-07T19:32:18Z

Thanks for the review and thoughts, @SimonTopp. I think you hit the nail on the head here:

it puts the onus on the user to configure the Snakefile rather than just setting some arguments in the config.yml and hitting go.

This change definitely puts more of the onus on the modeler/user. And I think it's still an open question whether that is the direction we want to go.

For fun I just sketched out where along the "flexibility - helpfulness" spectrum I see this PR.

I think the upside of flexibility we gain is pretty nice. True we are doing less of the work for a given user/modeler, that said, I think that sometimes we need the ability to make custom models and not having to figure out how to plug them into a rigid (but helpful when plugged in) code base can be pretty freeing. And knowing how to instantiate a model object is pretty powerful and actually not too hard.

The one thing that gives me pause is needing to know better how to manipulate the Snakefile. And maybe that's where you and Janet's idea comes in:

we've chatted about creating a handful of example Snakefiles (e.g. baseline run, running replicates, with and without pre-training, etc) and explicitely stating that the config.yml and Snakefile in the repo aren't canonical and should only be used as reference.

This could be the happy medium where we provide a more flexible tool but also show people how to use it for their own application.

janetrbarclay · 2021-12-29T19:02:03Z

Snakefile_gw

-out_dir = config['out_dir']
-code_dir = config['code_dir']
-pred_weights = config['pred_weights']
+out_dir = config['out_dir'] + "_gw"


Is there a reason for adding the gw tag to the output directory here? I might rather keep the directory naming in the config file since I'm already reviewing that before every run.

Good point. Much better in the config file

janetrbarclay · 2021-12-29T19:12:11Z

Snakefile_gw

-                  out_file=output[0],
-                  reach_file= config['reach_attr_file'])
+                  out_file=output[0])
+                  #reach_file= config['reach_attr_file'])


why is the reach file coming out? we use it in gw_utils to flag the reaches that are known to be in / downstream of reserviors so we don't try to calculate the annual temperature signal properties on those reaches.

This and a lot of the changes I made were to make the testing of code changes self-contained. In the river_dl/tests/test_data/ directory we don't have a reach attributes file, but know that you say this, I think it makes sense to have one so that we can test that functionality.

At the risk of sounding like a dumb dumb, river_dl/tests/ has always been somewhat of an enigma to me. In looking at it now it seems very helpful and like I should probably use it more ;).

Haha. My intent was to let that have some test data and scripts to let us have a standard way of testing our code. I haven't really used it probably as much as I should either :).

janetrbarclay · 2021-12-29T19:15:05Z

Snakefile_gw

-pred_weights = config['pred_weights']
+out_dir = config['out_dir'] + "_gw"
+#code_dir = config['code_dir']
+#pred_weights = config['pred_weights']


flagging this so it can be adjusted to match the pred_weights in the main Snakefile (if that's edited to keep the option for training / best validation weights in the config file)

janetrbarclay · 2021-12-29T19:15:23Z

Snakefile_gw

    output:
-        directory("{outdir}/trained_weights/"),
+        directory("{outdir}/finetune_weights/"),
        directory("{outdir}/best_val_weights/"),


flagging this so it can be adjusted to match the pred_weights in the main Snakefile (if that's edited to keep the option for training / best validation weights in the config file)

janetrbarclay · 2021-12-29T19:17:19Z

Snakefile_gw

+        temp_air_index = np.where(io_data['x_vars'] == 'seg_tave_air')[0]
+        air_unscaled = io_data['x_trn'][:, :, temp_air_index] * io_data['x_std'][temp_air_index] + \
+                       io_data['x_mean'][temp_air_index]
+        y_trn_obs = np.concatenate(
+            [io_data["y_obs_trn"], io_data["GW_trn_reshape"], air_unscaled], axis=2
+        )
+        air_val = io_data['x_val'][:, :, temp_air_index] * io_data['x_std'][temp_air_index] + io_data['x_mean'][
+            temp_air_index]
+        y_val_obs = np.concatenate(
+            [io_data["y_obs_val"], io_data["GW_val_reshape"], air_val], axis=2


should "io_data" be changed to data here? (per line 120)

Ah. Yes indeed. Good catch.

janetrbarclay · 2021-12-29T19:22:52Z

Snakefile

+                    x_trn = data['x_pre_full'],
+                    y_trn = data['y_pre_full'],
+                    epochs = config['pt_epochs'],
+                    batch_size = 2,


is this (batch_size=2) correct?

This is because I was using the data in river_dl/tests/test_data which only has two sites

janetrbarclay · 2021-12-29T19:23:15Z

Snakefile_gw

+                    x_trn = data['x_trn'],
+                    y_trn = y_trn_obs,
+                    epochs = config['pt_epochs'],
+                    batch_size = 2,


is batch_size=2 correct? (same comment as on the Snakefile)

This is because I was using the data in river_dl/tests/test_data which only has two sites

janetrbarclay · 2021-12-29T19:36:41Z

config.yml

@@ -1,18 +1,13 @@
 # Input files
-obs_file: "data_DRB/Obs_temp_flow_drb_full_no3558"
-sntemp_file: "data_DRB/sntemp_inputs_outputs_drb_full_no3558"
-dist_matrix_file: "data_DRB/distance_matrix_drb_full_no3558.npz"


is the distance matrix taken out b/c this example config / Snakefile are using an lstm rather than the rgcn?

Yes. But since we are doing different examples, I'll add this back in.

janetrbarclay · 2021-12-29T19:43:52Z

Snakefile

@@ -79,54 +80,89 @@ rule prep_io_data:
 #        """


+model = LSTMModel(


Is anyone using this repo and using an lstm model? If not, would it make sense to have an rgcn as the example since that is being used?

I am using an LSTM for the DO project

jsadler2 · 2021-12-30T16:21:42Z

@janetrbarclay mentioned that it would be ideal if (paraphrasing here) if we can add the new functionality without losing the existing functionality. I think that is wise. I think there is a pretty straightforward way to do that. I will revise this PR to do that.

jsadler2 · 2022-01-04T22:12:20Z

@janetrbarclay and @SimonTopp - Thank you guys for your review comments. I think the biggest change is just to shift our thinking from the Snakefile/config.yml files from being "this is the way to use river-dl" to "this is an example for how one could use river-dl" but you will likely have to modify for your own purposes. I will make some edits to this PR to address that and your guys' comments.

rm trainer class, model as input, files as input

this is to accommodate new train routine

also found bug in not passing spatial/time idx names all the way through

jsadler2 · 2022-01-20T22:39:48Z

@janetrbarclay and @SimonTopp - this is ready for another look. I summarize the major changes as:

training and prediction functions takes a compiled tensorflow model
the Snakefiles and config files are now in their own directory (workflow_examples/). Also in that directory is a readme that describes the different examples
the asRunConfig function now takes the code directory as an argument (since we are no longer assuming the Snakefile one is using is located in the root river-dl directory.

SimonTopp

@jsadler2, this was a huge lift and super well done!!! I think it manages to actually make the repository more modular while maintaining its value on the "helpfullness" dimension. All my comments are pretty minor, but let me know if you want to talk about any of them. I think it's ready to go after a couple small changes.

river_dl/postproc_utils.py

SimonTopp · 2022-01-21T12:57:16Z

river_dl/predict.py

-def load_model_from_weights(
-    model_type, model_weights_dir, hidden_size, dist_matrix=None, num_tasks=1,
-):
-    """
-    load a TF model from the model weights directory
-    :param model_type: [str] model to use either 'rgcn', 'lstm', or 'gru'
-    :param model_weights_dir: [str] directory to saved model weights
-    :param hidden_size: [int] the number of hidden units in model
-    :param dist_matrix: [np array] the distance matrix if using 'rgcn'
-    :param num_tasks: [int] number of tasks (variables_to_log to be predicted)
-    :return: TF model
-    """
-    if model_type == "rgcn":
-        model = RGCNModel(hidden_size, A=dist_matrix, num_tasks=num_tasks)
-    elif model_type.startswith("lstm"):
-        model = LSTMModel(hidden_size, num_tasks=num_tasks)
-    elif model_type == "gru":
-        model = GRUModel(hidden_size, num_tasks=num_tasks)
-    else:
-        raise ValueError(
-            f'model_type must be "lstm", "gru" or "rgcn", (not {model_type})'
-        )
-
-    model.load_weights(model_weights_dir)
-    return model


Don't have a strong opinion here, but do you think it'd keep the Snakefile cleaner if we changed this to something like compile_model and moved it to one of the utils files? That way in the Snakefile you could compile the model and optionally load weights in one line rather than a handful of lines. Probably only a marginal gain and potentially makes the workflow more opaque. Thoughts?

I'd rather not have a compile_model function. I think that we'd be back to needing to maintain something that will be hard to actually maintain.

river_dl/tests/generate_test_data.py

river_dl/train.py

workflow_examples/Snakefile_rgcn.smk

SimonTopp · 2022-01-21T14:29:20Z

workflow_examples/config_basic.yml

+train_start_date:
+  - '2003-09-15'
+train_end_date:
+  - '2005-09-14'
+val_start_date:
+  - '2005-09-14'
+val_end_date:
+  - '2006-09-14'
+test_start_date:
+  - '1980-10-01'
+test_end_date:
+  - '1985-09-30'


Somewhere (maybe in the readme), we should explicitly state the baseline run conditions we've all agreed upon across projects (partition years, segments)

Yeah. Good idea.

SimonTopp · 2022-01-21T14:33:35Z

workflow_examples/config_pretrain_LSTM.yml

+obs_file: "../river_dl/tests/test_data/obs_temp_flow"
+sntemp_file: "../river_dl/tests/test_data/test_data"
+dist_matrix_file: "../river_dl/tests/test_data/test_dist_matrix.npz"
+code_dir: ".."


I think there are two ways we could think of these examples.

They are loose guidelines and we state explicitly in the Readme what aspects of them will likely change for individual runs (dates, data files, input vars), or

We make them as "out-of-the-box" as possible, meaning they have the most common input files and run conditions so users literally don't have to change anything.

Thoughts? Maybe some combination of the two is possible as well.

I much prefer 1. - loose guidelines.

workflow_examples/readme.md

jsadler2 requested review from SimonTopp and janetrbarclay December 6, 2021 18:01

SimonTopp reviewed Dec 7, 2021

View reviewed changes

janetrbarclay reviewed Dec 29, 2021

View reviewed changes

jsadler2 mentioned this pull request Dec 30, 2021

146a simplify training - but backwards compatible #153

Closed

jsadler2 added 14 commits January 19, 2022 17:05

[USGS-R#146] simplified training fxn

a5e9b4d

rm trainer class, model as input, files as input

[USGS-R#146] model definition, compilation in Snakefile

049cbdd

this is to accommodate new train routine

[USGS-R#146] model def in snakefile, not predict.py

4fd0e22

also found bug in not passing spatial/time idx names all the way through

[USGS-R#146] gw modification to data load

3ed6b07

spatial index in prep_catch_props

3375fb4

[USGS-R#146] move snakefiles and config files to example dir

d909a8e

[USGS-R#146] add test seg attrs and test dist matrix data

1ecb5c8

[USGS-R#146] rename workflows to 'basic', 'rgcn', and 'gw'

6262230

flatten instead of reshape pred arrays

454dc73

[USGS-R#146] rm print statement; move comment

64a9ea0

[USGS-R#146] updates to example workflows

664664d

[USGS-R#146] readme describing workflows

468dacc

[USGS-R#146] rm unneeded, generic Snakefile

b15cdb2

[USGS-R#146] add code dir arg to as_run_config fxn

6157859

jsadler2 force-pushed the 146-simplify-training branch from 2628249 to 6157859 Compare January 20, 2022 21:38

[USGS-R#146] rm Snakefile_gw_orig

02c6706

SimonTopp suggested changes Jan 21, 2022

View reviewed changes

SimonTopp mentioned this pull request Jan 21, 2022

sharing pytorch starting point #157

Closed

jsadler2 added 2 commits January 21, 2022 14:40

[USGS-R#146] remove gpu rule

4efd791

[USGS-R#146] add bit about gpu training in readme

3fa06a2

jsadler2 mentioned this pull request Jan 21, 2022

Add training models in replicate and hypertuning workflow examples #158

Closed

jsadler2 and others added 5 commits January 21, 2022 15:36

[USGS-R#146] describing what may change workflow readme

781ce59

[USGS-R#146] rm redundant text in workflow readme

a942fe4

[USGS-R#146] link to section below in workflow readme

b883e52

[USGS-R#146] readme formatting

047321c

[USGS-R#146] catch if best_val_weights but no val data

f1591c7

SimonTopp approved these changes Jan 24, 2022

View reviewed changes

jsadler2 merged commit dd3b84c into USGS-R:main Jan 24, 2022

jsadler2 deleted the 146-simplify-training branch January 24, 2022 17:16

jsadler2 mentioned this pull request Jan 25, 2022

Explicitly state agreed upon DRB partitions (years, segments) in readme? #162

Open

		@@ -79,54 +80,89 @@ rule prep_io_data:
		# """


		model = LSTMModel(

146 simplify training #148

146 simplify training #148

Conversation

jsadler2 commented Dec 6, 2021

** Warning: This is a breaking PR**

SimonTopp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jsadler2 commented Dec 7, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jsadler2 commented Dec 30, 2021

jsadler2 commented Jan 4, 2022

jsadler2 commented Jan 20, 2022

SimonTopp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Warning: This is a breaking PR