Training waveglow model for 16kHz #215

fatihkiralioglu · 2020-07-03T07:59:46Z

Hi,
I'm trying to train 16kHz models for both waveglow and tacotron2.
for 16k tacotron I have used win_length=800 and hop_length=200, It has produced good results with 22k pretrained waveglow model. In order to get better results I want to train an 16khz waveglow model
I guess that the same parameter values of 800 and 200 should be used for waveglow training.
When I use these new parameters instead of 1024 and 256, can I still use pretrained 22k waveglow model for warmstart? I have some reservations because pretrained 22k waveglow model is trained with win_length:1024 and hop_length:200
Thanks.

ashish-roopan · 2020-07-19T15:08:57Z

Someone please answer this question.I trained the model after loading the pretrained weights ,but after 14K steps the audio is full of noise.

mychiux413 · 2020-07-20T03:44:27Z

I got the same issue.

I used waveglow_256channels_universal_v5.pt as the pretrained model
I used LJSpeech + VCTK in 16kHz for training data with trimmed silence.
The v5 model should be trained by mel spec with :

"sampling_rate": 22050,
"filter_length": 1024,
"hop_length": 256,
"win_length": 1024,
"mel_fmin": 0.0,
"mel_fmax": 8000.0

my mel spec was:

"sampling_rate": 16000,
"filter_length": 768,
"hop_length": 192,
"win_length": 768,
"mel_fmin": 0.0,
"mel_fmax": 8000.0

Before training, I used the v5 model(22k pretrained) to infer my mel spec, the speech was still audible(even male's spec), of course the pitch must all shifted down if I choosed to output frame-rate as 16kHz.

After training with the pre-trained model, the loss could fast drop to ~-5.0 after few steps, in the period of my 25k steps, the losses were ~-5.5 around, but all the audio which inferenced by 25k steps checkpoint were all full of noise(almost no sound).

Of course if I trained without pre-trained model, the loss will drop very slowly, and the inference results were also full of noise.

mychiux413 · 2020-07-20T09:13:04Z

Maybe we could try to modify the code as #88 , then try again.

ashish-roopan · 2020-07-20T10:16:33Z

So after training the pre-trained model for 25k steps,you are still getting noisy output?
I also faced the same issue ,the output I got after inference with waveglow_256channels_universal_v5.pt was at least audible.
I also got the same loss around -6.

ashish-roopan · 2020-07-20T10:30:25Z

#88 may work

mychiux413 · 2020-07-22T09:41:36Z

after #88 , training 16kHz with pre-trained model is not available anymore, because the WaveGlow.upsample depends on the win_length/hop_length.

ashish-roopan · 2020-07-25T13:54:44Z

Yes,I also faced the same issue.So I trained the model from scratch.After 100K steps ,the audio quality is not improving much .
The generated audio has audible speech ,but has some noise.Do you know how much steps is required for getting results similar to official model?

ashish-roopan · 2020-07-25T13:57:04Z

Have you tried #99?Can we train 16KHz with pre-trained model using this code?

HiiamCong · 2020-08-03T09:46:37Z

Hi, I currently have a problem with 16kHz waveglow training
My Tacontron2 model is ok (tested with pre-trained WaveGlow model). I'm trying to train waveglow from scratch.
I used WaveGlow code at master branch with below config.json

"train_config":
"fp16_run": true,
"output_directory": "checkpoints",
"epochs": 100000,
"learning_rate": 1e-4,
"sigma": 1.0,
"iters_per_checkpoint": 2000,
"batch_size": 12,
"seed": 1234,
"checkpoint_path": "",
"with_tensorboard": false

"data_config":
"training_files": "train_files.txt",
"segment_length": 16000,
"sampling_rate": 16000,
"filter_length": 800,
"hop_length": 200,
"win_length": 800,
"mel_fmin": 0.0,
"mel_fmax": 8000.0

"waveglow_config":
"n_mel_channels": 80,
"n_flows": 12,
"n_group": 8,
"n_early_every": 4,
"n_early_size": 2,
"WN_config": {
    "n_layers": 8,
    "n_channels": 256,
    "kernel_size": 3
}

I have trained for 236k steps and every output audios are silence. Hope u guys could give me some light :(
Output audio: https://drive.google.com/drive/folders/1hqVHOVoZISP3-BxvJG8n3MCfG6LGF0te?usp=sharing

STASYA00 · 2020-09-06T12:48:49Z

Did anyone manage to solve this issue? I'm also training on 16000 dataset. To check the model I trained it just on 12 samples (1 batch) with different parameters using pretrained model. The first one:

"segment_length": 16000,
"sampling_rate": 16000,
"filter_length": 800,
"hop_length": 200,
"win_length": 800,

"learning_rate": 1e-5

after 500 epochs the loss starts to increase, all the inferences (500, 1000, ... 5000) give only noise in the output.
The second one:

"segment_length": 16000,
"sampling_rate": 16000,
"filter_length": 1024,
"hop_length": 256,
"win_length": 1024,

"learning_rate": 1e-5

Gives audible speech after 500, but there's a lot of noise and it's too fast.

The question is: why does the loss increase? Why does the quality remain the same on the training set and does not improve even though the sample has been seen many times? And how to remove the noise and normalize the audio speed?

xDuck · 2020-09-29T14:39:40Z

Was anyone able to figure this out? I also tried training 16k from scratch and had the same experience as @mychiux413

adrianastan · 2020-09-29T15:39:09Z

You can find a model trained from scratch on 21 hours of multispeaker 16kHz data (544000 training steps) here: http://adrianastan.com/models/ . Not as good as the NVIDIA release, but it does the job.

The config is as follows:

{
    "train_config": {
        "fp16_run": true,
        "output_directory": "checkpoints_swara",
        "epochs": 100000,
        "learning_rate": 1e-4,
        "sigma": 1.0,
        "iters_per_checkpoint": 2000,
        "batch_size": 8,
        "seed": 1234,
        "checkpoint_path": "",
        "with_tensorboard": false
    },
    "data_config": {
        "training_files": "train_SWARA.txt",
        "segment_length": 16000,
        "sampling_rate": 16000,
        "filter_length": 1024,
        "hop_length": 256,
        "win_length": 1024,
        "mel_fmin": 0.0,
        "mel_fmax": 8000.0
    },
    "dist_config": {
        "dist_backend": "nccl",
        "dist_url": "tcp://localhost:54321"
    },

    "waveglow_config": {
        "n_mel_channels": 80,
        "n_flows": 12,
        "n_group": 8,
        "n_early_every": 4,
        "n_early_size": 2,
        "WN_config": {
            "n_layers": 8,
            "n_channels": 256,
            "kernel_size": 3
        }
    }
}

Perhaps you can warmstart your model from it.

xprilion · 2021-04-19T09:37:17Z

Trained one for 377.5k steps, unsure of how good/bad it is because for my use case it was okay-ish - https://drive.google.com/file/d/1dP4eMDPrZyqRo_gMz1VUDr2Bd_eRXoIa/view?usp=sharing

naba89 · 2021-04-21T11:57:14Z

Trained one for 377.5k steps, unsure of how good/bad it is because for my use case it was okay-ish - https://drive.google.com/file/d/1dP4eMDPrZyqRo_gMz1VUDr2Bd_eRXoIa/view?usp=sharing

Can you also share your config please.

Merlin-721 · 2021-08-11T13:30:10Z

Trained one for 377.5k steps, unsure of how good/bad it is because for my use case it was okay-ish - https://drive.google.com/file/d/1dP4eMDPrZyqRo_gMz1VUDr2Bd_eRXoIa/view?usp=sharing

I get the following exception when loading the model:
No module named 'waveglow'

xprilion mentioned this issue Apr 19, 2021

16kHz Pre-Trained model #232

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training waveglow model for 16kHz #215

Training waveglow model for 16kHz #215

fatihkiralioglu commented Jul 3, 2020 •

edited

Loading

ashish-roopan commented Jul 19, 2020

mychiux413 commented Jul 20, 2020

mychiux413 commented Jul 20, 2020

ashish-roopan commented Jul 20, 2020

ashish-roopan commented Jul 20, 2020

mychiux413 commented Jul 22, 2020

ashish-roopan commented Jul 25, 2020

ashish-roopan commented Jul 25, 2020

HiiamCong commented Aug 3, 2020 •

edited

Loading

STASYA00 commented Sep 6, 2020

xDuck commented Sep 29, 2020

adrianastan commented Sep 29, 2020

xprilion commented Apr 19, 2021 •

edited

Loading

naba89 commented Apr 21, 2021

Merlin-721 commented Aug 11, 2021

Training waveglow model for 16kHz #215

Training waveglow model for 16kHz #215

Comments

fatihkiralioglu commented Jul 3, 2020 • edited Loading

ashish-roopan commented Jul 19, 2020

mychiux413 commented Jul 20, 2020

mychiux413 commented Jul 20, 2020

ashish-roopan commented Jul 20, 2020

ashish-roopan commented Jul 20, 2020

mychiux413 commented Jul 22, 2020

ashish-roopan commented Jul 25, 2020

ashish-roopan commented Jul 25, 2020

HiiamCong commented Aug 3, 2020 • edited Loading

STASYA00 commented Sep 6, 2020

xDuck commented Sep 29, 2020

adrianastan commented Sep 29, 2020

xprilion commented Apr 19, 2021 • edited Loading

naba89 commented Apr 21, 2021

Merlin-721 commented Aug 11, 2021

fatihkiralioglu commented Jul 3, 2020 •

edited

Loading

HiiamCong commented Aug 3, 2020 •

edited

Loading

xprilion commented Apr 19, 2021 •

edited

Loading