Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training waveglow model for 16kHz #215

Open
fatihkiralioglu opened this issue Jul 3, 2020 · 15 comments
Open

Training waveglow model for 16kHz #215

fatihkiralioglu opened this issue Jul 3, 2020 · 15 comments

Comments

@fatihkiralioglu
Copy link

fatihkiralioglu commented Jul 3, 2020

Hi,
I'm trying to train 16kHz models for both waveglow and tacotron2.
for 16k tacotron I have used win_length=800 and hop_length=200, It has produced good results with 22k pretrained waveglow model. In order to get better results I want to train an 16khz waveglow model
I guess that the same parameter values of 800 and 200 should be used for waveglow training.
When I use these new parameters instead of 1024 and 256, can I still use pretrained 22k waveglow model for warmstart? I have some reservations because pretrained 22k waveglow model is trained with win_length:1024 and hop_length:200
Thanks.

@ashish-roopan
Copy link

Someone please answer this question.I trained the model after loading the pretrained weights ,but after 14K steps the audio is full of noise.

@mychiux413
Copy link

I got the same issue.

  • I used waveglow_256channels_universal_v5.pt as the pretrained model
  • I used LJSpeech + VCTK in 16kHz for training data with trimmed silence.
  • The v5 model should be trained by mel spec with :
"sampling_rate": 22050,
"filter_length": 1024,
"hop_length": 256,
"win_length": 1024,
"mel_fmin": 0.0,
"mel_fmax": 8000.0
  • my mel spec was:
"sampling_rate": 16000,
"filter_length": 768,
"hop_length": 192,
"win_length": 768,
"mel_fmin": 0.0,
"mel_fmax": 8000.0

Before training, I used the v5 model(22k pretrained) to infer my mel spec, the speech was still audible(even male's spec), of course the pitch must all shifted down if I choosed to output frame-rate as 16kHz.

After training with the pre-trained model, the loss could fast drop to ~-5.0 after few steps, in the period of my 25k steps, the losses were ~-5.5 around, but all the audio which inferenced by 25k steps checkpoint were all full of noise(almost no sound).

Of course if I trained without pre-trained model, the loss will drop very slowly, and the inference results were also full of noise.

@mychiux413
Copy link

Maybe we could try to modify the code as #88 , then try again.

@ashish-roopan
Copy link

So after training the pre-trained model for 25k steps,you are still getting noisy output?
I also faced the same issue ,the output I got after inference with waveglow_256channels_universal_v5.pt was at least audible.
I also got the same loss around -6.

@ashish-roopan
Copy link

#88 may work

@mychiux413
Copy link

after #88 , training 16kHz with pre-trained model is not available anymore, because the WaveGlow.upsample depends on the win_length/hop_length.

@ashish-roopan
Copy link

Yes,I also faced the same issue.So I trained the model from scratch.After 100K steps ,the audio quality is not improving much .
The generated audio has audible speech ,but has some noise.Do you know how much steps is required for getting results similar to official model?

@ashish-roopan
Copy link

Have you tried #99?Can we train 16KHz with pre-trained model using this code?

@HiiamCong
Copy link

HiiamCong commented Aug 3, 2020

Hi, I currently have a problem with 16kHz waveglow training
My Tacontron2 model is ok (tested with pre-trained WaveGlow model). I'm trying to train waveglow from scratch.
I used WaveGlow code at master branch with below config.json

"train_config":
"fp16_run": true,
"output_directory": "checkpoints",
"epochs": 100000,
"learning_rate": 1e-4,
"sigma": 1.0,
"iters_per_checkpoint": 2000,
"batch_size": 12,
"seed": 1234,
"checkpoint_path": "",
"with_tensorboard": false
"data_config":
"training_files": "train_files.txt",
"segment_length": 16000,
"sampling_rate": 16000,
"filter_length": 800,
"hop_length": 200,
"win_length": 800,
"mel_fmin": 0.0,
"mel_fmax": 8000.0
"waveglow_config":
"n_mel_channels": 80,
"n_flows": 12,
"n_group": 8,
"n_early_every": 4,
"n_early_size": 2,
"WN_config": {
    "n_layers": 8,
    "n_channels": 256,
    "kernel_size": 3
}

I have trained for 236k steps and every output audios are silence. Hope u guys could give me some light :(
Output audio: https://drive.google.com/drive/folders/1hqVHOVoZISP3-BxvJG8n3MCfG6LGF0te?usp=sharing

@STASYA00
Copy link

STASYA00 commented Sep 6, 2020

Did anyone manage to solve this issue? I'm also training on 16000 dataset. To check the model I trained it just on 12 samples (1 batch) with different parameters using pretrained model. The first one:

"segment_length": 16000,
"sampling_rate": 16000,
"filter_length": 800,
"hop_length": 200,
"win_length": 800,

"learning_rate": 1e-5

after 500 epochs the loss starts to increase, all the inferences (500, 1000, ... 5000) give only noise in the output.
The second one:

"segment_length": 16000,
"sampling_rate": 16000,
"filter_length": 1024,
"hop_length": 256,
"win_length": 1024,

"learning_rate": 1e-5

Gives audible speech after 500, but there's a lot of noise and it's too fast.

The question is: why does the loss increase? Why does the quality remain the same on the training set and does not improve even though the sample has been seen many times? And how to remove the noise and normalize the audio speed?

@xDuck
Copy link

xDuck commented Sep 29, 2020

Was anyone able to figure this out? I also tried training 16k from scratch and had the same experience as @mychiux413

@adrianastan
Copy link

You can find a model trained from scratch on 21 hours of multispeaker 16kHz data (544000 training steps) here: http://adrianastan.com/models/ . Not as good as the NVIDIA release, but it does the job.

The config is as follows:

{
    "train_config": {
        "fp16_run": true,
        "output_directory": "checkpoints_swara",
        "epochs": 100000,
        "learning_rate": 1e-4,
        "sigma": 1.0,
        "iters_per_checkpoint": 2000,
        "batch_size": 8,
        "seed": 1234,
        "checkpoint_path": "",
        "with_tensorboard": false
    },
    "data_config": {
        "training_files": "train_SWARA.txt",
        "segment_length": 16000,
        "sampling_rate": 16000,
        "filter_length": 1024,
        "hop_length": 256,
        "win_length": 1024,
        "mel_fmin": 0.0,
        "mel_fmax": 8000.0
    },
    "dist_config": {
        "dist_backend": "nccl",
        "dist_url": "tcp://localhost:54321"
    },

    "waveglow_config": {
        "n_mel_channels": 80,
        "n_flows": 12,
        "n_group": 8,
        "n_early_every": 4,
        "n_early_size": 2,
        "WN_config": {
            "n_layers": 8,
            "n_channels": 256,
            "kernel_size": 3
        }
    }
}

Perhaps you can warmstart your model from it.

@xprilion
Copy link

xprilion commented Apr 19, 2021

Trained one for 377.5k steps, unsure of how good/bad it is because for my use case it was okay-ish - https://drive.google.com/file/d/1dP4eMDPrZyqRo_gMz1VUDr2Bd_eRXoIa/view?usp=sharing

@naba89
Copy link

naba89 commented Apr 21, 2021

Trained one for 377.5k steps, unsure of how good/bad it is because for my use case it was okay-ish - https://drive.google.com/file/d/1dP4eMDPrZyqRo_gMz1VUDr2Bd_eRXoIa/view?usp=sharing

Can you also share your config please.

@Merlin-721
Copy link

Trained one for 377.5k steps, unsure of how good/bad it is because for my use case it was okay-ish - https://drive.google.com/file/d/1dP4eMDPrZyqRo_gMz1VUDr2Bd_eRXoIa/view?usp=sharing

I get the following exception when loading the model:
No module named 'waveglow'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants