Implementation of the FlashSpeech. For all details check out our paper accepted to ACM MM 2024: FlashSpeech: Efficient Zero-Shot Speech Synthesis.
- This project is a modified version based on Amphion's NaturalSpeech2 due to the use of some internal Microsoft tools in the original code.
- Environment Setup:
bash env.sh
- I have replaced Amphion's
accelerate
withlightning
because I encountered similar issues (related issue). Training withlightning
is faster.
- Modify
ns2dataset.py
based on your data. - This version has been tested on the LibriTTS dataset. Ensure you have the following data prepared in advance:
- Pitch
- Code
- Phoneme
- Duration
- Run the Training Script:
bash egs/tts/NaturalSpeech2/run_train.sh
Important Notes:
-
Choose Configuration:
- You can select either
***_s1
or***_s2
configuration files based on the training stage.
- You can select either
-
Modify Model Codec:
- In
models/tts/naturalspeech2/flashspeech.py
, update the codec to your own. - Adjust
self.latent_norm
to normalize the codec latent to the standard deviation. (This step is crucial for training the consistency model.)
- In
-
Stage 2 Setup:
- In
models/tts/naturalspeech2/flashspeech_trainer_stage2.py
, set the initial weights obtained from Stage 1 training.
- In
-
Stage 3 Development:
- The code for Stage 3 is not yet released. However, you can refer to Stage 1's consistency training to implement it.
Further organize the project structure and complete the remaining code.
Special thanks to Amphion, as our codebase is primarily borrowed from Amphion.
Thank you for using FlashSpeech!