ASR Project within a course 'Deep Learning in Audio'
In ctc_char_encoder.py I added the beam search with and without LM.
pip install https://github.com/kpu/kenlm/archive/master.zip
wget https://www.openslr.org/resources/11/3-gram.arpa.gz --no-check-certificate
gzip -d 3-gram.arpa.gz
wget https://drive.google.com/file/d/1FfcDs004kl3bo8prP-TmvpES9igAayBl/view?usp=sharing
For test-clean:
python3 test.py \
--config hw_asr/configs/test_clean.json \
--resume model_best.pth \
--batch-size 64 \
--jobs 4 \
--beam-size 100 \
--output output_clean.json
For test-other:
python3 test.py \
--config hw_asr/configs/test_other.json \
--resume model_best.pth \
--batch-size 64 \
--jobs 4 \
--beam-size 100 \
--output output_other.json
Dependent on which model you want, but I trained the best model using this command
python3 train.py -c hw_asr/configs/train_together.json
python3 test.py \
-c hw_asr/configs/test_other.json \
-r model_best.pth \
-t test_data \
-o test_result.json \
-b 5
GPU: Tesla P100-PCIE
For my implementation of ASR, I chose the model from paper Deep Speech 2: End-to-End Speech Recognition in English and Mandarin. Model architecture in the project is heavily dependant on this architecture, though a bit reduced due to resources limitations.
For padding, kernel sizes and strides in the convolutional layers I looked up the official documentation of DeepSpeech2 in order to build a strong model from the very first steps.
- First I am training my model only on LibriSpeech dataset train-clean-100 and train-clean-360 for 50 epochs and testing the model quality on LibriSpeech dataset test-clean. [since the audios are clean i did not add any noise augmentations]
- Then I am fine-tuning the model on LibriSpeech dataset train-other-500 for 50 epochs and testing its final quality on test-other. [on the fine-tuning step I am adding noise augmentations to upsample the training data and adapt the model to noisy sounds]
The hypothesis here is that there is a decent shift in the distributions of data when we move from the clean dataset to 'other', so the first method with finetuning + pretraining would not work well. Due to that, there is a need to train on both datasets and make the model train on various examples. So the pipeline is:
- Train on both clean + other train and eval on test-other straight away to understand the real accuracy of my model while training.
First I implemented a model with 2 convolution layers and 4 GRU layers. Optimizer - SGD, lr_scheduler - OneCycleLR, no augmentations for train-clean part. See the config for the clean-part training here. The metrics on this step while training only on clean part were CER: 0.39 and I would like to not state the WER, because I was not computing it in the right way.
I decided to carry on with training on the clean part and get better cer and wer metrics on them, so I am adding more GRU layers and augmentations. I decided to take the architecture which is shown in the picture in part 1 [3 convolution layers and 7 GRU layers]. First for 50 epochs I trained on the train-clean datasets adding such augmentations: for waves - Shift, Gain and Guassian Noise; for spectrograms: Frequency masking and Time masking. Afterwards, I trained the same model for 50 epochs on train-other with the same augmentations, but without adding the Guassian Noise. Configs for training - first 50 epochs and second 50 epochs. Logs of training: first 50 epochs and second 50 epochs.
[model numbers were first train_ds2/1021_175629, second train_ds2_other/1022_173704]
Beamsearch | WER | CER |
---|---|---|
NO | 0.45 | 0.16 |
Beamsearch | WER | CER |
---|---|---|
NO | 0.63 | 0.27 |
Using TimeInversion Augmentation was a failure, the audio became impossible to listen to and this augmentation was rather vague.
I decided to try to train the model for 100 epochs on the whole dataset(clean + other) and see how well that goes. Model architecture was a bit changed since i had to downsize the number of GRU layers and make it only 4 due to enlargening the dataset. This attempt rather failed too [see metrics]. logs and config
Beamsearch | WER | CER |
---|---|---|
NO | 0.99 | 0.71 |
Beamsearch | WER | CER |
---|---|---|
NO | 1.02 | 0.68 |
My hypothesis is that using Guassian noise augmentation with the "other" dataset was an overkill. Also, SGD optimizer might be less efficient than Adam in this case, but I wanted to try the CyclirLR scheduler with triangular2 mode and it is compatible only with SGD.
You can see the code of my beamsearch without using Language model in the /text_encoder/ctc_char_text_encoder.py file, yet I wouldn't say that custom beam search actually did something valuable here, CER and WER metrics went up with beam search results [example below]
Beamsearch | WER | CER |
---|---|---|
NO | 0.46 | 0.16 |
YES | 0.47 | 0.17 |
[weird, no rise in metrics]
Here we have some hyperparameters games, didn't change much though - only changed the number of epochs + removed the guassian noise augmentation + we are still training on clean + other train, but now in one epoch we have 1000 steps [might be an overkill] + running val on test-other to see how good of a score we can get. Since the training data is large, i removed the number of GRU layers[left 4 GRU layers and 2 convolutional layers] and got left with only 19M params - this might result in insufficient quality.
- Since one epoch lasted for about 20 minutes due to the number of steps per epoch - I decided to stop the training on the 33 epoch to see how well this change in hyperparameters resulted in the model accuracy. On test-clean:
Beamsearch with LM | WER | CER |
---|---|---|
NO | 0.27 | 0.08 |
YES | 0.16 | 0.06 |
On test-other:
Beamsearch with LM | WER | CER |
---|---|---|
NO | 0.45 | 0.19 |
YES | 0.46 | 0.19 |
For beam search with LM i used the 3-gram.arpa LM from LibriSpeech models, implemented a ctc decoder which is able to decode texts. Additionally, I added some hot words[hard words which not a lot of people could guess how to write] which are passed to the decoder with a significant weight for the model to draw attention to these words specifically.
In the prior experiment[№4] I implemented Beam Search with LM with hyperparameters alpha = 0.5 and beta = 0.1, maybe this needs some changes so I will experiment with these in the fifth, final experiment.
You can download the model's weights here
2 Convolutional layers + 5 GRU layers. In the paper Deep Speech 2: End-to-End Speech Recognition in English and Mandarin authors said that there is no radical gain in 3 convolutional layers and 7 GRU layers, and one can stop on 2 conv layers + 4 gru layers, but I decided to add one more just in case(it didn't slow down the training process too much).
For waves I used Shift, Gain, Guassian Noise[links to their docs are given above] and for spectrograms I used TimeMasking and FreqMasking. Since the dataset i was training on was clean + other, I decided that the probablitity of adding Guassian Noise should not be 1 as in other experiments, because it can make the data too inaudible. You can listen to the audio + look at spectrograms after augmentations here.
On test-clean:
Beamsearch with LM | WER | CER |
---|---|---|
NO | 0.22 | 0.069 |
YES | 0.1996 | 0.074 |
On test-other:
Beamsearch with LM | WER | CER |
---|---|---|
NO | 0.43 | 0.17 |
YES | 0.42 | 0.198 |
Config for train, for test-clean and test-other. Training logs are here
For LM I tried different hyperparameters for alpha and beta, but the best ones were alpha = 0.5 and beta = 0.1, yet i tried different betas from 1e-3 to 0.1