ASR
ASR (Automatic Speech Recognition) is a technology that converts spoken language into text. ASR is widely used in various applications such as voice assistants, dictation software, and transcription services. SpeeChain provides a collection of ASR recipes that allow users to train and evaluate ASR models on various datasets. The recipes include configurations for different backbones, such as transformer and conformer, and provide pretrained models for reproducibility. Users can also create their own ASR models by following the instructions in the recipes.
👆Back to the recipe README.md
Table of Contents
- Available Backbones
- Pretrained Models for Reproducibility
- Training an ASR model
- Creating your own ASR model
Available Backbones (WER with/without LM)
Dataset | Subset | Configuration | Test Clean | Test Other |
---|---|---|---|---|
librispeech | train-clean-5 | 5-bpe5k_conformer-small_lr2e-3 | 15.2% / 35.8% | 35.8% / 42.3% |
train-clean-100 | 100-bpe5k_transformer-wide_lr2e-3 | 8.40% / 21.92% | 5.50% / 15.56% | |
100-bpe5k_conformer-small_lr2e-3 | 6.3% / 9.3% | 19.14% / 25.2% | ||
100-bpe5k_conformer-medium_lr2e-3 | 7.87% / 21.36% | 5.30% / 15.57% | ||
100-bpe5k_conformer-large_lr2e-3 | 7.30% / 20.24% | 5.33% / 15.15% | ||
train-clean-460 | 460-bpe5k_transformer-large | % / % | % / % | |
460-bpe5k_conformer-large | % / % | % / % | ||
train-960 | 960-bpe5k_transformer-large | % / % | % / % | |
960-bpe5k_conformer-large | % / % | % / % | ||
libritts_librispeech | train-960 | 960-bpe5k_transformer-large | % / % | % / % |
👆Back to the table of contents
Pretrained Models for Reproducibility
For reproducibility of our ASR model configuration files in ${SPEECHAIN_ROOT}/recipes/asr/
, we provide the following pretrained models to ensure consistent performance:
-
SentencePiece tokenizer models
-
Please download tokenizer model and vocabulary to where your dataset is dumped. The default path is
${SPEECHAIN_ROOT}/datasets
.
Note: If your dataset is dumped outside SpeeChain, please replace${SPEECHAIN_ROOT}/datasets
in the following commands by your place. -
LibriSpeech:
-
train-clean-100:
-
train-clean-460:
-
train-960:
-
-
-
Transformer-based language models
- Please download both LM model and configuration file. The default path is
${SPEECHAIN_ROOT}/recipes/lm
.
Note: If you want to store model files outside SpeeChain, please replace${SPEECHAIN_ROOT}/recipes/lm
in the following commands by your place. Also, change thelm_cfg_path
andlm_model_path
arguments in each ASR configuration file. -
LibriSpeech:
-
train-clean-100:
-
train-960:
-
- Please download both LM model and configuration file. The default path is
👆Back to the table of contents
Training an ASR model
Before training an ASR model, ensure that your target datasets are dumped by the scripts in ${SPEECHAIN_ROOT}/datasets/{your-target-dataset}
.
More details on how to dump a dataset can be found here.
Use an existing dataset with a pre-tuned configuration
-
locate a .yaml configuration file in
${SPEECHAIN_ROOT}/recipes/asr
. Suppose we want to train an ASR model by the configuration${SPEECHAIN_ROOT}/recipes/asr/librispeech/train-960/exp_cfg/960-bpe5k_transformer-wide_ctc_perturb.yaml
. -
Train and evaluate the ASR model on your target training set
cd ${SPEECHAIN_ROOT}/recipes/asr/librispeech/train-960
bash run.sh --exp_cfg 960-bpe5k_transformer-wide_ctc_perturb (--ngpu x --gpus x,x)
1. Review the comments on the top of the configuration file to ensure that your computational resources fit the configuration before training the model.
If your resources do not match the configuration, adjust it by
--ngpu
and --gpus
to match your available GPU memory.2. To save the experimental results outside the toolkit folder
${SPEECHAIN_ROOT}
,
specify your desired location by appending --train_result_path {your-target-path}
to bash run.sh
.In this example,
bash run.sh --exp_cfg 960-bpe5k_transformer-wide_ctc_perturb --train_result_path /a/b/c
will save results to /a/b/c/960-bpe5k_transformer-wide_ctc_perturb
.
Creating a new configuration for a non-existing dataset
-
Dump your target dataset from the Internet following these instructions.
-
Create a folder for your dumped dataset
${SPEECHAIN_ROOT}/recipes/asr/{your-new-dataset}/{your-target-subset}
:Note:mkdir ${SPEECHAIN_ROOT}/recipes/asr/{your-new-dataset}/{your-target-subset} cp ${SPEECHAIN_ROOT}/recipes/asr/librispeech/train-960/run.sh ${SPEECHAIN_ROOT}/recipes/asr/{your-new-dataset}/{your-target-subset}/run.sh
- Update the arguments
dataset
andsubset
(line no.16 & 17) in${SPEECHAIN_ROOT}/recipes/asr/{your-new-dataset}/{your-target-subset}/run.sh
:
- Update the arguments
-
Copy a pre-tuned configuration file into your newly created folder. Suppose we want to use the configuration
${SPEECHAIN_ROOT}/recipes/asr/librispeech/train-960/exp_cfg/960-bpe5k_transformer-wide_ctc_perturb.yaml
:
Note:cd ${SPEECHAIN_ROOT}/recipes/asr/{your-new-dataset}/{your-target-subset} mkdir ./data_cfg ./exp_cfg cp ${SPEECHAIN_ROOT}/recipes/asr/librispeech/train-960/exp_cfg/960-bpe5k_transformer-wide_ctc_perturb.yaml ./exp_cfg
- Update the dataset arguments at the beginning of your selected configuration:
-
Train the ASR model on your target training set:
Note:cd ${SPEECHAIN_ROOT}/recipes/asr/{your-new-dataset}/{your-target-subset} bash run.sh --test false --exp_cfg 960-bpe5k_transformer-wide_ctc_perturb (--ngpu x --gpus x,x)
--test false
is used to skip the testing stage.- Ensure your computational resources match the configuration before training the model.
- To save experimental results outside ${SPEECHAIN_ROOT}, specify your desired location by appending --train_result_path {your-target-path} to bash run.sh.
-
Tune the inference hyperparameters on the corresponding validation set
Note:cp ${SPEECHAIN_ROOT}/recipes/asr/librispeech/train-960/data_cfg/test_dev-clean+other.yaml ./data_cfg mv ./data_cfg/test_dev-clean.yaml ./data_cfg/test_{your-valid-set-name}.yaml bash run.sh --train false --exp_cfg 960-bpe5k_transformer-wide_ctc_perturb --data_cfg test_{your-valid-set-name}
-
Update the dataset arguments in
./data_cfg/test_{your-valid-set-name}.yaml
: -
--train false
is used to skip the training stage. --data_cfg
switches the data loading configuration from the original one for training in exp_cfg to the one for validation tuning.- To access experimental results saved outside
${SPEECHAIN_ROOT}
, append--train_result_path {your-target-path}
tobash run.sh
.
-
-
Evaluate the trained ASR model on the official test sets
Note:bash run.sh --train false --exp_cfg 960-bpe5k_transformer-wide_ctc_perturb --infer_cfg "{the-best-configuration-you-get-during-validation-tuning}"
--train false
is used to skip the training stage.- There are two ways to specify the optimal
infer_cfg
tuned on the validation set:- Update
infer_cfg
in${SPEECHAIN_ROOT}/recipes/asr/{your-new-dataset}/{your-target-subset}/exp_cfg/960-bpe5k_transformer-wide_ctc_perturb.yaml
. - Provide a parsable string as the value for
--infer_cfg
in the terminal. For example,beam_size:16,ctc_weight:0.2
can be converted into a dictionary with two key-value items (beam_size=16
andctc_weight=0.2
).
For more details about this syntax, refer to here.
- Update
- To access experimental results saved outside
${SPEECHAIN_ROOT}
, append--train_result_path {your-target-path}
tobash run.sh
.
👆Back to the table of contents
How to create your own ASR model
The detailed instructions for creating your own ASR model using SpeeChain are coming soon.