ASR
ASR (Automatic Speech Recognition) is a technology that converts spoken language into text. ASR is widely used in various applications such as voice assistants, dictation software, and transcription services. SpeeChain provides a collection of ASR recipes that allow users to train and evaluate ASR models on various datasets. The recipes include configurations for different backbones, such as transformer and conformer, and provide pretrained models for reproducibility. Users can also create their own ASR models by following the instructions in the recipes.
👆Back to the recipe README.md
Table of Contents
- Available Backbones
- Pretrained Models for Reproducibility
- Training an ASR model
- Creating your own ASR model
Available Backbones (WER with/without LM)
| Dataset | Subset | Configuration | Test Clean | Test Other |
|---|---|---|---|---|
| librispeech | train-clean-5 | 5-bpe5k_conformer-small_lr2e-3 | 15.2% / 35.8% | 35.8% / 42.3% |
| train-clean-100 | 100-bpe5k_transformer-wide_lr2e-3 | 8.40% / 21.92% | 5.50% / 15.56% | |
| 100-bpe5k_conformer-small_lr2e-3 | 6.3% / 9.3% | 19.14% / 25.2% | ||
| 100-bpe5k_conformer-medium_lr2e-3 | 7.87% / 21.36% | 5.30% / 15.57% | ||
| 100-bpe5k_conformer-large_lr2e-3 | 7.30% / 20.24% | 5.33% / 15.15% | ||
| train-clean-460 | 460-bpe5k_transformer-large | % / % | % / % | |
| 460-bpe5k_conformer-large | % / % | % / % | ||
| train-960 | 960-bpe5k_transformer-large | % / % | % / % | |
| 960-bpe5k_conformer-large | % / % | % / % | ||
| libritts_librispeech | train-960 | 960-bpe5k_transformer-large | % / % | % / % |
👆Back to the table of contents
Pretrained Models for Reproducibility
For reproducibility of our ASR model configuration files in ${SPEECHAIN_ROOT}/recipes/asr/, we provide the following pretrained models to ensure consistent performance:
-
SentencePiece tokenizer models
-
Please download tokenizer model and vocabulary to where your dataset is dumped. The default path is
${SPEECHAIN_ROOT}/datasets.
Note: If your dataset is dumped outside SpeeChain, please replace${SPEECHAIN_ROOT}/datasetsin the following commands by your place. -
LibriSpeech:
-
train-clean-100:
-
train-clean-460:
-
train-960:
-
-
-
Transformer-based language models
- Please download both LM model and configuration file. The default path is
${SPEECHAIN_ROOT}/recipes/lm.
Note: If you want to store model files outside SpeeChain, please replace${SPEECHAIN_ROOT}/recipes/lmin the following commands by your place. Also, change thelm_cfg_pathandlm_model_patharguments in each ASR configuration file. -
LibriSpeech:
-
train-clean-100:
-
train-960:
-
- Please download both LM model and configuration file. The default path is
👆Back to the table of contents
Training an ASR model
Before training an ASR model, ensure that your target datasets are dumped by the scripts in ${SPEECHAIN_ROOT}/datasets/{your-target-dataset}.
More details on how to dump a dataset can be found here.
Use an existing dataset with a pre-tuned configuration
-
locate a .yaml configuration file in
${SPEECHAIN_ROOT}/recipes/asr. Suppose we want to train an ASR model by the configuration${SPEECHAIN_ROOT}/recipes/asr/librispeech/train-960/exp_cfg/960-bpe5k_transformer-wide_ctc_perturb.yaml. -
Train and evaluate the ASR model on your target training set
cd ${SPEECHAIN_ROOT}/recipes/asr/librispeech/train-960
bash run.sh --exp_cfg 960-bpe5k_transformer-wide_ctc_perturb (--ngpu x --gpus x,x)
1. Review the comments on the top of the configuration file to ensure that your computational resources fit the configuration before training the model.
If your resources do not match the configuration, adjust it by
--ngpu and --gpus to match your available GPU memory.2. To save the experimental results outside the toolkit folder
${SPEECHAIN_ROOT},
specify your desired location by appending --train_result_path {your-target-path} to bash run.sh.In this example,
bash run.sh --exp_cfg 960-bpe5k_transformer-wide_ctc_perturb --train_result_path /a/b/c
will save results to /a/b/c/960-bpe5k_transformer-wide_ctc_perturb.
Creating a new configuration for a non-existing dataset
-
Dump your target dataset from the Internet following these instructions.
-
Create a folder for your dumped dataset
${SPEECHAIN_ROOT}/recipes/asr/{your-new-dataset}/{your-target-subset}:Note:mkdir ${SPEECHAIN_ROOT}/recipes/asr/{your-new-dataset}/{your-target-subset} cp ${SPEECHAIN_ROOT}/recipes/asr/librispeech/train-960/run.sh ${SPEECHAIN_ROOT}/recipes/asr/{your-new-dataset}/{your-target-subset}/run.sh- Update the arguments
datasetandsubset(line no.16 & 17) in${SPEECHAIN_ROOT}/recipes/asr/{your-new-dataset}/{your-target-subset}/run.sh:
- Update the arguments
-
Copy a pre-tuned configuration file into your newly created folder. Suppose we want to use the configuration
${SPEECHAIN_ROOT}/recipes/asr/librispeech/train-960/exp_cfg/960-bpe5k_transformer-wide_ctc_perturb.yaml:
Note:cd ${SPEECHAIN_ROOT}/recipes/asr/{your-new-dataset}/{your-target-subset} mkdir ./data_cfg ./exp_cfg cp ${SPEECHAIN_ROOT}/recipes/asr/librispeech/train-960/exp_cfg/960-bpe5k_transformer-wide_ctc_perturb.yaml ./exp_cfg- Update the dataset arguments at the beginning of your selected configuration:
-
Train the ASR model on your target training set:
Note:cd ${SPEECHAIN_ROOT}/recipes/asr/{your-new-dataset}/{your-target-subset} bash run.sh --test false --exp_cfg 960-bpe5k_transformer-wide_ctc_perturb (--ngpu x --gpus x,x)--test falseis used to skip the testing stage.- Ensure your computational resources match the configuration before training the model.
- To save experimental results outside ${SPEECHAIN_ROOT}, specify your desired location by appending --train_result_path {your-target-path} to bash run.sh.
-
Tune the inference hyperparameters on the corresponding validation set
Note:cp ${SPEECHAIN_ROOT}/recipes/asr/librispeech/train-960/data_cfg/test_dev-clean+other.yaml ./data_cfg mv ./data_cfg/test_dev-clean.yaml ./data_cfg/test_{your-valid-set-name}.yaml bash run.sh --train false --exp_cfg 960-bpe5k_transformer-wide_ctc_perturb --data_cfg test_{your-valid-set-name}-
Update the dataset arguments in
./data_cfg/test_{your-valid-set-name}.yaml: -
--train falseis used to skip the training stage. --data_cfgswitches the data loading configuration from the original one for training in exp_cfg to the one for validation tuning.- To access experimental results saved outside
${SPEECHAIN_ROOT}, append--train_result_path {your-target-path}tobash run.sh.
-
-
Evaluate the trained ASR model on the official test sets
Note:bash run.sh --train false --exp_cfg 960-bpe5k_transformer-wide_ctc_perturb --infer_cfg "{the-best-configuration-you-get-during-validation-tuning}"--train falseis used to skip the training stage.- There are two ways to specify the optimal
infer_cfgtuned on the validation set:- Update
infer_cfgin${SPEECHAIN_ROOT}/recipes/asr/{your-new-dataset}/{your-target-subset}/exp_cfg/960-bpe5k_transformer-wide_ctc_perturb.yaml. - Provide a parsable string as the value for
--infer_cfgin the terminal. For example,beam_size:16,ctc_weight:0.2can be converted into a dictionary with two key-value items (beam_size=16andctc_weight=0.2).
For more details about this syntax, refer to here.
- Update
- To access experimental results saved outside
${SPEECHAIN_ROOT}, append--train_result_path {your-target-path}tobash run.sh.
👆Back to the table of contents
How to create your own ASR model
The detailed instructions for creating your own ASR model using SpeeChain are coming soon.