Skip to content

ASR

ASR (Automatic Speech Recognition) is a technology that converts spoken language into text. ASR is widely used in various applications such as voice assistants, dictation software, and transcription services. SpeeChain provides a collection of ASR recipes that allow users to train and evaluate ASR models on various datasets. The recipes include configurations for different backbones, such as transformer and conformer, and provide pretrained models for reproducibility. Users can also create their own ASR models by following the instructions in the recipes.

👆Back to the recipe README.md

Table of Contents

  1. Available Backbones
  2. Pretrained Models for Reproducibility
  3. Training an ASR model
  4. Creating your own ASR model

Available Backbones (WER with/without LM)

Dataset Subset Configuration Test Clean Test Other
librispeech train-clean-5 5-bpe5k_conformer-small_lr2e-3 15.2% / 35.8% 35.8% / 42.3%
train-clean-100 100-bpe5k_transformer-wide_lr2e-3 8.40% / 21.92% 5.50% / 15.56%
100-bpe5k_conformer-small_lr2e-3 6.3% / 9.3% 19.14% / 25.2%
100-bpe5k_conformer-medium_lr2e-3 7.87% / 21.36% 5.30% / 15.57%
100-bpe5k_conformer-large_lr2e-3 7.30% / 20.24% 5.33% / 15.15%
train-clean-460 460-bpe5k_transformer-large % / % % / %
460-bpe5k_conformer-large % / % % / %
train-960 960-bpe5k_transformer-large % / % % / %
960-bpe5k_conformer-large % / % % / %
libritts_librispeech train-960 960-bpe5k_transformer-large % / % % / %

👆Back to the table of contents

Pretrained Models for Reproducibility

For reproducibility of our ASR model configuration files in ${SPEECHAIN_ROOT}/recipes/asr/, we provide the following pretrained models to ensure consistent performance:

  1. SentencePiece tokenizer models

    • Please download tokenizer model and vocabulary to where your dataset is dumped. The default path is ${SPEECHAIN_ROOT}/datasets.
      Note: If your dataset is dumped outside SpeeChain, please replace ${SPEECHAIN_ROOT}/datasets in the following commands by your place.

    • LibriSpeech:

      1. train-clean-100:

        # Download BPE model
        gdown -O ${SPEECHAIN_ROOT}/datasets/librispeech/data/sentencepiece/train-clean-100/bpe5k/no-punc 
        
        # Download BPE vocabulary
        gdown -O ${SPEECHAIN_ROOT}/datasets/librispeech/data/sentencepiece/train-clean-100/bpe5k/no-punc 
        
      2. train-clean-460:

        # Download BPE model
        gdown -O ${SPEECHAIN_ROOT}/datasets/librispeech/data/sentencepiece/train-clean-100/bpe5k/no-punc 
        
        # Download BPE vocabulary
        gdown -O ${SPEECHAIN_ROOT}/datasets/librispeech/data/sentencepiece/train-clean-100/bpe5k/no-punc 
        
      3. train-960:

        # Download BPE model
        gdown -O ${SPEECHAIN_ROOT}/datasets/librispeech/data/sentencepiece/train-clean-100/bpe5k/no-punc 
        
        # Download BPE vocabulary by 
        gdown -O ${SPEECHAIN_ROOT}/datasets/librispeech/data/sentencepiece/train-clean-100/bpe5k/no-punc 
        
  2. Transformer-based language models

    • Please download both LM model and configuration file. The default path is ${SPEECHAIN_ROOT}/recipes/lm.
      Note: If you want to store model files outside SpeeChain, please replace ${SPEECHAIN_ROOT}/recipes/lm in the following commands by your place. Also, change the lm_cfg_path and lm_model_path arguments in each ASR configuration file.
    • LibriSpeech:

      1. train-clean-100:

        # Download LM model  
        gdown -O ${SPEECHAIN_ROOT}/recipes/lm/librispeech/lm_text/exp/100-bpe5k_transformer_gelu/models   
        # Download LM configuration  
        gdown -O ${SPEECHAIN_ROOT}/recipes/lm/librispeech/lm_text/exp/100-bpe5k_transformer_gelu  
        
      2. train-960:

        # Download LM model
        gdown -O ${SPEECHAIN_ROOT}/recipes/lm/librispeech/train-960_lm_text/exp/960-bpe5k_transformer_gelu/models 
        
        # Download LM configuration
        gdown -O ${SPEECHAIN_ROOT}/recipes/lm/librispeech/train-960_lm_text/exp/960-bpe5k_transformer_gelu  
        

👆Back to the table of contents

Training an ASR model

Before training an ASR model, ensure that your target datasets are dumped by the scripts in ${SPEECHAIN_ROOT}/datasets/{your-target-dataset}. More details on how to dump a dataset can be found here.

Use an existing dataset with a pre-tuned configuration

  1. locate a .yaml configuration file in ${SPEECHAIN_ROOT}/recipes/asr. Suppose we want to train an ASR model by the configuration ${SPEECHAIN_ROOT}/recipes/asr/librispeech/train-960/exp_cfg/960-bpe5k_transformer-wide_ctc_perturb.yaml.

  2. Train and evaluate the ASR model on your target training set

cd ${SPEECHAIN_ROOT}/recipes/asr/librispeech/train-960
bash run.sh --exp_cfg 960-bpe5k_transformer-wide_ctc_perturb (--ngpu x --gpus x,x)
Note:
1. Review the comments on the top of the configuration file to ensure that your computational resources fit the configuration before training the model.
If your resources do not match the configuration, adjust it by --ngpu and --gpus to match your available GPU memory.
2. To save the experimental results outside the toolkit folder ${SPEECHAIN_ROOT}, specify your desired location by appending --train_result_path {your-target-path} to bash run.sh.
In this example, bash run.sh --exp_cfg 960-bpe5k_transformer-wide_ctc_perturb --train_result_path /a/b/c will save results to /a/b/c/960-bpe5k_transformer-wide_ctc_perturb.

Creating a new configuration for a non-existing dataset

  1. Dump your target dataset from the Internet following these instructions.

  2. Create a folder for your dumped dataset ${SPEECHAIN_ROOT}/recipes/asr/{your-new-dataset}/{your-target-subset}:

    mkdir ${SPEECHAIN_ROOT}/recipes/asr/{your-new-dataset}/{your-target-subset}
    cp ${SPEECHAIN_ROOT}/recipes/asr/librispeech/train-960/run.sh ${SPEECHAIN_ROOT}/recipes/asr/{your-new-dataset}/{your-target-subset}/run.sh
    
    Note:

    • Update the arguments dataset and subset (line no.16 & 17) in ${SPEECHAIN_ROOT}/recipes/asr/{your-new-dataset}/{your-target-subset}/run.sh:
    dataset=librispeech -> 'your-new-dataset'
    subset='train-960' -> 'your-new-dataset'
    
  3. Copy a pre-tuned configuration file into your newly created folder. Suppose we want to use the configuration ${SPEECHAIN_ROOT}/recipes/asr/librispeech/train-960/exp_cfg/960-bpe5k_transformer-wide_ctc_perturb.yaml:

    cd ${SPEECHAIN_ROOT}/recipes/asr/{your-new-dataset}/{your-target-subset}
    mkdir ./data_cfg ./exp_cfg
    cp ${SPEECHAIN_ROOT}/recipes/asr/librispeech/train-960/exp_cfg/960-bpe5k_transformer-wide_ctc_perturb.yaml ./exp_cfg
    
    Note:

    • Update the dataset arguments at the beginning of your selected configuration:
      # dataset-related
      dataset: librispeech -> 'your-new-dataset'
      train_set: train-960 -> 'your-target-subset'
      valid_set: dev -> 'valid-set-of-new-dataset'
      
      # tokenizer-related
      txt_format: asr
      vocab_set: train-960 -> 'your-target-subset'
      token_type: sentencepiece
      token_num: bpe5k
      
  4. Train the ASR model on your target training set:

    cd ${SPEECHAIN_ROOT}/recipes/asr/{your-new-dataset}/{your-target-subset}
    bash run.sh --test false --exp_cfg 960-bpe5k_transformer-wide_ctc_perturb (--ngpu x --gpus x,x)
    
    Note:

    1. --test false is used to skip the testing stage.
    2. Ensure your computational resources match the configuration before training the model.
    3. To save experimental results outside ${SPEECHAIN_ROOT}, specify your desired location by appending --train_result_path {your-target-path} to bash run.sh.
  5. Tune the inference hyperparameters on the corresponding validation set

    cp ${SPEECHAIN_ROOT}/recipes/asr/librispeech/train-960/data_cfg/test_dev-clean+other.yaml ./data_cfg
    mv ./data_cfg/test_dev-clean.yaml ./data_cfg/test_{your-valid-set-name}.yaml
    bash run.sh --train false --exp_cfg 960-bpe5k_transformer-wide_ctc_perturb --data_cfg test_{your-valid-set-name}
    
    Note:

    1. Update the dataset arguments in ./data_cfg/test_{your-valid-set-name}.yaml:

      dataset: librispeech -> 'your-new-dataset'
      valid_dset: &valid_dset dev-clean -> &valid_dset 'valid-set-of-new-dataset'
      

    2. --train false is used to skip the training stage.

    3. --data_cfg switches the data loading configuration from the original one for training in exp_cfg to the one for validation tuning.
    4. To access experimental results saved outside ${SPEECHAIN_ROOT}, append --train_result_path {your-target-path} to bash run.sh.
  6. Evaluate the trained ASR model on the official test sets

    bash run.sh --train false --exp_cfg 960-bpe5k_transformer-wide_ctc_perturb --infer_cfg "{the-best-configuration-you-get-during-validation-tuning}"
    
    Note:

    1. --train false is used to skip the training stage.
    2. There are two ways to specify the optimal infer_cfg tuned on the validation set:
      1. Update infer_cfg in ${SPEECHAIN_ROOT}/recipes/asr/{your-new-dataset}/{your-target-subset}/exp_cfg/960-bpe5k_transformer-wide_ctc_perturb.yaml.
      2. Provide a parsable string as the value for --infer_cfg in the terminal. For example, beam_size:16,ctc_weight:0.2 can be converted into a dictionary with two key-value items (beam_size=16 and ctc_weight=0.2).
        For more details about this syntax, refer to here.
    3. To access experimental results saved outside ${SPEECHAIN_ROOT}, append --train_result_path {your-target-path} to bash run.sh.

👆Back to the table of contents

How to create your own ASR model

The detailed instructions for creating your own ASR model using SpeeChain are coming soon.

👆Back to the table of contents