Skip to content

Handbook

Our documentation is organized by different roles in this toolkit. You can start exploring SpeeChain by reading the following sections or by jumping directly to the section that interests you. These sections also appear as README files in some subfolders of the toolkit repository.

๐Ÿ‘†Back to the home page

Table of Contents

  1. For those who just discovered SpeeChain
  2. How to dump a dataset to your machine
  3. How to prepare a configuration file
  4. How to train and evaluate a model
  5. How to interpret the files generated in the exp folder
  6. For those who want to use SpeeChain for research
  7. SpeeChain file system
  8. How to customize my own data loading and batching strategy
  9. How to customize my own model
  10. How to customize my own learning rate scheduling strategy
  11. For those who want to contribute to SpeeChain
  12. Contribution specifications

For those who just discovered SpeeChain

In SpeeChain toolkit, a basic research pipeline has 5 steps:

  1. Dump a dataset from the Internet to your disk.
  2. Prepare experimental configuration files.
  3. Train a model.
  4. Evaluate the trained model.
  5. Analyse the evaluation results.

The following subsections will explain how to execute the steps above one by one.

๐Ÿ‘†Back to the table of contents

How to dump a dataset to your machine

In our toolkit, the datasets are grouped by their data types. Each available dataset corresponds a specific folder in ${SPEECHAIN_ROOT}/datasets:

SpeeChain follows the all-in-one dumping style by a bash script named data_dumping.sh where the procedure of dataset dumping is divided into individual steps and each step is executed by a specific script.

We provide an executable script named run.sh in each dataset folder under ${SPEECHAIN_ROOT}/datasets. Please refer to here before starting the dumping pipeline.

๐Ÿ‘†Back to the table of contents

How to prepare configuration files

Please refer to here for some configuration templates in those exp_cfg folders.

In order to avoid messy and unreadable configuration setting in the terminal, SpeeChain provides some advanced services to simplify the configuration setting.
PS: The following contents are not urgent for you to learn. If you don't want to jump into details too much right now, please go to the next step

Flexible Path Parsing Services

In SpeeChain, the path arguments can be given in 3 ways:

  1. Absolute Path: You can indicate an absolute path by beginning the path with a slash '/', e.g., /x/xx/xxx/speechain/runner.py.

  2. General Relative Path: If your input path begins with . or .., it will be converted to the corresponding absolute path in our framework.
    Note: The relative path will be parsed by the directory where you execute the script rather than the directory where the executable script is placed!

  3. In-toolkit Relative Path:
    The path arguments can be given as the relative location under the toolkit root, i.e., ${SPEECHAIN_ROOT}. The toolkit root ${SPEECHAIN_ROOT} is created by the bash script envir_preparation.sh.
    For example, speechain/runn.py will be parsed to to ${SPEECHAIN_ROOT}/speechain/runner.py. If you would like to specify a place outside the toolkit root, you can directly give its absolute path with a slash / at the beginning to notify the framework of an absolute path, e.g., /x/xx/xxx/speechain/runner.py.

Convertable Arguments in the Terminal

Conventionally, it's hard for us to assign the values of List and Dict arguments in the terminal. In SpeeChain, our framework provides a convenient way to convert your entered strings in the specified format into the corresponding List or Dict variables.

  1. For the List variables, your entered string should be surrounded by a pair of square brackets and each element inside the brackets should be split by a comma. The structure can be nested to initialize sub-List in the return List variable.
    For example, the string [a,[1,2,[1.1,2.2,3.3],[h,i,j,k]],c,[d,e,[f,g,[h,i,j,k]]]] will be parsed to

    - 'a'
    - - 1
      - 2
      - - 1.1
        - 2.2
        - 3.3
      - - 'h'
        - 'i'
        - 'j'
        - 'k'
    - 'c'
    - - 'd'
      - 'e'
      - - 'f'
        - 'g'
        - - 'h'
          - 'i'
          - 'j'
          - 'k'
    

  2. For the Dict variables, the key and its value should be split by a colon. The value should be surrounded by a pair of braces if it's a sub-Dict. The structure can be nested to initialize sub-Dict in the return Dict variable.
    For example, the string a:{b:12.3,c:{d:123,e:{g:xyz}}},g:xyz will be parsed to

    a:
        b: 12.3
        c:
            d: 123
            e:
                g:xyz
    g: xyz
    

Moreover, the List string can also be nested into the Dict string like a:[1,2,3] will be parsed as

a:
- 1
- 2
- 3

Concise Configuration File

As the number of arguments increases, it would be hard for us to given all the arguments one by one in the terminal. As a frequently-used file format for configuration, .yaml has been popular in many well-known toolkits.

In SpeeChain, we wrap the conventional .yaml file and provides some advanced !-suffixed .yaml representers to further simplify its layout and improve the readability:

  1. !str allows you to cast a numerical value into a string by replacing key_name: 10 with key_name: !str 10. In this scenario, the value of key_name will be a string '10' instead of an integer 10.

  2. !list allows you to compress the configuration of a list into one line from

    key_name: 
    - - a
      - b
      - c
    - - d
      - e
      - f
    
    to
    key_name:
    - !list [a,b,c]
    - !list [d,e,f]
    
    Note:

    1. The elements should be separated by commas ',' and surrounded by a pair of angle brackets '[]'.
    2. Nested structures like key_name: !list [!list [a,b,c],!list [d,e,f]] are not supported yet.
  3. !tuple allows you to create tuples in your configuration. The statement

    key_name: 
    - a
    - b
    - c
    
    can only give us a list, but sometimes we may need to create a tuple. Instead, we can use key_name: !tuple (a,b,c) to create a tuple.
    Note: The elements should be separated by commas ',' and surrounded by a pair of brackets '()'.

  4. !ref allows you to reuse the values you have already created by replacing

    key_name1: abc/def/ghi/jkl
    key_name2: abc/def/ghi/jkl/mno
    key_name3: abc/def/ghi/jkl/mno/pqr
    

    with

    key_name1: abc/def/ghi/jkl
    key_name2: !ref <key_name1>/mno
    key_name3: !ref <key_name2>/pqr
    

    In this scenario, the value of key_name1 will be reused to create key_name2 which will be further reused to create key_name3.

    Note:

    1. Nested structures like

      key_name1: abc/def/ghi/jkl
      key_name2: !ref <key_name1>/mno
      key_name3: !list [!ref <key_name1>,!ref <key_name2>]
      
      are not supported yet.

    2. Different !ref representers must be used in order. The following usage is invalid (key_name3 is used before key_name2):

      key_name1: abc/def/ghi/jkl
      key_name3: !ref <key_name2>/pqr
      key_name2: !ref <key_name1>/mno
      

๐Ÿ‘†Back to the table of contents

Inference Configuration for hyperparameter adjustment

Model inference configuration is given by infer_cfg in the configuration file. There could be either one inference configuration or multiple configurations in infer_cfg:

  1. If infer_cfg is not given, the default inference configuration will be used for model inference.

  2. If you only want to give one inference configuration, please give it by either a string or a Dict.

    1. String: The string indicates where the inference configuration file is placed. For example, infer_cfg: config/infer/asr/greedy_decoding.yaml means the configuration file ${SPEECHAIN_ROOT}/config/infer/asr/greedy_decoding.yaml will be used for model inference. In this example, the evaluation results will be saved to a folder named greedy_decoding.
      If there are many arguments you need to give in the configuration, we recommend you to give them by a configuration file for concision.

    2. Dict: The Dict indicates the content of your inference configuration. For example.

      infer_cfg:
          beam_size: 1
          temperature: 1.0
      
      means that beam_size=1 and temperature=1.0 will be used for ASR decoding. In this example, the evaluation results will be saved to a folder named beam_size=1_temperature=1.0 which is decided by the keys and values in your given Dict.
      If there are not so many arguments in your configuration, we recommend you to give them by a Dict to avoid messy configuration files on your disk.

  3. If you want to give multiple inference configuration in infer_cfg, please give them by either a List or a Dict.

    1. List: Each element in the List could be either a string or a Dict.
      • The string indicates the file paths of a given inference configuration. For example,
        infer_cfg:
          - config/infer/asr/greedy_decoding.yaml
          - config/infer/asr/beam_size=16.yaml
        
        means that both greedy_decoding.yaml and beam_size=16.yaml in ${SPEECHAIN_ROOT}/config/infer/asr/ will be used for ASR decoding.
      • The Dict indicates the content of a given inference configuration. For example,
        infer_cfg:
          - beam_size: 1
            temperature: 1.0
          - beam_size: 16
            temperature: 1.0
        
        could be used and two folders beam_size=1_temperature=1.0 and beam_size=16_temperature=1.0 will be created to place their evaluation results.
      • Of course, strings and Dicts can be mixed in infer_cfg like
        infer_cfg:
          - config/infer/asr/greedy_decoding.yaml
          - beam_size: 16
            temperature: 1.0
        
    2. Dict: There must be two keys in the Dict: shared_args and exclu_args.
      shared_args (short of 'shared arguments') is a Dict which contains the arguments shared by all the configurations in the Dict.
      exclu_args (short of 'exclusive arguments') is a List[Dict] where each element contains the exclusive arguments for each configuration.
      For example,
        infer_cfg:
          shared_args:
              beam_size: 16
          exclu_args:
              - temperature: 1.0
              - temperature: 1.5
      
      means that there will be two configurations used for model inference:
      beam_size: 16
      temperature: 1.0
      
      and
      beam_size: 16
      temperature: 1.5
      
      Their evaluation results will be saved to beam_size=16_temperature=1.0 and beam_size=16_temperature=1.5.
      If your configurations don't contain too many arguments and you only want to change one or two arguments for each of them, we recommend you to give your configurations in this way.

๐Ÿ‘†Back to the table of contents

How to train and evaluate a model

We provide two levels of executable bash scripts:

  1. All-in-one executable run.sh in ${SPEECHAIN_ROOT}/recipes/. This bash script is task-independent and can be called everywhere to run an experimental job.
    For more details, please go to ${SPEECHAIN_ROOT}/recipes and run bash run.sh --help for the message about involved arguments.

  2. Low-level run.sh designed for sub-folder in ${SPEECHAIN_ROOT}/recipes/. Those scripts are used to run the experiments of the specific task.
    For more details, please go to the target sub-folder and run bash run.sh --help for the message about involved arguments.

The execution hierarchy of the scripts is:

${SPEECHAIN_ROOT}/recipes/{task_name}/{dataset_name}/{subset_name}/run.sh
    --->${SPEECHAIN_ROOT}/recipes/run.sh
        --->${SPEECHAIN_ROOT}/speechain/runner.py

For the detailed instructions about how to launch the jobs for each model, please refer to here and click your target model.

By the way, you can also directly use the command ${SPEECHAIN_PYTHON} ${SPEECHAIN_ROOT}/speechain/runn.py in your terminal or your own bash script to run your experimental jobs. Before doing so, we recommend you to first use the command ${SPEECHAIN_PYTHON} ${SPEECHAIN_ROOT}/speechain/runn.py --help to familiarize yourself with the involved arguments.

๐Ÿ‘†Back to the table of contents

How to interpret the files generated in the exp folder

Please refer to ${SPEECHAIN_ROOT}/recipes/README.md for more details.

๐Ÿ‘†Back to the table of contents

For those who want to use SpeeChain for research

SpeeChain file system

Configuration Folder

This folder contains off-the-shelf configuration files that can be shared across different tasks, models, or datasets. Each type of configuration corresponds to a specific sub-folder where each category of configuration corresponds to a specific sub-sub-folder.

Folder architecture is shown below:

/config
    /feat       # Configuration for acoustic feature extraction
        /log_mel    # Configuration files for log-Mel spectrogram extraction
            /...
        /mfcc       # Configuration files for MFCC extraction
            /...
    /infer      # Configuration for model inference
        /asr        # Configuration files for ASR inference
            /...
        /tts        # Configuration files for TTS inference
            /...
For more details about the configuration files in ${SPEECHAIN_ROOT}/config/feat/, please refer to the docstring of ${SPEECHAIN_ROOT}/datasets/pyscripts/feat_extractor.py.

For more details about the configuration files in ${SPEECHAIN_ROOT}/config/infer/, please refer to the docstring of the corresponding inference function in ${SPEECHAIN_ROOT}/speechain/infer_func/.

๐Ÿ‘†Back to the table of contents

Dataset Folder

This folder contains off-the-shelf processing scripts to dump datasets into your machine. Each type of datasets corresponds to a specific sub-folder where each dataset corresponds a specific sub-sub-folder.

Folder architecture is shown below:

/datasets
    /speech_text        # Datasets that are made up of speech and text data
        /librispeech        # Processing scripts for the LibriSpeech dataset
            /...
        /libritts           # Processing scripts for the LibriTTS dataset
            /...
        /ljspeech           # Processing scripts for the LJSpeech dataset
            /...
        /data_dumping.sh    # all-in-one speech-text dataset dumping script
For more details, please refer to the README.md of each type of dataset in ${SPEECHAIN_ROOT}/datasets/.

๐Ÿ‘†Back to the table of contents

Recipes Folder

This folder contains our recipes for all tasks on the available datasets. Each task corresponds to a specific sub-folder where each dataset corresponds a specific sub-sub-folder. In the dataset folder, there may be some sub-folders corresponding to different settings of model training where a sub-sub-folder /data_cfg contains all the configuration files of data loading that are shared by all the model sub-sub-folders.

Folder architecture is shown below:

/recipes
    /asr                    # Recipes for the ASR task
        /librispeech            # Recipes for ASR models on the LibriSpeech dataset
            ...                     # different ASR settings for LibriSpeech
        /libritts               # Recipes for ASR models on the LibriTTS dataset
            ...                     # different ASR settings for LibriTTS
        /libritts+librispeech   # Recipes for ASR models on the 16khz-downsampled LibriTTS and LibriSpeech datasets
            ...                     # different ASR settings for 16khz-downsampled LibriTTS and LibriSpeech
    /tts                    # Recipes for the TTS task
        /libritts               # Recipes for TTS models on the LibriTTS dataset
            ...                     # different TTS settings for LibriTTS
        /ljspeech               # Recipes for TTS models on the LJSpeech dataset
            ...
    /offline_tts2asr        # Recipes for the offline TTS-to-ASR chain
        /libritts_librispeech   # Recipes for TTS trained on LibriTTS and ASR trained on LibriSpeech
            ...                     # different TTS-to-ASR settings for LibriSpeech and LibriTTS
    /offline_asr2tts        # Recipes for the offline ASR-to-TTS chain
        /libritts                # Recipes for ASR and TTS trained on LibriTTS 
            ...                     # different ASR-to-TTS settings for LibriTTS
For more details, please refer to ${SPEECHAIN_ROOT}/recipes/README.md.

๐Ÿ‘†Back to the table of contents

Main Folder

The folder /speechain is the core part of our toolkit where each sub-folder corresponds to a specific part of an experimental pipeline. In each sub-folder, there is a .py file named abs.py that declares the abstract class of the corresponding pipeline part. Based on the abstract class, many implementation classes are included in the same sub-folder with the name like xxx.py.

/speechain
    # Sub-folders for all specific parts of an experimental pipeline
    /criterion
        ...
    /dataset
        ...
    /infer_func
        /beam_search.py     # Inference function of the beam searching. Mainly used for ASR models.
        /tts_decoding.py    # Inference function of the autoregressive TTS decoding.
        ...
    /iterator
        ...
    /model
        ...
    /module
        ...
    /optim_sche
        ...
    /tokenizer
        ...
    # General part of the pipeline
    /run.py             # The entrance of SpeeChain toolkit for both model training and testing.
    /monitor.py         # The training and testing monitors. Used to record and regulate the training and testing process.
    /snapshooter.py     # The figure snapshooter. Used to transform the input snapshotting materials into the visible figures.

Read more about /speechain/criterion
Read more about /speechain/dataset
Read more about /speechain/iterator
Read more about /speechain/model
Read more about /speechain/module
Read more about /speechain/optim_sche
Read more about /speechain/tokenizer

๐Ÿ‘†Back to the table of contents

How to customize my own data loading and batching strategy

For how to customize your own data loading strategy, please refer to the API document of /speechain/dataset.

For how to customize your own data batching, please refer to the API document of /speechain/iterator.

๐Ÿ‘†Back to the table of contents

How to customize my own model

For how to customize your own model, please refer to the API document of /speechain/model.

If the existing Module implementations in /speechain/module, you can refer to the API document of /speechain/module for the instructions about how to customize your own modules.

For the model involving text tokenization like ASR and TTS, if the existing Tokenizer implementations cannot satisfy your needs, you can refer to the API document of /speechain/tokenizer for the instructions about how to customize your own tokenizers.

๐Ÿ‘†Back to the table of contents

How to customize my own learning rate scheduling strategy

For how to customize your own optimization strategy, please refer to the API document of /speechain/optim_sche.

๐Ÿ‘†Back to the table of contents

For those who want to contribute to SpeeChain

Contribution specifications

We have some specifications for you to standardize your contribution:

  1. Documentation: We will appreciate it a lot if you could provide enough documents for your contribution.

    • We recommend you to use the Google-style function docstring. If you are using PyCharm, you can set the docstring style in Fileโ†’Settingโ†’Toolsโ†’Python Integrated Toolsโ†’Docstringsโ†’Docstring format.

      As for argument explanation in the docstring, we recommend you to write the argument type after the colon and give its description below with a tab retract as follows.

          Args:
              d_model: int
                  The dimension of the input feature sequences.
      
      If the argument type is torch.Tensor or numpy.array, please replace the type with its shape as follows.
          Args:
              emb_feat: (batch_size, seq_len, d_model)
                  Embedded input feature sequences
      

    • For in-line comments, we recommend you start a new line every time you want to comment (it's better not to type a long comment after the code). The codes are better to be divided into several code blocks by their roles with an in-line comment right above the block as follows.

        # member registration
        self.d_model = d_model
        self.num_layers = num_layers
        self.num_heads = num_heads
        self.layernorm_first = layernorm_first
      

    Note you can format the docstring using docformatter with the following command at the root directory of SpeeChain.

    docformatter --in-place -s google -r speechain --black
    

  2. Naming: We have several recommendations for class names and variable names.

    • For class names, we recommend you to name your class in the CamelCase style. The names are better to be in the form of "What is it made up of" + "What is it".

      For example, SpeechTextDataset means a dataset class that returns speech-text paired data during training. Conv2dPrenet means a prenet module that is made up of Conv2d layers.

    • For long variable names, please make some abbreviations. For the abbreviations, we recommend the following 2 frequently-used strategies:

      • Tail-Truncating: delete the letters from the tail and only retain the part before the second vowel. For example, 'convolution' -> 'conv', 'previous' -> 'prev'.
      • Vowel-Omitting: directly delete all vowels and some trivial consonants behind each vowel. For example, 'transformer' -> 'trfm', 'source' -> 'src', 'target' -> 'tgt'.
    • For the temporary variables only used to register data for a short period, please add an underline at the beginning of the name to notify other users. For example, _tmp_feat_dim means the temporary variable used to register the intermediate value of the feature dimension.

๐Ÿ‘†Back to the table of contents