Handbook
Our documentation is organized by different roles in this toolkit. You can start exploring SpeeChain by reading the following sections or by jumping directly to the section that interests you. These sections also appear as README files in some subfolders of the toolkit repository.
Table of Contents
- For those who just discovered SpeeChain
- How to dump a dataset to your machine
- How to prepare a configuration file
- How to train and evaluate a model
- How to interpret the files generated in the exp folder
- For those who want to use SpeeChain for research
- SpeeChain file system
- How to customize my own data loading and batching strategy
- How to customize my own model
- How to customize my own learning rate scheduling strategy
- For those who want to contribute to SpeeChain
- Contribution specifications
For those who just discovered SpeeChain
In SpeeChain toolkit, a basic research pipeline has 5 steps:
- Dump a dataset from the Internet to your disk.
- Prepare experimental configuration files.
- Train a model.
- Evaluate the trained model.
- Analyse the evaluation results.
The following subsections will explain how to execute the steps above one by one.
๐Back to the table of contents
How to dump a dataset to your machine
In our toolkit, the datasets are grouped by their data types.
Each available dataset corresponds a specific folder in ${SPEECHAIN_ROOT}/datasets
:
SpeeChain follows the all-in-one dumping style by a bash script named data_dumping.sh
where the procedure of dataset dumping is divided into individual steps and each step is executed by a specific script.
We provide an executable script named run.sh
in each dataset folder under ${SPEECHAIN_ROOT}/datasets
.
Please refer to here before starting the dumping pipeline.
๐Back to the table of contents
How to prepare configuration files
Please refer to here for some configuration templates in those exp_cfg
folders.
In order to avoid messy and unreadable configuration setting in the terminal, SpeeChain provides some advanced services to simplify the configuration setting.
PS: The following contents are not urgent for you to learn. If you don't want to jump into details too much right now, please go to the next step
Flexible Path Parsing Services
In SpeeChain, the path arguments can be given in 3 ways:
-
Absolute Path: You can indicate an absolute path by beginning the path with a slash '/', e.g.,
/x/xx/xxx/speechain/runner.py
. -
General Relative Path: If your input path begins with
.
or..
, it will be converted to the corresponding absolute path in our framework.
Note: The relative path will be parsed by the directory where you execute the script rather than the directory where the executable script is placed! -
In-toolkit Relative Path:
The path arguments can be given as the relative location under the toolkit root, i.e.,${SPEECHAIN_ROOT}
. The toolkit root${SPEECHAIN_ROOT}
is created by the bash scriptenvir_preparation.sh
.
For example,speechain/runn.py
will be parsed to to${SPEECHAIN_ROOT}/speechain/runner.py
. If you would like to specify a place outside the toolkit root, you can directly give its absolute path with a slash/
at the beginning to notify the framework of an absolute path, e.g.,/x/xx/xxx/speechain/runner.py
.
Convertable Arguments in the Terminal
Conventionally, it's hard for us to assign the values of List and Dict arguments in the terminal. In SpeeChain, our framework provides a convenient way to convert your entered strings in the specified format into the corresponding List or Dict variables.
-
For the List variables, your entered string should be surrounded by a pair of square brackets and each element inside the brackets should be split by a comma. The structure can be nested to initialize sub-List in the return List variable.
For example, the string[a,[1,2,[1.1,2.2,3.3],[h,i,j,k]],c,[d,e,[f,g,[h,i,j,k]]]]
will be parsed to -
For the Dict variables, the key and its value should be split by a colon. The value should be surrounded by a pair of braces if it's a sub-Dict. The structure can be nested to initialize sub-Dict in the return Dict variable.
For example, the stringa:{b:12.3,c:{d:123,e:{g:xyz}}},g:xyz
will be parsed to
Moreover, the List string can also be nested into the Dict string like a:[1,2,3]
will be parsed as
Concise Configuration File
As the number of arguments increases, it would be hard for us to given all the arguments one by one in the terminal. As a frequently-used file format for configuration, .yaml has been popular in many well-known toolkits.
In SpeeChain, we wrap the conventional .yaml file and provides some advanced !-suffixed .yaml representers to further simplify its layout and improve the readability:
-
!str allows you to cast a numerical value into a string by replacing
key_name: 10
withkey_name: !str 10
. In this scenario, the value ofkey_name
will be a string '10' instead of an integer 10. -
!list allows you to compress the configuration of a list into one line from
to Note:
- The elements should be separated by commas ',' and surrounded by a pair of angle brackets '[]'.
- Nested structures like
key_name: !list [!list [a,b,c],!list [d,e,f]]
are not supported yet.
-
!tuple allows you to create tuples in your configuration. The statement
can only give us a list, but sometimes we may need to create a tuple. Instead, we can usekey_name: !tuple (a,b,c)
to create a tuple.
Note: The elements should be separated by commas ',' and surrounded by a pair of brackets '()'. -
!ref allows you to reuse the values you have already created by replacing
with
In this scenario, the value of
key_name1
will be reused to createkey_name2
which will be further reused to createkey_name3
.Note:
-
Nested structures like
are not supported yet.
-
Different !ref representers must be used in order. The following usage is invalid (
key_name3
is used beforekey_name2
):
-
๐Back to the table of contents
Inference Configuration for hyperparameter adjustment
Model inference configuration is given by infer_cfg
in the configuration file.
There could be either one inference configuration or multiple configurations in infer_cfg:
-
If infer_cfg is not given, the default inference configuration will be used for model inference.
-
If you only want to give one inference configuration, please give it by either a string or a Dict.
-
String: The string indicates where the inference configuration file is placed. For example,
infer_cfg: config/infer/asr/greedy_decoding.yaml
means the configuration file${SPEECHAIN_ROOT}/config/infer/asr/greedy_decoding.yaml
will be used for model inference. In this example, the evaluation results will be saved to a folder namedgreedy_decoding
.
If there are many arguments you need to give in the configuration, we recommend you to give them by a configuration file for concision. -
Dict: The Dict indicates the content of your inference configuration. For example.
means thatbeam_size=1
andtemperature=1.0
will be used for ASR decoding. In this example, the evaluation results will be saved to a folder namedbeam_size=1_temperature=1.0
which is decided by the keys and values in your given Dict.
If there are not so many arguments in your configuration, we recommend you to give them by a Dict to avoid messy configuration files on your disk.
-
-
If you want to give multiple inference configuration in infer_cfg, please give them by either a List or a Dict.
- List: Each element in the List could be either a string or a Dict.
- The string indicates the file paths of a given inference configuration. For example,
means that both
greedy_decoding.yaml
andbeam_size=16.yaml
in${SPEECHAIN_ROOT}/config/infer/asr/
will be used for ASR decoding. - The Dict indicates the content of a given inference configuration. For example,
could be used and two folders
beam_size=1_temperature=1.0
andbeam_size=16_temperature=1.0
will be created to place their evaluation results. - Of course, strings and Dicts can be mixed in infer_cfg like
- The string indicates the file paths of a given inference configuration. For example,
means that both
- Dict: There must be two keys in the Dict:
shared_args
andexclu_args
.
shared_args
(short of 'shared arguments') is a Dict which contains the arguments shared by all the configurations in the Dict.
exclu_args
(short of 'exclusive arguments') is a List[Dict] where each element contains the exclusive arguments for each configuration.
For example, means that there will be two configurations used for model inference: and Their evaluation results will be saved tobeam_size=16_temperature=1.0
andbeam_size=16_temperature=1.5
.
If your configurations don't contain too many arguments and you only want to change one or two arguments for each of them, we recommend you to give your configurations in this way.
- List: Each element in the List could be either a string or a Dict.
๐Back to the table of contents
How to train and evaluate a model
We provide two levels of executable bash scripts:
-
All-in-one executable
run.sh
in${SPEECHAIN_ROOT}/recipes/
. This bash script is task-independent and can be called everywhere to run an experimental job.
For more details, please go to${SPEECHAIN_ROOT}/recipes
and runbash run.sh --help
for the message about involved arguments. -
Low-level
run.sh
designed for sub-folder in${SPEECHAIN_ROOT}/recipes/
. Those scripts are used to run the experiments of the specific task.
For more details, please go to the target sub-folder and runbash run.sh --help
for the message about involved arguments.
The execution hierarchy of the scripts is:
${SPEECHAIN_ROOT}/recipes/{task_name}/{dataset_name}/{subset_name}/run.sh
--->${SPEECHAIN_ROOT}/recipes/run.sh
--->${SPEECHAIN_ROOT}/speechain/runner.py
For the detailed instructions about how to launch the jobs for each model, please refer to here and click your target model.
By the way, you can also directly use the command ${SPEECHAIN_PYTHON} ${SPEECHAIN_ROOT}/speechain/runn.py
in your terminal or your own bash script to run your experimental jobs.
Before doing so, we recommend you to first use the command ${SPEECHAIN_PYTHON} ${SPEECHAIN_ROOT}/speechain/runn.py --help
to familiarize yourself with the involved arguments.
๐Back to the table of contents
How to interpret the files generated in the exp folder
Please refer to ${SPEECHAIN_ROOT}/recipes/README.md for more details.
๐Back to the table of contents
For those who want to use SpeeChain for research
SpeeChain file system
Configuration Folder
This folder contains off-the-shelf configuration files that can be shared across different tasks, models, or datasets. Each type of configuration corresponds to a specific sub-folder where each category of configuration corresponds to a specific sub-sub-folder.
Folder architecture is shown below:
/config
/feat # Configuration for acoustic feature extraction
/log_mel # Configuration files for log-Mel spectrogram extraction
/...
/mfcc # Configuration files for MFCC extraction
/...
/infer # Configuration for model inference
/asr # Configuration files for ASR inference
/...
/tts # Configuration files for TTS inference
/...
${SPEECHAIN_ROOT}/config/feat/
, please refer to the docstring of ${SPEECHAIN_ROOT}/datasets/pyscripts/feat_extractor.py.
For more details about the configuration files in ${SPEECHAIN_ROOT}/config/infer/
, please refer to the docstring of the corresponding inference function in ${SPEECHAIN_ROOT}/speechain/infer_func/.
๐Back to the table of contents
Dataset Folder
This folder contains off-the-shelf processing scripts to dump datasets into your machine. Each type of datasets corresponds to a specific sub-folder where each dataset corresponds a specific sub-sub-folder.
Folder architecture is shown below:
/datasets
/speech_text # Datasets that are made up of speech and text data
/librispeech # Processing scripts for the LibriSpeech dataset
/...
/libritts # Processing scripts for the LibriTTS dataset
/...
/ljspeech # Processing scripts for the LJSpeech dataset
/...
/data_dumping.sh # all-in-one speech-text dataset dumping script
๐Back to the table of contents
Recipes Folder
This folder contains our recipes for all tasks on the available datasets.
Each task corresponds to a specific sub-folder where each dataset corresponds a specific sub-sub-folder.
In the dataset folder, there may be some sub-folders corresponding to different settings of model training where a sub-sub-folder /data_cfg
contains all the configuration files of data loading that are shared by all the model sub-sub-folders.
Folder architecture is shown below:
/recipes
/asr # Recipes for the ASR task
/librispeech # Recipes for ASR models on the LibriSpeech dataset
... # different ASR settings for LibriSpeech
/libritts # Recipes for ASR models on the LibriTTS dataset
... # different ASR settings for LibriTTS
/libritts+librispeech # Recipes for ASR models on the 16khz-downsampled LibriTTS and LibriSpeech datasets
... # different ASR settings for 16khz-downsampled LibriTTS and LibriSpeech
/tts # Recipes for the TTS task
/libritts # Recipes for TTS models on the LibriTTS dataset
... # different TTS settings for LibriTTS
/ljspeech # Recipes for TTS models on the LJSpeech dataset
...
/offline_tts2asr # Recipes for the offline TTS-to-ASR chain
/libritts_librispeech # Recipes for TTS trained on LibriTTS and ASR trained on LibriSpeech
... # different TTS-to-ASR settings for LibriSpeech and LibriTTS
/offline_asr2tts # Recipes for the offline ASR-to-TTS chain
/libritts # Recipes for ASR and TTS trained on LibriTTS
... # different ASR-to-TTS settings for LibriTTS
๐Back to the table of contents
Main Folder
The folder /speechain
is the core part of our toolkit where each sub-folder corresponds to a specific part of an experimental pipeline.
In each sub-folder, there is a .py file named abs.py
that declares the abstract class of the corresponding pipeline part.
Based on the abstract class, many implementation classes are included in the same sub-folder with the name like xxx.py
.
/speechain
# Sub-folders for all specific parts of an experimental pipeline
/criterion
...
/dataset
...
/infer_func
/beam_search.py # Inference function of the beam searching. Mainly used for ASR models.
/tts_decoding.py # Inference function of the autoregressive TTS decoding.
...
/iterator
...
/model
...
/module
...
/optim_sche
...
/tokenizer
...
# General part of the pipeline
/run.py # The entrance of SpeeChain toolkit for both model training and testing.
/monitor.py # The training and testing monitors. Used to record and regulate the training and testing process.
/snapshooter.py # The figure snapshooter. Used to transform the input snapshotting materials into the visible figures.
Read more about /speechain/criterion
Read more about /speechain/dataset
Read more about /speechain/iterator
Read more about /speechain/model
Read more about /speechain/module
Read more about /speechain/optim_sche
Read more about /speechain/tokenizer
๐Back to the table of contents
How to customize my own data loading and batching strategy
For how to customize your own data loading strategy, please refer to the API document of /speechain/dataset
.
For how to customize your own data batching, please refer to the API document of /speechain/iterator
.
๐Back to the table of contents
How to customize my own model
For how to customize your own model, please refer to the API document of /speechain/model
.
If the existing Module implementations in /speechain/module
, you can refer to the API document of /speechain/module
for the instructions about how to customize your own modules.
For the model involving text tokenization like ASR and TTS, if the existing Tokenizer implementations cannot satisfy your needs, you can refer to the API document of /speechain/tokenizer
for the instructions about how to customize your own tokenizers.
๐Back to the table of contents
How to customize my own learning rate scheduling strategy
For how to customize your own optimization strategy, please refer to the API document of /speechain/optim_sche
.
๐Back to the table of contents
For those who want to contribute to SpeeChain
Contribution specifications
We have some specifications for you to standardize your contribution:
-
Documentation: We will appreciate it a lot if you could provide enough documents for your contribution.
-
We recommend you to use the Google-style function docstring. If you are using PyCharm, you can set the docstring style in FileโSettingโToolsโPython Integrated ToolsโDocstringsโDocstring format.
As for argument explanation in the docstring, we recommend you to write the argument type after the colon and give its description below with a tab retract as follows.
If the argument type istorch.Tensor
ornumpy.array
, please replace the type with its shape as follows. -
For in-line comments, we recommend you start a new line every time you want to comment (it's better not to type a long comment after the code). The codes are better to be divided into several code blocks by their roles with an in-line comment right above the block as follows.
Note you can format the docstring using docformatter with the following command at the root directory of SpeeChain.
-
-
Naming: We have several recommendations for class names and variable names.
-
For class names, we recommend you to name your class in the CamelCase style. The names are better to be in the form of "What is it made up of" + "What is it".
For example,
SpeechTextDataset
means a dataset class that returns speech-text paired data during training.Conv2dPrenet
means a prenet module that is made up of Conv2d layers. -
For long variable names, please make some abbreviations. For the abbreviations, we recommend the following 2 frequently-used strategies:
- Tail-Truncating: delete the letters from the tail and only retain the part before the second vowel. For example, 'convolution' -> 'conv', 'previous' -> 'prev'.
- Vowel-Omitting: directly delete all vowels and some trivial consonants behind each vowel. For example, 'transformer' -> 'trfm', 'source' -> 'src', 'target' -> 'tgt'.
-
For the temporary variables only used to register data for a short period, please add an underline at the beginning of the name to notify other users. For example,
_tmp_feat_dim
means the temporary variable used to register the intermediate value of the feature dimension.
-