Datasets (Directory)
The dataset
folder contains all the available datasets in this toolkit.
Each dataset corresponds to a sub-folder and has a uniform file system.
You can easily dump your target dataset to your machine by following the instructions below.
If you want to contribute a new dataset, we would appreciate it if you could follow our file systems and metadata formats.
Table of Contents
- File System
- Metadata Format
- How to Dump a Dataset on your Machine
- How to Extract Speaker Embedding by my own model
- How to Contribute a New Dataset
File System
/datasets
/pyscripts # fixed .py scripts provided by the toolkit
data_dumping.sh # the shared .sh script across all the speech-text datasets. It contains the complete pipeline of data dumping.
meta_generator.py # the abstract .py script used by each dataset to decide their own meta generation logic.
meta_post_processor.py # the abstract .py script used by each dataset to decide their own meta post-processing logic.
/{dataset_name} # root folder of each dataset
/data # main folder of each dataset (the folder name 'data' is shared across all the datasets)
/wav # waveform folder (the folder name 'wav' is shared across all the datasets)
/{subset_name} # the name of a subset in this dataset
/wav_{comp_file_ext} # (optional) the folder that contains the compressed package files of all the waveform data
idx2wav_{comp_file_ext} # (optional) the file that contains the pairs of index and the absolute address of waveforms in the compressed package files
idx2wav # the file that contains the pairs of index and the absolute address of waveform files
idx2wav_len # the file that contains the pairs of index and waveform length
idx2{txt_format}_text # the file that contains the pairs of index and transcript text. (txt_format is the format of processed text which is used to distinguish different idx2text files)
... # (optional) some other metadata files available in the dataset (such as idx2spk, idx2spk_feat, idx2gen, ...)
/wav{sample_rate} # (optional) downsampled waveform folder (sample_rate is the samping rate of waveforms after downsampling)
... # same structure as '/wav'
/{feat_config} # (optional) acoustic feature folder by a given configuration (feat_config is the name of the configuration file)
/{subset_name} # the name of a subset in this dataset
/feat_{comp_file_ext} # (optional) the folder that contains the compressed chunk files of all the acoustic feature data
idx2feat_{comp_file_ext} # (optional) the file that contains the pairs of index and the absolute address of waveforms in the compressed package files
idx2feat # the file that contains the pairs of index and the absolute address of acoustic feature files
idx2feat_len # the file that contains the pairs of index and feature length
idx2{txt_format}_text # the file that contains the pairs of index and transcript text. (txt_format is the format of processed text which is used to distinguish different idx2text files)
... # (optional) some metadata files available in the dataset (such as idx2spk, idx2gen, ...)
/{token_type} # token folder of a specific type (token_type is the name of the specified token type)
/{src_subset} # the source subset used for generating the token vocabulary (src_subset is the name of the used source subset)
/{token_config} # the configuration for generating the vocabulary list.
/{txt_format} # the format of text data used to generate the token vocabulary
idx2text # the file that contains the pairs of index and transcript text after tokenization
idx2text_len # the file that contains the pairs of index and transcript text length after tokenization
vocab # token vocabulary
model # (optional) tokenizer model. Used for sentencepiece tokenizers.
run.sh # the dataset-specific .sh script that controls how the data dumping for the corresponding dataset is going
data_download.sh # the dataset-specific .sh script that downloads the corresponding dataset from the internet
meta_generator.py # the dataset-specific .py script that generates the meta data files of the dataset (idx2wav, idx2spk, ...)
meta_post_processor.py # (optional) the dataset-specific .py script that performs post-processing for the meta data files (needed by some datasets like LibriSpeech)
👆Back to the table of contents
Metadata Format
Metadata files are suffix-free .txt files used to access data instances during training. The data formats of metadata files are uniform for all datasets. The format uniformity enables the automatic configuration for the data loading part of our toolkit. The format for each metadata file is shown below.
idx2wav
In idx2wav
, each line corresponds to a pair of file index and the absolute address of raw waveforms.
Index and address are separated by a blank.
For example,
103-1240-0000 ${SPEECHAIN_ROOT}/datasets/librispeech/data/wav/train-clean-100/103/1240/103-1240-0000.flac
103-1240-0001 ${SPEECHAIN_ROOT}/datasets/librispeech/data/wav/train-clean-100/103/1240/103-1240-0001.flac
103-1240-0002 ${SPEECHAIN_ROOT}/datasets/librispeech/data/wav/train-clean-100/103/1240/103-1240-0002.flac
103-1240-0003 ${SPEECHAIN_ROOT}/datasets/librispeech/data/wav/train-clean-100/103/1240/103-1240-0003.flac
103-1240-0004 ${SPEECHAIN_ROOT}/datasets/librispeech/data/wav/train-clean-100/103/1240/103-1240-0004.flac
soundfile.read()
(such as .flac, .wav, ...) are OK in idx2wav.
👆Back to the table of contents
idx2wav_len
In idx2wav_len
, each line corresponds to a pair of file index and file length which are separated by a blank.
The file length means the number of sampling points of the waveform.
For example,
103-1240-0000 225360
103-1240-0001 255120
103-1240-0002 223120
103-1240-0003 235360
103-1240-0004 200240
👆Back to the table of contents
idx2feat
In idx2feat
, each line corresponds to a pair of file index and absolute address of the acoustic feature.
The index and absolute address are separated by a blank.
For example,
103-1240-0000 ${SPEECHAIN_ROOT}/datasets/librispeech/data/{feat_config}/train-clean-100/103-1240-0000.npz
103-1240-0001 ${SPEECHAIN_ROOT}/datasets/librispeech/data/{feat_config}/train-clean-100/103-1240-0001.npz
103-1240-0002 ${SPEECHAIN_ROOT}/datasets/librispeech/data/{feat_config}/train-clean-100/103-1240-0002.npz
103-1240-0003 ${SPEECHAIN_ROOT}/datasets/librispeech/data/{feat_config}/train-clean-100/103-1240-0003.npz
103-1240-0004 ${SPEECHAIN_ROOT}/datasets/librispeech/data/{feat_config}/train-clean-100/103-1240-0004.npz
👆Back to the table of contents
idx2feat_len
In idx2feat_len
, each line corresponds to a pair of file index and file length which are separated by a blank.
The file length means the number of time frames in the extracted acoustic features (e.g. log-mel spectrogram or MFCC).
For example,
👆Back to the table of contents
idx2text
In idx2text
, each line corresponds to a pair of file index and transcript text string which are separated by a blank.
idx2text
will be renamed as idx2{txt_format}_text
to indicate the text processing format used to generate this file.
In our toolkit, there are many available text processing format to generate transcript text strings with different styles.
For example,
103-1240-0000 chapter one missus rachel lynde is surprised missus rachel lynde lived just where the avonlea main road dipped down into a little hollow fringed with alders and ladies eardrops and traversed by a brook
103-1240-0001 that had its source away back in the woods of the old cuthbert place it was reputed to be an intricate headlong brook in its earlier course through those woods with dark secrets of pool and cascade but by the time it reached lynde's hollow it was a quiet well conducted little stream
103-1240-0002 for not even a brook could run past missus rachel lynde's door without due regard for decency and decorum it probably was conscious that missus rachel was sitting at her window keeping a sharp eye on everything that passed from brooks and children up
103-1240-0003 and that if she noticed anything odd or out of place she would never rest until she had ferreted out the whys and wherefores thereof there are plenty of people in avonlea and out of it who can attend closely to their neighbor's business by dint of neglecting their own
103-1240-0004 but missus rachel lynde was one of those capable creatures who can manage their own concerns and those of other folks into the bargain she was a notable housewife her work was always done and well done she ran the sewing circle
👆Back to the table of contents
idx2spk
In idx2spk
, each line corresponds to a pair of file index and speaker ID.
For example,
👆Back to the table of contents
idx2spk_feat
In idx2spk_feat
, each line corresponds to a pair of file index and the absolute address of a speaker embedding.
For example,
1034_121119_000001_000001 ${SPEECHAIN_ROOT}/datasets/libritts/data/wav/train-clean-100/xvector/1034_121119_000001_000001.npy
1034_121119_000002_000001 ${SPEECHAIN_ROOT}/datasets/libritts/data/wav/train-clean-100/xvector/1034_121119_000002_000001.npy
1034_121119_000010_000004 ${SPEECHAIN_ROOT}/datasets/libritts/data/wav/train-clean-100/xvector/1034_121119_000010_000004.npy
1034_121119_000010_000006 ${SPEECHAIN_ROOT}/datasets/libritts/data/wav/train-clean-100/xvector/1034_121119_000010_000006.npy
1034_121119_000012_000000 ${SPEECHAIN_ROOT}/datasets/libritts/data/wav/train-clean-100/xvector/1034_121119_000012_000000.npy
1034_121119_000014_000000 ${SPEECHAIN_ROOT}/datasets/libritts/data/wav/train-clean-100/xvector/1034_121119_000014_000000.npy
1034_121119_000018_000000 ${SPEECHAIN_ROOT}/datasets/libritts/data/wav/train-clean-100/xvector/1034_121119_000018_000000.npy
👆Back to the table of contents
spk_list
In spk_list, each line corresponds the string ID of a speaker in the corresponding subset of the dataset. This file is mainly used for training multi-speaker TTS models.
For example,
👆Back to the table of contents
idx2gen
In idx2gen, each line corresponds to a pair of file index and gender.
For example,
👆Back to the table of contents
How to Dump a Dataset on your Machine
For dumping an existing dataset,
1. Go to the folder of your target dataset ${SPEECHAIN_ROOT}/datasets/{dataset_name}
(e.g. if you want to dump LibriTTS, please go to ${SPEECHAIN_ROOT}/datasets/libritts
)
2. Run bash run.sh --help
to familiarize yourself with the involved arguments.
3. Run bash run.sh
to dump your target dataset (add some arguments if needed).
Note:
1. If you already have the decompressed dataset on your disk, please attach the argument --src_path {the-path-of-your-existing-dataset}
to the command bash run.sh
in the no.3 step above.
Please make sure that src_path
is an absolute path starting with a slash '/' and the content of src_path
should be exactly the same with the one downloaded from the internet (please see the help message of --src_path
in each run.sh
).
2. If you want to save the dumped data and metadata files outside the toolkit folder (${SPEECHAIN_ROOT}
), please attach the argument --tgt_path {the-path-you-want-to-save-files}
to the command bash run.sh
in the no.3 step above.
Please make sure that tgt_path
is an absolute path starting with a slash '/'.
👆Back to the table of contents
How to Extract Speaker Embedding by my own model
If you want to use the pretrained speaker embedding model on your machine, please:
1. Don't give the argument --spk_emb_model
when running the command bash run.sh
2. Write your own extraction script. You can use the metadata files idx2wav
and idx2wav_len
to read and organize the audio files.
Please save all the speaker embedding vectors to a specific folder in the same directory of idx2wav
and give a metadata file named idx2spk_feat
for data reference.
Note:
1. For the file format of idx2spk_feat
, please click here for reference.
2. Please keep the same data index with idx2wav
in your idx2spk_feat
.
3. Each speaker embedding vector should be in the shape of [1, spk_feat_dim]
.
4. Speaker embedding vectors could be saved in two ways:
1. save each vector to an individual .npy
file
2. save all vectors to a .npz
file where the index of each vector is exactly the one in idx2spk_feat
.
👆Back to the table of contents
How to Contribute a New Dataset
If the dataset that you want to use for your experiments is not included here, you could make the dumping pipeline of your target dataset by the following instructions:
- Go to
${SPEECHAIN_ROOT}/datasets/
. - Run
bash data_dumping.sh --help
to familiarize yourself with the involved arguments. - Make a new folder in
${SPEECHAIN_ROOT}/datasets/
with the name as your target dataset. - Make a new data_download.sh in
${SPEECHAIN_ROOT}/datasets/{dataset_name}
to download your target dataset from the internet. Please download the dataset into${SPEECHAIN_ROOT}/datasets/{dataset_name}/data/wav
. - Make a new meta_generator.py in
${SPEECHAIN_ROOT}/datasets/{dataset_name}
to extract the metadata files of your target dataset. Please refer to${SPEECHAIN_ROOT}/datasets/meta_generator.py
for instructions of how to override the pipeline of metadata generation. - If needed, make a new meta_post_processor.py in
${SPEECHAIN_ROOT}/datasets/{dataset_name}
to post-process the extracted metadata files of all the subsets.
(e.g. combine train-clean-100 and train-clean-360 of LibriSpeech into train-clean-460) Please refer to${SPEECHAIN_ROOT}/datasets/meta_post_processor.py
for instructions of how to override the pipeline of metadata post-processing. - Make a new run.sh in
${SPEECHAIN_ROOT}/datasets/{dataset_name}
to manipulate the dumping pipeline of your target dataset.
You could refer to the ones in the existing dataset folders as a template.
Note:
Please keep the same script names (i.e., data_download.sh
, meta_generator.py
, and meta_post_processor.py
) for the compatibility with data_dumping.sh
.