Iterator
Iterator is the base class that takes charge of grouping data instances into batches for training or testing models. Each iterator object has a built-in speechain.dataset.Dataset object as a member variable. Actually, an Iterator object cannot directly access the data instances in the built-in Dataset object but maintains a batching view of the indices of the data instances used for model training or testing.
The iterators are divided into 3 groups: train, valid, and test. In each group, 2 or more iterator objects can be constructed so that there could be multiple data-label pairs in a single batch.
Table of Contents
- Configuration File Format
- Iterator Library
- API Document
- How to Construct Multiple Dataloaders for a Batch
Configuration File Format
The configuration of Iterator is given in data_cfg. The configuration format is shown below.
train:
{iterator_name}:
type: {file_name}.{class_name}
conf:
# Built-in Dataset Configuration
dataset_type: {file_name}.{class_name}
dataset_conf:
...
# General Iterator Configuration
batches_per_epoch:
data_selection:
is_descending:
shuffle:
data_len:
group_info:
# Customized Iterator Configuration
...
...
valid:
{iterator_name}:
type: {file_name}.{class_name}
conf:
dataset_type: {file_name}.{class_name}
dataset_conf:
...
...
...
test:
{test_set_name}:
{iterator_name}:
type: {file_name}.{class_name}
conf:
dataset_type: {file_name}.{class_name}
dataset_conf:
...
...
...
The combination of your first-level keys must be one of train & valid & test (for training and testing), train & valid (for training only), or test (for testing only).
-
The second-level keys are iterator names used for distinguishing the loaded data of each iterator. There is no restriction on the iterator names, so you can name them in your own preference.
Under the name of each iterator, there are two third-level keys whose names are fixed:-
type:
The value of this key acts as the query string to pick up your target Iterator subclass inSPEECHAIN_ROOT/speechain/iterator/
. Your given query should be in the form of{file_name}.{class_name}
wherefile_name
specifies your target .py file inSPEECHAIN_ROOT/speechain/iterator/
andclass_name
indicates your target Iterator subclass inSPEECHAIN_ROOT/speechain/iterator/{file_name}.py
.
For example,block.BlockIterator
means the subclassBlockIterator
inSPEECHAIN_ROOT/speechain/iterator/block.py
. -
conf:
The value of this key indicates the configuration of your iterator. The configuration is made up of the following 4 fourth-level keys:-
dataset_type:
The value of this key acts as the query string to pick up your target built-in Dataset subclass inSPEECHAIN_ROOT/speechain/dataset/
. Your given query should be in the form of{file_name}.{class_name}
wherefile_name
specifies your target .py file inSPEECHAIN_ROOT/speechain/dataset/
, andclass_name
indicates your target Dataset subclass inSPEECHAIN_ROOT/speechain/dataset/{file_name}.py
.
For example,speech_text.SpeechTextDataset
means the subclassSpeechTextDataset
in./speechain/dataset/speech_text.py
. -
dataset_conf:
The value of this key contains all the configuration used to initialize the built-in Dataset object. Please refer to Dataset API Document for more details. -
General Iterator Configuration:
These configurations are used to initialize the general part shared by all iterator subclasses. There are 6 general arguments that can be set manually in data_cfg: (please refer to speechain.iterator.abs.Iterator.__init__ for more details)- batches_per_epoch
- is_descending
- shuffle
- data_len
- group_info
- Customized Iterator Configuration:
The arguments of the customized configuration are used by each Iterator subclass to generate the batching view. Please refer to your target Iterator subclass for more details.
-
-
👆Back to the table of contents
Iterator Library
/speechain
/iterator
/abs.py # Abstract class of Iterator. Base of all Iterator implementations.
/block.py # Iterator implementation of the block strategy (variable utterances per batch). Mainly used for ASR and TTS training.
/piece.py # Iterator implementation of the piece strategy (fixed utterances per batch). Mainly used for ASR and TTS evaluation.
👆Back to the table of contents
API Document
speechain.iterator.abs.Iterator
Non-overridable backbone functions:
1. __init__
2. __len__
3. get_batch_indices
4. get_group_info
5. build_loader
Overridable interface functions:
1. batches_generate_fn
👆Back to the table of contents
speechain.iterator.abs.Iterator
The initialization of the built-in Dataset object is done automatically during the initialization of the iterator.
At the beginning of each epoch, the iterator generates a torch.utils.data.DataLoader
object to fetch the batches of data instances from the disk.
Each iterator subclass should override a static hook function batches_generate_fn()
to generate the batching view of data instances in the built-in Dataset object based on their own data batching strategy.
__init__(self, dataset_type, dataset_conf, batches_per_epoch, data_len, group_info, data_selection, is_descending, shuffle, seed, ngpu, num_workers, pin_memory, distributed, **iter_conf)
- Description:
The general initialization function shared by all the Iterator classes. Dataset initialization is automatically done here by the given dataset_type and dataset_conf. - Arguments:
- dataset_type: str
Query string to pick up the target Dataset subclass inSPEECHAIN_ROOT/speechain/dataset/
- dataset_conf: Dict
Dataset configuration for the automatic initialization of the built-in Dataset object. - batches_per_epoch: int = None
The number of batches in each epoch. This number can be either smaller or larger than the real batch number. If not given (None), all batches will be used in each epoch. - data_len: str or List[str] = None
The path of the data length file. Multiple data length files can be given in a list, but they must contain non-overlapping data instances. - group_info: Dict[str, str or List[str]]
The dictionary of paths for the idx2data files used for group-wise evaluation results visualization. - is_descending: bool = True
Whether the batches are sorted in the descending order by the length (True) or in the ascending order (False). This argument is effective only when data_len is given. - shuffle: bool = True
Whether the batches are shuffled at the beginning of each epoch. - seed: int = 0
Random seed for iterator initialization. This argument is automatically given by the experiment environment configuration.
The seed will be used to- shuffle batches before giving to the Dataloader of each epoch.
- initialize all the workers of the Dataloader for the reproducibility.
- ngpu: int = 1
The number of GPUs used to train or test models. This argument is automatically given by the experiment environment configuration.
The GPU number is used to ensure that each GPU process in the DDP mode has the batches with the same number of data instances. - num_workers: int = 1
Number of workers for the Dataloader. This argument is automatically given by the experiment environment configuration. - pin_memory: bool = False
Whether pin_memory is activated in the Dataloader. This argument is automatically given by the experiment environment configuration. - distributed: bool = False
Whether DDP is used to distribute the model. This argument is automatically given by the experiment environment configuration. - iter_conf: Dict
Iterator configuration for customized batch generation
- dataset_type: str
__len__(self)
- Description:
Get the number of batches the iterator will load in each epoch. - Return:
If batches_per_epoch is given, its value will be returned; otherwise, the total number of all the batches in the built-in Dataset object will be returned.
get_batch_indices(self)
- Description:
This function return the current batching view of the iterator object. - Return: List[List[str]]
The batching view generated by the customized hook interfacebatches_generate_fn()
. Each element of the returned batching view list is a sub-list of data indices where each index corresponds to a data instance in the built-in Dataset object.
get_group_info(self)
- Description:
This function returns the group information of the data instances in the built-in Dataset object. The returned metadata is mainly used for group-wise testing results visualization. - Return: Dict
If metadata information is not initialized in the built-in Dataset object, None will be returned. Otherwise, the meta_info member of the built-in Dataset object will be returned which is a dictionary.
build_loader(self, epoch, start_step)
- Description:
This function generate atorch.util.data.DataLoader
object to load the batches of data instances for the current epoch.
Ifbatches_per_epoch
is not given, all the batches inself.batches
will be used to generate the Dataloader; Ifbatches_per_epoch
is given, a batch clip containingbatches_per_epoch
batches will be used to generate the Dataloader.
batches_per_epoch
can be either larger or smaller than the total number of batches. For a smallerbatches_per_epoch
, a part ofself.batches
will be used as the batch clip; For a largerbatches_per_epoch
,self.batches
will be supplemented by a part of itself to form the batch clip. - Arguments:
- epoch: int = 1
The number of the current epoch. Used as part of the random seed to shuffle the batches. - start_step: int = 0
The start point for the dataloader of the current epoch. Mainly used for resuming a model testing job from a checkpoint.
- epoch: int = 1
- Return: torch.util.data.DataLoader
A DataLoader built on the batch clip of the current epoch. Ifbatches_per_epoch
is not given, the batch clip isself.batches
.
batches_generate_fn(self, data_index, data_len, batch_size)
- Description:
This hook function generates the batching view based on your customized batch generation strategy.
Your overridden function should return the batches of instance indices as a List[List[str]] where each sub-list corresponds to a batch of data instances. Each element in the sub-list is the index of a data instance.
In this original hook implementation, all the data instances in the built-in Dataset object will be grouped into batches with exactly the same amount of instances.data_len
is not used in this hook function but used for sorting all the instances in the general initialization function of the iterator. The sorted data instances make sure that the instances in a single batch have similar lengths. - Arguments:
- data_index: List[str]
The list of indices of all the data instances available to generate the batching view. - data_len: Dict[str, int]
The dictionary that indicates the data length of each available data instance in data_index. - batch_size: int = None
How many data instances does a batch should have. If not given, it will be the number of GPUs (ngpu) to ensure that the model validation or testing is done one data instance at each step on a single GPU process.
Note:batch_size
is implicitly given by**iter_conf
in__init__()
to this static hook function, so your implementation don't need to keep this argument, and you can declare your own argument.
- data_index: List[str]
- Return: List[List[str]]
A list of batches generated by your batching strategy. This List[List[str]] is called the batching view of the iterator object. Each batch in the returned list is a sub-list whose elements are the indices of data instances in the corresponding batch.
👆Back to the table of contents
How to Construct Multiple Dataloaders
Multiple Dataloaders can be easily constructed by giving the configuration of multiple iterators. Each iterator creates an independent Dataloader that contributes a data-label pair in the batch.
An example of semi-supervised ASR training is shown below. There are two iterators in the train group: sup and unsup (the iterator names are given by users based on their preferences) .
These two iterators are in the same type and have built-in datasets with the same type.
train:
sup:
type: block.BlockIterator
conf:
dataset_type: speech_text.SpeechTextDataset
dataset_conf:
...
...
unsup:
type: block.BlockIterator
conf:
dataset_type: speech_text.SpeechTextDataset
dataset_conf:
...
...
If there are multiple Dataloaders used to load data, each Dataloader will contribute a sub-Dict in the batch Dict train_batch as shown below.
The name of each sub-Dict is the one users give as the name of the corresponding iterator.
train_batch:
sup:
feat: torch.Tensor
feat_len: torch.Tensor
text: torch.Tensor
text_len: torch.Tensor
unsup:
feat: torch.Tensor
feat_len: torch.Tensor
text: torch.Tensor
text_len: torch.Tensor
If you have only one iterator like the configuration below, your train_batch will not have any sub-Dict but only the data-label pair from that iterator.
In this case, you don't need to give the name tag for the iterator.