Skip to content

Dataset (Class)

Dataset is the base class that takes charge of reading the data instances from the disk into the memory and packaging them into a batch for model training or testing.

A Dataset object receives indices of the selected data instances from the Dataloader object created by the high-level Iterator object. The output batches of packaged data instances generated by the Dataset object may not be well-processed. Some post-processing steps need to be done in the Model object later.

👆Back to the handbook page

Table of Contents

  1. Configuration File Format
  2. Dataset Library
  3. API Document
  4. How to Mix Multiple Data Sources in my Dataset Object
  5. How to Perform Data Selection in my Dataset Object

Configuration File Format

dataset_conf:
    main_data:
        {data_name1}: {data_file_path1}
        {data_name2}: {data_file_path2}
        ...
    data_selection:
        - ...
    # Customized Arguments passed to the hook dataset_init_fn()
    ...
  • The first-level key should be dataset_conf to fit the setting of Iterator.

  • The second-level keys are made up of two parts:

    1. main_data:
      The main body dictionary of the data instances you want to load for training or testing your models.
      Each key-value item corresponds to a data variable in a data instance where the key is the variable name and the value is the file path of the data to be extracted.
      For more details about how to give data_name, please refer to the API document of your target Dataset subclass.

    2. data_selection:
      This argument defines how to select the target data instances in the given dataset.
      For more details about how to configure this argument, please refer to the AIP document below.

    3. Arguments that are passed to the hook dataset_init_fn() for customized Dataset initialization. For more details about the customized arguments, please refer to the AIP document of your target Dataset subclass.

👆Back to the table of contents

Dataset Library

/speechain
    /dataset
        /abs.py             # Abstract class of Dataset. Base of all Dataset implementations.
        /speech_text.py     # Dataset implementation of speech-text datasets. Mainly used for ASR and TTS models.
👆Back to the table of contents

API Document

speechain.dataset.abs.Dataset

Non-overridable backbone functions:
1. __init__
2. __getitem__
3. data_selection
4. get_data_index
5. remove_data_by_index
6. collate_fn

Overridable interface functions:
1. dataset_init_fn
2. extract_main_data_fn
3. collate_main_data_fn

👆Back to the table of contents

speechain.dataset.abs.Dataset

This base class inherits torch.utils.data.Dataset and provides three hook functions: dataset_init_fn(), extract_main_data_fn(), and collate_main_data_fn(). If you want to make your own Dataset implementation, please follow the instructions to override those hooks.

__init__(self, main_data, data_selection, **dataset_conf)

  • Description:
    This initialization function reads the main body of the data instances into its memory. The main body is used to extract individual data instances from the disk to form a batch during model training or testing.
    The hook dataset_init_fn() is executed here after reading the main body files.
  • Arguments:

    • main_data: Dict[str, str or List[str]]
      The main body dictionary of the data instances used in this Dataset object. In each key-value item, the key is the name of the data variable and the value is the absolute path of the target idx2data files. The value can be given as a single path string or a list of multiple path strings.
    • data_selection: List[str or List[str]] = None
      The strategies for data selection during the iterator initialization to shrink the used data instances. Multiple strategies can be specified in a list. Each data selection strategy must be either a bi-list (non-meta strategy) or tri-list (meta strategy).

      1. non-meta strategy:
        The rule-based selection strategies that don't involve metadata. These strategies should be given as a bi-list, i.e., ['selection mode', 'selection number'].
        'selection mode' indicates the way to select data instances while 'selection number' indicates how many data instances to be selected.
        Currently, available non-meta selection modes include:

        1. 'order': Select the data instances from the beginning of the dataset.
        2. 'rev_order': Select the data instances from the end of the dataset.
        3. 'random': Randomly select the data instances from the dataset.
          Note: You should keep the same random seeds for all the GPU processes in the DDP mode to ensure that the selected data instances are the same in each process. In this case, please set the 'same_proc_seed' argument to True in your configuration given to speechain.runner.py.
      2. meta strategy:
        The selection strategies that involves metadata. These strategies should be given as a tri-list, i.e., ['selection mode', 'selection threshold', 'metadata path'].
        'selection mode' indicates the way to select data instances, 'selection threshold' indicates the metadata threshold to select data instances, and 'metadata path' indicates where is the metadata used for selection.
        Currently, available meta selection modes include:

        1. 'min': Select the data instances who have smaller metadata.
        2. 'max': Select the data instances who have larger metadata.
        3. 'middle': Remove the data instances whose metadata is the largest and smallest.
    • **dataset_conf:
      The configuration arguments for customized Dataset initialization.

👆Back to the API list

__getitem__(self, index)

  • Description:
    This function is the implementation of the one in the parent class torch.utils.data.Dataset. This function is activated by the Dataloader object one data instance a time. In each time, this function receives an index and returns the selected data instance.
    The hook extract_main_data_fn() is executed here to extract the main body of the selected data instance from the disk.
  • Arguments:
    • index: str
      The index of the selected data instance given by the Dataloader object.
  • Return: Dict[str, Any]
    A dictionary containing a data instance.

👆Back to the API list

data_selection(self, selection_mode, selection_num, meta_info)

  • Description:
    This function executes the data selection by the input selection strategy arguments. A new batching view of the selected data instances will be returned.
  • Arguments:

    • selection_mode: str
      The mode indicating how the data instances are selected. Selection modes are grouped by different types of data selection strategies.
    • non-meta strategy:
      The rule-based selection strategies that don't involve metadata. Currently, available non-meta selection modes include:
      1. 'order': Select the data instances from the beginning of the dataset.
      2. 'rev_order': Select the data instances from the end of the dataset.
      3. 'random': Randomly select the data instances from the dataset.
        Note: You should keep the same random seeds for all the GPU processes in the DDP mode to ensure that the selected data instances are the same in each process. In this case, please set the 'same_proc_seed' argument to True in your configuration given to speechain.runner.py.
    • meta strategy:
      The selection strategies that involves metadata. Currently, available meta selection modes include:
      1. 'min': Select the data instances who have smaller metadata.
      2. 'max': Select the data instances who have larger metadata.
      3. 'middle': Remove the data instances whose metadata is the largest and smallest.
    • selection_num: float or int or str
      This argument has the different usage with different data types.
    • float type:
      Float value represents the relative number of data instances to be selected. If selection_num is given as a float number, it must be between 0 and 1.
    • int type:
      Integer value represents the absolute number of data instances to be selected. If selection_num is given as an interger number, it must be negative (its absolute value will be taken).
    • str type:
      String value represents the metadata threshold used to select the data instances. Only 'min' and 'max' modes support string selection_num.
      Note: You can use the !-suffixed representer !str to convert a float or integer number to a string in your .yaml file.
    • meta_info: str = None
      The path where the metadata information used for selection is placed. Only the meta strategies 'min', 'max', and 'middle' need this argument.
  • Return: List[str]
    A list of indices of the selected data instances.

👆Back to the API list

get_data_index(self)

  • Description:
    This function is designed to make users know the data indices of this Dataset object without accessing its members for the lower coupling principle.
  • Return: List[str]
    The list of the indices of all data instances in this dataset.

👆Back to the API list

remove_data_by_index(self)

  • Description:
    This function removes the corresponding data instance from this Dataset object by the given index. It's mainly used for solving the index mismatch of data instances with the high-level Iterator object.

👆Back to the API list

dataset_init_fn(self, **dataset_conf)

  • Description:
    This hook function initializes the customized part of your Dataset implementations. This hook is not mandatory to be overridden and the original one in the base class does nothing. If your Dataset subclass has some customized part, please override this hook function and put your logic here.
  • Arguments:
    • **dataset_conf:
      The configuration arguments for customized Dataset initialization received from __init__().

👆Back to the API list

extract_main_data_fn(batch_dict, main_data)

  • Description:
    This hook function extracts the selected data instance from the disk to the memory. The original hook in the base class does nothing and directly return main_data. If you want to implement your own data instance extraction, please override this hook function and give your logic here.
  • Arguments:
    • main_data: Dict[str, str]
      The dictionary containing necessary information for extracting the data instance from the disk to the memory. For example, the audio file path for the waveform data and the feature file path for the speaker embedding.
  • Return: Dict[str, Any]
    The dictionary containing the extracted data instance.

👆Back to the API list

collate_fn(self, batch)

  • Description:
    This hook function is used as the value of the argument collate_fn for initializing Dataloader object at the beginning of each epoch.
    If you have your own batch collating strategy, we don't recommend you to override this hook but another hook named collate_main_data_fn(). This function should return the processed batch data in the form of a dictionary.
  • Arguments:
    • batch: List[Dict[str, Any]]
      The tuple of data instance dictionaries extracted by extract_main_data_fn().
  • Return: Dict[str, Any]
    The batch dictionary that will be passed to the model.

👆Back to the API list

collate_main_data_fn(self, batch_dict)

  • Description:
    This hook function decides how to preprocess a dictionary of the extracted batch of data instances before giving them to the model. The original hook in the base class packages all the elements other than strings of the batch into a torch.Tensor. Therefore, the torch.Tensor elements must have the same shape. The string elements will remain a list.
    If you have your own batch collating strategy, please override this hook function and give your logic here.
  • Arguments:
    • batch_dict: Dict[str, List]
      The reshaped dictionary of the extracted batch. In each key-value item, the key is the name of the data variable that will be passed to the model and the value is the list of unorganized data from all the elements in the batch.
  • Return: Dict[str, torch.Tensor or List]
    The dictionary containing the collated batch of data instances.

👆Back to the API list

👆Back to the table of contents

How to Mix Multiple Data Sources in my Dataset Object

If you want to initialize your iterator with multiple datasets and want your dataloader to pick up batches from the mixed dataset, you can simply give a list of file paths to the src_data and tgt_label arguments to initialize the built-in dataset of your iterator like the example below.

data_root: ./datasets/speech/librispeech/data/wav
train:
    type: block.BlockIterator
    conf:
        dataset_type: speech.speech_text.SpeechTextDataset
        dataset_conf:
            src_data:
                - !ref <data_root>/train_clean_100/feat.scp
                - !ref <data_root>/train_clean_360/feat.scp
                - !ref <data_root>/train_other_500/feat.scp
            tgt_label:
                - !ref <data_root>/train_clean_100/text
                - !ref <data_root>/train_clean_100/text
                - !ref <data_root>/train_other_500/text
        ...

👆Back to the table of contents

How to Perform Data Selection in my Dataset Object

If you only need to load a part of the data samples from the built-in dataset, you can use the arguments selection_mode and selection_num. selection_mode specifies the selection method and selection_num specifies the number of selected samples. selection_num can be given as a positive float number or a negative integer number. The positive float number means the ratio of the dataset. In the example below, the first 50% of LibriSpeech-train_clean_100 will be selected.

data_root: ./datasets/speech/librispeech/data/wav
train:
    type: block.BlockIterator
    conf:
        dataset_type: speech.speech_text.SpeechTextDataset
        dataset_conf:
            src_data: !ref <data_root>/train_clean_100/feat.scp
            tgt_label: !ref <data_root>/train_clean_100/text

        selection_mode: order
        selection_num: 0.5
        ...

The negative integer number means the absolute number of the selected samples. In the example below, 1000 data samples of LibriSpeech-train_clean_100 will be randomly selected.

data_root: ./datasets/speech/librispeech/data/wav
train:
    type: block.BlockIterator
    conf:
        dataset_type: speech.speech_text.SpeechTextDataset
        dataset_conf:
            src_data: !ref <data_root>/train_clean_100/feat.scp
            tgt_label: !ref <data_root>/train_clean_100/text
)
        selection_mode: random
        selection_num: -1000
        ...
Moreover, data selection and datasets mixing can be used in a single iterator but they will be done sequentially. In the example below, train_clean_100, train_clean_360, and train_other_500 datasets of the LibriSpeech corpus will be first mixed into a large dataset, and then the last 50% of the large dataset will be selected.
data_root: ./datasets/speech/librispeech/data/wav
train:
    type: block.BlockIterator
    conf:
        dataset_type: speech.speech_text.SpeechTextDataset
        dataset_conf:
            src_data:
                - !ref <data_root>/train_clean_100/feat.scp
                - !ref <data_root>/train_clean_360/feat.scp
                - !ref <data_root>/train_other_500/feat.scp
            tgt_label:
                - !ref <data_root>/train_clean_100/text
                - !ref <data_root>/train_clean_360/text
                - !ref <data_root>/train_other_500/text
        )
        selection_mode: rev_order
        selection_num: 0.5
        ...

👆Back to the table of contents