Skip to content

OptimScheduler

OptimScheduler is the base class of all OptimScheduler objects that combine the roles of traditional optimizers and schedulers together. Its main job is optimizing the target model parameters and scheduling the learning rate during training.
In this toolkit, we combine traditional optimizers and schedulers into a single class: OptimScheduler. Each OptimScheduler object has one built-in member optimizer (torch.optim.Optimizer) which is initialized automatically by the optim_type and optim_conf given in your configuration.

👆Back to the handbook page

Table of Contents

  1. Configuration File Format
  2. OptimScheduler Library
  3. API Document
  4. How to Construct Multiple Optimizers on Multiple Losses
  5. How to Simulate Large Batch Training with Limited GPUs
  6. How to Perform Fine-tuning

Configuration File Format

The configuration of OptimScheduler is given in the optim_sches tag of train_cfg. The configuration format is shown below.

optim_sches:
    type: {file_name}.{class_name}
    conf:
        optim_type: {class_name}
        optim_conf:
            ...
        # general optimscheduler configuration
        optim_loss:
        updated_modules:
        step_per_update:
        # customized optimscheduler configuration
        ...

  • The first-level key must be optim_sches to notify the framework of the optimscheduler configuration.

    1. type is a second-level key that indicates your optimscheduler type. The value of this key is used as the query to pick up your target OptimScheduler subclass for initialization.
      Your given query should be in the form of file_name.class_name to indicate the place of your target subclass.
      For example, noam.NoamLr means the class NoamLr in ./speechain/optim_sche/noam.py.

    2. conf is a second-level key that indicates your optimscheduler configuration. The value of this key is a Dict whose configuration is as following:

      1. optim_type is a query that indicates the type of the built-in torch.optim.Optimizer in this optimscheduler.
        Your given query should be in the form of class_name to indicate your target subclass in torch.optim.
        For example, Adam means the class torch.optim.Adam.

      2. optim_conf contains all the configuration used to initialize the built-in optimizer.
        For more details, please refer to the PyTorch document of your target torch.optim.Optimizer subclass.

      3. optimscheduler general configuration is shared by all OptimScheduler subclasses.
        1. optim_loss
        2. updated_modules
        3. step_per_update

      4. optimscheduler customized configuration is used to initialize the customized part of each optimscheduler subclass. This part defines the scheduling strategy to adjust the learning rates during training.
        Please refer to the docstrings of your target OptimScheduler subclass for more details.

👆Back to the table of contents

OptimScheduler Library

/speechain
    /optim_sche
        /abs.py     # Abstract class of OptimScheduler. Base of all OptimScheduler implementations.
        /noam.py    # OptimScheduler implementation of the Noam scheduler. Mainly used for Transformer training.

👆Back to the table of contents

API Document

Non-overridable backbone functions:
1. speechain_optim_sche.abs.OptimScheduler.__init__
2. speechain.optim_sche.abs.OptimScheduler.step
3. speechain.optim_sche.abs.OptimScheduler.get_lr
4. speechain.optim_sche.abs.OptimScheduler.state_dict
5. speechain.optim_sche.abs.OptimScheduler.load_state_dict
6. speechain.optim_sche.abs.OptimScheduler.__repr__

Overridable interface functions:
1. speechain.optim_sche.abs.OptimScheduler.sche_init
2. speechain.optim_sche.abs.OptimScheduler.update_lr
3. speechain.optim_sche.abs.OptimScheduler.extra_repr_fn

👆Back to the table of contents

speechain_optim_sche.abs.OptimScheduler.__init__(self, optim_type, optim_conf, model, distributed, optim_loss, updated_modules, step_per_update, use_amp, accum_grad, ft_factor, grad_clip, grad_norm_type, **sche_conf)

  • Description:
    This initialization function initializes the general part shared by all OptimScheduler subclasses. At the end of this function, an interface function sche_init() is called to initialize the customized part of each OptimScheduler subclass.
  • Arguments:

    Arguments received from exp_cfg:

    • model: speechain.model.abs.Model
      The pointer to the model whose parameters will be optimized by the built-in torch.optim.Optimizer.
    • distributed: bool = False
      Whether the model to be optimized is distributed to multiple GPUs.
      If True, gradient accumulation will be done asynchronously in the DDP mode to speed up training.
    • use_amp: bool = True
      Whether the Automatic Mixed Precision (AMP) technique is used during back-propagation.
      If True, a built-in torch.cuda.amp.GradScaler will be initialized to calculate the gradients and optimize the parameters.
    • accum_grad: int = 1
      The number of steps to accumulate gradients before optimization. The larger this argument is, the larger your virtual batches will be.
    • ft_factor: float = 1.0
      The finetuning factor used to scale down the learning rates during training.

Arguments received from train_cfg:

  • optim_type: str
    The optimizer query used to pick up the target Optimizer subclass from torch.optim.
  • optim_conf: Dict
    The configuration used to initialize the built-in torch.optim.Optimizer.
  • optim_loss: str = None
    The name of the target loss used in this OptimScheduler object to calculate the gradients. If not given, the loss named loss will be used for optimization.
  • updated_modules: str or List[str]
    This argument allows you to update only a part of parameters of the built-in model pointer. updated_modules indicate the names of your target modules (first-level module in the nested module tree) in the built-in model pointer.
    Its value can be either a string (only one target module) or a list (multiple target modules).
    If not given, the entire model will be updated.
  • step_per_update: int = 1
    The optimization interval for the built-in optimizer. It means that the parameter optimization will be done once every step_per_update steps.
  • **sche_conf:
    The arguments used to initialize the customized part of this OptimScheduler.
    Mainly used to decide the learning rate scheduling strategy.

👆Back to the API list

speechain_optim_sche.abs.OptimScheduler.step(losses, time_func, optim_name, step_num)

  • Description:
    This function optimizes the target parameters of the built-in model pointer with the input training losses.
  • Arguments:
    • losses: Dict[str, torch.Tensor]
      The training loss Dict received from the criterion_forward() of the bulit-in model pointer.
    • time_func:
      The context function used to record the consumed time during gradient back-propagation and parameter optimization.
    • optim_name: str
      The name of the OptimScheduler object. This argument is used to identify the recorded consumed time information.
    • step_num: int
      The number of the current training step. This argument is used to update the learning rate for the current step by self.update_lr().

👆Back to the API list

speechain_optim_sche.abs.OptimScheduler.get_lr(self)

  • Description:
    This function returns the current learning rate of the built-in torch.optim.Optimizer member.
  • Return: float
    The value of the learning rates obtained from self.optimizer.param_groups.

👆Back to the API list

speechain_optim_sche.abs.OptimScheduler.state_dict(self)

  • Description:
    This function returns the current status of the OptimScheduler object for checkpoint storage.
  • Return: Dict
    The status Dict containing the current status of the built-in torch.optim.Optimizer and the built-in torch.cuda.amp.GradScaler (if had).

👆Back to the API list

speechain_optim_sche.abs.OptimScheduler.load_state_dict(self, state_dict)

  • Description:
    This function loads the existing checkpoint information into the OptimScheduler object as the starting status.
  • Arguments:
    • state_dict: Dict
      The status information loaded from the existing checkpoint.

👆Back to the API list

speechain_optim_sche.abs.OptimScheduler.__repr__(self)

  • Description:
    This function returns the description string of the OptimScheduler object. There is a general description part shared by all the OptimScheduler subclasses.
    In this function, an interface hook function extra_repr_fn() will be called to generate the specific description part of each OptimScheduler subclass.
  • Return: str
    The description string for the OptimScheduler object.

👆Back to the API list

speechain_optim_sche.abs.OptimScheduler.sche_init(**sche_conf)

  • Description:
    This abstract interface function is the customized initialization function which decides how the learning rate is scheduled as the training goes.
    This interface is mandatory to be overridden.
  • Arguments:
    • **sche_conf:
      The arguments used to initialize the customized part of this OptimScheduler. For more details about the learning rate scheduling strategy, please refer to the docstring of sche_init() of your target OptimScheduler subclass.

👆Back to the API list

speechain_optim_sche.abs.OptimScheduler.update_lr(self, real_step)

  • Description:
    This abstract interface function generates the learning rate by the input step number.
  • Arguments:
    • real_step: int
      The number of the real step for parameter optimization. Due to the existence of self.accum_grad, parameter optimization may not be done at each training step. The real step number here means the training steps where parameter optimization is done.
  • Return: float
    The learning rate used for parameter optimization in the current training step.

👆Back to the API list

speechain_optim_sche.abs.OptimScheduler.extra_repr_fn(self)

  • Description:
    This interface hook function returns the specific part of the description string of the OptimScheduler object. The original implementation in the base class returns an empty string.
    In principle, this interface hook function must be overridden by each OptimScheduler subclass. But there won't be any errors if you don't override it in your implementation.
  • Return: str
    The specific part of the description string of the OptimScheduler object.

👆Back to the API list

👆Back to the table of contents

How to Construct Multiple Optimizers on Multiple Losses

The cooperation of multiple optimizers is handled by 3 arguments: optim_losses, updated_modules, and step_per_update.

  1. optim_losses means the training loss used to calculate the gradients for the optimizer.

  2. update_modules means the target module in your where that you would like the optimizer to update the parameters.

  3. step_per_update means the updating frequency of the optimizer (i.e. the parameter optimization can be done once per step_per_update steps).

In the example below, there are two optimschedulers for optimizing the parameters of an Encoder-Decoder model. encoder_optim optimizes the encoder part using the training loss called encoder_loss while decoder_optim optimizes the decoder part using the training loss called decoder_loss. The encoder optimization is done once every 2 steps while the decoder optimization is done once every step.

optim_sches:
    encoder_optim:
        type: noam.NoamLr
        conf:
            optim_type: Adam
            optim_conf:
                ...
            optim_losses: encoder_loss
            updated_modules: encoder
            step_per_update: 2

    decoder_optim:
        type: noam.NoamLr
        conf:
            optim_type: Adam
            optim_conf:
                ...
            optim_losses: decoder_loss
            updated_modules: decoder

👆Back to the table of contents

How to Simulate Large Batch Training with Limited GPUs

We provide a method called gradient accumulation (implemented by the argument accum_grad in exp_cfg) to train your model with large batches that are beyond the memory of your GPUs. The basic idea is to accumulate the gradients calculated in several small batches and update the model with the accumulated gradients to mimic a large batch. So, the actual batch size becomes accum_grad * batch_size.

The pseudo-code of gradient accumulation is like this:

for step in range(max_step):
    loss /= accum_grad
    loss.backward()
    if step % accum_grad == 0:
        # real_step = (step - 1) // accum_grad + 1
        optimizer.step()
        optimizer.zero_grad()
Let me show you an intuitive example. Suppose we want to calculate the mean value of 1, 2, ..., 9, 10 but we cannot directly divide the sum by 10 because our calculator is not powerful enough. Instead, we can calculate the mean value of two sub-groups: 1, 2, .., 5 and 6, 7, ..., 10. We get two sub-mean values: 3 and 8. The overall mean value can be calculated by taking the mean value of these two sub-mean values: (3 + 8) / 2 = 5.5.

Unfortunately, gradient accumulation is not identical to large batch training. Since small batches are used to calculate the gradients of each step, some calculations of large batch training cannot be simulated (e.g. BatchNorm and FeatureNormalization). Therefore, the performance of the model trained by gradient accumulation may be slightly different from the one trained by the actual large batches.

👆Back to the table of contents

How to Perform Fine-tuning

In the normal setting, we need to scale down the learning rates by a factor of 10 to 100 for fine-tuning a pretrained model. In this toolkit, the learning rates can be easily scaled down by the input argument ft_factor in exp_cfg without changing the scheduling configuration of your optimscheduler. It's no longer necessary for you to redesign the scheduler configuration for fine-tuning!

👆Back to the table of contents