OptimScheduler
OptimScheduler is the base class of all OptimScheduler objects that combine the roles of traditional optimizers and schedulers together.
Its main job is optimizing the target model parameters and scheduling the learning rate during training.
In this toolkit, we combine traditional optimizers and schedulers into a single class: OptimScheduler.
Each OptimScheduler object has one built-in member optimizer (torch.optim.Optimizer
) which is initialized automatically by the optim_type
and optim_conf
given in your configuration.
Table of Contents
- Configuration File Format
- OptimScheduler Library
- API Document
- How to Construct Multiple Optimizers on Multiple Losses
- How to Simulate Large Batch Training with Limited GPUs
- How to Perform Fine-tuning
Configuration File Format
The configuration of OptimScheduler is given in the optim_sches
tag of train_cfg.
The configuration format is shown below.
optim_sches:
type: {file_name}.{class_name}
conf:
optim_type: {class_name}
optim_conf:
...
# general optimscheduler configuration
optim_loss:
updated_modules:
step_per_update:
# customized optimscheduler configuration
...
-
The first-level key must be optim_sches to notify the framework of the optimscheduler configuration.
-
type is a second-level key that indicates your optimscheduler type. The value of this key is used as the query to pick up your target OptimScheduler subclass for initialization.
Your given query should be in the form offile_name.class_name
to indicate the place of your target subclass.
For example,noam.NoamLr
means the classNoamLr
in./speechain/optim_sche/noam.py
. -
conf is a second-level key that indicates your optimscheduler configuration. The value of this key is a Dict whose configuration is as following:
-
optim_type is a query that indicates the type of the built-in
torch.optim.Optimizer
in this optimscheduler.
Your given query should be in the form ofclass_name
to indicate your target subclass intorch.optim
.
For example,Adam
means the classtorch.optim.Adam
. -
optim_conf contains all the configuration used to initialize the built-in optimizer.
For more details, please refer to the PyTorch document of your targettorch.optim.Optimizer
subclass. -
optimscheduler general configuration is shared by all OptimScheduler subclasses.
1. optim_loss
2. updated_modules
3. step_per_update -
optimscheduler customized configuration is used to initialize the customized part of each optimscheduler subclass. This part defines the scheduling strategy to adjust the learning rates during training.
Please refer to the docstrings of your target OptimScheduler subclass for more details.
-
-
👆Back to the table of contents
OptimScheduler Library
/speechain
/optim_sche
/abs.py # Abstract class of OptimScheduler. Base of all OptimScheduler implementations.
/noam.py # OptimScheduler implementation of the Noam scheduler. Mainly used for Transformer training.
👆Back to the table of contents
API Document
Non-overridable backbone functions:
1. speechain_optim_sche.abs.OptimScheduler.__init__
2. speechain.optim_sche.abs.OptimScheduler.step
3. speechain.optim_sche.abs.OptimScheduler.get_lr
4. speechain.optim_sche.abs.OptimScheduler.state_dict
5. speechain.optim_sche.abs.OptimScheduler.load_state_dict
6. speechain.optim_sche.abs.OptimScheduler.__repr__
Overridable interface functions:
1. speechain.optim_sche.abs.OptimScheduler.sche_init
2. speechain.optim_sche.abs.OptimScheduler.update_lr
3. speechain.optim_sche.abs.OptimScheduler.extra_repr_fn
👆Back to the table of contents
speechain_optim_sche.abs.OptimScheduler.__init__(self, optim_type, optim_conf, model, distributed, optim_loss, updated_modules, step_per_update, use_amp, accum_grad, ft_factor, grad_clip, grad_norm_type, **sche_conf)
- Description:
This initialization function initializes the general part shared by all OptimScheduler subclasses. At the end of this function, an interface functionsche_init()
is called to initialize the customized part of each OptimScheduler subclass. -
Arguments:
Arguments received from
exp_cfg
:- model: speechain.model.abs.Model
The pointer to the model whose parameters will be optimized by the built-intorch.optim.Optimizer
. - distributed: bool = False
Whether the model to be optimized is distributed to multiple GPUs.
If True, gradient accumulation will be done asynchronously in the DDP mode to speed up training. - use_amp: bool = True
Whether the Automatic Mixed Precision (AMP) technique is used during back-propagation.
If True, a built-intorch.cuda.amp.GradScaler
will be initialized to calculate the gradients and optimize the parameters. - accum_grad: int = 1
The number of steps to accumulate gradients before optimization. The larger this argument is, the larger your virtual batches will be. - ft_factor: float = 1.0
The finetuning factor used to scale down the learning rates during training.
- model: speechain.model.abs.Model
Arguments received from train_cfg
:
- optim_type: str
The optimizer query used to pick up the target Optimizer subclass fromtorch.optim
. - optim_conf: Dict
The configuration used to initialize the built-intorch.optim.Optimizer
. - optim_loss: str = None
The name of the target loss used in this OptimScheduler object to calculate the gradients. If not given, the loss namedloss
will be used for optimization. - updated_modules: str or List[str]
This argument allows you to update only a part of parameters of the built-in model pointer.updated_modules
indicate the names of your target modules (first-level module in the nested module tree) in the built-in model pointer.
Its value can be either a string (only one target module) or a list (multiple target modules).
If not given, the entire model will be updated. - step_per_update: int = 1
The optimization interval for the built-in optimizer. It means that the parameter optimization will be done once everystep_per_update
steps. - **sche_conf:
The arguments used to initialize the customized part of this OptimScheduler.
Mainly used to decide the learning rate scheduling strategy.
speechain_optim_sche.abs.OptimScheduler.step(losses, time_func, optim_name, step_num)
- Description:
This function optimizes the target parameters of the built-in model pointer with the input training losses. - Arguments:
- losses: Dict[str, torch.Tensor]
The training loss Dict received from thecriterion_forward()
of the bulit-in model pointer. - time_func:
The context function used to record the consumed time during gradient back-propagation and parameter optimization. - optim_name: str
The name of the OptimScheduler object. This argument is used to identify the recorded consumed time information. - step_num: int
The number of the current training step. This argument is used to update the learning rate for the current step byself.update_lr()
.
- losses: Dict[str, torch.Tensor]
speechain_optim_sche.abs.OptimScheduler.get_lr(self)
- Description:
This function returns the current learning rate of the built-intorch.optim.Optimizer
member. - Return: float
The value of the learning rates obtained fromself.optimizer.param_groups
.
speechain_optim_sche.abs.OptimScheduler.state_dict(self)
- Description:
This function returns the current status of the OptimScheduler object for checkpoint storage. - Return: Dict
The status Dict containing the current status of the built-intorch.optim.Optimizer
and the built-intorch.cuda.amp.GradScaler
(if had).
speechain_optim_sche.abs.OptimScheduler.load_state_dict(self, state_dict)
- Description:
This function loads the existing checkpoint information into the OptimScheduler object as the starting status. - Arguments:
- state_dict: Dict
The status information loaded from the existing checkpoint.
- state_dict: Dict
speechain_optim_sche.abs.OptimScheduler.__repr__(self)
- Description:
This function returns the description string of the OptimScheduler object. There is a general description part shared by all the OptimScheduler subclasses.
In this function, an interface hook functionextra_repr_fn()
will be called to generate the specific description part of each OptimScheduler subclass. - Return: str
The description string for the OptimScheduler object.
speechain_optim_sche.abs.OptimScheduler.sche_init(**sche_conf)
- Description:
This abstract interface function is the customized initialization function which decides how the learning rate is scheduled as the training goes.
This interface is mandatory to be overridden. - Arguments:
- **sche_conf:
The arguments used to initialize the customized part of this OptimScheduler. For more details about the learning rate scheduling strategy, please refer to the docstring ofsche_init()
of your target OptimScheduler subclass.
- **sche_conf:
speechain_optim_sche.abs.OptimScheduler.update_lr(self, real_step)
- Description:
This abstract interface function generates the learning rate by the input step number. - Arguments:
- real_step: int
The number of the real step for parameter optimization. Due to the existence ofself.accum_grad
, parameter optimization may not be done at each training step. The real step number here means the training steps where parameter optimization is done.
- real_step: int
- Return: float
The learning rate used for parameter optimization in the current training step.
speechain_optim_sche.abs.OptimScheduler.extra_repr_fn(self)
- Description:
This interface hook function returns the specific part of the description string of the OptimScheduler object. The original implementation in the base class returns an empty string.
In principle, this interface hook function must be overridden by each OptimScheduler subclass. But there won't be any errors if you don't override it in your implementation. - Return: str
The specific part of the description string of the OptimScheduler object.
👆Back to the table of contents
How to Construct Multiple Optimizers on Multiple Losses
The cooperation of multiple optimizers is handled by 3 arguments: optim_losses, updated_modules, and step_per_update.
-
optim_losses means the training loss used to calculate the gradients for the optimizer.
-
update_modules means the target module in your where that you would like the optimizer to update the parameters.
-
step_per_update means the updating frequency of the optimizer (i.e. the parameter optimization can be done once per step_per_update steps).
In the example below, there are two optimschedulers for optimizing the parameters of an Encoder-Decoder model. encoder_optim optimizes the encoder part using the training loss called encoder_loss while decoder_optim optimizes the decoder part using the training loss called decoder_loss. The encoder optimization is done once every 2 steps while the decoder optimization is done once every step.
optim_sches:
encoder_optim:
type: noam.NoamLr
conf:
optim_type: Adam
optim_conf:
...
optim_losses: encoder_loss
updated_modules: encoder
step_per_update: 2
decoder_optim:
type: noam.NoamLr
conf:
optim_type: Adam
optim_conf:
...
optim_losses: decoder_loss
updated_modules: decoder
👆Back to the table of contents
How to Simulate Large Batch Training with Limited GPUs
We provide a method called gradient accumulation (implemented by the argument accum_grad
in exp_cfg) to train your model with large batches that are beyond the memory of your GPUs.
The basic idea is to accumulate the gradients calculated in several small batches and update the model with the accumulated gradients to mimic a large batch.
So, the actual batch size becomes accum_grad * batch_size
.
The pseudo-code of gradient accumulation is like this:
for step in range(max_step):
loss /= accum_grad
loss.backward()
if step % accum_grad == 0:
# real_step = (step - 1) // accum_grad + 1
optimizer.step()
optimizer.zero_grad()
Unfortunately, gradient accumulation is not identical to large batch training. Since small batches are used to calculate the gradients of each step, some calculations of large batch training cannot be simulated (e.g. BatchNorm and FeatureNormalization). Therefore, the performance of the model trained by gradient accumulation may be slightly different from the one trained by the actual large batches.
👆Back to the table of contents
How to Perform Fine-tuning
In the normal setting, we need to scale down the learning rates by a factor of 10 to 100 for fine-tuning a pretrained model.
In this toolkit, the learning rates can be easily scaled down by the input argument ft_factor
in exp_cfg without changing the scheduling configuration of your optimscheduler.
It's no longer necessary for you to redesign the scheduler configuration for fine-tuning!