noam
Author: Heli Qi Affiliation: NAIST Date: 2022.07
Noamlr
Bases: OptimScheduler
The OptimScheduler where the scheduling contains a LR warmup stage and a LR decay stage. In the warmup stage, the learning rate increases linearly to the peak. In the decay stage, the learning rate decreases in the level of square root.
This OptimScheduler is mainly used for Transformer-based models.
Source code in speechain/optim_sche/noam.py
sche_init(d_model=None, warmup_steps=4000)
The learning rate calculation is different depending on whether d_model is given or not.
If d_model is given, the learning rate would be: (d_model ** -0.5) * min(step ** -0.5, real_step * warmup_steps ** -1.5) This calculation method is the original method proposed in 'Attention is all you need'.
If d_model is not given, the learning rate would be: (optimizer.lr * warmup_steps ** 0.5) * min(real_step ** -0.5, step * warmup_steps ** -1.5) This calculation method makes sure that the learning rate reaches the maximum (optimizer.lr) right after all the warmup steps are finished.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
d_model
|
int
|
int The dimension of the hidden vectors of your Transformer model. |
None
|
warmup_steps
|
int
|
int The number of warming up steps. |
4000
|
Returns:
Type | Description |
---|---|
A list of names of your customized member variables. |
Source code in speechain/optim_sche/noam.py
update_lr(real_step, epoch_num)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
real_step
|
int
|
int The number of the current training step. Will be different from self.step_num when self.accum_grad is layer than 1. |
required |