Skip to content

noam

Author: Heli Qi Affiliation: NAIST Date: 2022.07

Noamlr

Bases: OptimScheduler

The OptimScheduler where the scheduling contains a LR warmup stage and a LR decay stage. In the warmup stage, the learning rate increases linearly to the peak. In the decay stage, the learning rate decreases in the level of square root.

This OptimScheduler is mainly used for Transformer-based models.

Source code in speechain/optim_sche/noam.py
class Noamlr(OptimScheduler):
    """The OptimScheduler where the scheduling contains a LR warmup stage and a LR decay
    stage. In the warmup stage, the learning rate increases linearly to the peak. In the
    decay stage, the learning rate decreases in the level of square root.

    This OptimScheduler is mainly used for Transformer-based models.
    """

    def sche_init(self, d_model: int = None, warmup_steps: int = 4000):
        """The learning rate calculation is different depending on whether d_model is
        given or not.

        If d_model is given, the learning rate would be:
            (d_model ** -0.5) * min(step ** -0.5, real_step * warmup_steps ** -1.5)
        This calculation method is the original method proposed in 'Attention is all you need'.

        If d_model is not given, the learning rate would be:
            (optimizer.lr * warmup_steps ** 0.5) * min(real_step ** -0.5, step * warmup_steps ** -1.5)
        This calculation method makes sure that the learning rate reaches the maximum (optimizer.lr) right after
        all the warmup steps are finished.

        Args:
            d_model: int
                The dimension of the hidden vectors of your Transformer model.
            warmup_steps: int
                The number of warming up steps.

        Returns:
            A list of names of your customized member variables.
        """
        # para recording
        self.d_model = d_model
        self.init_lr = (
            d_model**-0.5 if d_model is not None else self.get_lr() * warmup_steps**0.5
        )
        self.warmup_steps = warmup_steps

    def update_lr(self, real_step: int, epoch_num: int) -> float:
        """

        Args:
            real_step: int
                The number of the current training step.
                Will be different from self.step_num when self.accum_grad is layer than 1.

        """
        # the learning rate of the current step for the optimizer
        return self.init_lr * min(
            real_step**-0.5, real_step * (self.warmup_steps**-1.5)
        )

    def extra_repr_fn(self) -> str:
        return f"d_model={self.d_model}, " f"warmup_steps={self.warmup_steps}"

sche_init(d_model=None, warmup_steps=4000)

The learning rate calculation is different depending on whether d_model is given or not.

If d_model is given, the learning rate would be: (d_model ** -0.5) * min(step ** -0.5, real_step * warmup_steps ** -1.5) This calculation method is the original method proposed in 'Attention is all you need'.

If d_model is not given, the learning rate would be: (optimizer.lr * warmup_steps ** 0.5) * min(real_step ** -0.5, step * warmup_steps ** -1.5) This calculation method makes sure that the learning rate reaches the maximum (optimizer.lr) right after all the warmup steps are finished.

Parameters:

Name Type Description Default
d_model int

int The dimension of the hidden vectors of your Transformer model.

None
warmup_steps int

int The number of warming up steps.

4000

Returns:

Type Description

A list of names of your customized member variables.

Source code in speechain/optim_sche/noam.py
def sche_init(self, d_model: int = None, warmup_steps: int = 4000):
    """The learning rate calculation is different depending on whether d_model is
    given or not.

    If d_model is given, the learning rate would be:
        (d_model ** -0.5) * min(step ** -0.5, real_step * warmup_steps ** -1.5)
    This calculation method is the original method proposed in 'Attention is all you need'.

    If d_model is not given, the learning rate would be:
        (optimizer.lr * warmup_steps ** 0.5) * min(real_step ** -0.5, step * warmup_steps ** -1.5)
    This calculation method makes sure that the learning rate reaches the maximum (optimizer.lr) right after
    all the warmup steps are finished.

    Args:
        d_model: int
            The dimension of the hidden vectors of your Transformer model.
        warmup_steps: int
            The number of warming up steps.

    Returns:
        A list of names of your customized member variables.
    """
    # para recording
    self.d_model = d_model
    self.init_lr = (
        d_model**-0.5 if d_model is not None else self.get_lr() * warmup_steps**0.5
    )
    self.warmup_steps = warmup_steps

update_lr(real_step, epoch_num)

Parameters:

Name Type Description Default
real_step int

int The number of the current training step. Will be different from self.step_num when self.accum_grad is layer than 1.

required
Source code in speechain/optim_sche/noam.py
def update_lr(self, real_step: int, epoch_num: int) -> float:
    """

    Args:
        real_step: int
            The number of the current training step.
            Will be different from self.step_num when self.accum_grad is layer than 1.

    """
    # the learning rate of the current step for the optimizer
    return self.init_lr * min(
        real_step**-0.5, real_step * (self.warmup_steps**-1.5)
    )