pos_enc
Origin: Sashi Novitasari Modification: Heli Qi Affiliation: NAIST Date: 2022.07
PositionalEncoding
Bases: Module
Pre-compute position encodings (PE).
In forward pass, this module adds the positional encodings to the embedded feature vectors to make the Transformer aware of the positional information of the sequences.
Source code in speechain/module/transformer/pos_enc.py
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 |
|
forward(emb_feat)
Embedded feature.
-> LayerNorm(Embedded feature)
-> LayerNorm(Embedded feature) * sqrt(d_model)
-> LayerNorm(Embedded feature) * sqrt(d_model) + Positional Encoding * learnable scalar
-> Dropout(LayerNorm(Embedded feature) * sqrt(d_model) + Positional Encoding * learnable scalar)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
emb_feat
|
Tensor
|
(batch_size, seq_len, d_model) Embedded input feature sequences |
required |
Returns:
Type | Description |
---|---|
Embedded input feature sequences with positional encoding |
Source code in speechain/module/transformer/pos_enc.py
module_init(posenc_type='mix', d_model=512, emb_scale=False, emb_layernorm=False, posenc_scale=False, init_alpha=1.0, max_len=5000, dropout=0.0)
Positional Encoding with maximum length max_len.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
posenc_type
|
str
|
str The type of positional encoding (must be either 'mix' or 'sep'). For the 'mix' type, sin is applied to the odd dimensions and cos is applied to the even dimensions. The equations are as below: PE(pos, 2i) = sin(pos / 10000^{2i / d_model}), i ∈ {0, ..., d_model / 2 - 1} PE(pos, 2i + 1) = cos(pos / 10000^{2i / d_model}), i ∈ {0, ..., d_model / 2 - 1} Reference: 'Attention Is All You Need' https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf For the 'sep' type, sin is applied to the first half of dimensions and cos is applied to the second half of dimensions. The equations are as below: PE(pos, i) = sin(pos / 10000^{2i / d_model}), i ∈ {0, ..., d_model / 2 - 1} PE(pos, i) = cos(pos / 10000^{2i / d_model}), i ∈ {d_model / 2, ..., d_model - 1} Reference: 'Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition' https://ieeexplore.ieee.org/abstract/document/8462506/ |
'mix'
|
d_model
|
int
|
int The dimension of the hidden feature vectors of the Transformer layers. |
512
|
emb_scale
|
bool
|
bool Controls whether the embedding vectors are scaled up by sqrt(d_model) before adding into the positional encoding or not. References: Section 3.4 in 'Attention Is All You Need' https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf In most cases, we don't recommend you to turn it on especially when you don't have a large training set (e.g. LibriSpeech-train_clean_100) because it may make your model hard to converge. Please consider it only when you want to emphasize the embedded features over the positional encodings. |
False
|
emb_layernorm
|
bool
|
bool Controls whether the embedding vectors are normalized by LayerNorm before adding into the positional encoding or not. |
False
|
posenc_scale
|
bool
|
bool Controls whether the positional encodings are scaled up by a trainable scalar before adding into the embedded features or not. Reference: 'Neural Speech Synthesis with Transformer Network' https://ojs.aaai.org/index.php/AAAI/article/view/4642/4520 |
False
|
init_alpha
|
float
|
float The initial value of the alpha used for positional encoding scaling. Only effective when posenc_scale is True. |
1.0
|
max_len
|
int
|
int The maximum length of the input feature sequences. |
5000
|
dropout
|
float
|
float The dropout rate for the Dropout layer after adding the positional encoding to the input |
0.0
|
Source code in speechain/module/transformer/pos_enc.py
reset_parameters()
Make sure that the scalar value is not influenced by different model initialization methods.
update_posenc(max_len)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
max_len
|
int
|
|
required |