decoder
Origin: Sashi Novitasari Modification: Heli Qi Affiliation: NAIST Date: 2022.07
Conv1dEv
Bases: Module
A 1D convolutional layer with support for different padding modes.
Attributes:
Name | Type | Description |
---|---|---|
cutoff |
bool
|
Indicates whether the output should be cut off for the 'same' padding mode. |
causal_padding |
int
|
Additional padding required for the 'causal' padding mode. |
dilation |
int
|
The dilation rate of the convolutional layer. |
conv_lyr |
Conv1d
|
The 1D convolutional layer. |
Source code in speechain/module/prenet/conv1d.py
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 |
|
__init__(in_channels, out_channels, kernel_size, stride=1, dilation=1, padding_mode='same', bias=True, use_weight_norm=False, groups=1)
Initializes the Conv1dEv module with the specified parameters.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
in_channels
|
int
|
Number of channels in the input feature. |
required |
out_channels
|
int
|
Number of channels produced by the convolution. |
required |
kernel_size
|
int
|
Size of the convolutional kernel. |
required |
stride
|
int
|
Stride of the convolution. Defaults to 1. |
1
|
dilation
|
int
|
The dilation rate of the kernel. Defaults to 1. |
1
|
padding_mode
|
str
|
Padding mode. Supported values are 'valid', 'full', 'same' and 'causal'. Defaults to 'same'. |
'same'
|
bias
|
bool
|
If True, adds a learnable bias to the output. Defaults to True. |
True
|
Raises:
Type | Description |
---|---|
ValueError
|
If an unsupported padding mode is specified. |
Source code in speechain/module/prenet/conv1d.py
forward(feat)
Performs a forward pass through the convolutional layer.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
feat
|
Tensor
|
The input feature tensor. Shape: (batch, feat_dim, feat_maxlen). |
required |
Returns:
Type | Description |
---|---|
torch.Tensor: The output tensor. Shape: (batch, out_channels, output_len). |
Source code in speechain/module/prenet/conv1d.py
MultiHeadedAttention
Bases: Module
A Multi-Head Attention layer has
· Query linear layer · Key linear layer · Value linear layer · Softmax layer · Attention Dropout layer · Output linear layer
Implementation modified from OpenNMT-py. https://github.com/OpenNMT/OpenNMT-py
Source code in speechain/module/transformer/attention.py
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 |
|
forward(k, v, q, mask=None)
Computes multi-headed attention.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
k
|
Tensor
|
keys [B, M, D] with M being the sentence length. |
required |
v
|
Tensor
|
values [B, M, D] |
required |
q
|
Tensor
|
query [B, M, D] |
required |
mask
|
Tensor
|
optional mask [B, 1, M] |
None
|
Returns:
Source code in speechain/module/transformer/attention.py
module_init(num_heads, d_model, dropout=0.1, scale_dp_by_head=False)
Create a multi-headed attention layer.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
num_heads
|
int
|
The number of heads |
required |
d_model
|
int
|
Model size (must be divisible by num_heads) |
required |
dropout
|
float
|
The dropout rate of the Dropout layer after the softmax operation |
0.1
|
Source code in speechain/module/transformer/attention.py
PositionalEncoding
Bases: Module
Pre-compute position encodings (PE).
In forward pass, this module adds the positional encodings to the embedded feature vectors to make the Transformer aware of the positional information of the sequences.
Source code in speechain/module/transformer/pos_enc.py
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 |
|
forward(emb_feat)
Embedded feature.
-> LayerNorm(Embedded feature)
-> LayerNorm(Embedded feature) * sqrt(d_model)
-> LayerNorm(Embedded feature) * sqrt(d_model) + Positional Encoding * learnable scalar
-> Dropout(LayerNorm(Embedded feature) * sqrt(d_model) + Positional Encoding * learnable scalar)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
emb_feat
|
Tensor
|
(batch_size, seq_len, d_model) Embedded input feature sequences |
required |
Returns:
Type | Description |
---|---|
Embedded input feature sequences with positional encoding |
Source code in speechain/module/transformer/pos_enc.py
module_init(posenc_type='mix', d_model=512, emb_scale=False, emb_layernorm=False, posenc_scale=False, init_alpha=1.0, max_len=5000, dropout=0.0)
Positional Encoding with maximum length max_len.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
posenc_type
|
str
|
str The type of positional encoding (must be either 'mix' or 'sep'). For the 'mix' type, sin is applied to the odd dimensions and cos is applied to the even dimensions. The equations are as below: PE(pos, 2i) = sin(pos / 10000^{2i / d_model}), i ∈ {0, ..., d_model / 2 - 1} PE(pos, 2i + 1) = cos(pos / 10000^{2i / d_model}), i ∈ {0, ..., d_model / 2 - 1} Reference: 'Attention Is All You Need' https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf For the 'sep' type, sin is applied to the first half of dimensions and cos is applied to the second half of dimensions. The equations are as below: PE(pos, i) = sin(pos / 10000^{2i / d_model}), i ∈ {0, ..., d_model / 2 - 1} PE(pos, i) = cos(pos / 10000^{2i / d_model}), i ∈ {d_model / 2, ..., d_model - 1} Reference: 'Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition' https://ieeexplore.ieee.org/abstract/document/8462506/ |
'mix'
|
d_model
|
int
|
int The dimension of the hidden feature vectors of the Transformer layers. |
512
|
emb_scale
|
bool
|
bool Controls whether the embedding vectors are scaled up by sqrt(d_model) before adding into the positional encoding or not. References: Section 3.4 in 'Attention Is All You Need' https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf In most cases, we don't recommend you to turn it on especially when you don't have a large training set (e.g. LibriSpeech-train_clean_100) because it may make your model hard to converge. Please consider it only when you want to emphasize the embedded features over the positional encodings. |
False
|
emb_layernorm
|
bool
|
bool Controls whether the embedding vectors are normalized by LayerNorm before adding into the positional encoding or not. |
False
|
posenc_scale
|
bool
|
bool Controls whether the positional encodings are scaled up by a trainable scalar before adding into the embedded features or not. Reference: 'Neural Speech Synthesis with Transformer Network' https://ojs.aaai.org/index.php/AAAI/article/view/4642/4520 |
False
|
init_alpha
|
float
|
float The initial value of the alpha used for positional encoding scaling. Only effective when posenc_scale is True. |
1.0
|
max_len
|
int
|
int The maximum length of the input feature sequences. |
5000
|
dropout
|
float
|
float The dropout rate for the Dropout layer after adding the positional encoding to the input |
0.0
|
Source code in speechain/module/transformer/pos_enc.py
reset_parameters()
Make sure that the scalar value is not influenced by different model initialization methods.
update_posenc(max_len)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
max_len
|
int
|
|
required |
Source code in speechain/module/transformer/pos_enc.py
PositionwiseFeedForward
Bases: Module
Position-wise Feed-forward layer Projects the output vectors of multi- head attention layer to fdfwd_dim and then back to d_model.
Source code in speechain/module/transformer/feed_forward.py
forward(x)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
Tensor
|
(batch, seq_maxlen, d_model) |
required |
Returns:
Source code in speechain/module/transformer/feed_forward.py
module_init(d_model=512, fdfwd_dim=2048, fdfwd_type='linear', fdfwd_activation='ReLU', fdfwd_args={}, dropout=0.1)
Initializes position-wise feed-forward layer.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
d_model
|
int
|
int The dimension of the hidden feature vector in each Transformer layer |
512
|
fdfwd_dim
|
int
|
int The value of the out_features of the first linear feedforward layer and the in_features of the second linear feedforward layer |
2048
|
fdfwd_type
|
str
|
str The type of the feed-forward layer. 'linear' means the Linear layer while 'conv' means the Conv1d layer. |
'linear'
|
fdfwd_activation
|
str
|
str The name of the activation function of feedforward layers. Should be the name of functions in 'torch.nn'. |
'ReLU'
|
fdfwd_kernel
|
int The kernal size of the Conv1d feed-forward layer. This argument is not effective if fdfwd_type == 'linear'. |
required | |
dropout
|
float The dropout rate for the Dropout layer after the first linear feedforward layer |
0.1
|
Source code in speechain/module/transformer/feed_forward.py
TransformerDecoder
Bases: Module
Source code in speechain/module/transformer/decoder.py
178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 |
|
forward(tgt, src, tgt_mask, src_mask, return_att=False, return_hidden=False)
Transformer decoder forward pass.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tgt
|
Tensor
|
(batch, tgt_maxlen, d_model) embedded targets |
required |
src
|
Tensor
|
(batch, src_maxlen, d_model) source representations |
required |
tgt_mask
|
Tensor
|
(batch, 1, tgt_maxlen) to mask out target paddings Note that a subsequent mask is applied here. |
required |
src_mask
|
Tensor
|
(batch, 1, src_maxlen) to mask out source paddings |
required |
return_att
|
bool
|
|
False
|
return_hidden
|
bool
|
|
False
|
Returns:
Type | Description |
---|---|
The output of the Transformer decoder. |
|
The outputs of each Transformer decoder layer will be returned as a List. |
|
The attention matrix (self and enc-dec) of each Transformer decoder layer will also be returned as a List. |
Source code in speechain/module/transformer/decoder.py
module_init(posenc_type='mix', posenc_maxlen=5000, posenc_dropout=0.1, posenc_scale=False, posenc_init_alpha=1.0, emb_layernorm=False, emb_scale=True, d_model=512, num_heads=4, num_layers=8, scale_dp_by_head=False, fdfwd_dim=2048, fdfwd_activation='ReLU', fdfwd_dropout=0.1, att_dropout=0.1, res_dropout=0.1, layernorm_first=True)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
posenc_type
|
str
|
str Specify the positional encoding type you would like to use in your Transformer blocks. |
'mix'
|
posenc_maxlen
|
int
|
int Maximal length when calculating the positional encoding. Usually, the default value of this argument is enough for the research. |
5000
|
posenc_dropout
|
float
|
float The dropout rate for the Dropout layer after adding the positional encoding to the input |
0.1
|
posenc_scale
|
bool
|
bool Controls whether the positional encodings are scaled up by a trainable scalar before adding into the embedded features or not. Reference: 'Neural Speech Synthesis with Transformer Network' https://ojs.aaai.org/index.php/AAAI/article/view/4642/4520 |
False
|
posenc_init_alpha
|
float
|
float = 1.0 The initial value of the alpha used for positional encoding scaling. Only effective when posenc_scale is True. |
1.0
|
emb_layernorm
|
bool
|
bool Controls whether the embedding vectors are normalized by LayerNorm before adding into the positional encoding or not. |
False
|
emb_scale
|
bool
|
bool Controls whether the embedding vectors are scaled up by sqrt(d_model) before adding into the positional encoding or not. |
True
|
d_model
|
int
|
int The dimension of the hidden feature vector in each Transformer layer |
512
|
num_heads
|
int
|
int The number of attention heads in each Transformer layer |
4
|
num_layers
|
int
|
int The number of Transformer layers |
8
|
att_dropout
|
float
|
float The dropout rate for the Dropout layer after calculating the weights in each Transformer layer |
0.1
|
fdfwd_dim
|
int
|
int The value of the out_features of the first linear feedforward layer and the in_features of the second linear feedforward layer in each Transformer layer. |
2048
|
fdfwd_activation
|
str
|
str The name of the activation function of feedforward layers. Should be the name of functions in 'torch.nn'. |
'ReLU'
|
fdfwd_dropout
|
float
|
float The dropout rate for the Dropout layer after the first linear feedforward layer in each Transformer layer |
0.1
|
res_dropout
|
float
|
float The dropout rate for the Dropout layer before adding the output of each Transformer layer into its input |
0.1
|
layernorm_first
|
bool
|
bool controls whether the LayerNorm layer appears at the beginning or at the end of each Transformer layer. True means the LayerNorm layer appears at the beginning; False means the LayerNorm layer appears at the end. |
True
|
Source code in speechain/module/transformer/decoder.py
179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 |
|
subsequent_mask(batch_size, maxlen)
staticmethod
Mask out subsequent positions (to prevent attending to future positions) Transformer helper function.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
batch_size
|
|
required | |
maxlen
|
int
|
int size of mask (2nd and 3rd dim) |
required |
Returns:
Source code in speechain/module/transformer/decoder.py
TransformerDecoderLayer
Bases: Module
A single Transformer decoder layer has: · a self multi-head attention sublayer · a LayerNorm layer exclusively for the self-attention sublayer · a encoder-decoder multi-head attention sublayer · a LayerNorm layer exclusively for the encoder-decoder attention sublayer · a position-wise feed-forward sublayer · a LayerNorm layer exclusively for the feed-forward sublayer · a residual dropout layer
Source code in speechain/module/transformer/decoder.py
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 |
|
forward(tgt, src, tgt_mask, src_mask)
Forward pass of a single Transformer decoder layer.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tgt
|
Tensor
|
(batch, tgt_maxlen, d_model) target inputs |
required |
src
|
Tensor
|
(batch, src_maxlen, d_model) source representations |
required |
tgt_mask
|
Tensor
|
(batch, tgt_maxlen, tgt_maxlen) target mask (so as to not condition on future steps) |
required |
src_mask
|
Tensor
|
(batch, 1, src_maxlen) source mask |
required |
Returns:
Type | Description |
---|---|
The output of this Transformer decoder layer and the attention matrix (self and enc-dec) |
Source code in speechain/module/transformer/decoder.py
97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 |
|
module_init(d_model=512, num_heads=8, scale_dp_by_head=False, att_dropout=0.1, fdfwd_dim=0, fdfwd_activation='ReLU', fdfwd_dropout=0.1, res_dropout=0.1, layernorm_first=True)
Represents a single Transformer decoder layer. It attends to the source representation and the previous decoder states.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
d_model
|
int
|
int The dimension of the hidden feature vector in each Transformer layer |
512
|
num_heads
|
int
|
int The number of attention heads in each Transformer layer |
8
|
att_dropout
|
float
|
float The dropout rate for the Dropout layer after calculating the weights in each Transformer layer |
0.1
|
fdfwd_dim
|
int
|
int The value of the out_features of the first linear feedforward layer and the in_features of the second linear feedforward layer in each Transformer layer. |
0
|
fdfwd_activation
|
str
|
str The name of the activation function of feedforward layers. Should be the name of functions in 'torch.nn'. |
'ReLU'
|
fdfwd_dropout
|
float
|
float The dropout rate for the Dropout layer after the first linear feedforward layer in each Transformer layer |
0.1
|
res_dropout
|
float
|
float The dropout rate for the Dropout layer before adding the output of each Transformer layer into its input |
0.1
|
layernorm_first
|
bool
|
bool Whether layernorm is performed before feeding src into sublayers. if layernorm_first is True: output = input + Sublayer(LayerNorm(input)) elif layernorm_first is False: output = LayerNorm(input + Sublayer(input)) |
True
|