encoder
Origin: Sashi Novitasari Modification: Heli Qi Affiliation: NAIST Date: 2022.07
TransformerEncoder
Bases: Module
The Transformer encoder for any Sequence-to-Sequence tasks. Reference: Attention is all you need https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
Our Transformer encoder implements the following properties
- Different positional encoding. (Mix or Sep)
- Different positions of the LayerNorm layer (first or last)
- Time Frame Downsampling (pool or concat)
For the details, please refer to the docstrings of PositionalEncoding and TransformerEncoderLayer.
In our Transformer implementation, there are 4 places to place the Dropout layers: 1. After adding the positional encoding into the embedded features. 2. After the softmax operation and before reweighting all the values by these weights in the multi-head attention layer. 3. Between two feedforward linear layers there will be a Dropout layer. 4. Before performing residual connect in a Transformer layer.
Source code in speechain/module/transformer/encoder.py
146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 |
|
forward(src, mask)
Pass the input (and mask) through each layer in turn. Applies a Transformer encoder to sequence of embeddings x. The input mini-batch x needs to be sorted by src length. x and mask should have the same dimensions [batch, time, dim].
Parameters:
Name | Type | Description | Default |
---|---|---|---|
src
|
Tensor
|
(batch_size, src_maxlen, embed_size) embedded src inputs, |
required |
mask
|
Tensor
|
(batch_size, 1, src_maxlen) indicates padding areas (zeros where padding) |
required |
Returns:
Type | Description |
---|---|
The output of the Transformer encoder with its mask. |
|
The outputs of each Transformer encoder layer will be returned as a List. |
|
The attention matrix of each Transformer encoder layer will also be returned as a List. |
Source code in speechain/module/transformer/encoder.py
module_init(posenc_type='mix', posenc_maxlen=5000, posenc_dropout=0.1, posenc_scale=False, posenc_init_alpha=1.0, emb_layernorm=False, emb_scale=False, d_model=512, num_heads=4, num_layers=8, scale_dp_by_head=False, att_dropout=0.1, fdfwd_dim=2048, fdfwd_type='linear', fdfwd_activation='ReLU', fdfwd_args={}, fdfwd_dropout=0.1, res_dropout=0.1, layernorm_first=True, uni_direction=False)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
posenc_type
|
str
|
str Specify the positional encoding type you would like to use in your Transformer blocks. |
'mix'
|
posenc_maxlen
|
int
|
int Maximal length when calculating the positional encoding. Usually, the default value of this argument is enough for the research. |
5000
|
posenc_dropout
|
float
|
float The dropout rate for the Dropout layer after adding the positional encoding to the input |
0.1
|
posenc_scale
|
bool
|
bool Controls whether the positional encodings are scaled up by a trainable scalar before adding into the embedded features or not. Reference: 'Neural Speech Synthesis with Transformer Network' https://ojs.aaai.org/index.php/AAAI/article/view/4642/4520 |
False
|
posenc_init_alpha
|
float
|
float = 1.0 The initial value of the alpha used for positional encoding scaling. Only effective when posenc_scale is True. |
1.0
|
emb_layernorm
|
bool
|
bool Controls whether the embedding vectors are normalized by LayerNorm before adding into the positional encoding or not. |
False
|
emb_scale
|
bool
|
bool Controls whether the embedding vectors are scaled up by sqrt(d_model) before adding into the positional encoding or not. |
False
|
d_model
|
int
|
int The dimension of the hidden feature vector in each Transformer layer |
512
|
num_heads
|
int
|
int The number of attention heads in each Transformer layer |
4
|
num_layers
|
int
|
int The number of Transformer layers |
8
|
att_dropout
|
float
|
float The dropout rate for the Dropout layer after calculating the weights in each Transformer layer |
0.1
|
fdfwd_dim
|
int
|
int The value of the out_features of the first linear feedforward layer and the in_features of the second linear feedforward layer in each Transformer layer. |
2048
|
fdfwd_type
|
str
|
str The type of the feed-forward layer. 'linear' means the Linear layer while 'conv' means the Conv1d layer. |
'linear'
|
fdfwd_activation
|
str
|
str The name of the activation function of feedforward layers. Should be the name of functions in 'torch.nn'. |
'ReLU'
|
fdfwd_dropout
|
float
|
float The dropout rate for the Dropout layer after the first linear feedforward layer in each Transformer layer |
0.1
|
res_dropout
|
float
|
float The dropout rate for the Dropout layer before adding the output of each Transformer layer into its input |
0.1
|
uni_direction
|
bool
|
bool = False Whether the encoder is unidirectional or not. If True, the attention matrix will be masked into a lower-triangular matrix. |
False
|
layernorm_first
|
bool
|
bool controls whether the LayerNorm layer appears at the beginning or at the end of each Transformer layer. True means the LayerNorm layer appears at the beginning False means the LayerNorm layer appears at the end. For LayerNorm first, there will be an additional LayerNorm at the end of the Transformer Encoder to perform the final normalization. |
True
|
Source code in speechain/module/transformer/encoder.py
168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 |
|
subsequent_mask(batch_size, maxlen)
staticmethod
Mask out subsequent positions (to prevent attending to future positions) Transformer helper function.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
batch_size
|
|
required | |
maxlen
|
int
|
int size of mask (2nd and 3rd dim) |
required |
Returns:
Source code in speechain/module/transformer/encoder.py
TransformerEncoderLayer
Bases: Module
A single Transformer encoder layer has: · a Multi-head attention sublayer · a LayerNorm layer exclusively for the attention sublayer · a position-wise feed-forward sublayer · a LayerNorm layer exclusively for the feed-forward sublayer · a residual dropout layer
Source code in speechain/module/transformer/encoder.py
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 |
|
forward(src, mask)
Forward pass for a single transformer encoder layer.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
src
|
Tensor
|
(batch, src_maxlen, d_model) source input for the encoder |
required |
mask
|
Tensor
|
(batch, 1, src_maxlen) input mask |
required |
Returns:
Type | Description |
---|---|
The output of this Transformer encoder layer and the attention matrix |
Source code in speechain/module/transformer/encoder.py
module_init(d_model=512, num_heads=8, scale_dp_by_head=False, att_dropout=0.1, fdfwd_dim=2048, fdfwd_type='linear', fdfwd_activation='ReLU', fdfwd_args={}, fdfwd_dropout=0.1, res_dropout=0.1, layernorm_first=True)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
d_model
|
int
|
int The dimension of the hidden feature vector in each Transformer layer |
512
|
num_heads
|
int
|
int The number of attention heads in each Transformer layer |
8
|
att_dropout
|
float
|
float The dropout rate for the Dropout layer after calculating the weights in each Transformer layer |
0.1
|
fdfwd_dim
|
int
|
int The value of the out_features of the first linear feedforward layer and the in_features of the second linear feedforward layer in each Transformer layer. |
2048
|
fdfwd_type
|
str
|
str The type of the feed-forward layer. 'linear' means the Linear layer while 'conv' means the Conv1d layer. |
'linear'
|
fdfwd_activation
|
str
|
str The name of the activation function of feedforward layers. Should be the name of functions in 'torch.nn'. |
'ReLU'
|
fdfwd_kernel
|
int The kernal size of the Conv1d feed-forward layer. This argument is not effective if fdfwd_type == 'linear'. |
required | |
fdfwd_dropout
|
float
|
float The dropout rate for the Dropout layer after the first linear feedforward layer in each Transformer layer |
0.1
|
res_dropout
|
float
|
float The dropout rate for the Dropout layer before adding the output of each Transformer layer into its input |
0.1
|
layernorm_first
|
bool
|
bool Whether layernorm is performed before feeding src into sublayers. if layernorm_first is True: output = input + Sublayer(LayerNorm(input)) elif layernorm_first is False: output = LayerNorm(input + Sublayer(input)) |
True
|