abs
Author: Heli Qi Affiliation: NAIST Date: 2022.07
Tokenizer
Bases: ABC
Tokenizer is the base class of all the Tokenizer objects in this toolkit. It on-the-fly transforms text data between strings and tensors.
For data storage and visualization, the text data should be in the form of strings which is not friendly for model
forward calculation. For model forward calculation, the text data is better to be in the form of vectors
(torch.tensor
or numpy.ndarray
).
Source code in speechain/tokenizer/abs.py
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 |
|
__init__(token_path=None, copy_path=None, **tokenizer_conf)
This function registers some shared member variables for all Tokenizer subclasses:
1. self.idx2token
: the mapping Dict from the token index to token string.
2. self.token2idx
: the mapping Dict from the token string to token index.
3. self.vocab_size
: the number of tokens in the given vocabulary.
4. self.sos_eos_idx
: the index of the joint self.ignore_idx
: the index of the blank token used for either CTC blank modeling or ignored token for
encoder-decoder ASR&TTS models.
6. self.unk_idx
: the index of the unknown token.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
token_path
|
str
|
str The path where the token vocabulary is placed. |
None
|
copy_path
|
str
|
str = None The path where you want to paste the given token vocabulary as a backup. If not given, no backup will be saved. |
None
|
**tokenizer_conf
|
The arguments used by tokenizer_init_fn() for your customized Tokenizer initialization. |
{}
|
Source code in speechain/tokenizer/abs.py
tensor2text(tensor)
This functions decodes a text tensor into a human-friendly string.
The default implementation transforms each token index in the input tensor to the token string by self.idx2token
. If the token index is self.unk_idx
, an asterisk (*) will be used to represent an unknown
token in the string.
This interface is not mandatory to be overridden. If your Tokenizer subclass uses some third-party packages
to decode the input tensor rather than the built-in self.idx2token
, please override this function.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tensor
|
LongTensor
|
torch.LongTensor 1D integer torch.Tensor that contains the token indices of the sentence to be decoded. |
required |
Returns:
Type | Description |
---|---|
str
|
The string of the decoded sentence. |
Source code in speechain/tokenizer/abs.py
text2tensor(text, no_sos=False, no_eos=False, return_tensor=True)
abstractmethod
This functions encodes a text string into a model-friendly tensor. This
interface is mandatory to be overridden. By default, this function will attach
two
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text
|
str
|
str the input text string to be encoded |
required |
no_sos
|
bool
|
bool = False
Whether to remove the |
False
|
no_eos
|
bool
|
bool = False
Whether to remove the |
False
|
return_tensor
|
bool
|
bool = True Whether to return the tokenization results as a tensor. If False, a List will be returned. |
True
|
torch.LongTensor
Type | Description |
---|---|
LongTensor or List
|
The tensor of the encoded sentence |
Source code in speechain/tokenizer/abs.py
tokenizer_init_fn(token_path, copy_path=None, **tokenizer_conf)
This hook interface function initializes the customized part of a Tokenizer subclass if had. This interface is not mandatory to be overridden.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
copy_path
|
str
|
str = None The path where you want to paste the given tokenizer model as a backup. If not given, no backup will be saved. |
None
|
**tokenizer_conf
|
The arguments used by tokenizer_init_fn() for your customized Tokenizer initialization. For more details, please refer to the docstring of your target Tokenizer subclass. |
{}
|