Tokenizer
Tokenizer is the base class of all the Tokenizer objects in this toolkit. It on-the-fly transforms text data between strings and tensors.
For data storage and visualization, the text data should be in the form of strings which is not friendly for model forward calculation.
For model forward calculation, the text data is better to be in the form of vectors (torch.tensor
or numpy.ndarray
).
Table of Contents
Tokenizer Library
/speechain
/tokenizer
/abs.py # Abstract class of Tokenizer. Base of all Tokenizer implementations.
/char.py # Tokenizer implementation of the character tokenizer.
/sp.py # Tokenizer implementation of the subword tokenizer by SentencePiece package.
/g2p.py # Tokenizer implementation of the phoneme tokenizer by G2P package.
👆Back to the table of contents
API Document
Non-overridable backbone functions:
1. speechain.tokenizer.abs.Tokenizer.__init__
Overridable interface functions:
1. speechain.tokenizer.abs.Tokenizer.tokenizer_init_fn
2. speechain.tokenizer.abs.Tokenizer.tensor2text
3. speechain.tokenizer.abs.Tokenizer.text2tensor
speechain.tokenizer.abs.Tokenizer.__init__(self, token_vocab, **tokenizer_conf)
- Description:
This function registers some shared member variables for all Tokenizer subclasses:self.idx2token
: the mapping Dict from the token index to token string.self.token2idx
: the mapping Dict from the token string to token index.self.vocab_size
: the number of tokens in the given vocabulary.self.sos_eos_idx
: the index of the jointtoken used as the beginning and end of a sentence. self.ignore_idx
: the index of the blank token used for either CTC blank modeling or ignored token for encoder-decoder ASR&TTS models.self.unk_idx
: the index of the unknown token.
- Arguments:
- token_vocab: str
The path where the token vocabulary is placed. - **tokenizer_conf:
The arguments used bytokenizer_init_fn()
for your customized Tokenizer initialization.
- token_vocab: str
speechain.tokenizer.abs.Tokenizer.tokenizer_init_fn(self, **tokenizer_conf)
- Description:
This hook interface function initializes the customized part of a Tokenizer subclass if had.
This interface is not mandatory to be overridden. - Arguments:
- **tokenizer_conf:
The arguments used bytokenizer_init_fn()
for your customized Tokenizer initialization.
For more details, please refer to the docstring of your target Tokenizer subclass.
- **tokenizer_conf:
speechain.tokenizer.abs.Tokenizer.tensor2text(self, tensor)
- Description:
This functions decodes a text tensor into a human-friendly string.
The default implementation transforms each token index in the input tensor to the token string byself.idx2token
. If the token index isself.unk_idx
, an asterisk (*) will be used to represent an unknown token in the string.
This interface is not mandatory to be overridden. If your Tokenizer subclass uses some third-party packages to decode the input tensor rather than the built-inself.idx2token
, please override this function. - Arguments:
- tensor: torch.LongTensor
1D integer torch.Tensor that contains the token indices of the sentence to be decoded.
- tensor: torch.LongTensor
- Return:
The string of the decoded sentence.
speechain.tokenizer.abs.Tokenizer.text2tensor(self, text)
- Description:
This functions encodes a text string into a model-friendly tensor.
This interface is mandatory to be overridden.
By default, this function will attach twoat the beginning and end of the returned token id sequence. - Arguments:
- text: str
The input text string to be encoded - no_sos: bool = False
Whether to remove theat the beginning of the token id sequence. - no_eos: bool = False
Whether to remove theat the end of the token id sequence.
- text: str
- Return: torch.LongTensor
The tensor of the encoded sentence