semantic_router.tokenizers
BaseTokenizer Objects
Abstract Tokenizer class
vocab_size
Returns the vocabulary size of the tokenizer
Returns:
int
: Vocabulary size of tokenizer
config
The tokenizer config
Returns:
dict
: dictionary of tokenizer config
save
Saves the configuration of the tokenizer
Saves these files:
- tokenizer.json: saved configuration of the tokenizer
Arguments:
path
(str, :class:
pathlib.Path“): Path to save the tokenizer to
load
Returns a :class:bm25_engine.tokenizer.BaseTokenizer
object from saved configuration
Requires these files:
- tokenizer.json: saved configuration of the tokenizer
Arguments:
path
(str, :class:
pathlib.Path“): Path to load the tokenizer from
Returns:
BaseTokenizer
: Configured BaseTokenizer
PretrainedTokenizer Objects
Wrapper for HuggingFace tokenizers, representing a pretrained tokenizer (i.e. bert-base-uncased).
Extends the :class:semantic_router.tokenizers.BaseTokenizer
class.
Arguments:
tokenizer
(class:
tokenizers.Tokenizer“): Binding for HuggingFace Rust tokenizersadd_special_tokens
(bool
): Whether to accept special tokens from the tokenizer (i.e.[PAD]
)pad
(bool
): Whether to pad the input to a consistent length (using[PAD]
tokens)tokenizer
0 (tokenizer
1): HuggingFace ID of the model (i.e.tokenizer
2)
__init__
Constructor method
vocab_size
Returns the vocabulary size of the tokenizer
Returns:
int
: Vocabulary size of tokenizer
config
The tokenizer config
Returns:
dict
: dictionary of tokenizer config
tokenize
Tokenizes a string or list of strings into a 2D :class:numpy.ndarray
of token ids
Arguments:
texts
(str, list
): Texts to be tokenizedpad
(bool
): unused here (configured in the constructor)
Returns:
class:
numpy.ndarray“: 2D numpy array representing token ids
TokenizerFactory Objects
Tokenizer factory class
get
Get a configured :class:bm25_engine.tokenizer.BaseTokenizer
Arguments:
type_
(str
): Tokenizer type to instantiate\**kwargs
: kwargs to be passed to Tokenizer constructor
Returns:
bm25_engine.tokenizer.BaseTokenizer
: Tokenizer