BaseTokenizer Objects
vocab_size
int: Vocabulary size of tokenizer
config
dict: dictionary of tokenizer config
save
- tokenizer.json: saved configuration of the tokenizer
path(str, :class:pathlib.Path“): Path to save the tokenizer to
load
bm25_engine.tokenizer.BaseTokenizer object from saved configuration
Requires these files:
- tokenizer.json: saved configuration of the tokenizer
path(str, :class:pathlib.Path“): Path to load the tokenizer from
BaseTokenizer: Configured BaseTokenizer
PretrainedTokenizer Objects
semantic_router.tokenizers.BaseTokenizer class.
Arguments:
tokenizer(class:tokenizers.Tokenizer“): Binding for HuggingFace Rust tokenizersadd_special_tokens(bool): Whether to accept special tokens from the tokenizer (i.e.[PAD])pad(bool): Whether to pad the input to a consistent length (using[PAD]tokens)tokenizer0 (tokenizer1): HuggingFace ID of the model (i.e.tokenizer2)
__init__
vocab_size
int: Vocabulary size of tokenizer
config
dict: dictionary of tokenizer config
tokenize
numpy.ndarray of token ids
Arguments:
texts(str, list): Texts to be tokenizedpad(bool): unused here (configured in the constructor)
class:numpy.ndarray“: 2D numpy array representing token ids
TokenizerFactory Objects
get
bm25_engine.tokenizer.BaseTokenizer
Arguments:
type_(str): Tokenizer type to instantiate\**kwargs: kwargs to be passed to Tokenizer constructor
bm25_engine.tokenizer.BaseTokenizer: Tokenizer
