BaseTokenizer Objects
vocab_size
int
: Vocabulary size of tokenizer
config
dict
: dictionary of tokenizer config
save
- tokenizer.json: saved configuration of the tokenizer
path
(str, :class:
pathlib.Path“): Path to save the tokenizer to
load
bm25_engine.tokenizer.BaseTokenizer
object from saved configuration
Requires these files:
- tokenizer.json: saved configuration of the tokenizer
path
(str, :class:
pathlib.Path“): Path to load the tokenizer from
BaseTokenizer
: Configured BaseTokenizer
PretrainedTokenizer Objects
semantic_router.tokenizers.BaseTokenizer
class.
Arguments:
tokenizer
(class:
tokenizers.Tokenizer“): Binding for HuggingFace Rust tokenizersadd_special_tokens
(bool
): Whether to accept special tokens from the tokenizer (i.e.[PAD]
)pad
(bool
): Whether to pad the input to a consistent length (using[PAD]
tokens)tokenizer
0 (tokenizer
1): HuggingFace ID of the model (i.e.tokenizer
2)
__init__
vocab_size
int
: Vocabulary size of tokenizer
config
dict
: dictionary of tokenizer config
tokenize
numpy.ndarray
of token ids
Arguments:
texts
(str, list
): Texts to be tokenizedpad
(bool
): unused here (configured in the constructor)
class:
numpy.ndarray“: 2D numpy array representing token ids
TokenizerFactory Objects
get
bm25_engine.tokenizer.BaseTokenizer
Arguments:
type_
(str
): Tokenizer type to instantiate\**kwargs
: kwargs to be passed to Tokenizer constructor
bm25_engine.tokenizer.BaseTokenizer
: Tokenizer