BaseTokenizer Objects

class BaseTokenizer()

Abstract Tokenizer class

vocab_size

@property
def vocab_size() -> int

Returns the vocabulary size of the tokenizer

Returns:

int: Vocabulary size of tokenizer

config

@property
def config() -> dict

The tokenizer config

Returns:

dict: dictionary of tokenizer config

save

def save(path: str | Path) -> None

Saves the configuration of the tokenizer

Saves these files:

  • tokenizer.json: saved configuration of the tokenizer

Arguments:

  • path (str, :class:pathlib.Path“): Path to save the tokenizer to

load

@classmethod
def load(cls, path: str | Path) -> "BaseTokenizer"

Returns a :class:bm25_engine.tokenizer.BaseTokenizer object from saved configuration

Requires these files:

  • tokenizer.json: saved configuration of the tokenizer

Arguments:

  • path (str, :class:pathlib.Path“): Path to load the tokenizer from

Returns:

BaseTokenizer: Configured BaseTokenizer

PretrainedTokenizer Objects

class PretrainedTokenizer(BaseTokenizer)

Wrapper for HuggingFace tokenizers, representing a pretrained tokenizer (i.e. bert-base-uncased).

Extends the :class:semantic_router.tokenizers.BaseTokenizer class.

Arguments:

  • tokenizer (class:tokenizers.Tokenizer“): Binding for HuggingFace Rust tokenizers
  • add_special_tokens (bool): Whether to accept special tokens from the tokenizer (i.e. [PAD])
  • pad (bool): Whether to pad the input to a consistent length (using [PAD] tokens)
  • tokenizer0 (tokenizer1): HuggingFace ID of the model (i.e. tokenizer2)

__init__

def __init__(model_ident: str,
             custom_normalizer: Any = None,
             add_special_tokens: bool = False,
             pad: bool = True) -> None

Constructor method

vocab_size

@property
def vocab_size()

Returns the vocabulary size of the tokenizer

Returns:

int: Vocabulary size of tokenizer

config

@property
def config() -> dict

The tokenizer config

Returns:

dict: dictionary of tokenizer config

tokenize

def tokenize(texts: str | list[str], pad: bool = True) -> np.ndarray

Tokenizes a string or list of strings into a 2D :class:numpy.ndarray of token ids

Arguments:

  • texts (str, list): Texts to be tokenized
  • pad (bool): unused here (configured in the constructor)

Returns:

class:numpy.ndarray“: 2D numpy array representing token ids

TokenizerFactory Objects

class TokenizerFactory()

Tokenizer factory class

get

@staticmethod
def get(type_: str, **tokenizer_kwargs) -> BaseTokenizer

Get a configured :class:bm25_engine.tokenizer.BaseTokenizer

Arguments:

  • type_ (str): Tokenizer type to instantiate
  • \**kwargs: kwargs to be passed to Tokenizer constructor

Returns:

bm25_engine.tokenizer.BaseTokenizer: Tokenizer