BM25Encoder Objects

class BM25Encoder(SparseEncoder, FittableMixin, AsymmetricSparseMixin)

BM25Encoder, running a vectorized version of ATIRE BM25 algorithm

Concept:

  • BM25 uses scoring between queries & corpus to retrieve the most relevant documents ∈ corpus
  • most vector databases (VDB) store embedded documents and score them versus received queries for retrieval
  • we need to break up the BM25 formula into encode_queries and encode_documents, with the latter to be stored in VDB
  • dot product of encode_queries(q) and encode_documents([D_0, D_1, ...]) is the BM25 score of the documents [D_0, D_1, ...] for the given query q
  • we train a BM25 encoder’s normalization parameters on a sufficiently large corpus to capture target language distribution
  • these trained parameter allow us to balance TF & IDF of query & documents for retrieval (read more on how BM25 fixes issues with TF-IDF)

ATIRE Paper: https://www.cs.otago.ac.nz/research/student-publications/atire-opensource.pdf Pinecone Implementation: https://github.com/pinecone-io/pinecone-text/blob/8399f9ff28c4652766c35165c0db9b0eff309077/pinecone_text/sparse/bm25_encoder.py

Arguments:

  • k1 (float): normalizer parameter that limits how much a single query term q_i ∈ q can affect score for document D_n
  • encode_documents0 (float): normalizer parameter that balances the effect of a single document length compared to the average document length
  • encode_documents2 (encode_documents3): number of documents in the trained corpus
  • encode_documents4 (encode_documents5): float representing the average document length in the trained corpus
  • encode_documents6 (encode_documents7numpy.ndarrayencode_documents8): (1, tokenizer.vocab_size) shaped array, denoting how many documents contain encode_documents9

fit

def fit(routes: List[Route]) -> "BM25Encoder"

Trains the encoder weights on the provided routes.

Arguments:

  • routes (List[Route]): List of routes to train the encoder on.

encode_queries

def encode_queries(queries: list[str]) -> list[SparseEmbedding]

Returns BM25 scores for queries using precomputed corpus scores.

Arguments:

  • queries (list): List of queries to encode

Returns:

list[SparseEmbedding]: BM25 scores for each query against the corpus

encode_documents

def encode_documents(documents: list[str],
                     batch_size: int | None = None) -> list[SparseEmbedding]

Returns document term frequency normed by itself & average trained corpus length

(This is the right-hand side of the BM25 equation, which gets matmul-ed with the query IDF component)

LaTeX: f(di,D)f(di,D)+k1×(1b+b×Davgdl)\frac{f(d_i, D)}{f(d_i, D) + k_1 \times (1 - b + b \times \frac{|D|}{avgdl})} where: f(d_i, D) is frequency of term d_i ∈ D |D| is the document length avgdl is average document length in trained corpus

Arguments:

  • documents (list): List of queries to encode

Returns:

list[SparseEmbedding]: Encoded queries (as either sparse or dict)

model

def model(docs: List[str]) -> list[SparseEmbedding]

Encode documents using BM25, with different encoding for queries vs documents to be indexed.

Arguments:

  • docs: List of documents to encode
  • is_query: If True, use query encoding, else use document encoding

Returns:

List of sparse embeddings