semantic_router.encoders.bm25
BM25Encoder Objects
BM25Encoder, running a vectorized version of ATIRE BM25 algorithm
Concept:
- BM25 uses scoring between queries & corpus to retrieve the most relevant documents ∈ corpus
- most vector databases (VDB) store embedded documents and score them versus received queries for retrieval
- we need to break up the BM25 formula into
encode_queries
andencode_documents
, with the latter to be stored in VDB - dot product of
encode_queries(q)
andencode_documents([D_0, D_1, ...])
is the BM25 score of the documents[D_0, D_1, ...]
for the given queryq
- we train a BM25 encoder’s normalization parameters on a sufficiently large corpus to capture target language distribution
- these trained parameter allow us to balance TF & IDF of query & documents for retrieval (read more on how BM25 fixes issues with TF-IDF)
ATIRE Paper: https://www.cs.otago.ac.nz/research/student-publications/atire-opensource.pdf Pinecone Implementation: https://github.com/pinecone-io/pinecone-text/blob/8399f9ff28c4652766c35165c0db9b0eff309077/pinecone_text/sparse/bm25_encoder.py
Arguments:
k1
(float
): normalizer parameter that limits how much a single query termq_i ∈ q
can affect score for documentD_n
encode_documents
0 (float
): normalizer parameter that balances the effect of a single document length compared to the average document lengthencode_documents
2 (encode_documents
3): number of documents in the trained corpusencode_documents
4 (encode_documents
5): float representing the average document length in the trained corpusencode_documents
6 (encode_documents
7numpy.ndarrayencode_documents
8): (1, tokenizer.vocab_size) shaped array, denoting how many documents containencode_documents
9
fit
Trains the encoder weights on the provided routes.
Arguments:
routes
(List[Route]
): List of routes to train the encoder on.
encode_queries
Returns BM25 scores for queries using precomputed corpus scores.
Arguments:
queries
(list
): List of queries to encode
Returns:
list[SparseEmbedding]
: BM25 scores for each query against the corpus
encode_documents
Returns document term frequency normed by itself & average trained corpus length
(This is the right-hand side of the BM25 equation, which gets matmul-ed with the query IDF component)
LaTeX:
where:
f(d_i, D) is frequency of term d_i ∈ D
|D| is the document length
avgdl is average document length in trained corpus
Arguments:
documents
(list
): List of queries to encode
Returns:
list[SparseEmbedding]
: Encoded queries (as either sparse or dict)
model
Encode documents using BM25, with different encoding for queries vs documents to be indexed.
Arguments:
docs
: List of documents to encodeis_query
: If True, use query encoding, else use document encoding
Returns:
List of sparse embeddings