BM25Encoder Objects
- BM25 uses scoring between queries & corpus to retrieve the most relevant documents ∈ corpus
- most vector databases (VDB) store embedded documents and score them versus received queries for retrieval
- we need to break up the BM25 formula into
encode_queriesandencode_documents, with the latter to be stored in VDB - dot product of
encode_queries(q)andencode_documents([D_0, D_1, ...])is the BM25 score of the documents[D_0, D_1, ...]for the given queryq - we train a BM25 encoder’s normalization parameters on a sufficiently large corpus to capture target language distribution
- these trained parameter allow us to balance TF & IDF of query & documents for retrieval (read more on how BM25 fixes issues with TF-IDF)
k1(float): normalizer parameter that limits how much a single query termq_i ∈ qcan affect score for documentD_nencode_documents0 (float): normalizer parameter that balances the effect of a single document length compared to the average document lengthencode_documents2 (encode_documents3): number of documents in the trained corpusencode_documents4 (encode_documents5): float representing the average document length in the trained corpusencode_documents6 (encode_documents7numpy.ndarrayencode_documents8): (1, tokenizer.vocab_size) shaped array, denoting how many documents containencode_documents9
fit
routes(List[Route]): List of routes to train the encoder on.
encode_queries
queries(list): List of queries to encode
list[SparseEmbedding]: BM25 scores for each query against the corpus
encode_documents
d_i ∈ D
|D| is the document length
avgdl is average document length in trained corpus
Arguments:
documents(list): List of queries to encode
list[SparseEmbedding]: Encoded queries (as either sparse or dict)
model
docs: List of documents to encodeis_query: If True, use query encoding, else use document encoding

