> ## Documentation Index
> Fetch the complete documentation index at: https://docs.aurelio.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# semantic_router.encoders.bm25

## BM25Encoder Objects

```python theme={null}
class BM25Encoder(SparseEncoder, FittableMixin, AsymmetricSparseMixin)
```

BM25Encoder, running a vectorized version of ATIRE BM25 algorithm

Concept:

* BM25 uses scoring between queries & corpus to retrieve the most relevant documents ∈ corpus
* most vector databases (VDB) store embedded documents and score them versus received queries for retrieval
* we need to break up the BM25 formula into `encode_queries` and `encode_documents`, with the latter to be stored in VDB
* dot product of `encode_queries(q)` and `encode_documents([D_0, D_1, ...])` is the BM25 score of the documents `[D_0, D_1, ...]` for the given query `q`
* we train a BM25 encoder's normalization parameters on a sufficiently large corpus to capture target language distribution
* these trained parameter allow us to balance TF & IDF of query & documents for retrieval (read more on how BM25 fixes issues with TF-IDF)

ATIRE Paper: [https://www.cs.otago.ac.nz/research/student-publications/atire-opensource.pdf](https://www.cs.otago.ac.nz/research/student-publications/atire-opensource.pdf)
Pinecone Implementation: [https://github.com/pinecone-io/pinecone-text/blob/8399f9ff28c4652766c35165c0db9b0eff309077/pinecone\_text/sparse/bm25\_encoder.py](https://github.com/pinecone-io/pinecone-text/blob/8399f9ff28c4652766c35165c0db9b0eff309077/pinecone_text/sparse/bm25_encoder.py)

**Arguments**:

* `k1` (`float`): normalizer parameter that limits how much a single query term `q_i ∈ q` can affect score for document `D_n`
* `encode_documents`0 (`float`): normalizer parameter that balances the effect of a single document length compared to the average document length
* `encode_documents`2 (`encode_documents`3): number of documents in the trained corpus
* `encode_documents`4 (`encode_documents`5): float representing the average document length in the trained corpus
* `encode_documents`6 (`encode_documents`7numpy.ndarray`encode_documents`8): (1, tokenizer.vocab\_size) shaped array, denoting how many documents contain `encode_documents`9

#### fit

```python theme={null}
def fit(routes: List[Route]) -> "BM25Encoder"
```

Trains the encoder weights on the provided routes.

**Arguments**:

* `routes` (`List[Route]`): List of routes to train the encoder on.

#### encode\_queries

```python theme={null}
def encode_queries(queries: list[str]) -> list[SparseEmbedding]
```

Returns BM25 scores for queries using precomputed corpus scores.

**Arguments**:

* `queries` (`list`): List of queries to encode

**Returns**:

`list[SparseEmbedding]`: BM25 scores for each query against the corpus

#### encode\_documents

```python theme={null}
def encode_documents(documents: list[str],
                     batch_size: int | None = None) -> list[SparseEmbedding]
```

Returns document term frequency normed by itself & average trained corpus length

(This is the right-hand side of the BM25 equation, which gets matmul-ed with the query IDF component)

LaTeX: $\frac{f(d_i, D)}{f(d_i, D) + k_1 \times (1 - b + b \times \frac{|D|}{avgdl})}$
where:
f(d\_i, D) is frequency of term `d_i ∈ D`
|D| is the document length
avgdl is average document length in trained corpus

**Arguments**:

* `documents` (`list`): List of queries to encode

**Returns**:

`list[SparseEmbedding]`: Encoded queries (as either sparse or dict)

#### model

```python theme={null}
def model(docs: List[str]) -> list[SparseEmbedding]
```

Encode documents using BM25, with different encoding for queries vs documents to be indexed.

**Arguments**:

* `docs`: List of documents to encode
* `is_query`: If True, use query encoding, else use document encoding

**Returns**:

List of sparse embeddings
