Parameter | Type | Default | Description |
---|---|---|---|
input | Union[str, List[str]] | Required | Text or list of texts to embed |
input_type | str | Required | Either “queries” or “documents” depending on use case |
model | str | "bm25" | Embedding model to use (currently only “bm25” is available) |
timeout | int | 30 | Maximum seconds to wait for API response |
retries | int | 3 | Number of retry attempts for failed requests |
input_type
parameter accepts two possible values:
Input Type | Use Case | Description |
---|---|---|
"documents" | Creating a searchable knowledge base | Optimizes embeddings for document representation in a vector database |
"queries" | Querying a knowledge base | Optimizes embeddings for query representation when searching against embedded documents |
indices
correspond to token positions in the vocabulary, while the values
represent the importance of each token for the given text.
EmbeddingUsage
provides token consumption metrics:
EmbeddingDataObject
:
Characteristic | Sparse BM25 Embeddings | Dense Embeddings |
---|---|---|
Representation | Index-value pairs for non-zero elements | Fixed-dimension vectors of continuous values |
Storage Efficiency | High (only stores non-zero values) | Low (stores all dimensions) |
Term Matching | Excellent for exact term/keyword matching | May miss exact terminology |
Domain Adaptation | Strong for specialized vocabulary domains | May require fine-tuning for domains |
Interpretability | Higher (indices correspond to vocabulary terms) | Lower (dimensions not directly interpretable) |