llama.cpp
.
Using llama.cpp
also enables the use of quantized GGUF models, reducing the memory footprint of deployed models, allowing even 13-billion parameter models to run with hardware acceleration on an Apple M1 Pro chip.
Note: if you require hardware acceleration via BLAS, CUDA, Metal, etc. please refer to the abetlen/llama-cpp-python repository README.md
HuggingFaceEncoder
with sentence-transformers/all-MiniLM-L6-v2
(the default) as an embedding model.
llama.cpp
LLMllama-cpp-python
llama_cpp.Llama
LLM, and then pass it to the semantic_router.llms.LlamaCppLLM
wrapper class.
For llama_cpp.Llama
, there are a couple of parameters you should pay attention to:
n_gpu_layers
: how many LLM layers to offload to the GPU (if you want to offload the entire model, pass -1
, and for CPU execution, pass 0
)n_ctx
: context size, limit the number of tokens that can be passed to the LLM (this is bounded by the model’s internal maximum context size, in this case for Mistral-7B-Instruct, 8000 tokens)verbose
: if False
, silences output from llama.cpp
For other parameter explanation, refer to the llama-cpp-python
API Reference