llama.cpp
.
Using llama.cpp
also enables the use of quantized GGUF models, reducing the memory footprint of deployed models, allowing even 13-billion parameter models to run with hardware acceleration on an Apple M1 Pro chip.
Full Example
Installing the library
Note: if you require hardware acceleration via BLAS, CUDA, Metal, etc. please refer to the abetlen/llama-cpp-python repository README.md
Download the Mistral 7B Instruct 4-bit GGUF files
We will be using Mistral 7B Instruct, quantized as a 4-bit GGUF file, a good balance between performance and ability to deploy on consumer hardwareInitializing Dynamic Routes
Similar to dynamic routes in other examples, we will be initializing some dynamic routes that make use of LLMs for function callingEncoders
You can use alternative Encoders, however, in this example we want to showcase a fully-local Semantic Router execution, so we are going to use aHuggingFaceEncoder
with sentence-transformers/all-MiniLM-L6-v2
(the default) as an embedding model.
llama.cpp
LLM
From here, we can go ahead and instantiate our llama-cpp-python
llama_cpp.Llama
LLM, and then pass it to the semantic_router.llms.LlamaCppLLM
wrapper class.
For llama_cpp.Llama
, there are a couple of parameters you should pay attention to:
n_gpu_layers
: how many LLM layers to offload to the GPU (if you want to offload the entire model, pass-1
, and for CPU execution, pass0
)n_ctx
: context size, limit the number of tokens that can be passed to the LLM (this is bounded by the model’s internal maximum context size, in this case for Mistral-7B-Instruct, 8000 tokens)verbose
: ifFalse
, silences output fromllama.cpp
For other parameter explanation, refer to the llama-cpp-python
API Reference