Local execution
There are many reasons users might choose to roll their own LLMs rather than use a third-party service. Whether it’s due to cost, privacy or compliance, Semantic Router supports the use of “local” LLMs through llama.cpp
.
Using llama.cpp
also enables the use of quantized GGUF models, reducing the memory footprint of deployed models, allowing even 13-billion parameter models to run with hardware acceleration on an Apple M1 Pro chip.
Full Example
Below is an example of using semantic router with Mistral-7B-Instruct, quantized to reduce memory footprint.
Installing the library
Note: if you require hardware acceleration via BLAS, CUDA, Metal, etc. please refer to the abetlen/llama-cpp-python repository README.md
If you’re running on Apple silicon you can run the following to compile with Metal hardware acceleration:
Download the Mistral 7B Instruct 4-bit GGUF files
We will be using Mistral 7B Instruct, quantized as a 4-bit GGUF file, a good balance between performance and ability to deploy on consumer hardware
Initializing Dynamic Routes
Similar to dynamic routes in other examples, we will be initializing some dynamic routes that make use of LLMs for function calling
Encoders
You can use alternative Encoders, however, in this example we want to showcase a fully-local Semantic Router execution, so we are going to use a HuggingFaceEncoder
with sentence-transformers/all-MiniLM-L6-v2
(the default) as an embedding model.
llama.cpp
LLM
From here, we can go ahead and instantiate our llama-cpp-python
llama_cpp.Llama
LLM, and then pass it to the semantic_router.llms.LlamaCppLLM
wrapper class.
For llama_cpp.Llama
, there are a couple of parameters you should pay attention to:
n_gpu_layers
: how many LLM layers to offload to the GPU (if you want to offload the entire model, pass-1
, and for CPU execution, pass0
)n_ctx
: context size, limit the number of tokens that can be passed to the LLM (this is bounded by the model’s internal maximum context size, in this case for Mistral-7B-Instruct, 8000 tokens)verbose
: ifFalse
, silences output fromllama.cpp
For other parameter explanation, refer to the
llama-cpp-python
API Reference
Let’s test our router with some queries:
This should output:
Now let’s try a time-related query that will trigger our function calling:
This should output something like:
Let’s try more examples:
Output:
Output:
Cleanup
Once done, if you’d like to delete the downloaded model you can do so with the following: