Llama.cpp
My Setup
services:
llama-server:
image: ghcr.io/ggml-org/llama.cpp:server-cuda
restart: unless-stopped
volumes:
- ./models:/models:ro
ports:
- 127.0.0.1:4010:80
environment:
- LLAMA_ARG_MODELS_PRESET=/models/presets.ini
- LLAMA_ARG_MODELS_MAX=3
- LLAMA_ARG_HOST=0.0.0.0
- LLAMA_ARG_PORT=80
- LLAMA_ARG_FIT=off
command: --keep 1024
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [ gpu ]
version = 1 [Qwen 3 VL] model = /models/Huihui-Qwen3-VL-30B-A3B-Instruct-abliterated.Q4_K_M.gguf mmproj = /models/Qwen3-VL-30B-A3B-Instruct-Q4_K_M.mmproj-BF16.gguf ctx-size = 8192 cache-type-k = q8_0 cache-type-v = q8_0 flash-attn = on cpu-moe = on no-kv-offload = on [Qwen 2.5 Coder] model = /models/Qwen2.5-Coder-3B-Q4_K_M.gguf ctx-size = 8192 cache-type-k = q8_0 cache-type-v = q8_0 flash-attn = on no-kv-offload = on [Gemma 2 Blackout] model = /models/G2-9B-Blackout-R1-Q4_K_M.gguf ctx-size = 8192 cache-type-k = q8_0 cache-type-v = q8_0 flash-attn = on n-gpu-layers = 32 no-kv-offload = on
Optimization Tips
No KV Offload
no-kv-offload works surprisingly well! Doesn't come with much performance loss in my experience (Qwen 2.5 Coder 3B still works fast enough), and frees up VRAM for more models.
Don't mix K and V Quant Types
For me at least, mixing different K and V quant types made prompt processing much slower. An acquaintance suggested the two kernels might have overflowed some sort of cache?
Don't use Flash Attention with Gemma 3
Once again, prompt processing speed suffers.
There's no need to Offload Experts to the GPU
I am able to run Qwen 3 VL 30B A3B (30B params in total, 3B dense) rather well on my 12GB VRAM GPU (16 tokens per second). The trick is to use cpu-moe, which keeps the dense layers in the GPU (the 3B) while only loading in experts when needed. You can also use n-cpu-moe to choose exactly how many layers of experts should stay in VRAM. However, I found that even with all the layers of experts out of the VRAM, it was still fast enough.