Llama.cpp

My Setup

services:

 llama-server:
   image: ghcr.io/ggml-org/llama.cpp:server-cuda
   restart: unless-stopped
   volumes:
     - ./models:/models:ro
   ports:
     - 127.0.0.1:4010:80
   environment:
     - LLAMA_ARG_MODELS_PRESET=/models/presets.ini
     - LLAMA_ARG_MODELS_MAX=3
     - LLAMA_ARG_HOST=0.0.0.0
     - LLAMA_ARG_PORT=80
     - LLAMA_ARG_FIT=off
   command: --keep 1024
   deploy:
     resources:
       reservations:
         devices:
           - driver: nvidia
             count: all
             capabilities: [ gpu ]

version = 1

[Qwen 3 VL] model = /models/Huihui-Qwen3-VL-30B-A3B-Instruct-abliterated.Q4_K_M.gguf mmproj = /models/Qwen3-VL-30B-A3B-Instruct-Q4_K_M.mmproj-BF16.gguf ctx-size = 8192 cache-type-k = q8_0 cache-type-v = q8_0 flash-attn = on cpu-moe = on no-kv-offload = on

[Qwen 2.5 Coder] model = /models/Qwen2.5-Coder-3B-Q4_K_M.gguf ctx-size = 8192 cache-type-k = q8_0 cache-type-v = q8_0 flash-attn = on no-kv-offload = on

[Gemma 2 Blackout] model = /models/G2-9B-Blackout-R1-Q4_K_M.gguf ctx-size = 8192 cache-type-k = q8_0 cache-type-v = q8_0 flash-attn = on n-gpu-layers = 32 no-kv-offload = on

Optimization Tips

No KV Offload

no-kv-offload works surprisingly well! Doesn't come with much performance loss in my experience (Qwen 2.5 Coder 3B still works fast enough), and frees up VRAM for more models.

Don't mix K and V Quant Types

For me at least, mixing different K and V quant types made prompt processing much slower. An acquaintance suggested the kernel might have overflowed some sort of cache?

Don't use Flash Attention with Gemma 3

Once again, prompt processing speed suffers.

There's no need to Offload Experts to the GPU

I am able to run Qwen 3 30B A3B (30B params in total, 3B dense) rather well on my 12GB VRAM GPU (16 tokens per second). The trick is to use cpu-moe or n-cpu-moe (which allows you to choose how many layers you want to stay in the CPU), which keeps the dense layers in the GPU (the 3B) while only loading in experts when needed.