Jump to content

Llama.cpp

From 1G-N15's Place
Revision as of 06:32, 4 January 2026 by 1G-N15 (talk | contribs) (There's no need to Offload Experts to the GPU)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

My Setup

services:
  llama-server:
    image: ghcr.io/ggml-org/llama.cpp:server-cuda
    restart: unless-stopped
    volumes:
      - ./models:/models:ro
    ports:
      - 127.0.0.1:4010:80
    environment:
      - LLAMA_ARG_MODELS_PRESET=/models/presets.ini
      - LLAMA_ARG_MODELS_MAX=3
      - LLAMA_ARG_HOST=0.0.0.0
      - LLAMA_ARG_PORT=80
      - LLAMA_ARG_FIT=off
    command: --keep 1024
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [ gpu ]
version = 1

[Qwen 3 VL]
model = /models/Huihui-Qwen3-VL-30B-A3B-Instruct-abliterated.Q4_K_M.gguf
mmproj = /models/Qwen3-VL-30B-A3B-Instruct-Q4_K_M.mmproj-BF16.gguf
ctx-size = 8192
cache-type-k = q8_0
cache-type-v = q8_0
flash-attn = on
cpu-moe = on
no-kv-offload = on

[Qwen 2.5 Coder]
model = /models/Qwen2.5-Coder-3B-Q4_K_M.gguf
ctx-size = 8192
cache-type-k = q8_0
cache-type-v = q8_0
flash-attn = on
no-kv-offload = on

[Gemma 2 Blackout]
model = /models/G2-9B-Blackout-R1-Q4_K_M.gguf
ctx-size = 8192
cache-type-k = q8_0
cache-type-v = q8_0
flash-attn = on
n-gpu-layers = 32
no-kv-offload = on

Optimization Tips

No KV Offload

no-kv-offload works surprisingly well! Doesn't come with much performance loss in my experience (Qwen 2.5 Coder 3B still works fast enough), and frees up VRAM for more models.

Don't mix K and V Quant Types

For me at least, mixing different K and V quant types made prompt processing much slower. An acquaintance suggested the two kernels might have overflowed some sort of cache?

Don't use Flash Attention with Gemma 3

Once again, prompt processing speed suffers.

There's no need to Offload Experts to the GPU

I am able to run Qwen 3 VL 30B A3B (30B params in total, 3B dense) rather well on my 12GB VRAM GPU (16 tokens per second). The trick is to use cpu-moe, which keeps the dense layers in the GPU (the 3B) while only loading in experts when needed. You can also use n-cpu-moe to choose exactly how many layers of experts should stay in VRAM. However, I found that even with all the layers of experts out of the VRAM, it was still fast enough.