Llama.cpp: Difference between revisions
Created page with "== My Setup == <code>services: llama-server: image: ghcr.io/ggml-org/llama.cpp:server-cuda restart: unless-stopped volumes: - ./models:/models:ro ports: - 127.0.0.1:4010:80 environment: - LLAMA_ARG_MODELS_PRESET=/models/presets.ini - LLAMA_ARG_MODELS_MAX=3 - LLAMA_ARG_HOST=0.0.0.0 - LLAMA_ARG_PORT=80 - LLAMA_ARG_FIT=off command: --keep 1024 deploy: resources: reservations: de..." |
|||
| (3 intermediate revisions by the same user not shown) | |||
| Line 1: | Line 1: | ||
== My Setup == | == My Setup == | ||
< | <nowiki>services: | ||
llama-server: | llama-server: | ||
image: ghcr.io/ggml-org/llama.cpp:server-cuda | image: ghcr.io/ggml-org/llama.cpp:server-cuda | ||
| Line 22: | Line 22: | ||
- driver: nvidia | - driver: nvidia | ||
count: all | count: all | ||
capabilities: [ gpu ] | capabilities: [ gpu ]</nowiki> | ||
</ | |||
< | <nowiki>version = 1 | ||
[Qwen 3 VL] | [Qwen 3 VL] | ||
| Line 52: | Line 51: | ||
flash-attn = on | flash-attn = on | ||
n-gpu-layers = 32 | n-gpu-layers = 32 | ||
no-kv-offload = on</ | no-kv-offload = on</nowiki> | ||
== Optimization Tips == | == Optimization Tips == | ||
| Line 62: | Line 61: | ||
=== Don't mix K and V Quant Types === | === Don't mix K and V Quant Types === | ||
For me at least, mixing different K and V quant types made prompt processing much slower. An acquaintance suggested the | For me at least, mixing different K and V quant types made prompt processing much slower. An acquaintance suggested the two kernels might have overflowed some sort of cache? | ||
=== Don't use Flash Attention with Gemma 3 === | === Don't use Flash Attention with Gemma 3 === | ||
| Line 70: | Line 69: | ||
=== There's no need to Offload Experts to the GPU === | === There's no need to Offload Experts to the GPU === | ||
I am able to run Qwen 3 30B A3B (30B params in total, 3B dense) rather well on my 12GB VRAM GPU (16 tokens per second). The trick is to use <code>cpu-moe</code> | I am able to run Qwen 3 VL 30B A3B (30B params in total, 3B dense) rather well on my 12GB VRAM GPU (16 tokens per second). The trick is to use <code>cpu-moe</code>, which keeps the dense layers in the GPU (the 3B) while only loading in experts when needed. You can also use <code>n-cpu-moe</code> to choose exactly how many layers of experts should stay in VRAM. However, I found that even with all the layers of experts out of the VRAM, it was still fast enough. | ||
Latest revision as of 06:32, 4 January 2026
My Setup
services:
llama-server:
image: ghcr.io/ggml-org/llama.cpp:server-cuda
restart: unless-stopped
volumes:
- ./models:/models:ro
ports:
- 127.0.0.1:4010:80
environment:
- LLAMA_ARG_MODELS_PRESET=/models/presets.ini
- LLAMA_ARG_MODELS_MAX=3
- LLAMA_ARG_HOST=0.0.0.0
- LLAMA_ARG_PORT=80
- LLAMA_ARG_FIT=off
command: --keep 1024
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [ gpu ]
version = 1 [Qwen 3 VL] model = /models/Huihui-Qwen3-VL-30B-A3B-Instruct-abliterated.Q4_K_M.gguf mmproj = /models/Qwen3-VL-30B-A3B-Instruct-Q4_K_M.mmproj-BF16.gguf ctx-size = 8192 cache-type-k = q8_0 cache-type-v = q8_0 flash-attn = on cpu-moe = on no-kv-offload = on [Qwen 2.5 Coder] model = /models/Qwen2.5-Coder-3B-Q4_K_M.gguf ctx-size = 8192 cache-type-k = q8_0 cache-type-v = q8_0 flash-attn = on no-kv-offload = on [Gemma 2 Blackout] model = /models/G2-9B-Blackout-R1-Q4_K_M.gguf ctx-size = 8192 cache-type-k = q8_0 cache-type-v = q8_0 flash-attn = on n-gpu-layers = 32 no-kv-offload = on
Optimization Tips
No KV Offload
no-kv-offload works surprisingly well! Doesn't come with much performance loss in my experience (Qwen 2.5 Coder 3B still works fast enough), and frees up VRAM for more models.
Don't mix K and V Quant Types
For me at least, mixing different K and V quant types made prompt processing much slower. An acquaintance suggested the two kernels might have overflowed some sort of cache?
Don't use Flash Attention with Gemma 3
Once again, prompt processing speed suffers.
There's no need to Offload Experts to the GPU
I am able to run Qwen 3 VL 30B A3B (30B params in total, 3B dense) rather well on my 12GB VRAM GPU (16 tokens per second). The trick is to use cpu-moe, which keeps the dense layers in the GPU (the 3B) while only loading in experts when needed. You can also use n-cpu-moe to choose exactly how many layers of experts should stay in VRAM. However, I found that even with all the layers of experts out of the VRAM, it was still fast enough.