Jump to content

Llama.cpp: Difference between revisions

From 1G-N15's Place
Created page with "== My Setup == <code>services: llama-server: image: ghcr.io/ggml-org/llama.cpp:server-cuda restart: unless-stopped volumes: - ./models:/models:ro ports: - 127.0.0.1:4010:80 environment: - LLAMA_ARG_MODELS_PRESET=/models/presets.ini - LLAMA_ARG_MODELS_MAX=3 - LLAMA_ARG_HOST=0.0.0.0 - LLAMA_ARG_PORT=80 - LLAMA_ARG_FIT=off command: --keep 1024 deploy: resources: reservations: de..."
 
 
(3 intermediate revisions by the same user not shown)
Line 1: Line 1:
== My Setup ==
== My Setup ==


<code>services:
<nowiki>services:
   llama-server:
   llama-server:
     image: ghcr.io/ggml-org/llama.cpp:server-cuda
     image: ghcr.io/ggml-org/llama.cpp:server-cuda
Line 22: Line 22:
             - driver: nvidia
             - driver: nvidia
               count: all
               count: all
               capabilities: [ gpu ]
               capabilities: [ gpu ]</nowiki>
</code>


<code>version = 1
<nowiki>version = 1


[Qwen 3 VL]
[Qwen 3 VL]
Line 52: Line 51:
flash-attn = on
flash-attn = on
n-gpu-layers = 32
n-gpu-layers = 32
no-kv-offload = on</code>
no-kv-offload = on</nowiki>


== Optimization Tips ==
== Optimization Tips ==
Line 62: Line 61:
=== Don't mix K and V Quant Types ===
=== Don't mix K and V Quant Types ===


For me at least, mixing different K and V quant types made prompt processing much slower. An acquaintance suggested the kernel might have overflowed some sort of cache?  
For me at least, mixing different K and V quant types made prompt processing much slower. An acquaintance suggested the two kernels might have overflowed some sort of cache?


=== Don't use Flash Attention with Gemma 3 ===
=== Don't use Flash Attention with Gemma 3 ===
Line 70: Line 69:
=== There's no need to Offload Experts to the GPU ===
=== There's no need to Offload Experts to the GPU ===


I am able to run Qwen 3 30B A3B (30B params in total, 3B dense) rather well on my 12GB VRAM GPU (16 tokens per second). The trick is to use <code>cpu-moe</code> or <code>n-cpu-moe</code> (which allows you to choose how many layers you want to stay in the CPU), which keeps the dense layers in the GPU (the 3B) while only loading in experts when needed.
I am able to run Qwen 3 VL 30B A3B (30B params in total, 3B dense) rather well on my 12GB VRAM GPU (16 tokens per second). The trick is to use <code>cpu-moe</code>, which keeps the dense layers in the GPU (the 3B) while only loading in experts when needed. You can also use <code>n-cpu-moe</code> to choose exactly how many layers of experts should stay in VRAM. However, I found that even with all the layers of experts out of the VRAM, it was still fast enough.

Latest revision as of 06:32, 4 January 2026

My Setup

services:
  llama-server:
    image: ghcr.io/ggml-org/llama.cpp:server-cuda
    restart: unless-stopped
    volumes:
      - ./models:/models:ro
    ports:
      - 127.0.0.1:4010:80
    environment:
      - LLAMA_ARG_MODELS_PRESET=/models/presets.ini
      - LLAMA_ARG_MODELS_MAX=3
      - LLAMA_ARG_HOST=0.0.0.0
      - LLAMA_ARG_PORT=80
      - LLAMA_ARG_FIT=off
    command: --keep 1024
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [ gpu ]
version = 1

[Qwen 3 VL]
model = /models/Huihui-Qwen3-VL-30B-A3B-Instruct-abliterated.Q4_K_M.gguf
mmproj = /models/Qwen3-VL-30B-A3B-Instruct-Q4_K_M.mmproj-BF16.gguf
ctx-size = 8192
cache-type-k = q8_0
cache-type-v = q8_0
flash-attn = on
cpu-moe = on
no-kv-offload = on

[Qwen 2.5 Coder]
model = /models/Qwen2.5-Coder-3B-Q4_K_M.gguf
ctx-size = 8192
cache-type-k = q8_0
cache-type-v = q8_0
flash-attn = on
no-kv-offload = on

[Gemma 2 Blackout]
model = /models/G2-9B-Blackout-R1-Q4_K_M.gguf
ctx-size = 8192
cache-type-k = q8_0
cache-type-v = q8_0
flash-attn = on
n-gpu-layers = 32
no-kv-offload = on

Optimization Tips

No KV Offload

no-kv-offload works surprisingly well! Doesn't come with much performance loss in my experience (Qwen 2.5 Coder 3B still works fast enough), and frees up VRAM for more models.

Don't mix K and V Quant Types

For me at least, mixing different K and V quant types made prompt processing much slower. An acquaintance suggested the two kernels might have overflowed some sort of cache?

Don't use Flash Attention with Gemma 3

Once again, prompt processing speed suffers.

There's no need to Offload Experts to the GPU

I am able to run Qwen 3 VL 30B A3B (30B params in total, 3B dense) rather well on my 12GB VRAM GPU (16 tokens per second). The trick is to use cpu-moe, which keeps the dense layers in the GPU (the 3B) while only loading in experts when needed. You can also use n-cpu-moe to choose exactly how many layers of experts should stay in VRAM. However, I found that even with all the layers of experts out of the VRAM, it was still fast enough.