Llama.cpp - Revision history

1G-N15: /* Try Disabling Mmap */

2026-03-29T11:30:29Z

Try Disabling Mmap

← Older revision		Revision as of 11:30, 29 March 2026
Line 73:		Line 73:
	=== Try Disabling Mmap ===		=== Try Disabling Mmap ===

	Are your models loading slowly from disk? Try disabling mmap (LLAMA_ARG_MMAP=off)! Depending on your situation, this may or may not speed up loading. It did for me, though.		Are your models loading slowly from disk? Try disabling mmap (LLAMA_ARG_MMAP=off)! Depending on your situation, this may or may not speed up loading. It did for me, though, especially for larger MoE models that spilled into system RAM.

1G-N15 at 11:30, 29 March 2026

2026-03-29T11:30:13Z

← Older revision		Revision as of 11:30, 29 March 2026
Line 70:		Line 70:

	I am able to run Qwen 3 VL 30B A3B (30B params in total, 3B dense) rather well on my 12GB VRAM GPU (16 tokens per second). The trick is to use <code>cpu-moe</code>, which keeps the dense layers in the GPU (the 3B) while only loading in experts when needed. You can also use <code>n-cpu-moe</code> to choose exactly how many layers of experts should stay in VRAM. However, I found that even with all the layers of experts out of the VRAM, it was still fast enough.		I am able to run Qwen 3 VL 30B A3B (30B params in total, 3B dense) rather well on my 12GB VRAM GPU (16 tokens per second). The trick is to use <code>cpu-moe</code>, which keeps the dense layers in the GPU (the 3B) while only loading in experts when needed. You can also use <code>n-cpu-moe</code> to choose exactly how many layers of experts should stay in VRAM. However, I found that even with all the layers of experts out of the VRAM, it was still fast enough.

			=== Try Disabling Mmap ===

			Are your models loading slowly from disk? Try disabling mmap (LLAMA_ARG_MMAP=off)! Depending on your situation, this may or may not speed up loading. It did for me, though.

1G-N15: /* There's no need to Offload Experts to the GPU */

2026-01-04T06:32:53Z

There's no need to Offload Experts to the GPU

← Older revision		Revision as of 06:32, 4 January 2026
Line 69:		Line 69:
	=== There's no need to Offload Experts to the GPU ===		=== There's no need to Offload Experts to the GPU ===

	I am able to run Qwen 3 VL 30B A3B (30B params in total, 3B dense) rather well on my 12GB VRAM GPU (16 tokens per second). The trick is to use <code>cpu-moe</code> or <code>n-cpu-moe</code> ~~(which allows you~~ to choose how many layers ~~you want to~~ stay in ~~the CPU)~~, ~~which keeps~~ the ~~dense~~ layers in the ~~GPU (the 3B) while only loading in experts when needed~~.		I am able to run Qwen 3 VL 30B A3B (30B params in total, 3B dense) rather well on my 12GB VRAM GPU (16 tokens per second). The trick is to use <code>cpu-moe</code>, which keeps the dense layers in the GPU (the 3B) while only loading in experts when needed. You can also use <code>n-cpu-moe</code> to choose exactly how many layers of experts should stay in VRAM. However, I found that even with all the layers of experts out of the VRAM, it was still fast enough.

1G-N15: /* There's no need to Offload Experts to the GPU */

2026-01-04T06:31:18Z

There's no need to Offload Experts to the GPU

← Older revision		Revision as of 06:31, 4 January 2026
Line 69:		Line 69:
	=== There's no need to Offload Experts to the GPU ===		=== There's no need to Offload Experts to the GPU ===

	I am able to run Qwen 3 30B A3B (30B params in total, 3B dense) rather well on my 12GB VRAM GPU (16 tokens per second). The trick is to use <code>cpu-moe</code> or <code>n-cpu-moe</code> (which allows you to choose how many layers you want to stay in the CPU), which keeps the dense layers in the GPU (the 3B) while only loading in experts when needed.		I am able to run Qwen 3 VL 30B A3B (30B params in total, 3B dense) rather well on my 12GB VRAM GPU (16 tokens per second). The trick is to use <code>cpu-moe</code> or <code>n-cpu-moe</code> (which allows you to choose how many layers you want to stay in the CPU), which keeps the dense layers in the GPU (the 3B) while only loading in experts when needed.

1G-N15: /* Don't mix K and V Quant Types */

2026-01-04T06:31:06Z

Don't mix K and V Quant Types

← Older revision		Revision as of 06:31, 4 January 2026
Line 61:		Line 61:
	=== Don't mix K and V Quant Types ===		=== Don't mix K and V Quant Types ===

	For me at least, mixing different K and V quant types made prompt processing much slower. An acquaintance suggested the ~~kernel~~ might have overflowed some sort of cache?		For me at least, mixing different K and V quant types made prompt processing much slower. An acquaintance suggested the two kernels might have overflowed some sort of cache?

	=== Don't use Flash Attention with Gemma 3 ===		=== Don't use Flash Attention with Gemma 3 ===

1G-N15 at 06:30, 4 January 2026

2026-01-04T06:30:42Z

← Older revision		Revision as of 06:30, 4 January 2026
Line 1:		Line 1:
	== My Setup ==		== My Setup ==

	<~~code~~>services:		<nowiki>services:
	llama-server:		llama-server:
	image: ghcr.io/ggml-org/llama.cpp:server-cuda		image: ghcr.io/ggml-org/llama.cpp:server-cuda
Line 22:		Line 22:
	- driver: nvidia		- driver: nvidia
	count: all		count: all
	capabilities: [ gpu ]		capabilities: [ gpu ]</nowiki>
	</~~code~~>

	<~~code~~>version = 1		<nowiki>version = 1

	[Qwen 3 VL]		[Qwen 3 VL]
Line 52:		Line 51:
	flash-attn = on		flash-attn = on
	n-gpu-layers = 32		n-gpu-layers = 32
	no-kv-offload = on</~~code~~>		no-kv-offload = on</nowiki>

	== Optimization Tips ==		== Optimization Tips ==

1G-N15: Created page with "== My Setup == `services: llama-server: image: ghcr.io/ggml-org/llama.cpp:server-cuda restart: unless-stopped volumes: - ./models:/models:ro ports: - 127.0.0.1:4010:80 environment: - LLAMA_ARG_MODELS_PRESET=/models/presets.ini - LLAMA_ARG_MODELS_MAX=3 - LLAMA_ARG_HOST=0.0.0.0 - LLAMA_ARG_PORT=80 - LLAMA_ARG_FIT=off command: --keep 1024 deploy: resources: reservations: de..."`


2026-01-04T06:29:48Z
Created page with "== My Setup ==  <code>services:   llama-server:     image: ghcr.io/ggml-org/llama.cpp:server-cuda     restart: unless-stopped     volumes:       - ./models:/models:ro     ports:       - 127.0.0.1:4010:80     environment:       - LLAMA_ARG_MODELS_PRESET=/models/presets.ini       - LLAMA_ARG_MODELS_MAX=3       - LLAMA_ARG_HOST=0.0.0.0       - LLAMA_ARG_PORT=80       - LLAMA_ARG_FIT=off     command: --keep 1024     deploy:       resources:         reservations:           de..."
New page
== My Setup ==



<code>services:

  llama-server:

    image: ghcr.io/ggml-org/llama.cpp:server-cuda

    restart: unless-stopped

    volumes:

      - ./models:/models:ro

    ports:

      - 127.0.0.1:4010:80

    environment:

      - LLAMA_ARG_MODELS_PRESET=/models/presets.ini

      - LLAMA_ARG_MODELS_MAX=3

      - LLAMA_ARG_HOST=0.0.0.0

      - LLAMA_ARG_PORT=80

      - LLAMA_ARG_FIT=off

    command: --keep 1024

    deploy:

      resources:

        reservations:

          devices:

            - driver: nvidia

              count: all

              capabilities: [ gpu ]

</code>



<code>version = 1



[Qwen 3 VL]

model = /models/Huihui-Qwen3-VL-30B-A3B-Instruct-abliterated.Q4_K_M.gguf

mmproj = /models/Qwen3-VL-30B-A3B-Instruct-Q4_K_M.mmproj-BF16.gguf

ctx-size = 8192

cache-type-k = q8_0

cache-type-v = q8_0

flash-attn = on

cpu-moe = on

no-kv-offload = on



[Qwen 2.5 Coder]

model = /models/Qwen2.5-Coder-3B-Q4_K_M.gguf

ctx-size = 8192

cache-type-k = q8_0

cache-type-v = q8_0

flash-attn = on

no-kv-offload = on



[Gemma 2 Blackout]

model = /models/G2-9B-Blackout-R1-Q4_K_M.gguf

ctx-size = 8192

cache-type-k = q8_0

cache-type-v = q8_0

flash-attn = on

n-gpu-layers = 32

no-kv-offload = on</code>



== Optimization Tips ==



=== No KV Offload ===



<code>no-kv-offload</code> works surprisingly well! Doesn't come with much performance loss in my experience (Qwen 2.5 Coder 3B still works fast enough), and frees up VRAM for more models.



=== Don't mix K and V Quant Types ===



For me at least, mixing different K and V quant types made prompt processing much slower. An acquaintance suggested the kernel might have overflowed some sort of cache? 



=== Don't use Flash Attention with Gemma 3 ===



Once again, prompt processing speed suffers.



=== There's no need to Offload Experts to the GPU ===



I am able to run Qwen 3 30B A3B (30B params in total, 3B dense) rather well on my 12GB VRAM GPU (16 tokens per second). The trick is to use <code>cpu-moe</code> or <code>n-cpu-moe</code> (which allows you to choose how many layers you want to stay in the CPU), which keeps the dense layers in the GPU (the 3B) while only loading in experts when needed.