<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://1gn15.com/index.php?action=history&amp;feed=atom&amp;title=Llama.cpp</id>
	<title>Llama.cpp - Revision history</title>
	<link rel="self" type="application/atom+xml" href="https://1gn15.com/index.php?action=history&amp;feed=atom&amp;title=Llama.cpp"/>
	<link rel="alternate" type="text/html" href="https://1gn15.com/index.php?title=Llama.cpp&amp;action=history"/>
	<updated>2026-04-17T13:08:53Z</updated>
	<subtitle>Revision history for this page on the wiki</subtitle>
	<generator>MediaWiki 1.45.1</generator>
	<entry>
		<id>https://1gn15.com/index.php?title=Llama.cpp&amp;diff=172&amp;oldid=prev</id>
		<title>1G-N15: /* Try Disabling Mmap */</title>
		<link rel="alternate" type="text/html" href="https://1gn15.com/index.php?title=Llama.cpp&amp;diff=172&amp;oldid=prev"/>
		<updated>2026-03-29T11:30:29Z</updated>

		<summary type="html">&lt;p&gt;&lt;span class=&quot;autocomment&quot;&gt;Try Disabling Mmap&lt;/span&gt;&lt;/p&gt;
&lt;table style=&quot;background-color: #fff; color: #202122;&quot; data-mw=&quot;interface&quot;&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;en&quot;&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;Revision as of 11:30, 29 March 2026&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l73&quot;&gt;Line 73:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 73:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;=== Try Disabling Mmap ===&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;=== Try Disabling Mmap ===&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Are your models loading slowly from disk? Try disabling mmap (LLAMA_ARG_MMAP=off)! Depending on your situation, this may or may not speed up loading. It did for me, though.&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;Are your models loading slowly from disk? Try disabling mmap (LLAMA_ARG_MMAP=off)! Depending on your situation, this may or may not speed up loading. It did for me, though&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;, especially for larger MoE models that spilled into system RAM&lt;/ins&gt;.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;

&lt;!-- diff cache key my_wiki:diff:1.41:old-171:rev-172:php=table --&gt;
&lt;/table&gt;</summary>
		<author><name>1G-N15</name></author>
	</entry>
	<entry>
		<id>https://1gn15.com/index.php?title=Llama.cpp&amp;diff=171&amp;oldid=prev</id>
		<title>1G-N15 at 11:30, 29 March 2026</title>
		<link rel="alternate" type="text/html" href="https://1gn15.com/index.php?title=Llama.cpp&amp;diff=171&amp;oldid=prev"/>
		<updated>2026-03-29T11:30:13Z</updated>

		<summary type="html">&lt;p&gt;&lt;/p&gt;
&lt;table style=&quot;background-color: #fff; color: #202122;&quot; data-mw=&quot;interface&quot;&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;en&quot;&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;Revision as of 11:30, 29 March 2026&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l70&quot;&gt;Line 70:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 70:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;I am able to run Qwen 3 VL 30B A3B (30B params in total, 3B dense) rather well on my 12GB VRAM GPU (16 tokens per second). The trick is to use &amp;lt;code&amp;gt;cpu-moe&amp;lt;/code&amp;gt;, which keeps the dense layers in the GPU (the 3B) while only loading in experts when needed. You can also use &amp;lt;code&amp;gt;n-cpu-moe&amp;lt;/code&amp;gt; to choose exactly how many layers of experts should stay in VRAM. However, I found that even with all the layers of experts out of the VRAM, it was still fast enough.&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;I am able to run Qwen 3 VL 30B A3B (30B params in total, 3B dense) rather well on my 12GB VRAM GPU (16 tokens per second). The trick is to use &amp;lt;code&amp;gt;cpu-moe&amp;lt;/code&amp;gt;, which keeps the dense layers in the GPU (the 3B) while only loading in experts when needed. You can also use &amp;lt;code&amp;gt;n-cpu-moe&amp;lt;/code&amp;gt; to choose exactly how many layers of experts should stay in VRAM. However, I found that even with all the layers of experts out of the VRAM, it was still fast enough.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;=== Try Disabling Mmap ===&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;Are your models loading slowly from disk? Try disabling mmap (LLAMA_ARG_MMAP=off)! Depending on your situation, this may or may not speed up loading. It did for me, though.&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;

&lt;!-- diff cache key my_wiki:diff:1.41:old-126:rev-171:php=table --&gt;
&lt;/table&gt;</summary>
		<author><name>1G-N15</name></author>
	</entry>
	<entry>
		<id>https://1gn15.com/index.php?title=Llama.cpp&amp;diff=126&amp;oldid=prev</id>
		<title>1G-N15: /* There&#039;s no need to Offload Experts to the GPU */</title>
		<link rel="alternate" type="text/html" href="https://1gn15.com/index.php?title=Llama.cpp&amp;diff=126&amp;oldid=prev"/>
		<updated>2026-01-04T06:32:53Z</updated>

		<summary type="html">&lt;p&gt;&lt;span class=&quot;autocomment&quot;&gt;There&amp;#039;s no need to Offload Experts to the GPU&lt;/span&gt;&lt;/p&gt;
&lt;table style=&quot;background-color: #fff; color: #202122;&quot; data-mw=&quot;interface&quot;&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;en&quot;&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;Revision as of 06:32, 4 January 2026&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l69&quot;&gt;Line 69:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 69:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;=== There&amp;#039;s no need to Offload Experts to the GPU ===&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;=== There&amp;#039;s no need to Offload Experts to the GPU ===&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;I am able to run Qwen 3 VL 30B A3B (30B params in total, 3B dense) rather well on my 12GB VRAM GPU (16 tokens per second). The trick is to use &amp;lt;code&amp;gt;cpu-moe&amp;lt;/code&amp;gt; &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;or &lt;/del&gt;&amp;lt;code&amp;gt;n-cpu-moe&amp;lt;/code&amp;gt; &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;(which allows you &lt;/del&gt;to choose how many layers &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;you want to &lt;/del&gt;stay in &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;the CPU)&lt;/del&gt;, &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;which keeps &lt;/del&gt;the &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;dense &lt;/del&gt;layers &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;in &lt;/del&gt;the &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;GPU (the 3B) while only loading in experts when needed&lt;/del&gt;.&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;I am able to run Qwen 3 VL 30B A3B (30B params in total, 3B dense) rather well on my 12GB VRAM GPU (16 tokens per second). The trick is to use &amp;lt;code&amp;gt;cpu-moe&amp;lt;/code&amp;gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;, which keeps the dense layers in the GPU (the 3B) while only loading in experts when needed. You can also use &lt;/ins&gt;&amp;lt;code&amp;gt;n-cpu-moe&amp;lt;/code&amp;gt; to choose &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;exactly &lt;/ins&gt;how many layers &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;of experts should &lt;/ins&gt;stay in &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;VRAM. However&lt;/ins&gt;, &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;I found that even with all &lt;/ins&gt;the layers &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;of experts out of &lt;/ins&gt;the &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;VRAM, it was still fast enough&lt;/ins&gt;.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;

&lt;!-- diff cache key my_wiki:diff:1.41:old-125:rev-126:php=table --&gt;
&lt;/table&gt;</summary>
		<author><name>1G-N15</name></author>
	</entry>
	<entry>
		<id>https://1gn15.com/index.php?title=Llama.cpp&amp;diff=125&amp;oldid=prev</id>
		<title>1G-N15: /* There&#039;s no need to Offload Experts to the GPU */</title>
		<link rel="alternate" type="text/html" href="https://1gn15.com/index.php?title=Llama.cpp&amp;diff=125&amp;oldid=prev"/>
		<updated>2026-01-04T06:31:18Z</updated>

		<summary type="html">&lt;p&gt;&lt;span class=&quot;autocomment&quot;&gt;There&amp;#039;s no need to Offload Experts to the GPU&lt;/span&gt;&lt;/p&gt;
&lt;table style=&quot;background-color: #fff; color: #202122;&quot; data-mw=&quot;interface&quot;&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;en&quot;&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;Revision as of 06:31, 4 January 2026&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l69&quot;&gt;Line 69:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 69:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;=== There&amp;#039;s no need to Offload Experts to the GPU ===&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;=== There&amp;#039;s no need to Offload Experts to the GPU ===&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;I am able to run Qwen 3 30B A3B (30B params in total, 3B dense) rather well on my 12GB VRAM GPU (16 tokens per second). The trick is to use &amp;lt;code&amp;gt;cpu-moe&amp;lt;/code&amp;gt; or &amp;lt;code&amp;gt;n-cpu-moe&amp;lt;/code&amp;gt; (which allows you to choose how many layers you want to stay in the CPU), which keeps the dense layers in the GPU (the 3B) while only loading in experts when needed.&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;I am able to run Qwen 3 &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;VL &lt;/ins&gt;30B A3B (30B params in total, 3B dense) rather well on my 12GB VRAM GPU (16 tokens per second). The trick is to use &amp;lt;code&amp;gt;cpu-moe&amp;lt;/code&amp;gt; or &amp;lt;code&amp;gt;n-cpu-moe&amp;lt;/code&amp;gt; (which allows you to choose how many layers you want to stay in the CPU), which keeps the dense layers in the GPU (the 3B) while only loading in experts when needed.&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;

&lt;!-- diff cache key my_wiki:diff:1.41:old-124:rev-125:php=table --&gt;
&lt;/table&gt;</summary>
		<author><name>1G-N15</name></author>
	</entry>
	<entry>
		<id>https://1gn15.com/index.php?title=Llama.cpp&amp;diff=124&amp;oldid=prev</id>
		<title>1G-N15: /* Don&#039;t mix K and V Quant Types */</title>
		<link rel="alternate" type="text/html" href="https://1gn15.com/index.php?title=Llama.cpp&amp;diff=124&amp;oldid=prev"/>
		<updated>2026-01-04T06:31:06Z</updated>

		<summary type="html">&lt;p&gt;&lt;span class=&quot;autocomment&quot;&gt;Don&amp;#039;t mix K and V Quant Types&lt;/span&gt;&lt;/p&gt;
&lt;table style=&quot;background-color: #fff; color: #202122;&quot; data-mw=&quot;interface&quot;&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;en&quot;&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;Revision as of 06:31, 4 January 2026&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l61&quot;&gt;Line 61:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 61:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;=== Don&amp;#039;t mix K and V Quant Types ===&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;=== Don&amp;#039;t mix K and V Quant Types ===&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;For me at least, mixing different K and V quant types made prompt processing much slower. An acquaintance suggested the &lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;kernel &lt;/del&gt;might have overflowed some sort of cache?  &lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;For me at least, mixing different K and V quant types made prompt processing much slower. An acquaintance suggested the &lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;two kernels &lt;/ins&gt;might have overflowed some sort of cache?&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;=== Don&amp;#039;t use Flash Attention with Gemma 3 ===&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;=== Don&amp;#039;t use Flash Attention with Gemma 3 ===&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;

&lt;!-- diff cache key my_wiki:diff:1.41:old-123:rev-124:php=table --&gt;
&lt;/table&gt;</summary>
		<author><name>1G-N15</name></author>
	</entry>
	<entry>
		<id>https://1gn15.com/index.php?title=Llama.cpp&amp;diff=123&amp;oldid=prev</id>
		<title>1G-N15 at 06:30, 4 January 2026</title>
		<link rel="alternate" type="text/html" href="https://1gn15.com/index.php?title=Llama.cpp&amp;diff=123&amp;oldid=prev"/>
		<updated>2026-01-04T06:30:42Z</updated>

		<summary type="html">&lt;p&gt;&lt;/p&gt;
&lt;table style=&quot;background-color: #fff; color: #202122;&quot; data-mw=&quot;interface&quot;&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;en&quot;&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;Revision as of 06:30, 4 January 2026&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l1&quot;&gt;Line 1:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 1:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;== My Setup ==&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;== My Setup ==&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&amp;lt;&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;code&lt;/del&gt;&amp;gt;services:&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt; &lt;/ins&gt;&amp;lt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;nowiki&lt;/ins&gt;&amp;gt;services:&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;   llama-server:&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;   llama-server:&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;     image: ghcr.io/ggml-org/llama.cpp:server-cuda&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;     image: ghcr.io/ggml-org/llama.cpp:server-cuda&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l22&quot;&gt;Line 22:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 22:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;             - driver: nvidia&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;             - driver: nvidia&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;               count: all&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;               count: all&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;               capabilities: [ gpu ]&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;               capabilities: [ gpu ]&amp;lt;/&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;nowiki&lt;/ins&gt;&amp;gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&amp;lt;/&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;code&lt;/del&gt;&amp;gt;&lt;/div&gt;&lt;/td&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-added&quot;&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&amp;lt;&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;code&lt;/del&gt;&amp;gt;version = 1&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt; &lt;/ins&gt;&amp;lt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;nowiki&lt;/ins&gt;&amp;gt;version = 1&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;[Qwen 3 VL]&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;[Qwen 3 VL]&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l52&quot;&gt;Line 52:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 51:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;flash-attn = on&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;flash-attn = on&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;n-gpu-layers = 32&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;n-gpu-layers = 32&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;−&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #ffe49c; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;no-kv-offload = on&amp;lt;/&lt;del style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;code&lt;/del&gt;&amp;gt;&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;no-kv-offload = on&amp;lt;/&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;nowiki&lt;/ins&gt;&amp;gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;== Optimization Tips ==&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;== Optimization Tips ==&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;

&lt;!-- diff cache key my_wiki:diff:1.41:old-122:rev-123:php=table --&gt;
&lt;/table&gt;</summary>
		<author><name>1G-N15</name></author>
	</entry>
	<entry>
		<id>https://1gn15.com/index.php?title=Llama.cpp&amp;diff=122&amp;oldid=prev</id>
		<title>1G-N15: Created page with &quot;== My Setup ==  &lt;code&gt;services:   llama-server:     image: ghcr.io/ggml-org/llama.cpp:server-cuda     restart: unless-stopped     volumes:       - ./models:/models:ro     ports:       - 127.0.0.1:4010:80     environment:       - LLAMA_ARG_MODELS_PRESET=/models/presets.ini       - LLAMA_ARG_MODELS_MAX=3       - LLAMA_ARG_HOST=0.0.0.0       - LLAMA_ARG_PORT=80       - LLAMA_ARG_FIT=off     command: --keep 1024     deploy:       resources:         reservations:           de...&quot;</title>
		<link rel="alternate" type="text/html" href="https://1gn15.com/index.php?title=Llama.cpp&amp;diff=122&amp;oldid=prev"/>
		<updated>2026-01-04T06:29:48Z</updated>

		<summary type="html">&lt;p&gt;Created page with &amp;quot;== My Setup ==  &amp;lt;code&amp;gt;services:   llama-server:     image: ghcr.io/ggml-org/llama.cpp:server-cuda     restart: unless-stopped     volumes:       - ./models:/models:ro     ports:       - 127.0.0.1:4010:80     environment:       - LLAMA_ARG_MODELS_PRESET=/models/presets.ini       - LLAMA_ARG_MODELS_MAX=3       - LLAMA_ARG_HOST=0.0.0.0       - LLAMA_ARG_PORT=80       - LLAMA_ARG_FIT=off     command: --keep 1024     deploy:       resources:         reservations:           de...&amp;quot;&lt;/p&gt;
&lt;p&gt;&lt;b&gt;New page&lt;/b&gt;&lt;/p&gt;&lt;div&gt;== My Setup ==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;services:&lt;br /&gt;
  llama-server:&lt;br /&gt;
    image: ghcr.io/ggml-org/llama.cpp:server-cuda&lt;br /&gt;
    restart: unless-stopped&lt;br /&gt;
    volumes:&lt;br /&gt;
      - ./models:/models:ro&lt;br /&gt;
    ports:&lt;br /&gt;
      - 127.0.0.1:4010:80&lt;br /&gt;
    environment:&lt;br /&gt;
      - LLAMA_ARG_MODELS_PRESET=/models/presets.ini&lt;br /&gt;
      - LLAMA_ARG_MODELS_MAX=3&lt;br /&gt;
      - LLAMA_ARG_HOST=0.0.0.0&lt;br /&gt;
      - LLAMA_ARG_PORT=80&lt;br /&gt;
      - LLAMA_ARG_FIT=off&lt;br /&gt;
    command: --keep 1024&lt;br /&gt;
    deploy:&lt;br /&gt;
      resources:&lt;br /&gt;
        reservations:&lt;br /&gt;
          devices:&lt;br /&gt;
            - driver: nvidia&lt;br /&gt;
              count: all&lt;br /&gt;
              capabilities: [ gpu ]&lt;br /&gt;
&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;version = 1&lt;br /&gt;
&lt;br /&gt;
[Qwen 3 VL]&lt;br /&gt;
model = /models/Huihui-Qwen3-VL-30B-A3B-Instruct-abliterated.Q4_K_M.gguf&lt;br /&gt;
mmproj = /models/Qwen3-VL-30B-A3B-Instruct-Q4_K_M.mmproj-BF16.gguf&lt;br /&gt;
ctx-size = 8192&lt;br /&gt;
cache-type-k = q8_0&lt;br /&gt;
cache-type-v = q8_0&lt;br /&gt;
flash-attn = on&lt;br /&gt;
cpu-moe = on&lt;br /&gt;
no-kv-offload = on&lt;br /&gt;
&lt;br /&gt;
[Qwen 2.5 Coder]&lt;br /&gt;
model = /models/Qwen2.5-Coder-3B-Q4_K_M.gguf&lt;br /&gt;
ctx-size = 8192&lt;br /&gt;
cache-type-k = q8_0&lt;br /&gt;
cache-type-v = q8_0&lt;br /&gt;
flash-attn = on&lt;br /&gt;
no-kv-offload = on&lt;br /&gt;
&lt;br /&gt;
[Gemma 2 Blackout]&lt;br /&gt;
model = /models/G2-9B-Blackout-R1-Q4_K_M.gguf&lt;br /&gt;
ctx-size = 8192&lt;br /&gt;
cache-type-k = q8_0&lt;br /&gt;
cache-type-v = q8_0&lt;br /&gt;
flash-attn = on&lt;br /&gt;
n-gpu-layers = 32&lt;br /&gt;
no-kv-offload = on&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Optimization Tips ==&lt;br /&gt;
&lt;br /&gt;
=== No KV Offload ===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;no-kv-offload&amp;lt;/code&amp;gt; works surprisingly well! Doesn&amp;#039;t come with much performance loss in my experience (Qwen 2.5 Coder 3B still works fast enough), and frees up VRAM for more models.&lt;br /&gt;
&lt;br /&gt;
=== Don&amp;#039;t mix K and V Quant Types ===&lt;br /&gt;
&lt;br /&gt;
For me at least, mixing different K and V quant types made prompt processing much slower. An acquaintance suggested the kernel might have overflowed some sort of cache? &lt;br /&gt;
&lt;br /&gt;
=== Don&amp;#039;t use Flash Attention with Gemma 3 ===&lt;br /&gt;
&lt;br /&gt;
Once again, prompt processing speed suffers.&lt;br /&gt;
&lt;br /&gt;
=== There&amp;#039;s no need to Offload Experts to the GPU ===&lt;br /&gt;
&lt;br /&gt;
I am able to run Qwen 3 30B A3B (30B params in total, 3B dense) rather well on my 12GB VRAM GPU (16 tokens per second). The trick is to use &amp;lt;code&amp;gt;cpu-moe&amp;lt;/code&amp;gt; or &amp;lt;code&amp;gt;n-cpu-moe&amp;lt;/code&amp;gt; (which allows you to choose how many layers you want to stay in the CPU), which keeps the dense layers in the GPU (the 3B) while only loading in experts when needed.&lt;/div&gt;</summary>
		<author><name>1G-N15</name></author>
	</entry>
</feed>