Hybrid Inference: Memory & Layer Offloading

In hybrid setups (GPU VRAM + system RAM), the two main bottlenecks are:

Prefill (prompt processing): Slow when the KV Cache or attention layers land in RAM. The CPU must handle attention over thousands of tokens. Always keep the KV Cache in VRAM.
Generation (decode): Usually fast. Only token activations (kilobytes) cross the PCIe bus per step, so bandwidth is rarely a bottleneck here.

CPU Threading (`-t` and `-tb`)

In hybrid setups, if the CPU calculates its portion too slowly, the GPU sits idle, destroying your TPS. Do not set threads to your maximum logical core count (hyperthreading hurts matrix math).

Set the generation threads (-t) and prefill batch threads (-tb) strictly to your physical core count (or 1-2 cores fewer to leave room for the OS and agentic backend). For an 8-core CPU (like Ryzen 7 9700):

-t 7 -tb 7

(Alternatively, use -t 8 -tb 8 for maximum pure inference speed, but never exceed 8 to avoid logical thread contention).

Back-to-Front Offloading & Diagnostics

--n-cpu-moe offloads from the front layers, causing immediate bottlenecks when processing large prompts. For optimal Time-To-First-Token (TTFT), you want to offload the final layers to the CPU instead. Use llama-fit-params in the CLI to calculate the exact regex string for your VRAM limit, then use it as a static parameter:

-ot =blk\.58\.ffn_up|down|gate_exps=CPU,blk\.59...

Tip: To see how --fit plans to split your model without actually loading it, add --llama-fit-dry-run to your command.

Automatic Layer Fitting (`--fit`)

The --fit system replaces manual -ngl tuning. Before loading any model data, it runs a virtual memory simulation: it maps the model across GPU VRAM and system RAM, dynamically reducing context size if needed, then evicting model layers to CPU until the allocation fits. Physical loading only begins after the simulation succeeds.

Enable automatic layer fitting:

--fit on

Reserve space for a 128k context window before allocating model weights:

--fit-ctx 128000

Leave 512 MB of VRAM free to prevent OOM crashes and OS stutter:

--fit-target 512

If --fit fails on your specific fork, replace the fit flags with explicit settings and tune --n-cpu-moe manually:

-c 131072 --n-cpu-moe 20

Forcing Massive Compute Buffers (The `--fit-target` Trick)

If you need to fit a huge ubatch compute buffer (e.g., 2GB for -ub 2048) but the auto-calculator keeps causing OOMs, you can trick llama-fit-params into leaving exactly enough space. Run the tool and set --fit-target to your required buffer size + your KV cache size + system overhead (e.g., ~2300 MB):

llama-fit-params -m [model] --fit-ctx 32000 --fit-target 2300

This forces the tool to calculate an extreme Back-to-Front regex (-ot) that evicts enough layers to the CPU to guarantee your massive prefill buffer will never crash the GPU.

The `--n-cpu-moe` Trap (Front-to-Back Offloading)

Historically, users used --n-cpu-moe N to offload expert weights to RAM. Avoid this for high-performance prefill. The parameter --n-cpu-moe uses a naive "Front-to-Back" approach: it offloads the first N layers (e.g., layers 0 to 35) to the CPU. When a massive prompt enters the model, it immediately hits the slow system RAM, destroying Time-To-First-Token (TTFT) and capping prefill speeds to ~15-20 t/s on mobile CPUs.

The Modern Solution: Always use the Back-to-Front regex method via -ot combined with -ngl 99 (or -ngl 41). By keeping the first few layers (e.g., blk.0 to blk.3) strictly in fast GPU VRAM, initial prompt processing accelerates massively (500+ t/s on PCIe Gen4 achieved in my tests), only touching the CPU at the end of the pipeline.

RoPE Offload Bug Fix (`LLAMA_SET_ROPE`)

In hybrid MoE setups, RoPE (Rotary Positional Embeddings) calculations may be erroneously routed to the CPU, dropping prefill speed from ~180 t/s to ~10–15 t/s.

Set this before starting the server:

export LLAMA_SET_ROPE=1

Prefill & Batch Optimization

The default batch size of 512 tokens starves modern GPUs during prompt processing. Increase batch and micro-batch size for nearly double prefill throughput:

-b 2048 -ub 2048

For single-user or single-agent use, limit parallel sequences. This eliminates recurrent-state memory reservations, making large contexts (128k+) nearly free in RAM overhead:

-np 1

The Context Reservation Trap vs. Compute Buffers

Large micro-batch sizes (-ub 2048) require massive temporary VRAM matrices (Compute Buffers) during prompt processing — often 1.5GB to 2GB. When you start llama-server, the --ctx-size (-c) parameter immediately reserves VRAM for the entire context window upfront. If you request -c 32000 (which reserves ~175MB to ~1GB depending on quant) and your VRAM is almost full, the engine will not have enough space to allocate the Compute Buffer, resulting in a cudaMalloc failed: out of memory crash during prefill. The Fix: You must balance -c and -ub, or force extra VRAM headroom using llama-fit-params (see below).

Quantization: Selection Rules

K-Quants vs I-Quants in Hybrid Setups

When any layers run on the CPU, quantization format determines CPU speed significantly:

K-Quants (Q4_K_M, Q5_K_S) are AVX-friendly — CPUs decode them efficiently.
I-Quants (IQ4_XS, IQ3_S) require complex bit-fiddling and are not AVX-friendly. Some IQ4_XS files use IQ3_S for expert tensors, which is especially slow on CPU.

Use K-Quants for any hybrid setup. Reserve I-Quants only when the entire model fits 100% in VRAM.

MoE vs Dense Model Resilience

Dense models (e.g., Qwen 27B) handle aggressive quantization well. Q3_K_XL still performs strongly because the full active parameter count remains large.
MoE models (e.g., Qwen 35B-A3B with 3B active parameters) shatter at heavy quantization. Only ~3B parameters are active per token, so quantization errors punch through like holes in a thin filter.

Never go below Q4_K_M for small-active-parameter MoE models.

Turbo3 vs ISO3 in Custom Forks (TurboQuant/RotorQuant)

iso3 (Isotropic): Better perplexity, but relies on fused kernels. Breaks in hybrid setups — llama.cpp disables the fused Gated Delta Net and falls back to slow software paths (~100 t/s) or throws a segfault.
turbo3 / q4_0: Scales cleanly across PCIe lanes between CPU and GPU. Use these for any CPU+GPU split.

Server Startup Strategies

The "Constrained Hardware" Gold Standard (e.g., 6GB VRAM Mobile GPU)

When running a massive model (like a 35B MoE) on limited VRAM, dynamic --fit on commands can fail due to massive compute buffer allocations (ubatch). The most stable method is to pre-calculate the eviction using llama-fit-params and hardcode the Back-to-Front regex (-ot).

Here is a verified production script for an RTX A3000 (6GB) running a 35B MoE model with a 16k context window, achieving ~540 t/s prefill and ~31 t/s generation:

#!/bin/bash
export LLAMA_SET_ROPE=1

llama-server \
  --model "model/Qwen3.6-35B-A3B-Q4_K_M.gguf" \
  --port 8080 \
  -c 16000 \
  -b 2048 \
  -ub 2048 \
  -t 7 \
  -tb 7 \
  -fa 1 \
  -np 1 \
  --cache-type-k q4_0 \
  --cache-type-v q4_0 \
  -ngl 41 \
  -ot "blk\.4\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU,blk\.5\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU,blk\.6\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU,blk\.7\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU,blk\.8\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU,blk\.9\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU,blk\.10\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU,blk\.11\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU,blk\.12\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU,blk\.13\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU,blk\.14\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU,blk\.15\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU,blk\.16\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU,blk\.17\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU,blk\.18\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU,blk\.19\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU,blk\.20\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU,blk\.21\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU,blk\.22\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU,blk\.23\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU,blk\.24\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU,blk\.25\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU,blk\.26\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU,blk\.27\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU,blk\.28\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU,blk\.29\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU,blk\.30\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU,blk\.31\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU,blk\.32\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU,blk\.33\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU,blk\.34\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU,blk\.35\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU,blk\.36\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU,blk\.37\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU,blk\.38\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU,blk\.39\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU,blk\.40\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU"

High-VRAM Baseline (12-24GB)

If your GPU can comfortably hold the model and buffers, dynamic fitting works perfectly:

llama-server \
  -m [model] \
  --fit on \
  --fit-ctx 128000 \
  --fit-target 512 \
  -b 2048 -ub 2048 \
  -np 1 \
  --chat-template-kwargs "{\"preserve_thinking\": true}" \
  --host 0.0.0.0 --port 8080

24/7 Server VRAM Management (Sleep Mode)

If running a persistent API server for background agents, do not leave the model permanently locked in VRAM. llama-server can automatically unload the model during idle periods, allowing the GPU to enter a low-power state. Use the sleep flag to define the idle timeout (in seconds):

--sleep-idle-seconds 300

This unloads the model after 5 minutes of inactivity, freeing VRAM for other tasks (like compiling code or running Docker containers).

Agentic Workflows

Reasoning Budget Control

Do not set a global --reasoning-budget at startup. Control thinking depth per API request instead.

Disable reasoning for fast, formatting-strict tasks (JSON extraction, simple formatting):

"reasoning_budget": 0

Should work with --chat-template-kwargs '{"enable_thinking": false}' or sending "chat_template_kwargs": {"enable_thinking": false} in the API request.

Enable reasoning for complex analytical tasks:

"reasoning_budget": 4096

If the model uses its entire max_tokens on internal thoughts without producing output, cap the thinking budget explicitly:

"reasoning_budget_tokens": 2000

Recommended budget by task type:

Task Type	Reasoning	Budget
Data Extraction / JSON Parsing	OFF	0 tokens
Simple Coding	Limited	1000–1500 tokens
System Architecture / Planning	ON	High / Unlimited
Fixing Broken Code / Debugging	ON	2000+ tokens

Externalizing the "Thinking" State

Relying entirely on a model's internal <|thought|> or preserve_thinking mechanics consumes thousands of tokens and drastically increases Time-To-First-Token (TTFT) for actionable output. The Optimization: For maximum speed and stability, disable internal thinking ("preserve_thinking": false or reasoning_budget: 0) and use a Python wrapper to create an external state machine:

Ask: Generate the raw code.
Validate: A separate, fast API call to check syntax.
Review: A separate API call simulating a code review.
Refine/Accept: Final output generation. This external loop approach prevents the model from getting lost in recursive logic loops within a single giant context window.

Preserve Thought Traces (Qwen)

Keeps past reasoning steps in the context history. Prevents the model from re-thinking earlier decisions in multi-turn sessions, reducing loops and improving tool-call consistency:

--chat-template-kwargs "{\"preserve_thinking\": true}"

Tool Calling: Precision Drift

Quantized models accumulate floating-point rounding errors as context grows across turns. Eventually, this precision drift causes the model to emit an incorrect structural token, breaking JSON tool calls or triggering a loop.

Mitigation: write tight, unambiguous tool routing in the system prompt. The fewer structural choices the model must make at each step, the less drift degrades JSON output.

System Prompt Scaffolding

Local models underperform commercial APIs in agentic tasks primarily because of missing scaffolding, not capability. Commercial systems inject detailed operating manuals into every request: explicit XML tool schemas, reasoning structures, and edge-case handling.

Treat the system prompt as an OS. Provide exhaustive tool schemas, exact call formats, and reasoning instructions.
Use specialized micro-agents (e.g., a Git-only agent, a file-search-only agent) instead of one large universal prompt. Keeps context slim and Time-To-First-Token (TTFT) low.

Local API Endpoints & Self-Healing

You are no longer restricted to cloud APIs for advanced agentic IDEs like Claude Code, Codex, or Roo Code. You can route them directly to your local llama-server instance. For maximum stability in agentic loops, consider placing an API middleware (like the Unsloth API endpoint) in front of llama.cpp to enable:

Self-healing tool calls: Automatically intercepts and fixes malformed JSON structural errors, reducing broken tool calls by 50%.
Code Execution: Enables the agent to run Bash and Python locally for verification.

Agentic Model Selection (The Qwen 3.6 Dilemma)

When building autonomous workflows, choose your model based on the agent's strategy:

For High Quality & Complex Logic: Use Qwen 3.6 27B (Dense). It is neck-and-neck with Sonnet 4.5 in coding benchmarks and requires less hand-holding. Tradeoff: Slower inference (~7-15 t/s).
For Speed & Iteration: Use Qwen 3.6 35B-A3B (MoE). It generates text incredibly fast but makes more logical mistakes. It excels in workflows where the agent is programmed to rapidly test, fail, and fix its own code in loops.

Prompt Caching (Long-Context RAG)

llama.cpp caches the full KV state of a shared prompt prefix. For repeated queries over a large knowledge base:

Send the knowledge base as a prefix together with the first question. Accept the initial heavy prefill.
For subsequent questions, send the same prefix with the new question appended.
The server detects the matching prefix, loads it from the KV cache in milliseconds, and evaluates only the new tokens.

This reduces subsequent query latency from minutes to ~100–200 ms on hybrid hardware.

Structured Output (JSON)

Use response_format in the API request to enforce strict JSON output. llama-server converts the schema into a GBNF grammar that physically blocks any token — including reasoning tags and markdown fences — that violates the schema:

{
  "response_format": {
    "type": "json_schema",
    "json_schema": { }
  },
  "temperature": 0.0
}

Pair with "temperature": 0.0 for deterministic, hallucination-free data extraction.

Automated Benchmarking (`llama-bench`)

Use llama-bench to find the physical bandwidth limits of your PCIe bus and RAM by testing extreme -b and -ub combinations:

./llama-bench -m [model] -t 7 -p 4096 -b 2048,1024 -ub 2048,1024,512 -ctk q4_0 -ctv q4_0 -ngl 41 -ot "..."

Monitor VRAM with nvtop during the benchmark. The goal is to find the maximum ubatch size your system can ingest without crashing, which will dictate your maximum prefill speed (pp column).

Critical `llama-bench` Quirks

When testing extreme configurations, llama-bench behaves differently than the server:

Missing Thread Parameter: llama-bench does not use -tb (threads-batch). You must use only -t for all thread allocations.
The KV Cache OOM Trap: By default, llama-bench creates the test KV Cache in heavy f16 precision. If you are testing VRAM limits, this will instantly crash the benchmark. You must explicitly pass your cache quantization flags (e.g., -ctk turbo3 -ctv turbo3) to simulate real server conditions.
Context Size: llama-bench defaults to reserving only the prompt size (-p), not the massive 32k+ contexts. A configuration that passes llama-bench might still OOM on llama-server if -c is set too high without proper --fit-target headroom.

Multi-GPU Hardware (AM4/AM5)

For dual-GPU AI workstations:

Avoid mATX and B550 boards. The second PCIe x16 slot routes through the chipset at x4 speed, choking memory offload bandwidth.
Target X570 boards (e.g., ASUS ROG Crosshair VIII Hero/Dark Hero, MSI MEG X570). These provide true CPU PCIe bifurcation at x8/x8, giving both GPUs direct high-speed access.
PSU minimum: 1000W Platinum. High-TDP primary GPU (350W+) plus secondary compute card (250W+) create intense power spikes.

Tesla P40 specific tuning

The Tesla P40 (24GB) is the budget king but requires strict handling:

Cooling: No active fan. Requires a 3D-printed shroud + server fan.
Power Limit: Limit to 170W-180W via nvidia-smi to prevent thermal throttling while retaining 95%+ performance.
Real-World Metrics: When properly configured with -fa 1 (Flash Attention) and q8_0 KV cache, a single P40 can achieve ~48 tok/s on a 30B MoE model while consuming only ~144W.

The AMD GPU Warning (CUDA vs ROCm)

While high-end AMD cards (like the RX 7900 XTX) boast massive memory bandwidth, their software ecosystem remains a severe bottleneck for AI developers:

Essential libraries like vLLM, flashattention2, and bitsandbytes currently lack stable ROCm support.
ExLlamaV2 is heavily CUDA-optimized, making RTX cards significantly faster (e.g., RTX 3090 beats 7900 XTX by ~90% in ExLlamaV2 generation).
Vulkan Workaround: If forced to use an AMD GPU in llama.cpp, testing the Vulkan backend (GGML_VULKAN=1) instead of ROCm can sometimes yield massive performance jumps (up to +48% in prompt eval).

Mixed Architectures & P2P Bug

If mixing GPU generations (e.g., RTX 3080 Ti + Tesla P40), NVIDIA drivers block direct Peer-to-Peer (P2P) transfers. Data must route through the CPU. Warning: Do not use --split-mode row in mixed or multi-AMD setups, as it currently triggers an infinite loop/freeze bug. Stick to the default --split-mode layer.

CUDA Build Troubleshooting

`nvcc` Not Found

Export the CUDA binary path:

export PATH=/usr/local/cuda/bin:$PATH

Export the CUDA library path:

export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

Clear the broken CMake cache before retrying:

rm -rf build

Build Freezes

Limit concurrent compilation jobs. Building large CUDA files (e.g., fattn.cu) consumes enormous RAM per thread:

cmake --build build -j 8

C++ Syntax Errors (Double Extern)

If a custom fork fails with invalid use of 'extern', wrap the macro correctly:

extern "C" { GGML_API int function_name(...); }

Compiling for Mixed GPUs (Fat Binaries)

If using GPUs from different eras (e.g., Ampere sm_86 and Pascal sm_61), instruct CMake to build a "Fat Binary" so both run optimally on the same server:

cmake -B build -DGGML_CUDA=1 -DCMAKE_CUDA_ARCHITECTURES="61;86"

Flash Attention Compilation Fix

On some builds, llama.cpp silently disables Flash Attention for certain quants, ruining CUDA prefill speeds. Force it on during compilation:

cmake -B build -DGGML_CUDA=ON -DLLAMA_CUDA_FA_ALL_QUANTS=1

Vision / Multimodal

Vision encoders (*mmproj* files) do not need to match the LLM's quantization level. Download the unquantized (f16 or f32) encoder and place it in the same directory as the model. Encoders are typically under 1 GB, so full precision costs almost nothing in memory while maximizing image recognition accuracy.

Fixing "image input is not supported"

If you attempt to pass an image to a multimodal model (like Qwen-VL or LLaVA) and the server returns a 500 error stating image input is not supported - hint: if this is unexpected, you may need to provide the mmproj, it means the LLM engine is running without its "eyes".

The Fix: The vision encoder is a separate file from the main LLM. You must explicitly pass the downloaded mmproj file to the server using the --mmproj flag:

llama-server \
  --model "model/Qwen3.6-35B-A3B-Q4_K_M.gguf" \
  --mmproj "model/qwen3.6-vl-mmproj-f16.gguf" \
  --port 8080 \
  -c 16000 -b 2048 -ub 2048 -fa 1

The `mmproj` VRAM Penalty (Text-Only Mode)

Do not load the vision encoder (--mmproj) if your current agentic session does not strictly require image analysis. The text tokenizer is already built into the main .gguf file. Forcing the server to load the mmproj file for purely text-based coding tasks wastes ~1GB of precious VRAM and degrades overall generation speed.

VRAM Optimization Tips

iGPU display trick: Plug the monitor into the motherboard video output instead of the dedicated GPU. This offloads the desktop compositor to the integrated GPU, freeing 500 MB–1 GB of VRAM on the compute GPU.

KV Cache priority: The KV Cache must stay in VRAM. Once it overflows to system RAM, the CPU must process attention across thousands of tokens, causing catastrophic prefill slowdowns.

Dual-Server iGPU Strategy (Vulkan): Do not let your primary GPU waste VRAM and compute on embedding models for RAG. Compile a second llama-server with GGML_VULKAN=1, load a lightweight embedding model (e.g., nomic-embed), and run it on your CPU's integrated graphics on a different port.

VRAM Juggling (llama-swap): If running LLMs alongside ComfyUI or Whisper, use llama-swap as a reverse proxy. It automatically offloads the LLM to RAM when idle, freeing up the GPU for image generation, and swaps it back instantly when the agent requests text.

FP8 KV Cache: To effectively double your context window size on modern architectures, force the cache to 8-bit: --kv-cache-dtype fp8 (or -ctk q8_0).

Vulkan Setup & The Gemma Float16 Bug

When offloading embedding models (like Gemma 300M) to an Intel iGPU via Vulkan:

Dependency: CMake will fail finding Vulkan without the Google shader compiler. Run sudo apt install glslc before building GGML_VULKAN=1.
Isolation: Prevent the Vulkan server from touching your primary NVIDIA GPU by strictly isolating the device ID in the startup script:
```
GGML_VK_VISIBLE_DEVICES=0 ./llama-server --embedding ...
```
The Gemma NaN Bug: Embedding models do not persist a KV cache, but precision during the forward pass matters. Gemma architectures overflow when using f16 activations, producing NaN (Not a Number) or broken vectors. You must force the temporary attention buffers to 32-bit float: -ctk f32 -ctv f32.

The UMA Fallacy (0-100% GPU Swings)

If you move from Apple Silicon (Unified Memory Architecture) to a discrete PC GPU (NVIDIA/AMD) and experience massive stuttering (~1 t/s) with GPU utilization swinging wildly between 0% and 100%, you have breached your VRAM ceiling. Discrete GPUs cannot smoothly page memory across the PCIe bus during matrix math. The 0% utilization drops are the GPU starving, waiting for swapped data from system RAM. You must establish a hard ceiling (using -ot or lowering context) to stay under 95% physical VRAM limit.

Document Parsing for LLMs

Standard PDF parsers (like PyPDF2) often read left-to-right and destroy structural elements like tables. For LLM pipelines, it is highly recommended to use MarkItDown (built by the AutoGen Team at Microsoft). It is a lightweight Python utility designed to convert various files to Markdown for text analysis pipelines.

Markdown is highly token-efficient and preserves important document structures like headings, lists, tables, and links. Mainstream LLMs understand Markdown natively, making it the perfect extraction format. MarkItDown supports a massive variety of formats out of the box, including PDF, PowerPoint, Word, Excel, Images, HTML, CSV, ZIP, and EPubs.

Basic Usage: First, install the package with all optional dependencies:

pip install 'markitdown[all]'

Then, parse your document in Python:

from markitdown import MarkItDown

md = MarkItDown(enable_plugins=False)
result = md.convert("document.pdf")
print(result.text_content)

For more details, 3rd-party plugins (like markitdown-ocr), and advanced configurations using LLMs for image descriptions, visit the official repository: https://github.com/microsoft/markitdown

Appendix: Advanced Insights & Agentic Optimizations

1. The Quantization Quality Paradox (Q4_K_M vs Q8_0)

Recent community evaluations of Qwen 3.6 architectures (e.g., 27B and 35B-A3B) reveal a counterintuitive trend in quantization performance:

The Sweet Spot: Q4_K_M remains the superior practical choice for agentic workflows. It retains nearly identical function-calling (BFCL) and human-level reasoning capabilities compared to BF16.
The Q8_0 Degradation: In several benchmarks, Q8_0 GGUF variants performed slightly worse in logic tests (like HellaSwag) and were significantly slower due to higher memory bandwidth requirements.
Provider Differences: Be aware that different GGUF providers (e.g., Unsloth vs. LM Studio vs. Bartowski) use varying quantization methods for weights vs. attention layers. An "Unsloth" version might consume more VRAM but offer better reasoning consistency than a default "LM Studio" quant of the same name.

2. Jinja Templates & Reasoning Traces

A common issue in local deployments is "Thought Leaking," where internal reasoning tokens (e.g., <think>) are printed as plain text to the user instead of being hidden.

The Jinja Fix: This is typically a Jinja template issue rather than a model flaw. If your UI does not natively support reasoning tags, you must adjust the Jinja template to wrap or suppress these blocks during the final output phase.
KV Cache Stability: Forcing preserve_thinking: true via --chat-template-kwargs is critical for preventing Cache Invalidation. If the reasoning trace is stripped between turns, the engine must re-calculate the entire prompt, causing a massive latency spike in multi-turn agentic sessions.

3. KV Cache Quantization (Empirical Benchmarks)

Based on stress tests conducted on RTX A3000 (Mobile) hardware for MoE models:

Cache Type	Prefill (pp4096)	Decode (tg128)	Stability
q4_0	578.7 t/s	34.3 t/s	Rock Solid
turbo3	563.9 t/s	34.8 t/s	Good (Best VRAM savings)
iso3	271.6 t/s	34.5 t/s	Broken (Kills prefill speed)

Recommendation: Use q4_0 for maximum stability and prefill speed. Switch to turbo3 only if you are within 500MB of your VRAM ceiling to squeeze in more context. Avoid iso3 in hybrid setups as it breaks fused CUDA kernels and falls back to slow software paths.

4. Multi-Node Agentic Architecture

For complex "OpenClaw" setups, avoid running a single monolithic node. Adopt a "Cluster" strategy:

The Orchestrator: A laptop or low-power node handles system prompts, tool routing, and lightweight decision-making.
The Worker (Heavy Compute): A dedicated workstation (e.g., RTX 3090 / 2x 1080 Ti + P40) handles the massive 35B+ models and long-context code generation.
The Utility Node: Use an iGPU (Vulkan) for embedding models and Whisper (speech-to-text) to keep your primary VRAM free for LLM layers.

Experimental & Bleeding-Edge Features

Monitor the llama.cpp community for these emerging optimizations:

Dynamic Runtime Quantization: The --fit algorithm is evolving beyond simple layer routing to dynamically adjust quantization bit-depth on the fly, allowing monolithic 70B models to squeeze into 8GB of total system RAM by heavily compressing less critical layers.
The "Hot Expert" Cache (MoE): Experimental branches are moving away from static CPU offloading for MoE models. New algorithms track frequently used "hot" experts over the last N tokens and dynamically swap them from RAM into VRAM, evicting "cold" experts. This has shown up to ~45% speedups in token generation on massive models like Qwen 122B.
Multi Token Prediction (MTP): Instead of generating one token per layer pass, MTP guesses multiple tokens simultaneously (usually 3). Because source code is highly predictable, MTP achieves a massive acceptance rate in coding tasks, dramatically multiplying generation speed for agentic coding. Ensure your GGUF file explicitly includes MTP layers to utilize this.

Customize Variables

💡 Tips & Tricks