Local LLM & Legacy GPU Cheatsheet

Building from Source (Legacy GPU Support)

When pre-compiled binaries drop support for older GPUs (like the Pascal-based Tesla P40), you must compile the inference engine from source targeting your specific Compute Capability.

Build llama.cpp with Specific CUDA Version

Parameters:

compute_cap: Your GPU's Compute Capability without the decimal (e.g., 6.1 becomes 61).
cuda_path: Path to the specific CUDA toolkit you want to build against (useful when avoiding host/server version mismatches).

git clone [https://github.com/ggerganov/llama.cpp](https://github.com/ggerganov/llama.cpp)
cd llama.cpp
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=[compute_cap] -DCMAKE_CUDA_COMPILER=[cuda_path]/bin/nvcc
cmake --build build --config Release -j

Hardware Management

Multi-GPU Inference

Restrict which GPUs the AI engine can see. This is highly useful when building a budget cluster (e.g., dual Tesla P40s) to pool VRAM.

export CUDA_VISIBLE_DEVICES=[gpu_ids]
./build/bin/llama-server -m path/to/model.gguf -ngl 99

💡 Tip: Set -ngl 99 in llama.cpp to offload all possible layers to the GPUs specified in CUDA_VISIBLE_DEVICES.

Model Format Suffixes

Understanding file extensions and model suffixes is crucial for matching the right model to your specific hardware capabilities.

Suffix / Format	Best Hardware	Software Engine	Description
GGUF	Pascal/Volta, CPUs, Mac	Ollama, llama.cpp	The most flexible format. Excellent for older GPUs as it supports splitting models across GPU VRAM and System RAM. Uses INT8 math efficiently.
EXL2	RTX 30/40/50 Series	ExLlamaV2	Extremely fast, but relies heavily on modern Tensor Cores. Not recommended for older architectures with lower compute capabilities (e.g., Pascal P40).
AWQ / GPTQ	Modern NVIDIA GPUs	vLLM, AutoGPTQ	Standard 4-bit quantization. Great for throughput on newer datacenter cards (Ampere/Hopper).
MLX	Apple Silicon (M-Series)	MLX Framework	Apple's proprietary framework format optimized for Unified Memory. Incompatible with NVIDIA GPUs.
Safetensors	High-VRAM Datacenter	Transformers, vLLM	The raw, unquantized base weights (usually FP16). Very large file sizes; typically used to convert into GGUF/EXL2.

GGUF Quantization Levels

When downloading GGUF models, you will see different quantization suffixes. This dictates the balance between VRAM usage, speed, and output quality.

Quantization	VRAM Required	Quality Loss	Description
Q8_0	High	Almost None	8-bit quantization. The sweet spot for Tesla P40s because the P40 has native hardware support for 8-bit math (DP4A instructions).
Q5_K_M	Medium	Very Low	Excellent balance. Highly recommended if a model is just slightly too big to fit in your VRAM at Q8.
Q4_K_M	Low	Low	The standard community favorite. Compresses a model to about 25-30% of its original size while maintaining great reasoning.
Q2_K / IQ2	Very Low	High	Extreme compression. Only use this if you desperately want to run a massive model (like a 70B) on a single 24GB card.

Decoding GGUF Quantization Suffixes

GGUF models use a specific naming convention to tell you exactly how the model was compressed. Understanding this helps you balance VRAM usage against the AI's "smartness."

1. Anatomy of a Quantization Name

Component	Example	Meaning
Prefix	`Q`, `IQ`	`Q`: Standard Quantization. `IQ`: Importance Matrix (Non-linear, bleeding-edge compression).
Bit Rate	`8`, `5`, `4`	The target number of bits per weight (down from the original 16-bit FP16).
Legacy Math	`_0`, `_1`	`_0`: Symmetric quantization (fast, hardware-friendly). `_1`: Asymmetric quantization (includes an offset).
K-Quants	`_K`	Mixed Precision. It dynamically uses higher bits for critical "brain" layers and lower bits for less important layers.
Size/Scale	`_S`, `_M`, `_L`	`_S` (Small): Maximizes VRAM savings. `_M` (Medium): The "Sweet Spot" for quality vs. size. `_L` (Large): Maximizes reasoning quality.

2. Common Quantization Levels (From Best Quality to Smallest Size)

Suffix	VRAM Footprint	Quality / Perplexity	Notes & Best Use Case
`Q8_0`	Very Large	Near Flawless	Practically identical to the uncompressed model. Perfect for the Tesla P40, which has native 8-bit hardware math.
`Q6_K`	Large	Excellent	Very low quality loss. Great if `Q8_0` is just barely too big for your VRAM.
`Q5_K_M`	Medium	Very Good	The gold standard "Sweet Spot". Highly recommended for complex reasoning and coding tasks.
`Q4_K_M`	Small	Good	The community default. Shrinks the model to ~25-30% of its original size while keeping it very coherent.
`IQ3_M`	Very Small	Fair	Requires Importance Matrix (`IQ`). Noticeable degradation, but remarkably usable for 3-bit.
`IQ2_M`	Extreme	Poor to Fair	2-bit compression. Only use this if you are desperate to fit a massive model (like a 70B+) onto a single 24GB GPU.

Advanced Inference: vLLM

vLLM is a high-throughput and memory-efficient LLM inference engine. It uses PagedAttention to manage attention keys and values efficiently.

When to use vLLM?

High Concurrency: When multiple users are querying the model simultaneously.
Modern Hardware: Best suited for GPUs with Compute Capability 7.0+ (Volta) or higher.
Flash Attention: heavily utilizes Flash Attention hardware features (which the Tesla P40 lacks).

Run a Model via vLLM

python3 -m vllm.entrypoints.openai.api_server \
    --model unsloth/Qwen3.5-MoE-A17B \
    --quantization awq \
    --tensor-parallel-size 2

⚠️ Note for Legacy GPUs: While vLLM is the industry standard for modern production, llama.cpp remains the undisputed champion for squeezing performance out of older, non-Tensor Core GPUs like the P40.

NVIDIA CUDA Micro-Architecture & Compute Capability List

Use this table to find the target [compute_cap] for your specific hardware generation.

Architecture	Release Date	Compute Capability	GeForce	Quadro	Jetson
Fermi	2010	2.0, 2.1	GeForce 400/500 series	Quadro 600
Kepler	2012	3.0, 3.5, 3.7	GeForce 600/700 series	Quadro K600
Maxwell	2014	5.0, 5.2, 5.3	GeForce 900 series	Quadro K620	Jetson Nano
Pascal	2016	6.0, 6.1, 6.2	GeForce 10 series	Quadro P600, Tesla P40	Jetson TX2
Volta	2017	7.0, 7.2	Nvidia TITAN V	Quadro GV100	Jetson Xavier NX, AGX Xavier
Turing	2018	7.5	GeForce 20 series	NVIDIA T600, Quadro RTX 4000
Ampere	2020	8.0, 8.6, 8.7	GeForce 30 series	NVIDIA RTX A2000	Jetson Orin NX, AGX Orin
Ada Lovelace	2022	8.9	GeForce 40 series	NVIDIA RTX 2000
Hopper	2022	9.0		H100 (Data Center)
Blackwell	2024	10.x	GeForce 50 series

Local LLM & Legacy GPU Cheatsheet

Customize Variables

Building from Source (Legacy GPU Support)

Build llama.cpp with Specific CUDA Version

Hardware Management

Multi-GPU Inference

Model Format Suffixes

GGUF Quantization Levels

Decoding GGUF Quantization Suffixes

1. Anatomy of a Quantization Name

2. Common Quantization Levels (From Best Quality to Smallest Size)

Advanced Inference: vLLM

When to use vLLM?

Run a Model via vLLM

NVIDIA CUDA Micro-Architecture & Compute Capability List

Resources

Customize Variables

💡 Tips & Tricks

Building from Source (Legacy GPU Support)

Build llama.cpp with Specific CUDA Version

Hardware Management

Multi-GPU Inference

Model Format Suffixes

GGUF Quantization Levels

Decoding GGUF Quantization Suffixes

1. Anatomy of a Quantization Name

2. Common Quantization Levels (From Best Quality to Smallest Size)

Advanced Inference: vLLM

When to use vLLM?

Run a Model via vLLM

NVIDIA CUDA Micro-Architecture & Compute Capability List

Resources

Clear Variables

Print Options