Local LLM & Legacy GPU Cheatsheet
A comprehensive guide to compiling AI engines for older GPUs, understanding model formats, quantization, and navigating NVIDIA architectures.
Building from Source (Legacy GPU Support)
When pre-compiled binaries drop support for older GPUs (like the Pascal-based Tesla P40), you must compile the inference engine from source targeting your specific Compute Capability.
Build llama.cpp with Specific CUDA Version
Parameters:
compute_cap: Your GPU's Compute Capability without the decimal (e.g., 6.1 becomes61).cuda_path: Path to the specific CUDA toolkit you want to build against (useful when avoiding host/server version mismatches).
git clone [https://github.com/ggerganov/llama.cpp](https://github.com/ggerganov/llama.cpp)
cd llama.cpp
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=[compute_cap] -DCMAKE_CUDA_COMPILER=[cuda_path]/bin/nvcc
cmake --build build --config Release -j
Hardware Management
Multi-GPU Inference
Restrict which GPUs the AI engine can see. This is highly useful when building a budget cluster (e.g., dual Tesla P40s) to pool VRAM.
export CUDA_VISIBLE_DEVICES=[gpu_ids]
./build/bin/llama-server -m path/to/model.gguf -ngl 99
💡 Tip: Set
-ngl 99inllama.cppto offload all possible layers to the GPUs specified inCUDA_VISIBLE_DEVICES.
Model Format Suffixes
Understanding file extensions and model suffixes is crucial for matching the right model to your specific hardware capabilities.
| Suffix / Format | Best Hardware | Software Engine | Description |
|---|---|---|---|
| GGUF | Pascal/Volta, CPUs, Mac | Ollama, llama.cpp | The most flexible format. Excellent for older GPUs as it supports splitting models across GPU VRAM and System RAM. Uses INT8 math efficiently. |
| EXL2 | RTX 30/40/50 Series | ExLlamaV2 | Extremely fast, but relies heavily on modern Tensor Cores. Not recommended for older architectures with lower compute capabilities (e.g., Pascal P40). |
| AWQ / GPTQ | Modern NVIDIA GPUs | vLLM, AutoGPTQ | Standard 4-bit quantization. Great for throughput on newer datacenter cards (Ampere/Hopper). |
| MLX | Apple Silicon (M-Series) | MLX Framework | Apple's proprietary framework format optimized for Unified Memory. Incompatible with NVIDIA GPUs. |
| Safetensors | High-VRAM Datacenter | Transformers, vLLM | The raw, unquantized base weights (usually FP16). Very large file sizes; typically used to convert into GGUF/EXL2. |
GGUF Quantization Levels
When downloading GGUF models, you will see different quantization suffixes. This dictates the balance between VRAM usage, speed, and output quality.
| Quantization | VRAM Required | Quality Loss | Description |
|---|---|---|---|
| Q8_0 | High | Almost None | 8-bit quantization. The sweet spot for Tesla P40s because the P40 has native hardware support for 8-bit math (DP4A instructions). |
| Q5_K_M | Medium | Very Low | Excellent balance. Highly recommended if a model is just slightly too big to fit in your VRAM at Q8. |
| Q4_K_M | Low | Low | The standard community favorite. Compresses a model to about 25-30% of its original size while maintaining great reasoning. |
| Q2_K / IQ2 | Very Low | High | Extreme compression. Only use this if you desperately want to run a massive model (like a 70B) on a single 24GB card. |
Decoding GGUF Quantization Suffixes
GGUF models use a specific naming convention to tell you exactly how the model was compressed. Understanding this helps you balance VRAM usage against the AI's "smartness."
1. Anatomy of a Quantization Name
| Component | Example | Meaning |
|---|---|---|
| Prefix | Q, IQ | Q: Standard Quantization.IQ: Importance Matrix (Non-linear, bleeding-edge compression). |
| Bit Rate | 8, 5, 4 | The target number of bits per weight (down from the original 16-bit FP16). |
| Legacy Math | _0, _1 | _0: Symmetric quantization (fast, hardware-friendly)._1: Asymmetric quantization (includes an offset). |
| K-Quants | _K | Mixed Precision. It dynamically uses higher bits for critical "brain" layers and lower bits for less important layers. |
| Size/Scale | _S, _M, _L | _S (Small): Maximizes VRAM savings._M (Medium): The "Sweet Spot" for quality vs. size._L (Large): Maximizes reasoning quality. |
2. Common Quantization Levels (From Best Quality to Smallest Size)
| Suffix | VRAM Footprint | Quality / Perplexity | Notes & Best Use Case |
|---|---|---|---|
Q8_0 | Very Large | Near Flawless | Practically identical to the uncompressed model. Perfect for the Tesla P40, which has native 8-bit hardware math. |
Q6_K | Large | Excellent | Very low quality loss. Great if Q8_0 is just barely too big for your VRAM. |
Q5_K_M | Medium | Very Good | The gold standard "Sweet Spot". Highly recommended for complex reasoning and coding tasks. |
Q4_K_M | Small | Good | The community default. Shrinks the model to ~25-30% of its original size while keeping it very coherent. |
IQ3_M | Very Small | Fair | Requires Importance Matrix (IQ). Noticeable degradation, but remarkably usable for 3-bit. |
IQ2_M | Extreme | Poor to Fair | 2-bit compression. Only use this if you are desperate to fit a massive model (like a 70B+) onto a single 24GB GPU. |
Advanced Inference: vLLM
vLLM is a high-throughput and memory-efficient LLM inference engine. It uses PagedAttention to manage attention keys and values efficiently.
When to use vLLM?
- High Concurrency: When multiple users are querying the model simultaneously.
- Modern Hardware: Best suited for GPUs with Compute Capability 7.0+ (Volta) or higher.
- Flash Attention: heavily utilizes Flash Attention hardware features (which the Tesla P40 lacks).
Run a Model via vLLM
python3 -m vllm.entrypoints.openai.api_server \
--model unsloth/Qwen3.5-MoE-A17B \
--quantization awq \
--tensor-parallel-size 2
⚠️ Note for Legacy GPUs: While vLLM is the industry standard for modern production,
llama.cppremains the undisputed champion for squeezing performance out of older, non-Tensor Core GPUs like the P40.
NVIDIA CUDA Micro-Architecture & Compute Capability List
Use this table to find the target [compute_cap] for your specific hardware generation.
| Architecture | Release Date | Compute Capability | GeForce | Quadro | Jetson |
|---|---|---|---|---|---|
| Fermi | 2010 | 2.0, 2.1 | GeForce 400/500 series | Quadro 600 | |
| Kepler | 2012 | 3.0, 3.5, 3.7 | GeForce 600/700 series | Quadro K600 | |
| Maxwell | 2014 | 5.0, 5.2, 5.3 | GeForce 900 series | Quadro K620 | Jetson Nano |
| Pascal | 2016 | 6.0, 6.1, 6.2 | GeForce 10 series | Quadro P600, Tesla P40 | Jetson TX2 |
| Volta | 2017 | 7.0, 7.2 | Nvidia TITAN V | Quadro GV100 | Jetson Xavier NX, AGX Xavier |
| Turing | 2018 | 7.5 | GeForce 20 series | NVIDIA T600, Quadro RTX 4000 | |
| Ampere | 2020 | 8.0, 8.6, 8.7 | GeForce 30 series | NVIDIA RTX A2000 | Jetson Orin NX, AGX Orin |
| Ada Lovelace | 2022 | 8.9 | GeForce 40 series | NVIDIA RTX 2000 | |
| Hopper | 2022 | 9.0 | H100 (Data Center) | ||
| Blackwell | 2024 | 10.x | GeForce 50 series |