AI

Local LLM & Legacy GPU Cheatsheet

A comprehensive guide to compiling AI engines for older GPUs, understanding model formats, quantization, and navigating NVIDIA architectures.

#llm #nvidia #cuda #p40 #gguf #llamacpp #vllm

Building from Source (Legacy GPU Support)

When pre-compiled binaries drop support for older GPUs (like the Pascal-based Tesla P40), you must compile the inference engine from source targeting your specific Compute Capability.

Build llama.cpp with Specific CUDA Version

Parameters:

  • compute_cap: Your GPU's Compute Capability without the decimal (e.g., 6.1 becomes 61).
  • cuda_path: Path to the specific CUDA toolkit you want to build against (useful when avoiding host/server version mismatches).
git clone [https://github.com/ggerganov/llama.cpp](https://github.com/ggerganov/llama.cpp)
cd llama.cpp
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=[compute_cap] -DCMAKE_CUDA_COMPILER=[cuda_path]/bin/nvcc
cmake --build build --config Release -j

Hardware Management

Multi-GPU Inference

Restrict which GPUs the AI engine can see. This is highly useful when building a budget cluster (e.g., dual Tesla P40s) to pool VRAM.

export CUDA_VISIBLE_DEVICES=[gpu_ids]
./build/bin/llama-server -m path/to/model.gguf -ngl 99

💡 Tip: Set -ngl 99 in llama.cpp to offload all possible layers to the GPUs specified in CUDA_VISIBLE_DEVICES.


Model Format Suffixes

Understanding file extensions and model suffixes is crucial for matching the right model to your specific hardware capabilities.

Suffix / FormatBest HardwareSoftware EngineDescription
GGUFPascal/Volta, CPUs, MacOllama, llama.cppThe most flexible format. Excellent for older GPUs as it supports splitting models across GPU VRAM and System RAM. Uses INT8 math efficiently.
EXL2RTX 30/40/50 SeriesExLlamaV2Extremely fast, but relies heavily on modern Tensor Cores. Not recommended for older architectures with lower compute capabilities (e.g., Pascal P40).
AWQ / GPTQModern NVIDIA GPUsvLLM, AutoGPTQStandard 4-bit quantization. Great for throughput on newer datacenter cards (Ampere/Hopper).
MLXApple Silicon (M-Series)MLX FrameworkApple's proprietary framework format optimized for Unified Memory. Incompatible with NVIDIA GPUs.
SafetensorsHigh-VRAM DatacenterTransformers, vLLMThe raw, unquantized base weights (usually FP16). Very large file sizes; typically used to convert into GGUF/EXL2.

GGUF Quantization Levels

When downloading GGUF models, you will see different quantization suffixes. This dictates the balance between VRAM usage, speed, and output quality.

QuantizationVRAM RequiredQuality LossDescription
Q8_0HighAlmost None8-bit quantization. The sweet spot for Tesla P40s because the P40 has native hardware support for 8-bit math (DP4A instructions).
Q5_K_MMediumVery LowExcellent balance. Highly recommended if a model is just slightly too big to fit in your VRAM at Q8.
Q4_K_MLowLowThe standard community favorite. Compresses a model to about 25-30% of its original size while maintaining great reasoning.
Q2_K / IQ2Very LowHighExtreme compression. Only use this if you desperately want to run a massive model (like a 70B) on a single 24GB card.

Decoding GGUF Quantization Suffixes

GGUF models use a specific naming convention to tell you exactly how the model was compressed. Understanding this helps you balance VRAM usage against the AI's "smartness."

1. Anatomy of a Quantization Name

ComponentExampleMeaning
PrefixQ, IQQ: Standard Quantization.
IQ: Importance Matrix (Non-linear, bleeding-edge compression).
Bit Rate8, 5, 4The target number of bits per weight (down from the original 16-bit FP16).
Legacy Math_0, _1_0: Symmetric quantization (fast, hardware-friendly).
_1: Asymmetric quantization (includes an offset).
K-Quants_KMixed Precision. It dynamically uses higher bits for critical "brain" layers and lower bits for less important layers.
Size/Scale_S, _M, _L_S (Small): Maximizes VRAM savings.
_M (Medium): The "Sweet Spot" for quality vs. size.
_L (Large): Maximizes reasoning quality.

2. Common Quantization Levels (From Best Quality to Smallest Size)

SuffixVRAM FootprintQuality / PerplexityNotes & Best Use Case
Q8_0Very LargeNear FlawlessPractically identical to the uncompressed model. Perfect for the Tesla P40, which has native 8-bit hardware math.
Q6_KLargeExcellentVery low quality loss. Great if Q8_0 is just barely too big for your VRAM.
Q5_K_MMediumVery GoodThe gold standard "Sweet Spot". Highly recommended for complex reasoning and coding tasks.
Q4_K_MSmallGoodThe community default. Shrinks the model to ~25-30% of its original size while keeping it very coherent.
IQ3_MVery SmallFairRequires Importance Matrix (IQ). Noticeable degradation, but remarkably usable for 3-bit.
IQ2_MExtremePoor to Fair2-bit compression. Only use this if you are desperate to fit a massive model (like a 70B+) onto a single 24GB GPU.

Advanced Inference: vLLM

vLLM is a high-throughput and memory-efficient LLM inference engine. It uses PagedAttention to manage attention keys and values efficiently.

When to use vLLM?

  • High Concurrency: When multiple users are querying the model simultaneously.
  • Modern Hardware: Best suited for GPUs with Compute Capability 7.0+ (Volta) or higher.
  • Flash Attention: heavily utilizes Flash Attention hardware features (which the Tesla P40 lacks).

Run a Model via vLLM

python3 -m vllm.entrypoints.openai.api_server \
    --model unsloth/Qwen3.5-MoE-A17B \
    --quantization awq \
    --tensor-parallel-size 2

⚠️ Note for Legacy GPUs: While vLLM is the industry standard for modern production, llama.cpp remains the undisputed champion for squeezing performance out of older, non-Tensor Core GPUs like the P40.


NVIDIA CUDA Micro-Architecture & Compute Capability List

Use this table to find the target [compute_cap] for your specific hardware generation.

ArchitectureRelease DateCompute CapabilityGeForceQuadroJetson
Fermi20102.0, 2.1GeForce 400/500 seriesQuadro 600
Kepler20123.0, 3.5, 3.7GeForce 600/700 seriesQuadro K600
Maxwell20145.0, 5.2, 5.3GeForce 900 seriesQuadro K620Jetson Nano
Pascal20166.0, 6.1, 6.2GeForce 10 seriesQuadro P600, Tesla P40Jetson TX2
Volta20177.0, 7.2Nvidia TITAN VQuadro GV100Jetson Xavier NX, AGX Xavier
Turing20187.5GeForce 20 seriesNVIDIA T600, Quadro RTX 4000
Ampere20208.0, 8.6, 8.7GeForce 30 seriesNVIDIA RTX A2000Jetson Orin NX, AGX Orin
Ada Lovelace20228.9GeForce 40 seriesNVIDIA RTX 2000
Hopper20229.0H100 (Data Center)
Blackwell202410.xGeForce 50 series

Resources