AI

Advanced AI Arch & Networking Cheatsheet

Decoding MoE architectures (Qwen / Llama 4), hardware bottlenecks, MoE-specific llama.cpp optimization, and Ollama network binding.

#moe #llama4 #qwen #rwkv #networking #llamacpp #performance

Decoding Mixture-of-Experts (MoE) Naming

Modern giant models divide their neural networks into smaller "experts." The model only runs a subset of these experts per word, giving you the knowledge of a massive model at the speed of a smaller one.

Qwen Nomenclature (TotalB-ActiveB)

  • Example: Qwen3.5-122B-A10B
  • 122B (Total): The entire size of the model on your disk and in system RAM.
  • A10B (Active): The "Active Parameters." The model only uses 10 Billion parameters to generate a single token. Your GPU compute speed is based on this number.

Meta Llama 4 Nomenclature (ActiveB-ExpertsE)

  • Example: Llama-4-Scout-17B-16E
  • Scout / Maverick: Meta's tiers. Scout has a 10M token context window; Maverick has a 1M token context window.
  • 17B (Active): The Active Parameters used during inference.
  • 16E (Experts): The total number of experts the model is divided into.
  • Hidden Total: You have to calculate or look up the total size. (e.g., Scout 16E is ~109B Total Parameters; Maverick 128E is ~400B Total Parameters).

The MoE Hardware Reality: The PCIe Penalty

When running MoE models on budget hardware (like a Tesla P40) where the model is larger than your VRAM, you must split the model between GPU VRAM and System RAM.

  • The Bottleneck: When the AI's router selects an expert that is stored in your system RAM, the CPU must drag those gigabytes of weights across the PCIe bus to the GPU.
  • The Result: The GPU stalls and waits. Even if the active parameters are small (e.g., 10B), generation speed will drop to less than 1 token/second due to bandwidth limits.
  • The Fix: Never use FP8 or FP16 for MoEs if you lack VRAM. Always use aggressive quantization (like Q4_K_M) to shrink the Total Parameter size, allowing more experts to fit directly inside the GPU's fast VRAM.

Optimizing llama.cpp for MoE & Dual GPUs

When running MoEs across multiple GPUs (like dual Tesla P40s), you need specific flags to prevent the engine from poorly distributing the experts.

./build/bin/llama-server \
  -m models/Llama-4-Scout-17B-16E.Q4_K_M.gguf \
  -ngl 99 \
  --split-mode row \
  --numa distribute

Flag Explanations:

  • --split-mode row: Crucial for MoEs. Instead of splitting individual tensors across multiple GPUs (which forces the GPUs to constantly talk to each other), this keeps whole layers intact on specific GPUs. It drastically reduces VRAM fragmentation.
  • --numa distribute: If your server has dual CPUs (NUMA nodes), this ensures system RAM is allocated evenly, preventing memory bandwidth choking when fetching offloaded experts.
  • -ngl [number]: If the model doesn't fit entirely in VRAM, lower this from 99. llama.cpp automatically prioritizes keeping the heaviest compute layers (the experts) in VRAM and pushes the lighter, non-expert layers to your system RAM.

Serving Non-Standard Architectures (RWKV)

RWKV is a Linear RNN (Recurrent Neural Network), not a Transformer. It lacks self-attention, meaning its memory usage stays perfectly flat regardless of how large the context window gets.

Ollama natively supports RWKV via llama.cpp backend integration. You do not need the Heretic tool (which fails on non-Transformers) or specialized software.

Run RWKV directly:

ollama run mollysama/rwkv-7-g1e:2.9b

Network Configuration: Exposing Ollama to your LAN

To use tools like LM Studio, Open WebUI, or custom coding scripts on your laptop to control the models sitting on your Tesla P40 server, you must bind Ollama to your local network.

1. Bind to 0.0.0.0

By default, Ollama only listens to localhost.

  • Linux (Systemd): Run systemctl edit ollama and add: Environment="OLLAMA_HOST=0.0.0.0"
  • Windows: Add a system environment variable OLLAMA_HOST with the value 0.0.0.0.

2. Connect Remote Tools

In your client application (like LM Studio), configure the API connection:

  • Protocol: OpenAI Compatible
  • Base URL: http://[server_ip]:11434/v1

Ollama handles hot-swapping automatically. If you request a Qwen model and then an RWKV model over the API, Ollama will seamlessly unload and load the weights in your server's VRAM.