Advanced AI Arch & Networking Cheatsheet
Decoding MoE architectures (Qwen / Llama 4), hardware bottlenecks, MoE-specific llama.cpp optimization, and Ollama network binding.
Decoding Mixture-of-Experts (MoE) Naming
Modern giant models divide their neural networks into smaller "experts." The model only runs a subset of these experts per word, giving you the knowledge of a massive model at the speed of a smaller one.
Qwen Nomenclature (TotalB-ActiveB)
- Example:
Qwen3.5-122B-A10B 122B(Total): The entire size of the model on your disk and in system RAM.A10B(Active): The "Active Parameters." The model only uses 10 Billion parameters to generate a single token. Your GPU compute speed is based on this number.
Meta Llama 4 Nomenclature (ActiveB-ExpertsE)
- Example:
Llama-4-Scout-17B-16E Scout/Maverick: Meta's tiers. Scout has a 10M token context window; Maverick has a 1M token context window.17B(Active): The Active Parameters used during inference.16E(Experts): The total number of experts the model is divided into.- Hidden Total: You have to calculate or look up the total size. (e.g., Scout 16E is ~109B Total Parameters; Maverick 128E is ~400B Total Parameters).
The MoE Hardware Reality: The PCIe Penalty
When running MoE models on budget hardware (like a Tesla P40) where the model is larger than your VRAM, you must split the model between GPU VRAM and System RAM.
- The Bottleneck: When the AI's router selects an expert that is stored in your system RAM, the CPU must drag those gigabytes of weights across the PCIe bus to the GPU.
- The Result: The GPU stalls and waits. Even if the active parameters are small (e.g., 10B), generation speed will drop to less than 1 token/second due to bandwidth limits.
- The Fix: Never use
FP8orFP16for MoEs if you lack VRAM. Always use aggressive quantization (likeQ4_K_M) to shrink the Total Parameter size, allowing more experts to fit directly inside the GPU's fast VRAM.
Optimizing llama.cpp for MoE & Dual GPUs
When running MoEs across multiple GPUs (like dual Tesla P40s), you need specific flags to prevent the engine from poorly distributing the experts.
./build/bin/llama-server \
-m models/Llama-4-Scout-17B-16E.Q4_K_M.gguf \
-ngl 99 \
--split-mode row \
--numa distribute
Flag Explanations:
--split-mode row: Crucial for MoEs. Instead of splitting individual tensors across multiple GPUs (which forces the GPUs to constantly talk to each other), this keeps whole layers intact on specific GPUs. It drastically reduces VRAM fragmentation.--numa distribute: If your server has dual CPUs (NUMA nodes), this ensures system RAM is allocated evenly, preventing memory bandwidth choking when fetching offloaded experts.-ngl [number]: If the model doesn't fit entirely in VRAM, lower this from 99.llama.cppautomatically prioritizes keeping the heaviest compute layers (the experts) in VRAM and pushes the lighter, non-expert layers to your system RAM.
Serving Non-Standard Architectures (RWKV)
RWKV is a Linear RNN (Recurrent Neural Network), not a Transformer. It lacks self-attention, meaning its memory usage stays perfectly flat regardless of how large the context window gets.
Ollama natively supports RWKV via llama.cpp backend integration. You do not need the Heretic tool (which fails on non-Transformers) or specialized software.
Run RWKV directly:
ollama run mollysama/rwkv-7-g1e:2.9b
Network Configuration: Exposing Ollama to your LAN
To use tools like LM Studio, Open WebUI, or custom coding scripts on your laptop to control the models sitting on your Tesla P40 server, you must bind Ollama to your local network.
1. Bind to 0.0.0.0
By default, Ollama only listens to localhost.
- Linux (Systemd): Run
systemctl edit ollamaand add:Environment="OLLAMA_HOST=0.0.0.0" - Windows: Add a system environment variable
OLLAMA_HOSTwith the value0.0.0.0.
2. Connect Remote Tools
In your client application (like LM Studio), configure the API connection:
- Protocol: OpenAI Compatible
- Base URL:
http://[server_ip]:11434/v1
Ollama handles hot-swapping automatically. If you request a Qwen model and then an RWKV model over the API, Ollama will seamlessly unload and load the weights in your server's VRAM.