Ollama
Run, manage, and query large language models locally with Ollama CLI and REST API.
Installation
Install Ollama on Linux or macOS.
curl -fsSL https://ollama.com/install.sh | sh
Start the server manually if needed.
ollama serve
💡 Tip: On Linux, Ollama installs as a
systemdservice. Usesystemctl status ollamato check status.
Model Management
Pull a Model
ollama pull [model]
Pull a specific tag.
ollama pull llama3.2:3b
Pull a quantized variant.
ollama pull mistral:7b-instruct-q4_K_M
List Downloaded Models
ollama list
Show Model Details
ollama show [model]
Show only the Modelfile.
ollama show [model] --modelfile
Copy a Model
ollama cp [model] my-custom-name
Remove a Model
ollama rm [model]
Push a Model to Registry
ollama push username/[model]
Running Models
Interactive Chat (REPL)
ollama run [model]
Run with an initial prompt.
ollama run [model] "[prompt]"
💡 Tip: Inside the REPL, type
/helpto see commands like/set,/save,/load,/bye.
Pipe Input
echo "[prompt]" | ollama run [model]
Summarize a file piped into the model.
cat document.txt | ollama run [model] "Summarize this document."
Multiline Prompt
ollama run [model] <<'EOF'
Explain the following Rust error in plain English:
error[E0502]: cannot borrow `x` as mutable because it is also borrowed as immutable
EOF
Run with Custom Parameters
ollama run [model] --verbose
REST API
The Ollama server exposes a REST API at http://localhost:11434.
Generate (Single-Turn)
Parameters:
model(Required): Model name.prompt(Required): The input prompt.stream(Optional): Stream tokens as they generate (default:true).options(Optional): Model parameters (temperature, top_p, etc.).
curl [host]/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "[model]",
"prompt": "[prompt]",
"stream": false
}'
Generate with Options
curl [host]/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "[model]",
"prompt": "[prompt]",
"stream": false,
"options": {
"temperature": 0.1,
"top_p": 0.9,
"top_k": 40,
"num_predict": 512,
"seed": 42
}
}'
Chat (Multi-Turn)
curl [host]/api/chat \
-H "Content-Type: application/json" \
-d '{
"model": "[model]",
"stream": false,
"messages": [
{"role": "system", "content": "You are a concise Linux expert."},
{"role": "user", "content": "What does the 2>/dev/null trick do?"}
]
}'
Chat with History
curl [host]/api/chat \
-H "Content-Type: application/json" \
-d '{
"model": "[model]",
"stream": false,
"messages": [
{"role": "user", "content": "What is a closure?"},
{"role": "assistant", "content": "A closure is a function that captures variables from its enclosing scope..."},
{"role": "user", "content": "Show me an example in Rust."}
]
}'
Embeddings
curl [host]/api/embed \
-H "Content-Type: application/json" \
-d '{
"model": "nomic-embed-text",
"input": "Rust ownership rules explained."
}'
List Local Models (API)
curl [host]/api/tags
Show Model Info (API)
curl [host]/api/show \
-H "Content-Type: application/json" \
-d '{"name": "[model]"}'
Pull a Model (API)
curl [host]/api/pull \
-H "Content-Type: application/json" \
-d '{"name": "[model]", "stream": false}'
Delete a Model (API)
curl -X DELETE [host]/api/delete \
-H "Content-Type: application/json" \
-d '{"name": "[model]"}'
Check Running Models
curl [host]/api/ps
Modelfile
Create a custom model with a Modelfile:
Basic Modelfile
FROM [model]
SYSTEM """
You are an expert Rust engineer. You write safe, idiomatic Rust.
You always explain why a design decision was made.
Never suggest code that uses unwrap() without justification.
"""
PARAMETER temperature 0.1
PARAMETER top_p 0.9
PARAMETER num_predict 2048
Build and register the custom model.
ollama create my-rust-expert -f ./Modelfile
Test the custom model.
ollama run my-rust-expert "How should I handle errors in a CLI app?"
Modelfile Parameters
| Parameter | Description | Example |
|---|---|---|
temperature | Randomness (0 = deterministic) | 0.1 |
top_p | Nucleus sampling | 0.9 |
top_k | Top-K sampling | 40 |
num_ctx | Context window size (tokens) | 8192 |
num_predict | Max tokens to generate | 1024 |
repeat_penalty | Penalise token repetition | 1.1 |
seed | Fixed seed for reproducibility | 42 |
stop | Stop sequences | `"< |
Modelfile with Template
Customise the prompt template for models that use special tokens:
FROM mistral
TEMPLATE """[INST] {{ .System }} {{ .Prompt }} [/INST]"""
PARAMETER temperature 0.7
Environment Variables
| Variable | Default | Description |
|---|---|---|
OLLAMA_HOST | 127.0.0.1:11434 | Interface and port to listen on |
OLLAMA_MODELS | ~/.ollama/models | Path to model storage directory |
OLLAMA_NUM_PARALLEL | 1 | Max concurrent request handlers |
OLLAMA_MAX_QUEUE | 512 | Max queued requests |
OLLAMA_KEEP_ALIVE | 5m | How long to keep model in memory |
OLLAMA_DEBUG | false | Enable debug logging |
Expose Ollama on all interfaces.
OLLAMA_HOST=0.0.0.0 ollama serve
Increase the default context window.
OLLAMA_NUM_CTX=16384 ollama serve
GPU Configuration
Check whether Ollama is using the GPU.
ollama run [model] --verbose 2>&1 | grep "GPU"
Force CPU-only mode.
CUDA_VISIBLE_DEVICES="" ollama serve
Select a specific GPU on a multi-GPU system.
CUDA_VISIBLE_DEVICES=1 ollama serve
Popular Models
| Model | Pull Command | Best For |
|---|---|---|
| Llama 3.2 3B | ollama pull llama3.2:3b | Fast, general purpose |
| Llama 3.3 70B | ollama pull llama3.3 | High quality reasoning |
| Mistral 7B | ollama pull mistral | Instruction following |
| Phi-4 | ollama pull phi4 | Compact, strong reasoning |
| Gemma 3 | ollama pull gemma3 | Google, strong coding |
| Qwen2.5-Coder | ollama pull qwen2.5-coder | Code generation |
| DeepSeek-R1 | ollama pull deepseek-r1 | Deep reasoning |
| nomic-embed-text | ollama pull nomic-embed-text | Text embeddings |