AI

Ollama

Run, manage, and query large language models locally with Ollama CLI and REST API.

#ollama #llm #local-ai #ai #api #gpu #llama #mistral #phi

Installation

Install Ollama on Linux or macOS.

curl -fsSL https://ollama.com/install.sh | sh

Start the server manually if needed.

ollama serve

💡 Tip: On Linux, Ollama installs as a systemd service. Use systemctl status ollama to check status.


Model Management

Pull a Model

ollama pull [model]

Pull a specific tag.

ollama pull llama3.2:3b

Pull a quantized variant.

ollama pull mistral:7b-instruct-q4_K_M

List Downloaded Models

ollama list

Show Model Details

ollama show [model]

Show only the Modelfile.

ollama show [model] --modelfile

Copy a Model

ollama cp [model] my-custom-name

Remove a Model

ollama rm [model]

Push a Model to Registry

ollama push username/[model]

Running Models

Interactive Chat (REPL)

ollama run [model]

Run with an initial prompt.

ollama run [model] "[prompt]"

💡 Tip: Inside the REPL, type /help to see commands like /set, /save, /load, /bye.

Pipe Input

echo "[prompt]" | ollama run [model]

Summarize a file piped into the model.

cat document.txt | ollama run [model] "Summarize this document."

Multiline Prompt

ollama run [model] <<'EOF'
Explain the following Rust error in plain English:

error[E0502]: cannot borrow `x` as mutable because it is also borrowed as immutable
EOF

Run with Custom Parameters

ollama run [model] --verbose

REST API

The Ollama server exposes a REST API at http://localhost:11434.

Generate (Single-Turn)

Parameters:

  • model (Required): Model name.
  • prompt (Required): The input prompt.
  • stream (Optional): Stream tokens as they generate (default: true).
  • options (Optional): Model parameters (temperature, top_p, etc.).
curl [host]/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "[model]",
    "prompt": "[prompt]",
    "stream": false
  }'

Generate with Options

curl [host]/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "[model]",
    "prompt": "[prompt]",
    "stream": false,
    "options": {
      "temperature": 0.1,
      "top_p":       0.9,
      "top_k":       40,
      "num_predict": 512,
      "seed":        42
    }
  }'

Chat (Multi-Turn)

curl [host]/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "model": "[model]",
    "stream": false,
    "messages": [
      {"role": "system",    "content": "You are a concise Linux expert."},
      {"role": "user",      "content": "What does the 2>/dev/null trick do?"}
    ]
  }'

Chat with History

curl [host]/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "model": "[model]",
    "stream": false,
    "messages": [
      {"role": "user",      "content": "What is a closure?"},
      {"role": "assistant", "content": "A closure is a function that captures variables from its enclosing scope..."},
      {"role": "user",      "content": "Show me an example in Rust."}
    ]
  }'

Embeddings

curl [host]/api/embed \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nomic-embed-text",
    "input": "Rust ownership rules explained."
  }'

List Local Models (API)

curl [host]/api/tags

Show Model Info (API)

curl [host]/api/show \
  -H "Content-Type: application/json" \
  -d '{"name": "[model]"}'

Pull a Model (API)

curl [host]/api/pull \
  -H "Content-Type: application/json" \
  -d '{"name": "[model]", "stream": false}'

Delete a Model (API)

curl -X DELETE [host]/api/delete \
  -H "Content-Type: application/json" \
  -d '{"name": "[model]"}'

Check Running Models

curl [host]/api/ps

Modelfile

Create a custom model with a Modelfile:

Basic Modelfile

FROM [model]

SYSTEM """
You are an expert Rust engineer. You write safe, idiomatic Rust.
You always explain why a design decision was made.
Never suggest code that uses unwrap() without justification.
"""

PARAMETER temperature 0.1
PARAMETER top_p 0.9
PARAMETER num_predict 2048

Build and register the custom model.

ollama create my-rust-expert -f ./Modelfile

Test the custom model.

ollama run my-rust-expert "How should I handle errors in a CLI app?"

Modelfile Parameters

ParameterDescriptionExample
temperatureRandomness (0 = deterministic)0.1
top_pNucleus sampling0.9
top_kTop-K sampling40
num_ctxContext window size (tokens)8192
num_predictMax tokens to generate1024
repeat_penaltyPenalise token repetition1.1
seedFixed seed for reproducibility42
stopStop sequences`"<

Modelfile with Template

Customise the prompt template for models that use special tokens:

FROM mistral

TEMPLATE """[INST] {{ .System }} {{ .Prompt }} [/INST]"""

PARAMETER temperature 0.7

Environment Variables

VariableDefaultDescription
OLLAMA_HOST127.0.0.1:11434Interface and port to listen on
OLLAMA_MODELS~/.ollama/modelsPath to model storage directory
OLLAMA_NUM_PARALLEL1Max concurrent request handlers
OLLAMA_MAX_QUEUE512Max queued requests
OLLAMA_KEEP_ALIVE5mHow long to keep model in memory
OLLAMA_DEBUGfalseEnable debug logging

Expose Ollama on all interfaces.

OLLAMA_HOST=0.0.0.0 ollama serve

Increase the default context window.

OLLAMA_NUM_CTX=16384 ollama serve

GPU Configuration

Check whether Ollama is using the GPU.

ollama run [model] --verbose 2>&1 | grep "GPU"

Force CPU-only mode.

CUDA_VISIBLE_DEVICES="" ollama serve

Select a specific GPU on a multi-GPU system.

CUDA_VISIBLE_DEVICES=1 ollama serve

Popular Models

ModelPull CommandBest For
Llama 3.2 3Bollama pull llama3.2:3bFast, general purpose
Llama 3.3 70Bollama pull llama3.3High quality reasoning
Mistral 7Bollama pull mistralInstruction following
Phi-4ollama pull phi4Compact, strong reasoning
Gemma 3ollama pull gemma3Google, strong coding
Qwen2.5-Coderollama pull qwen2.5-coderCode generation
DeepSeek-R1ollama pull deepseek-r1Deep reasoning
nomic-embed-textollama pull nomic-embed-textText embeddings

Resources