Skip to main content

Ollama Setup

Ollama enables self-hosted LLM deployment for air-gapped environments, complete data sovereignty, and zero per-token costs.

Prerequisites

  • Docker or Kubernetes cluster
  • GPU infrastructure (recommended: NVIDIA A100, H100, or RTX 4090)
  • Sufficient RAM (minimum 32GB for 7B models, 128GB for 70B models)

Quick Start

Docker Deployment

# CPU-only (slower, for development)
docker run -d -p 11434:11434 \
--name ollama \
ollama/ollama

# With NVIDIA GPU (recommended for production)
docker run -d -p 11434:11434 \
--gpus all \
--name ollama \
ollama/ollama

Pull a Model

# Pull Llama 3.1 (70B parameters - best quality)
docker exec ollama ollama pull llama3.1:70b

# Or smaller models for lower resource requirements
docker exec ollama ollama pull llama3.1:8b # 8B parameters
docker exec ollama ollama pull mistral:7b # 7B parameters
docker exec ollama ollama pull mixtral:8x7b # Mixture of experts

Configure AxonFlow

export OLLAMA_ENDPOINT=http://localhost:11434
export OLLAMA_MODEL=llama3.1:70b

Or via YAML:

# axonflow.yaml
llm_providers:
ollama:
enabled: true
config:
endpoint: ${OLLAMA_ENDPOINT:-http://localhost:11434}
model: llama3.1:70b
timeout: 120s # LLM inference can be slow
priority: 5

Kubernetes Deployment

Helm Chart

# values.yaml
replicaCount: 3

resources:
limits:
nvidia.com/gpu: 1
memory: 128Gi
requests:
nvidia.com/gpu: 1
memory: 64Gi

persistence:
enabled: true
size: 100Gi

service:
type: ClusterIP
port: 11434
helm install ollama ollama/ollama -f values.yaml

Service Mesh Integration

For high availability with Istio:

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: ollama
spec:
hosts:
- ollama.internal
http:
- route:
- destination:
host: ollama
port:
number: 11434
timeout: 120s
retries:
attempts: 3
retryOn: 5xx,reset,connect-failure

Model Selection

ModelParametersVRAM RequiredUse Case
llama3.1:8b8B8GBFast responses, development
llama3.1:70b70B40GBProduction quality
mistral:7b7B8GBEfficient, good quality
mixtral:8x7b47B (8x7B MoE)32GBBest quality/speed tradeoff
codellama:34b34B24GBCode generation

Air-Gapped Deployment

For environments without internet access:

1. Download Models on Connected Machine

# On a machine with internet access
ollama pull llama3.1:70b

# Export the model
ollama export llama3.1:70b > llama3.1-70b.tar

2. Transfer to Air-Gapped Environment

# Copy via approved transfer mechanism
scp llama3.1-70b.tar airgapped-host:/models/

3. Import on Air-Gapped Machine

# Import the model
ollama import llama3.1:70b < /models/llama3.1-70b.tar

Performance Tuning

GPU Memory Optimization

# Set context size to reduce memory usage
docker exec ollama ollama run llama3.1:70b --ctx 4096

# Use quantized models for lower memory
docker exec ollama ollama pull llama3.1:70b-q4_0 # 4-bit quantization

Concurrent Requests

# axonflow.yaml
llm_providers:
ollama:
enabled: true
config:
endpoint: http://ollama:11434
model: llama3.1:70b
# Enable request queuing
max_concurrent: 4
queue_timeout: 30s

Health Monitoring

Ollama exposes health endpoints:

# Check if Ollama is running
curl http://localhost:11434/api/tags

# Test model inference
curl http://localhost:11434/api/generate -d '{
"model": "llama3.1:70b",
"prompt": "Hello, world!",
"stream": false
}'

Prometheus Metrics

AxonFlow exposes Ollama metrics:

# prometheus.yml
scrape_configs:
- job_name: 'axonflow-orchestrator'
static_configs:
- targets: ['orchestrator:9090']
metrics_path: /metrics

Key metrics:

  • llm_provider_requests_total{provider="ollama"}
  • llm_provider_latency_seconds{provider="ollama"}
  • llm_provider_errors_total{provider="ollama"}

Troubleshooting

Out of Memory

  1. Use a smaller model or quantized version
  2. Reduce context size
  3. Add more GPU memory or use CPU offloading

Slow Responses

  1. Verify GPU is being used: nvidia-smi
  2. Check model is loaded: ollama list
  3. Consider smaller model for faster inference

Connection Refused

  1. Verify Ollama is running: docker ps | grep ollama
  2. Check port binding: docker port ollama
  3. Verify firewall allows connection

Next Steps