Ollama Setup
Ollama enables self-hosted LLM deployment for air-gapped environments, complete data sovereignty, and zero per-token costs.
Prerequisites
- Docker or Kubernetes cluster
- GPU infrastructure (recommended: NVIDIA A100, H100, or RTX 4090)
- Sufficient RAM (minimum 32GB for 7B models, 128GB for 70B models)
Quick Start
Docker Deployment
# CPU-only (slower, for development)
docker run -d -p 11434:11434 \
--name ollama \
ollama/ollama
# With NVIDIA GPU (recommended for production)
docker run -d -p 11434:11434 \
--gpus all \
--name ollama \
ollama/ollama
Pull a Model
# Pull Llama 3.1 (70B parameters - best quality)
docker exec ollama ollama pull llama3.1:70b
# Or smaller models for lower resource requirements
docker exec ollama ollama pull llama3.1:8b # 8B parameters
docker exec ollama ollama pull mistral:7b # 7B parameters
docker exec ollama ollama pull mixtral:8x7b # Mixture of experts
Configure AxonFlow
export OLLAMA_ENDPOINT=http://localhost:11434
export OLLAMA_MODEL=llama3.1:70b
Or via YAML:
# axonflow.yaml
llm_providers:
ollama:
enabled: true
config:
endpoint: ${OLLAMA_ENDPOINT:-http://localhost:11434}
model: llama3.1:70b
timeout: 120s # LLM inference can be slow
priority: 5
Kubernetes Deployment
Helm Chart
# values.yaml
replicaCount: 3
resources:
limits:
nvidia.com/gpu: 1
memory: 128Gi
requests:
nvidia.com/gpu: 1
memory: 64Gi
persistence:
enabled: true
size: 100Gi
service:
type: ClusterIP
port: 11434
helm install ollama ollama/ollama -f values.yaml
Service Mesh Integration
For high availability with Istio:
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: ollama
spec:
hosts:
- ollama.internal
http:
- route:
- destination:
host: ollama
port:
number: 11434
timeout: 120s
retries:
attempts: 3
retryOn: 5xx,reset,connect-failure
Model Selection
| Model | Parameters | VRAM Required | Use Case |
|---|---|---|---|
llama3.1:8b | 8B | 8GB | Fast responses, development |
llama3.1:70b | 70B | 40GB | Production quality |
mistral:7b | 7B | 8GB | Efficient, good quality |
mixtral:8x7b | 47B (8x7B MoE) | 32GB | Best quality/speed tradeoff |
codellama:34b | 34B | 24GB | Code generation |
Air-Gapped Deployment
For environments without internet access:
1. Download Models on Connected Machine
# On a machine with internet access
ollama pull llama3.1:70b
# Export the model
ollama export llama3.1:70b > llama3.1-70b.tar
2. Transfer to Air-Gapped Environment
# Copy via approved transfer mechanism
scp llama3.1-70b.tar airgapped-host:/models/
3. Import on Air-Gapped Machine
# Import the model
ollama import llama3.1:70b < /models/llama3.1-70b.tar
Performance Tuning
GPU Memory Optimization
# Set context size to reduce memory usage
docker exec ollama ollama run llama3.1:70b --ctx 4096
# Use quantized models for lower memory
docker exec ollama ollama pull llama3.1:70b-q4_0 # 4-bit quantization
Concurrent Requests
# axonflow.yaml
llm_providers:
ollama:
enabled: true
config:
endpoint: http://ollama:11434
model: llama3.1:70b
# Enable request queuing
max_concurrent: 4
queue_timeout: 30s
Health Monitoring
Ollama exposes health endpoints:
# Check if Ollama is running
curl http://localhost:11434/api/tags
# Test model inference
curl http://localhost:11434/api/generate -d '{
"model": "llama3.1:70b",
"prompt": "Hello, world!",
"stream": false
}'
Prometheus Metrics
AxonFlow exposes Ollama metrics:
# prometheus.yml
scrape_configs:
- job_name: 'axonflow-orchestrator'
static_configs:
- targets: ['orchestrator:9090']
metrics_path: /metrics
Key metrics:
llm_provider_requests_total{provider="ollama"}llm_provider_latency_seconds{provider="ollama"}llm_provider_errors_total{provider="ollama"}
Troubleshooting
Out of Memory
- Use a smaller model or quantized version
- Reduce context size
- Add more GPU memory or use CPU offloading
Slow Responses
- Verify GPU is being used:
nvidia-smi - Check model is loaded:
ollama list - Consider smaller model for faster inference
Connection Refused
- Verify Ollama is running:
docker ps | grep ollama - Check port binding:
docker port ollama - Verify firewall allows connection
Next Steps
- LLM Providers Overview - All supported providers
- AWS Bedrock Setup - Cloud alternative
- Custom Provider SDK - Build custom providers