Ollama Setup

Ollama enables self-hosted LLM deployment for air-gapped environments, complete data sovereignty, and zero per-token costs.

Prerequisites

Docker or Kubernetes cluster
GPU infrastructure (recommended: NVIDIA A100, H100, or RTX 4090)
Sufficient RAM (minimum 32GB for 7B models, 128GB for 70B models)

Quick Start

Docker Deployment

# CPU-only (slower, for development)
docker run -d -p 11434:11434 \
  --name ollama \
  ollama/ollama

# With NVIDIA GPU (recommended for production)
docker run -d -p 11434:11434 \
  --gpus all \
  --name ollama \
  ollama/ollama

Pull a Model

# Pull Llama 3.1 (70B parameters - best quality)
docker exec ollama ollama pull llama3.2:latest

# Or smaller models for lower resource requirements
docker exec ollama ollama pull llama3.2:latest    # 8B parameters
docker exec ollama ollama pull mistral:latest   # 7B parameters
docker exec ollama ollama pull mixtral:latest   # Mixture of experts

Configure AxonFlow

export OLLAMA_ENDPOINT=http://localhost:11434
export OLLAMA_MODEL=llama3.2:latest

Or via YAML:

# axonflow.yaml
llm_providers:
  ollama:
    enabled: true
    config:
      endpoint: ${OLLAMA_ENDPOINT:-http://localhost:11434}
      model: llama3.2:latest
      timeout: 120s  # LLM inference can be slow
    priority: 5

Kubernetes Deployment

Helm Chart

# values.yaml
replicaCount: 3

resources:
  limits:
    nvidia.com/gpu: 1
    memory: 128Gi
  requests:
    nvidia.com/gpu: 1
    memory: 64Gi

persistence:
  enabled: true
  size: 100Gi

service:
  type: ClusterIP
  port: 11434

helm install ollama ollama/ollama -f values.yaml

Service Mesh Integration

For high availability with Istio:

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: ollama
spec:
  hosts:
    - ollama.internal
  http:
    - route:
        - destination:
            host: ollama
            port:
              number: 11434
      timeout: 120s
      retries:
        attempts: 3
        retryOn: 5xx,reset,connect-failure

Model Selection

Model	Parameters	VRAM Required	Use Case
`llama3.2:latest`	8B	8GB	Fast responses, development
`llama3.3:70b`	70B	40GB	Production quality
`mistral:latest`	7B	8GB	Efficient, good quality
`mixtral:latest`	47B (8x7B MoE)	32GB	Best quality/speed tradeoff
`qwen2.5-coder:32b`	34B	24GB	Code generation

Air-Gapped Deployment

For environments without internet access:

1. Download Models on Connected Machine

# On a machine with internet access
ollama pull llama3.2:latest

# Export the model
ollama export llama3.2:latest > llama3.2-70b.tar

2. Transfer to Air-Gapped Environment

# Copy via approved transfer mechanism
scp llama3.2-70b.tar airgapped-host:/models/

3. Import on Air-Gapped Machine

# Import the model
ollama import llama3.2:latest < /models/llama3.2-70b.tar

Performance Tuning

GPU Memory Optimization

# Set context size to reduce memory usage
docker exec ollama ollama run llama3.2:latest --ctx 4096

# Use quantized models for lower memory
docker exec ollama ollama pull llama3.2:latest-q4_0  # 4-bit quantization

Concurrent Requests

# axonflow.yaml
llm_providers:
  ollama:
    enabled: true
    config:
      endpoint: http://ollama:11434
      model: llama3.2:latest
      # Enable request queuing
      max_concurrent: 4
      queue_timeout: 30s

Health Monitoring

Ollama exposes health endpoints:

# Check if Ollama is running
curl http://localhost:11434/api/tags

# Test model inference
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2:latest",
  "prompt": "Hello, world!",
  "stream": false
}'

Prometheus Metrics

AxonFlow exposes Ollama metrics:

# prometheus.yml
scrape_configs:
  - job_name: 'axonflow-orchestrator'
    static_configs:
      - targets: ['orchestrator:9090']
    metrics_path: /metrics

Key metrics:

llm_provider_requests_total{provider="ollama"}
llm_provider_latency_seconds{provider="ollama"}
llm_provider_errors_total{provider="ollama"}

Troubleshooting

Out of Memory

Use a smaller model or quantized version
Reduce context size
Add more GPU memory or use CPU offloading

Slow Responses

Verify GPU is being used: nvidia-smi
Check model is loaded: ollama list
Consider smaller model for faster inference

Connection Refused

Verify Ollama is running: docker ps | grep ollama
Check port binding: docker port ollama
Verify firewall allows connection

Next Steps

LLM Providers Overview - All supported providers
AWS Bedrock Setup - Cloud alternative
Custom Provider SDK - Build custom providers

Prerequisites​

Quick Start​

Docker Deployment​

Pull a Model​

Configure AxonFlow​

Kubernetes Deployment​

Helm Chart​

Service Mesh Integration​

Model Selection​

Air-Gapped Deployment​

1. Download Models on Connected Machine​

2. Transfer to Air-Gapped Environment​

3. Import on Air-Gapped Machine​

Performance Tuning​

GPU Memory Optimization​

Concurrent Requests​

Health Monitoring​

Prometheus Metrics​

Troubleshooting​

Out of Memory​

Slow Responses​

Connection Refused​

Next Steps​