From Cloud-Native Kubernetes to Bare Metal AI: The OpenClaw Journey

The Problem: AI Agents Need Real Access

I wanted an AI assistant that could actually do things:

Check my camper’s GPS location
Monitor home energy usage
Create Azure DevOps branches and pull requests
Search my personal journal spanning years
Control my smart home devices

The AI hype is overwhelming. Thousands of GitHub issues, equally many pull requests, mountains of contradictory information. After deep research, I chose OpenClaw - an open-source agentic framework supporting Model Context Protocol (MCP) servers, running fully local.

First Attempt: Kubernetes on Talos

My homelab runs Talos Linux with advanced Cilium networking. The natural choice was to run OpenClaw as a containerized workload on my cluster. I built Helm charts, configured persistent volumes, got it operational.

# values-platform-prod.yaml
agents:
  defaults:
    model:
      primary: "ollama/minimax-m2:cloud"
      fallbacks:
        - "ollama/deepseek-v3.2:cloud"
    workspace: "/home/node/.openclaw/workspace"
    userTimezone: "Europe/Amsterdam"
    heartbeat:
      every: "30m"
    subagents:
      maxConcurrent: 8
      archiveAfterMinutes: 10

The basics worked. Jairix (my AI persona - JARVIS-like, British, calm and precise) could communicate via Telegram, access my MCP servers, execute simple tasks. But problems emerged quickly:

Problem 1: Subagent Execution Failures

OpenClaw supports research agents - specialized subagents for deep investigations. When I made complex requests (like “analyze Azure cost trends from Q4 2024”), Jairix correctly spawned a researcher subagent, but:

[ERROR] Subagent output not returned to main process
[WARN] Research task completed but results missing in parent context
[ERROR] Timeout waiting for subagent communication

The main agent process received no feedback from the research agent. Subagents ran in their own containers with shared volume storage, but inter-process communication failed. I tried different configurations:

Shared memory volumes (/dev/shm)
Redis as message queue
Different context pruning modes
SQLite session storage optimizations

Nothing fixed it. OpenClaw’s architecture expects subagents to write output to shared state that the main process can read - within Kubernetes pods, that synchronization didn’t work reliably.

Problem 2: Sandbox Containers

OpenClaw has a sandbox feature where it can execute code in isolated Docker containers. This is crucial for security - if an agent needs to run Python code, it happens in an ephemeral container that’s destroyed afterwards.

Docker-in-Docker (DinD) within Kubernetes is notoriously difficult:

# Attempts with privileged containers
securityContext:
  privileged: true
volumeMounts:
  - name: docker-sock
    mountPath: /var/run/docker.sock

Even with privileged containers and socket mounting, sandbox containers kept crashing or giving connection errors. The issue: Kubernetes expects workloads to be stateless containers, not containers running Docker daemons themselves.

Problem 3: Model Performance and Latency

My cluster runs Ollama perfectly - multiple instances for different model types:

# Ollama instances in Kubernetes (these work fine)
ollama-cloud:    # Large models (DeepSeek, MiniMax, Mistral)
  baseUrl: "http://ollama.ollama.svc:11434/v1"
  
ollama-vision:   # Vision models (Qwen3-VL)
  baseUrl: "http://ollama-vision.ollama.svc:11434/v1"
  
ollama-local:    # Embeddings (CPU-only)
  baseUrl: "http://ollama-local.ollama.svc:11434/v1"

Ollama itself had no issues in Kubernetes. The problem was OpenClaw’s latency when accessing these services:

Service mesh overhead (Cilium)
Traefik ingress layer
Internal DNS resolution
Pod-to-pod networking

For conversational AI where response time is critical, the extra 200-400ms per model call was noticeable. OpenClaw made frequent model switches during reasoning, amplifying the latency.

The Pivot: Bare Metal Ubuntu Server

After a week of debugging, I realized: OpenClaw wasn’t designed for Kubernetes. It’s built for traditional servers where processes can communicate directly, and where Docker is natively available.

I grabbed an old laptop:

Hardware: Dell laptop with 16GB RAM, 4-core CPU
OS: Ubuntu Server 24.04 LTS
Network: Isolated VLAN, no direct internet access without explicit permission
Storage: 500GB SSD for model weights and workspace data

Security First Approach

The laptop runs in its own isolated VLAN on my network, protected by UniFi Dream Machine firewall rules. I can precisely control what the AI agent can access:

Firewall Configuration (UDM):

Default policy: Deny all egress traffic from AI VLAN
Allowed destinations:
- Ollama service on cluster (port 11434) - for model inference
- Home API service (port 3000, GET requests only) - for read-only monitoring
- Supabase (*.supabase.co:443) - for personal analytics
- Fallback LLM APIs (api.anthropic.com, api.openai.com:443) - rate-limited
Blocked: Everything else, including local network devices

VLAN Segmentation:

Default VLAN (10.106.0.0/24): Home devices, workstations
AI VLAN (10.107.0.0/24): OpenClaw laptop, isolated

Without explicit firewall allow rules, OpenClaw can’t reach anything. This is essential - an AI agent with access to my home API, Azure DevOps, and personal data must be maximally isolated.

Installation Stack

# 1. Base system
apt update && apt upgrade -y
apt install -y docker.io docker-compose git python3-pip nodejs npm

# 2. OpenClaw
git clone https://github.com/cyanheads/openclaw.git
cd openclaw
npm install

Note: Ollama continues to run in the Kubernetes cluster - OpenClaw connects to it remotely. This gives the best of both worlds:

Ollama in k8s: Scales well, GPU scheduling, multiple instances
OpenClaw on bare metal: Direct process communication, native Docker for sandboxing

OpenClaw Configuration

The configuration in openclaw.json defines my multi-agent setup and connects to Ollama instances running in Kubernetes:

{
  "models": {
    "providers": {
      "ollama": {
        "baseUrl": "http://ollama.cluster.local:11434/v1",
        "api": "openai-responses",
        "models": [
          {"id": "deepseek-v3.2:cloud", "name": "DeepSeek V3.2"},
          {"id": "minimax-m2:cloud", "name": "MiniMax M2"},
          {"id": "kimi-k2-thinking:cloud", "name": "Kimi Thinking"}
        ]
      },
      "ollamavision": {
        "baseUrl": "http://ollama-vision.cluster.local:11434/v1",
        "models": [
          {"id": "qwen3-vl:235b-cloud", "name": "Qwen3 VL"}
        ]
      },
      "ollamalocal": {
        "baseUrl": "http://ollama-local.cluster.local:11434/v1",
        "models": [
          {"id": "qwen3-embedding:8b", "name": "Qwen3 Embedding"}
        ]
      }
    }
  },
  "agents": {
    "list": [
      {
        "id": "main",
        "default": true,
        "identity": {
          "name": "Jairix",
          "emoji": "🦞"
        },
        "model": {
          "primary": "ollama/minimax-m2:cloud",
          "fallbacks": ["ollama/deepseek-v3.2:cloud"]
        }
      },
      {
        "id": "researcher",
        "model": {
          "primary": "ollama/kimi-k2-thinking:cloud"
        }
      },
      {
        "id": "vision",
        "model": {
          "primary": "ollamavision/qwen3-vl:235b-cloud"
        }
      },
      {
        "id": "coder",
        "model": {
          "primary": "ollama/mistral-large-3:675b-cloud"
        }
      }
    ]
  }
}

Architecture:

OpenClaw (bare metal laptop)
    ↓ HTTP API calls
Ollama instances (Kubernetes cluster)
    ↓ GPU inference
Model weights (persistent volumes in k8s)

Why these models?

MiniMax M2 (230B): Balance between speed and capability for conversational AI
Kimi K2 Thinking: Reasoning model for complex research tasks
DeepSeek V3.2 (671B): Fallback for edge cases, enormous capability
Qwen3 VL (235B): Vision model for screenshots and diagrams
Mistral Large 3 (675b): Code generation and debugging

Memory and Continuity

OpenClaw’s memory system is elegantly simple - files are memory:

workspace/
├── AGENTS.md      # Operational guidelines
├── IDENTITY.md    # Who is Jairix
├── SOUL.md        # Core values
├── USER.md        # My preferences and context
├── TOOLS.md       # Available tools and permissions
├── MEMORY.md      # Long-term curated memories
└── memory/
    ├── 2025-02-16.md  # Daily logs
    └── 2025-02-17.md

Each session starts with:

# IDENTITY.md startup protocol
1. Read AGENTS.md - operational guidelines
2. Read SOUL.md - core purpose
3. Read USER.md - user context
4. Read TOOLS.md - available tools
5. Read memory/YYYY-MM-DD.md (today + yesterday)
6. If main session: Read MEMORY.md

This gives Jairix continuity without complex database schemas. Simple, debuggable, effective.

MCP Servers and Skills: The Real Power

Model Context Protocol and OpenClaw skills are where the system transforms from chatbot to operational assistant. I built two custom MCP servers and several skills:

1. Home API MCP Server (Auto-Generated)

My home automation API exposes 60+ endpoints for:

Smart home: Toon thermostat, Philips Hue lights, network devices
Energy monitoring: P1 smart meter data, realtime usage
GPS tracking: Camper location via OnnTrack
Network management: UniFi Dream Machine controls
Notifications: FCM push notifications to my devices

Auto-generation from route definitions:

// routes/toon/getToonStatus.js
{
  method: 'get',
  path: '/get/toon/status',
  description: 'Get current thermostat status',
  validators: {
    query: joi.object().keys({
      action: joi.string().valid('getThermostatInfo').required()
    })
  }
}

My generator script (generate-tools.js) reads all route files and automatically generates MCP tools:

{
  "name": "api:toon_GetToonStatus",
  "description": "Get current thermostat status (temperature, setpoint, program state)",
  "inputSchema": {
    "type": "object",
    "properties": {
      "action": {
        "type": "string",
        "enum": ["getThermostatInfo"]
      }
    }
  }
}

2. Ask-My-Day MCP Server (Personal Analytics)

My journal system with Strava integration. Architecture:

Data Pipeline:
  journal-processor → Supabase (pgvector)
  strava-sync → Supabase

Interface:
  Jairix ↔ MCP Server ↔ Supabase
            ↑
       13 granular tools

13 specialized tools for semantic search, date filtering, person/location queries, mood analysis, Strava performance tracking, and trend analysis.

3. OpenClaw Skills (Not MCP)

I also built several OpenClaw skills (different from MCP servers):

Google Calendar Skill (gog):

gog calendar events        # Read-only
gog calendar create        # Requires approval

Zoho Mail Skill:

python3 zoho-email.py --api-mode rest
# Commands: unread, search, send-html

Philips Hue Skill (openhue):

openhue get room          # Read light status
openhue set room          # Control lights (requires approval)

Brave Search Skill:

Integrated via OpenClaw’s web search tools
API key configured in openclaw.json

SearXNG (Local Search):

Self-hosted metasearch engine running locally
Privacy-focused alternative to Brave
Accessed via custom skill wrapper

4. Azure DevOps Integration

Via mcporter (MCP skill system) Jairix gets access to Azure DevOps:

{
  "mcpServers": {
    "azure-devops": {
      "command": "npx",
      "args": ["-y", "@sneekes/azure-devops-mcp-server"],
      "env": {
        "AZURE_DEVOPS_ORG_ID": "Sneekes",
        "AZURE_DEVOPS_PROJECT_ID": "Sneekes Solutions",
        "AZURE_DEVOPS_PAT": "***"
      }
    }
  }
}

43 functions available:

Repository management (list, create branch, commit)
Pull requests (create, review, merge)
Work items (create, update, link)
Pipelines (trigger, monitor, get logs)
Wiki operations
Search (code, work items)

Real-world scenario:

Me: "Create a feature branch for Traefik migration, 
     update the README with migration steps,
     and create a PR targeting main"

Jairix:
1. azure-devops:create_branch(newBranch="feature/traefik-migration", sourceBranch="main")
2. azure-devops:create_commit(
     branchName="feature/traefik-migration",
     changes=[{
       path: "README.md",
       search: "# NGINX Configuration",
       replace: "# Traefik Configuration\n\n## Migration Steps..."
     }],
     commitMessage="docs: Add Traefik migration guide"
   )
3. azure-devops:create_pull_request(
     sourceRefName="refs/heads/feature/traefik-migration",
     targetRefName="refs/heads/main",
     title="Migration: NGINX to Traefik",
     reviewers=["sander.sneekes@domain.com"]
   )

This happens in seconds. No context switching to Azure DevOps UI, no YAML editing, no manual PR creation.

Performance Results: Kubernetes vs Bare Metal

After a week running on the Ubuntu server:

Metric	Kubernetes	Bare Metal	Delta
Research agent success rate	45%	98%	+118%
Sandbox container launch	Failure	<2s	∞
Average response latency	800ms	350ms	-56%
Model switching overhead	400ms	80ms	-80%
Subagent communication	Broken	Reliable	Fixed

Why such a difference?

The bare metal improvement came from OpenClaw-specific optimizations, not Ollama changes:

Direct process communication: Subagents share filesystem directly, no container orchestration overhead
Native Docker: Sandbox containers start via local Docker socket without DinD complexity
Reduced network hops: Still calling Ollama over network, but fewer service mesh layers
Shared memory: Subagents write to /home/node/.openclaw/sessions, main process reads directly

Ollama remained in Kubernetes throughout - the performance gain was from moving OpenClaw to bare metal, not from moving Ollama.

Lessons Learned

1. Not Everything Belongs in Kubernetes

I’m a Kubernetes evangelist. My entire platform runs on k8s. But OpenClaw taught me: containerization ≠ orchestration requirement.

What works great in Kubernetes:

Ollama: Multiple instances, GPU scheduling, automatic scaling, perfect for model serving
Stateless services: APIs, web apps, microservices
Batch workloads: Data processing, ML training

What doesn’t work in Kubernetes:

OpenClaw: Expects shared filesystem, native Docker access, process-level IPC
Agentic frameworks: Often designed for traditional server environments
Stateful single-node apps: Where orchestration adds complexity without benefit

OpenClaw’s architecture expects:

Shared filesystem access with immediate consistency
Native Docker daemon access
Process-level IPC for subagents
Low-latency local operations

These are anti-patterns for cloud-native design, but perfect for single-node AI workloads.

The hybrid approach works best: Ollama in k8s (scales, GPU management), OpenClaw on bare metal (native Docker, process communication).

2. Security Through Isolation

By running OpenClaw in an isolated VLAN with strict firewall rules via UDM Dream Machine, I get the best of both worlds:

Complete control over which external services are reachable
Network-level audit trail via UDM traffic logs
Zero-trust default - everything denied unless explicitly allowed

3. MCP is the Future

Model Context Protocol fundamentally solves “how do I give an LLM access to my data”. Instead of:

RAG pipelines with brittle chunking strategies
Function calling with JSON schema hell
Custom API wrappers per LLM provider

You now have:

Standardized protocol (stdio-based RPC)
Tool discovery (server advertises capabilities)
Composable tools (granular, single-purpose functions)
Universal client support (Claude Desktop, OpenClaw, etc.)

My two MCP servers (Home API + Ask-My-Day) work identically in Claude Desktop and OpenClaw. No code changes. Pure interoperability.

4. Skills vs MCP Servers

MCP Servers: Standardized protocol, work across any MCP client

Home API (auto-generated from routes)
Ask-My-Day (personal analytics)
Azure DevOps (via @sneekes/azure-devops-mcp-server)

OpenClaw Skills: Framework-specific integrations

Google Calendar (gog)
Zoho Mail (Python wrapper)
Philips Hue (openhue)
Brave Search (API integration)
SearXNG (local search)

Both work, but MCP servers are more portable.

5. Identity Matters

Defining Jairix’s identity in IDENTITY.md and SOUL.md isn’t fluff - it has direct impact on usability:

# SOUL.md
Be genuinely helpful, not performatively helpful.
Skip filler. Provide value.

Have opinions.
Silence is better than a bad answer.

# IDENTITY.md  
Your voice is impeccably British, controlled, and certain.
You are never enthusiastic. You do not overexplain.

With your human, your tone softens by the smallest possible margin.

This results in responses like:

Me: "Should I migrate to Traefik now or wait?"

Jairix: "Now. Your NGINX setup has 3 known CVEs and your Cert Manager 
         integration is brittle. I've already prepared a migration branch. 
         Shall I proceed?"

No “Let me help you think through this!” or “Great question!”. Direct, confident, actionable.

What’s Next

My OpenClaw setup is now stable. Next steps:

Extend skill coverage: More integrations, deeper automation
Proactive monitoring: Heartbeat-based checks for email, calendar, system health
Model optimization: Fine-tuning embeddings for specific use cases
Team integration: Consider if others could use Jairix (with isolation per user)

But mostly: use it. An AI assistant is only valuable if it actually takes work off your plate. My test case: can Jairix and I together plan, document, and execute the Traefik migration without me opening Azure DevOps?

I suspect we can.

Resources

OpenClaw: https://github.com/openclaw
Model Context Protocol: https://modelcontextprotocol.io
Ollama: https://ollama.com
Talos Linux: https://www.talos.dev
Cilium: https://cilium.io

All API keys and tokens in this article are placeholder values and no longer valid.

The Problem: AI Agents Need Real Access¶

First Attempt: Kubernetes on Talos¶

Problem 1: Subagent Execution Failures¶

Problem 2: Sandbox Containers¶

Problem 3: Model Performance and Latency¶

The Pivot: Bare Metal Ubuntu Server¶

Security First Approach¶

Installation Stack¶

OpenClaw Configuration¶

Memory and Continuity¶

MCP Servers and Skills: The Real Power¶

1. Home API MCP Server (Auto-Generated)¶

2. Ask-My-Day MCP Server (Personal Analytics)¶

3. OpenClaw Skills (Not MCP)¶

4. Azure DevOps Integration¶

Performance Results: Kubernetes vs Bare Metal¶

Lessons Learned¶

1. Not Everything Belongs in Kubernetes¶

2. Security Through Isolation¶

3. MCP is the Future¶

4. Skills vs MCP Servers¶

5. Identity Matters¶

What’s Next¶

Resources¶