Hermes Agent: Parallel Delegation with Separate Models

One of the features I didn’t expect to work this well out of the box: Hermes Agent’s delegation system. Ask it to do something complex, tell it to use delegation, and it splits the task into parallel subagents — each running independently, using a different model than the orchestrator.

Here’s what that looks like in practice, and how to configure it.

What Delegation Actually Does

Hermes Agent has a built-in concept of subagents. When the orchestrator decides a task is complex enough — or when you explicitly ask for delegation — it spawns child agents to handle subtasks in parallel. Each subagent gets its own context, its own tool access, and runs to completion independently. The orchestrator then synthesises the results.

This isn’t just task decomposition. It’s actual parallel execution. The spawn tree viewer shows the timeline clearly: multiple agents running simultaneously, each consuming tokens and making tool calls on their own track.

In practice, the flow looks like this:

User: "Research a 30-day camper trip through Spain. Use delegation."

Orchestrator (kimi-k2.6:cloud):
  → Subagent 1: Route research — north to south, 30 days, campsites      [1m 59s]
  → Subagent 2: Camper providers — Indie Campers, PaulCamper, Goboony    [3m 53s]
  → Subagent 3: Attractions + public transport combinations               [parallel]
  → Synthesis: combine outputs into unified report

Three parallel research streams, each handled by a dedicated agent. Total wall time is dominated by the slowest subagent, not the sum of all three.

Two Models, One Setup

The more interesting part: the orchestrator and subagents run on different models.

Orchestrator: kimi-k2.6:cloud — the primary model, handles planning, decomposition, and synthesis. Fast, good at structured reasoning, runs via ollama.ictq.xyz.

Subagents: glm-5.1:cloud — a separate model dedicated to execution tasks, runs via ollama-agents.ictq.xyz. Handles the actual research, tool calls, and data gathering.

This split is deliberate. Orchestration is a different cognitive task from execution. The orchestrator needs to reason about task structure, dependencies, and how to combine outputs. Subagents need to be good at using tools, following instructions, and producing structured results. These don’t have to be the same model.

Both are Ollama Cloud-backed models running through self-hosted Ollama instances. Direct cloud access works too, but routing through local instances keeps things consistent with the rest of the homelab stack.

Configuration

The delegation block in config.yaml:

delegation:
  model: 'glm-5.1:cloud'
  provider: 'ollama-agents.ictq.xyz'
  base_url: 'https://ollama-agents.ictq.xyz/v1'
  api_key: 'ollama'
  api_mode: ''
  inherit_mcp_toolsets: true
  max_iterations: 50
  child_timeout_seconds: 600
  reasoning_effort: ''
  max_concurrent_children: 3
  max_spawn_depth: 1
  orchestrator_enabled: true
  subagent_auto_approve: true

Key settings worth explaining:

inherit_mcp_toolsets: true — subagents inherit the parent’s MCP connections. In this case that means Azure DevOps and n8n are available to every subagent without additional configuration. For a research task that’s probably fine; for production workflows you might want tighter scoping.

max_concurrent_children: 3 — caps parallel agents at three. Prevents runaway token consumption on tasks that could spawn dozens of subagents. Tune this based on your model provider’s rate limits and your appetite for parallel spend.

max_spawn_depth: 1 — subagents cannot themselves spawn further subagents. Keeps the hierarchy flat and predictable. Depth 2 is possible but adds significant complexity to debugging when something goes wrong.

child_timeout_seconds: 600 — ten minutes per subagent. Long-running research tasks with many tool calls need this headroom. The camper research example ran one subagent for 3m 53s, well within the limit, but web research tasks can easily run longer.

subagent_auto_approve: true — subagents don’t wait for human approval on tool calls. For research tasks this is fine; for anything that writes, modifies, or deploys you’d want this off.

The Spawn Tree

Hermes has a built-in spawn tree viewer (/agents in the terminal interface) that shows delegation in real time:

Spawn tree   d1 · 2 agents · 58 tools · 1m 59s · 230k tok · ⚡1

Timeline · 5m 53s
1  ████████████████████████░░░░░░░░░░░░░░░░░░░░░░░░   1m 59s
2  ░░░░░░░░░░░░░░░░░░░░░░░░████████████████████████   3m 53s

1 ✓ Route research — north to south, 30 days...     ·25t
2 ● Camper providers — Indie Campers, PaulCamper...  ·33t  Mcp Azure D...

The timeline makes the parallel execution visible. Agent 1 completed in under two minutes. Agent 2 ran for nearly four minutes and was still active in this view — it had made 33 tool calls and was using the Azure DevOps MCP connection.

230k tokens across the session, 58 total tool calls, and glm-5.1:cloud running twice in parallel via the ollama-agents.ictq.xyz instance.

Why This Matters for Self-Hosted Setups

The combination that makes this interesting: you can route different models to different roles without depending on a third-party orchestration service.

Most orchestration setups with multiple models require either a managed platform (LangSmith, LangFuse, etc.) or significant custom code. Here it’s four lines of config:

delegation:
  model: 'glm-5.1:cloud'
  provider: 'ollama-agents.ictq.xyz'
  base_url: 'https://ollama-agents.ictq.xyz/v1'
  api_key: 'ollama'

The orchestrator handles planning with the primary model. Execution goes to a dedicated model on a dedicated Ollama instance. Both instances are self-hosted, both back onto Ollama Cloud for the actual inference. The routing stays entirely within infrastructure you control.

For a homelab context: you can size your Ollama instances differently. A smaller, faster instance for the orchestrator where latency matters. A larger or more capable instance for subagents where throughput and quality matter more. Or route subagents to a completely different provider for tasks that benefit from specialisation — web research to one model, code generation to another.

Practical Notes

A few things worth knowing before relying on this in production workflows:

Token consumption scales linearly with subagents. Three parallel agents means three times the context overhead. The 230k token figure for a camper research task is significant. Monitor this if cost matters.

MCP inheritance can be a footgun. inherit_mcp_toolsets: true gives every subagent access to every connected MCP server. For research tasks that’s convenient. For subagents that might write to Azure DevOps or trigger n8n workflows, think carefully about scope.

Config changes require a pod restart. Hermes does not hot-reload config.yaml. After changing the delegation config, recycle the pod:

kubectl rollout restart deployment/hermes-agent -n hermes-agent

Depth 1 is the right default. It’s tempting to enable depth 2 for complex tasks, but the debugging surface grows quickly. Start flat, increase only if there’s a clear reason.

What’s Next

Delegation with MCP access opens up some genuinely useful workflows: parallel Azure DevOps queries, n8n workflow inspection across multiple workspaces, research tasks that combine web search with internal tooling. The infrastructure is in place.

Next step is integrating Hermes Workspace for a richer frontend — the terminal interface works but the conductor view in the web UI is considerably more useful for monitoring complex delegation trees in real time.

What Delegation Actually Does¶

Two Models, One Setup¶

Configuration¶

The Spawn Tree¶

Why This Matters for Self-Hosted Setups¶

Practical Notes¶

What’s Next¶