Chenyi's Blog

Hermes Agent: When Your Agent Has Too Many Skills

Chenyi Zhang — Sat, 06 Jun 2026 00:00:00 GMT

The Problem Nobody Talks About

When any agent has too many skills — and by "too many" I mean past some fuzzy threshold that depends on skill complexity and overlap — the agent will eventually go nuts.

Here's a concrete example. I was running Hermes and had both a Gmail skill and a Google Workspace skill. They overlapped. At some point, the Gmail skill's API went out of date. Every time the agent called it:

"Sorry, API failed. Let me directly fetch the web... oh, I don't have access. Let me rethink... actually, wait, you have another skill that might work..."

Burning tokens. Spinning in circles. Not working.

The obvious fix — manually review and clean up the skills — doesn't scale. Skills aren't static. They're more like repos: they need to be maintained. APIs change, tools break, better patterns emerge.

I needed AI to maintain the skills, not me.

The Architecture

Here's what I built (Codex and Claude Code wrote 100% of it, I just described the flows 😅):

Here's the full flow:

And the agent delegation model that makes it work:

1. Skill Manager with RAG-based Deduplication

Before installing any new skill, run it through a skill manager that checks for redundancy.

Match by keyword and semantic embedding
Also use tags: "google", "productivity", "stock trading" — not just embedding similarity
At ~5k skills, this runs fast enough to be practical

If a new skill is too similar to an existing one, reject it or merge the concepts.

2. Telemetry System

Every skill call gets logged:

Success or failure
Chain-of-thought trace
Token cost

Stored in local SQL + blob storage. This is the data layer that makes everything else possible.

3. Installation Filter

On every new extension/plugin/skill install, the skill manager runs first. The filter compares the new skill against existing ones and reduces overlap before it lands in the system.

4. Weekly Cron Audit

A cron job (I haven't tuned the trigger yet — tell me a better one) does a delta review of the telemetry logs:

Find skills with high failure rates or bloated COT
Decide: modify the skill to be more efficient, or delete it
If deleting, use web search to find or create a replacement

Critical: make sure your eval environment is stable before running this. You don't want the audit job to delete a working skill because it was measured during a bad network day.

5. Main Agent Restructure

The main agent no longer holds specific skills directly. Instead:

Main agent receives a task
Spawns a sub-agent
Sub-agent calls the skill manager to install what it needs
Sub-agent executes

The main agent's only job is managing other agents and planning. I'm still thinking about whether planning and execution should be split further — probably over-engineering at this stage.

Why This Changes the Design Fundamentally

The skill management system forces a question I hadn't thought about clearly before: what should the main agent be good at?

My answer: not much, specifically. The main agent should be good at delegation and planning. Everything else — tool use, skill selection, domain expertise — gets handled by specialized sub-agents that spin up with exactly the skills they need.

This is closer to how real teams work. A good manager doesn't know how to do every job on the team. They know who to call and what to ask for.

What's Left (TODO)

Better trigger for the audit cron (weekly is arbitrary)
Web search integration for auto-replacing deleted skills
Eval environment stability before running automated cleanup
Split planning into a separate agent (maybe)

On the Birth Announcement

Yes, I buried the lede. My first baby Grace was just born. Between her and a TOP urgent work task, the Agent from Scratch series is delayed. But I'm not giving up on it. Q.Q

This is the Hermes architecture as of June 2026. The code is on GitHub — Claude Code wrote it, I just had the ideas.

Hermes Agent: Building Real Multi-Agent Support

Chenyi Zhang — Fri, 15 May 2026 00:00:00 GMT

The Problem with Hermes's Built-in Multi-Agent

HermesAgent ships with delegate_task — it spins up sub-agents in-process, fast and simple. But look at the source code:

DELEGATE_BLOCKED_TOOLS = frozenset({"delegate_task", "clarify", "memory", ...})
child = AIAgent(..., skip_memory=True, ...)

Every insight a sub-agent develops dies when the thread exits. The swarm does work, but never gets smarter.

That's the fundamental problem. Sub-agents are disposable compute, not collaborative intelligence. I wanted something different.

What I Built Instead

Each sub-agent is a complete Hermes instance — own OS process, own config, own state, full memory access.

The Lifecycle

Spawn → Execute → Handoff → Complete → Merge Learnings → Cleanup

Spawn: spawn-agent.sh snapshots the main agent's config into an isolated instance
Execute: The sub-agent runs with full autonomy — no restricted tools, real memory
Handoff: Sub-agent writes a structured handoff with findings, memory updates, and skill recommendations
Complete: complete-agent.sh validates the handoff, sends results via message queue, deletes the instance directory immediately
Merge: The main agent absorbs learnings through the native memory pipeline

Instances are ephemeral. Learnings are permanent.

Mistakes I Made Along the Way

Zombie agents in the registry. Strict bash mode + missing handoff file = the cleanup script exits early, leaving dead entries behind. Fixed with graceful degradation — always clean up the registry, even on failure.

Agent ignored my sub-agent skill. Given a choice between native delegate_task and my shell script approach, the LLM picked the simpler option every time. The model naturally gravitates to the path of least resistance. Fixed by adding a Decision Guide explaining when each approach is appropriate — now the agent knows when to use the lightweight in-process delegate vs. when to spin up a full isolated instance.

Wrong API keys. The spawn script was pulling from the global Hermes install instead of the project-local agent. Fixed to fork from the running instance so the sub-agent inherits the correct context.

Why This Matters

The core insight: learning shouldn't be scoped to a thread lifetime.

If you're building a multi-agent system and your sub-agents can't retain what they discover, you're running an expensive stateless compute cluster, not a system that gets smarter over time.

Process isolation costs more than in-process threads. But it buys you:

Real memory that persists across the agent's lifetime
No cross-contamination between concurrent agents
Clean handoff artifacts you can inspect and audit
Agents that actually accumulate knowledge

All experiments done with Qoder's expert mode — highly recommended for long-running agentic tasks where you want the agent to make mistakes, learn, and fix them autonomously.

GitHub

Full implementation: github.com/Czhang0727/agent-from-scratch

Next: how skill management keeps the main agent sane as the number of skills grows.

Agent from Scratch Part 3: Skills

Chenyi Zhang — Sun, 10 May 2026 00:00:00 GMT

What is a Skill?

A skill is a user manual for a tool — or a chain of tools.

If the model isn't powerful enough to figure out tool usage on its own, a skill also includes examples. Think of it like onboarding documentation: "here's what this tool does, here's when to use it, here's a concrete example."

Unlike a one-time prompt, skills are designed to be read repeatedly. Your agent will reach for them on every relevant task.

The Pile of Manuals Problem

Now imagine your agent has 50 user manuals in front of it. It needs to pick the right one before it can do anything.

Two problems emerge immediately:

1. Ambiguity kills accuracy. If two skills are too similar — say, two different ways to fetch weather data — the model has no reliable way to pick. It'll guess, and it'll guess wrong sometimes.

2. Context burns tokens. Loading every skill into the context window is wasteful and degrades focus. The more irrelevant content the model has to wade through, the noisier its reasoning becomes.

Modern agent design spends a lot of effort solving the skill selection problem before skill loading ever happens.

Skill Selection: Index Before Load

The right pattern is: select index, then load skill.

Think about driving a car. You don't need the manual for how to fix the engine just because you're making a left turn. If your agent is writing a document, it doesn't need the stock trading skill loaded into memory.

The goal is:

Fast — retrieval should not be the bottleneck
Accurate — wrong skill = wrong tool = failed task

In my implementation, I skip the naive "dump all skills into context" approach and instead use indexed selection — match the task to the right skill before injecting anything.

Skill Selection as Reinforcement

Here's an interesting insight: skill selection from human behavior is exactly what Meta's "distill from human" approach does at scale.

When a human expert picks the right tool for a job, that decision carries signal. If you capture those decisions — which skill was chosen, what was the context, did it succeed — you can train a model to make better choices over time.

The data you accumulate from real agent runs becomes a natural fine-tuning dataset. Your agent literally gets better at picking the right skill the more it works.

What's in a Skill File?

In practice, a skill is a plain text file. It can include:

Tool definition — what the tool does, its parameters, return values
Usage instructions — when to call it, what to avoid
Chaining examples — how to combine it with other tools
Failure modes — common errors and how to recover

Images work too, as long as your processor model handles multimodal input.

Key Design Principles

One skill, one job. Overlapping skills cause ambiguity. Deduplicate aggressively.
Index before load. Never inject skills you don't need for the current task.
Skills are maintained, not set-and-forget. APIs change, tools break, better patterns emerge. Treat your skills like code.
Capture selection signal. Every time your agent picks (or fails to pick) the right skill, that's training data.

GitHub

The implementation is at github.com/Czhang0727 — skills, selection logic, and the full agent scaffold.

Part 4 covers memory — how agents extend context beyond what fits in the window.

Agent from Scratch Part 4: Memory

Chenyi Zhang — Sun, 10 May 2026 00:00:00 GMT

Memory = Context Window Expansion

In agent design, "memory" is a word that sounds complicated but maps to something concrete: getting information into the context window that wouldn't otherwise fit.

A 256k context window sounds large until you try to fit the world's knowledge into it. More importantly, you shouldn't try — attention spreads thin over large contexts. Finding a tiny detail inside a massive blob of text is like finding a footnote inside an encyclopedia. The model can do it, but not reliably.

The Early Problem

Early LLMs had tiny context windows — 64k tokens was generous. With system prompt, guidance, and conversation history taking up space, you had almost nothing left for the actual knowledge that helps the model produce good answers.

This forced a key design decision: most knowledge lives outside the context window. You retrieve what you need, when you need it.

RAG (Retrieval-Augmented Generation) is the canonical solution. So is the knowledge graph. Both are just strategies for deciding what to pull in and when.

The File System Mental Model

Here's the simplest way to think about agent memory: it's a text file.

Think about what you can do with a text file in an OS:

Read — load it into context when relevant
Append — add new information without overwriting
Overwrite — replace when the old content is no longer valid
Concat — merge multiple memory sources

That's the complete set of memory operations you need. No magic required.

Triggers and Lifecycle

Every memory system needs a trigger — some event that causes the agent to create, update, or read memory.

In my implementation, the trigger is simple: end of session. After each session, the agent:

Scans existing memories for relevance
Summarizes what happened: key actions taken, what worked, what failed
Writes a new memory entry or updates an existing one

This creates a persistent record of agent experience. Over multiple sessions, the agent builds up a structured history of its own performance.

Memory as Training Data

Here's where it gets interesting.

The summaries your agent writes are exactly the kind of data you'd want for fine-tuning. Key decisions made, correct choices, wrong turns, recovery patterns — this is behavioral signal in a clean format.

If you can afford to fine-tune (or when fine-tuning costs drop further), the memory log from a well-designed agent becomes a natural training dataset. Your model starts embodying the patterns of whoever built the agent.

Evaluation Closes the Loop

In production, you don't just write memories blindly. You run an evaluation pass after each session:

Did the agent achieve the goal?
Were the actions efficient?
Were any tools misused?

Only memories that pass evaluation get committed to long-term storage. Bad runs get flagged for review, not reinforced.

This is the same loop that makes humans better at their jobs: do, reflect, evaluate, adjust.

The Full Picture

Session starts
  → Load relevant memories into context
  → Agent executes task using skills + context

Session ends
  → Summarize what happened
  → Evaluate quality
  → Write/update memory
  → (Optional) Flag data for fine-tuning

Simple. No exotic architecture required. The complexity is in the evaluation step — deciding what counts as a good run is the hardest part.

GitHub

Implementation is at github.com/Czhang0727. The memory system is the simplest module in the repo — a reminder that the best designs usually are.

The next post covers Hermes — a real-world agent hitting the limits of this design and what I built to fix it.

Agent from Scratch Part 2: Orchestration

Chenyi Zhang — Fri, 01 May 2026 00:00:00 GMT

The Problem with Raw LLM

Now that IO is hooked up, you'd think the agent should work. It doesn't.

A raw LLM is pretty much Q&A — there's no skill, no action. It just answers your input with predicted tokens. Impressive, but useless as an agent.

To resolve that, we need prompt engineering. This is probably the only truly unique part of an LLM-powered agent system. Everything else borrows from existing software patterns.

Two Types of Prompts

In my implementation I gave the agent two core prompts: emotional support and productivity.

The difference lands on what we ask the agent to do:

Emotional support prompt: "Say something nice, be supportive." The prompt helps the agent recognize that its job is comfort, not tasks. From the model's point of view, we've provided context, so it can make a better prediction.
Productivity prompt: Way more complex. This is where the "harness system" lives.

The Harness System

Harness engineering = creating a bash-style execution environment where:

We have skills (bash commands)
We define how to trigger them (accurate match vs. model-generated)

In my example, I created a set of skill schemas. It's a bit old-fashioned compared to plain-text skills I'll cover later, but they do the same thing at their core.

The full execution loop looks like this:

LLM reads local env
  → finds function it can use
  → understands the task
  → does it
  → validates and responds to user
  → user annotates (correct / incorrect)
  → agent learns from execution

The abstraction isn't that different from humans: fail more, learn more. And eventually there will be an "aha moment."

Distillation is the Real Secret

I almost forgot to mention the most important thing: "learn from other people's success or failure" is the best way to describe what good orchestration enables.

When you capture agent execution logs — what it tried, whether it worked, what the user annotated — you have a distillation dataset. That's exactly what powerful models are trained on: human-annotated traces of good decisions.

Keep It Simple

Data is king. Keep the flow simple but logical. Let the agent figure out the best way to do things — don't over-engineer the orchestration layer.

A complex orchestration system you built becomes a constraint the agent has to work around. A simple harness the agent can reason about is a tool the agent can use.

GitHub

Full code at github.com/Czhang0727/agent-from-scratch.

Part 3 covers skills — the user manuals that tell your agent which tools to use and when.

Agent from Scratch Part 1: IO

Chenyi Zhang — Fri, 10 Apr 2026 00:00:00 GMT

I lost my wisdom teeth today, so let's make it simple...

What is IO for an Agent?

IO defines how your agent system can explore or communicate with its external environment.

It's not a real human — it won't see, smell, or feel. Anything going in to the agent, and anything coming out, is plain bits.

Here's the bare minimum IO you need for an agent:

Text input
Text output

That's it. You can build a lot with just that.

Processors: Not Just Neural Nets

Before machine learning took over, processors were rule-based. Believe it or not, these systems still run today — when you call your bank and hear "Press 1 for balance, Press 2 for transfers," that's a rule-based agent. I'll cover that section later. For now, let's focus on IO.

Multimodal: Making IO Cooler

Want to go beyond text? "Multimodal support" just means your IO bus handles more data types. Video, image, voice — these are already solved problems:

Image viewer
Video player
MP3 player
Microphone input drivers
Image transformer

None of these are new. They've been around for decades, and they perfectly meet agent needs. The trick is making your IO bus generalized — built to accept more input types via plugins over time.

Think about where this goes: agents will soon have physical bodies. IoT sensors will feed into the same IO bus. The abstraction that handles voice today will handle temperature sensors tomorrow.

Design Principle: Generalize Your IO Bus

Don't hardcode IO types. Build a plugin-friendly bus where new input/output channels can be added without touching core agent logic.

Your agent's intelligence lives in the middle. The IO bus is just plumbing — but design it well and you only build it once.

GitHub

Full implementation at github.com/Czhang0727/agent-from-scratch.

Part 2 covers orchestration — once IO is hooked up, how do you get the model to actually do things?

Agent from Scratch Part 0: What Is an Agent?

Chenyi Zhang — Wed, 01 Apr 2026 00:00:00 GMT

I'm starting to build a general agent framework from scratch, sharing what I've learned over the past few years. Let's start from the very beginning.

What Is an Agent?

IMO, an agent is a workflow that can think like a human — do what a human can do. That concept existed even before LLMs, when we had stateful agents in backend system design.

The only reason "agents" are popular now is Large Models. We finally found a moment when agent design could be generalized — not hand-crafted for each narrow task.

The 10,000ft View: An Agent Is a PC

Back to old-fashioned computing: we have IO, a CPU, and storage.

An agent maps almost perfectly:

CPU → LLM
IO → connector to external devices (tools, APIs, sensors)
Storage → memory

Yep, it's that simple.

Over time, engineers added fancy stuff to make each component faster:

Better CPU → better models
Larger bandwidth → larger context windows
More applications → more skills / MCP servers

Nothing fundamentally changed.

The Agent Heartbeat

Here's the fake code of agent orchestration — if you know how OpenClaw works, this is pretty much the heartbeat:

while True:
    sleep(1000)
    input = read_input(context)
    intent_and_plan = think(context, input)
    execution_result = do(context, intent_and_plan)
    # this phase can be async sometime
    evaluation(context, execution_result)

Simple loop: read, think, do, evaluate. Repeat.

The Event-Driven Upgrade

There's a known problem with sleep — wasting resources waiting. The solution? Event-driven, just like JavaScript.

Claude Code's internals indicate they're doing the same thing. So the loop evolves:

User interaction side:

pub_sub_client = PubSubClient()

input = read_user_input()
pub_sub_client.send(topic="user_input", input)
result = pub_sub_client.subscript(topic="task_result")

Consumer (agent) side:

user_input = pub_sub_client.subscript(topic="user_input")
intent_and_plan = think(context, input)
execution_result = do(context, intent_and_plan)
pub_sub_client.send(topic="task_result", execution_result)
# this phase can be async sometime
evaluation(context, execution_result)

Clean decoupling. The agent becomes a proper event consumer.

What's Coming

In this series, I'll dig deeper into each component:

IO — how the agent talks to the world
Orchestration — prompt engineering and the harness system
Skills — user manuals for tools
Memory — expanding the context window
Multi-agent — when one agent isn't enough

Track progress and raise issues / PRs at github.com/Czhang0727/agent-from-scratch.