Skip to content

Hermes Agent: When Your AI Forgets What You Told It Yesterday

Most AI agents reset every session. Hermes Agent remembers, learns, and builds skills from what worked - but there's a catch few tutorials mention.

10 min readIntermediate

You spend an hour explaining your project setup to Claude. It writes perfect code. You close the terminal, come back tomorrow, and it’s forgotten everything.

Start over.

Hermes Agent by Nous Research promises to fix this: an open-source AI that remembers what it learns, builds reusable skills from successful tasks, and gets measurably better the longer you use it. Released in February 2026, it’s MIT licensed, runs on your own hardware, and stores everything locally.

But there’s a problem most tutorials won’t tell you about.

The Feature That’s Turned Off

Hermes markets itself around persistent memory – specifically Honcho, a system that builds a deepening model of how you work across every session. Sounds great. The catch: it’s disabled by default.

Multiple Reddit users reported confusion when the “self-learning” features didn’t work out of the box. You have to explicitly enable Honcho in the config. This isn’t mentioned in the quickstart. It’s buried in setup docs and community threads.

Run hermes memory setup after installation and choose your memory provider. Without this step, you get basic session memory – the same amnesia problem you’re trying to escape.

Why Most AI Agents Forget (And Why Hermes Doesn’t Have To)

Standard agents – ChatGPT, Claude, even most self-hosted tools – treat every conversation as isolated. You close the session, the context vanishes. If you ask the same question next week, it starts from zero.

Hermes uses three memory layers. Session memory is the current conversation (standard stuff). Persistent memory stores facts, preferences, and project details across sessions via MEMORY.md and USER.md files. Skill memory captures successful multi-step workflows as reusable procedures.

The persistent layer uses FTS5 full-text search over a SQLite database. When you reference something from weeks ago, the agent can retrieve it without stuffing your entire history into the context window. In theory.

The 64K Context Wall (And What Happens When You Hit It)

Hermes requires models with at least 64,000 tokens of context. According to the official documentation, models with smaller windows “cannot maintain enough working memory for multi-step tool-calling workflows and will be rejected at startup.”

That’s a hard rejection. The agent won’t start.

For local models via Ollama, you must set --ctx-size 65536 explicitly. Ollama defaults to lower values. If you don’t configure this, your agent will fail silently or produce broken tool calls. Community experience suggests 32B+ parameter models work reliably; smaller models struggle with multi-step reasoning even if they meet the 64K requirement.

Pro tip: If running Ollama locally, verify your context setting before spending time on setup. Run ollama run gemma4:26b --ctx-size 65536 to force the correct window size. The agent can’t set this for you via the OpenAI-compatible API.

Here’s what nobody mentions: MEMORY.md has a ~2,200 character limit. When it fills up, Hermes consolidates entries to make room. Before compression, it runs a dedicated “memory flush” – a separate model call where only the memory tool is available. Facts that weren’t flagged during that flush don’t survive.

This is agent-curated memory under context pressure. It works until it doesn’t.

The Telegram Token Trap

If you plan to run Hermes via the messaging gateway – Telegram, Discord, WhatsApp – here’s the number that matters: 15-20K input tokens per message.

Compare that to CLI usage: 6-8K tokens for the same conversation. That’s 2-3x more expensive to access your agent over Telegram.

The root cause was a bug in versions before v0.6.0. The gateway spawned in the hermes-agent repository directory instead of your home directory, loading development files (AGENTS.md, miscellaneous repo data) into every single request. One Reddit user reported 4 million tokens in 2 hours of light usage because of a Telegram gateway debugging loop.

Update to the latest version. Run hermes update and restart the gateway. If you’re on an older version, manually launch the gateway from your home directory, not the source folder.

Set API spend limits at your provider dashboard before you start. OpenRouter, OpenAI, and Anthropic all support this. One user hit $405 in API costs during a single project because they didn’t realize how token-heavy the gateway was.

What Hermes Actually Does Well

The skill system is the real differentiator. After completing a complex task (typically 5+ tool calls), Hermes can autonomously create a skill document: a structured markdown file with procedures, known pitfalls, and verification steps. These are stored in ~/.hermes/skills/ and follow the agentskills.io open standard.

The next time a similar task comes up, the agent loads the skill instead of solving the problem from scratch. One Reddit user reported a 40% speedup on repeated research tasks after the agent created three skill documents over two hours.

Skills self-improve during use. If the agent discovers a better approach while using an existing skill, it updates the document. This isn’t marketing – it’s a real feedback loop.

The catch: self-evaluation is unreliable. Hermes evaluates its own work to decide whether a task succeeded. Some users reported the agent overwrote their carefully tuned skills during “self-improvement,” turning them into jumbled messes. If you’ve spent time customizing a workflow, having the agent rewrite it autonomously can be frustrating.

Model Flexibility (200+ Options, But Pick Carefully)

Hermes supports any OpenAI-compatible endpoint. That includes:

  • OpenRouter (200+ models)
  • Nous Portal (400+ models as of April 2026)
  • Direct APIs: OpenAI, Anthropic, Kimi, MiniMax, Hugging Face
  • Local via Ollama, vLLM, llama.cpp, LM Studio
  • Custom self-hosted endpoints

Switch providers with hermes model – no code changes, no lock-in.

For local deployments: you need at least 24GB VRAM (or unified memory on Apple Silicon) to run capable models. Gemma 4 26B MoE and Qwen 3.5 27B are the community-recommended local options as of April 2026. Both handle agent workflows well if you configure the context window correctly.

For cloud APIs: budget $15-80/month based on typical usage patterns. Heavy use (large projects, multi-step debugging) can hit $400+. The framework itself is free. Hosting on a $5 VPS works. Your bill comes from LLM API calls.

The Gateway Is the Real Product

Most people try the CLI first. That’s fine for testing. But the gateway is where Hermes becomes useful.

One gateway process connects Telegram, Discord, Slack, WhatsApp, Signal, and CLI simultaneously. You start a conversation on your phone via Telegram, pick it up later in the terminal on your laptop. Same agent, same memory, same session.

It supports 6 terminal backends: local, Docker, SSH, Daytona, Singularity, Modal. The serverless options (Daytona, Modal) hibernate when idle and wake on demand – you pay almost nothing between sessions. This means your agent can run on a cloud VM while you’re away from your desk, and you interact with it from anywhere.

Setup: hermes gateway setup walks you through connecting platforms. For Telegram, you’ll need a bot token from @BotFather. For Discord, create an app in the Discord Developer Portal and enable MESSAGE CONTENT INTENT (this is the #1 cause of silent failures).

To restrict access, set TELEGRAM_ALLOWED_USERS with your user ID. Otherwise, anyone who finds your bot can use it.

When to Use Hermes (And When Not To)

Use Hermes if:

  • You work on the same projects repeatedly and want the agent to remember context between sessions
  • You need an agent that runs 24/7 on a server, not just when your laptop is open
  • You want to interact via messaging apps (Telegram, Discord) while the agent works on a remote machine
  • You prefer open source and self-hosting over managed services
  • You’re willing to manage infrastructure and debug token usage

Don’t use Hermes if:

  • You want plug-and-play setup with zero configuration (it’s not that)
  • You need 50+ platform integrations out of the box (OpenClaw has broader coverage)
  • You’re unwilling to monitor API costs closely (token bloat is a real risk)
  • You expect features to work by default without digging into config files

Is Hermes more stable than OpenClaw? Unclear. As one highly upvoted Reddit comment pointed out: “Hermes has had 6 releases to OpenClaw’s 82 releases. 3 of Hermes releases didn’t even work. Don’t listen to claims of it being more stable because it hasn’t been around to even make that claim.”

Fewer updates means fewer chances to break things. That’s not the same as stability.

Installation (The Parts That Break)

The one-liner install works on Linux, macOS, and WSL2:

curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh | bash

Native Windows is not supported. Install WSL2 first.

After installation:

  1. hermes setup – configure your LLM provider and model
  2. hermes memory setup – enable Honcho or another memory provider (critical, often skipped)
  3. hermes model – verify your model meets the 64K context requirement
  4. hermes – start the CLI

If coming from OpenClaw, run hermes claw migrate --dry-run to preview what gets imported before committing. It can migrate settings, memories, skills, and API keys automatically.

Common failure points: hermes-agent command not found (PATH issue, reload your shell), API key rejected (check for trailing whitespace when copy-pasting), Telegram bot not responding (MESSAGE CONTENT INTENT not enabled), local models failing tool calls (–jinja flag missing on llama-server).

Run hermes doctor to diagnose issues.

What You Should Do Next

If you’re serious about trying Hermes, do this before you install anything:

Check whether your preferred model meets the 64K context requirement. If you’re using a local model, verify you have enough VRAM. If you’re using a cloud API, set a spending limit at your provider dashboard.

Decide whether you’re willing to enable Honcho manually. The self-learning features people talk about don’t work without it.

Understand that the gateway costs more tokens than the CLI. If you plan to use Telegram or Discord heavily, budget accordingly or stick to CLI for heavy work.

The official documentation covers everything else. The GitHub repository is actively maintained. The community is split between Hermes and OpenClaw, but both have real users shipping real work.

Hermes isn’t perfect. But if you’ve been frustrated by AI agents that forget everything between sessions, it’s worth the setup cost.

Can Hermes run completely offline with local models?

Yes. Point Hermes at an Ollama instance running locally with hermes model and select “Custom endpoint.” Enter http://localhost:11434/v1 (Ollama’s OpenAI-compatible endpoint). Your model must have at least 64K context – set this with --ctx-size 65536 when launching Ollama. Gemma 4 26B and Qwen 3.5 27B are the community-recommended local options. You’ll need 24GB+ VRAM or unified memory.

What’s the real monthly cost if I use cloud APIs heavily?

Community reports range from $15-80/month for typical use. One user hit $405 during a single large project. The framework is free (MIT license). Costs come entirely from LLM API usage. Telegram gateway uses 2-3x more tokens than CLI due to context loading. Set spend limits at your provider dashboard before you start.

Does Hermes actually remember things better than ChatGPT?

Only if you enable Honcho. By default, Hermes uses MEMORY.md (facts) and USER.md (preferences) that persist across sessions – better than ChatGPT’s session-only memory. But the “self-improving” features require explicit setup via hermes memory setup. MEMORY.md has a ~2,200 character limit. When full, the agent consolidates entries, and some details get lost. It’s better than nothing, but not perfect.