Klaus
· by Robbie

How We Monitor 500+ OpenClaw Instances (And What Breaks)

How We Monitor 500+ OpenClaw Instances (And What Breaks)

Running one OpenClaw instance is simple. You set it up, connect your messaging apps, and check on it when something feels off. Running 500+ is a different problem. The failures that matter at scale are not the ones that crash loudly. They’re the ones that fail silently: a WhatsApp channel disconnects at 2 AM, a cron job stops firing without an error, an agent confidently uses a tool that stopped working three days ago.

At Klaus, we built Clawbert to handle this. It’s not a product we sell. It’s an internal tool that monitors our customer instances and fixes the boring problems before anyone notices. This is how it works, what it catches, and what still gets past it.

What Actually Breaks in Production

The OpenClaw troubleshooting docs list eight categories of common failures: messaging, gateway startup, channel flow, automation delays, tool execution, browser issues, and more. That’s the official view. Here’s what we see in practice, drawing from both our own fleet and what the self-hosting community reports.

Config Crash-Loops

A self-hosted user documented a collection of production findings on GitHub after four weeks of running OpenClaw. Among them: adding a custom key to openclaw.json that the gateway didn’t recognize caused the gateway to crash on startup. Systemd restarted it. It crashed again. 377 times. There’s no backoff mechanism in the restart loop and no validation on config keys. A single unrecognized entry makes the server completely unresponsive until someone manually edits the file.

We’ve hit the same class of problem on our fleet. The gateway process is technically “running” from systemd’s perspective. It just crashes and restarts faster than most health checks poll.

Channel Disconnects

Messaging channel connections drop. WhatsApp sessions expire. One well-documented case from that same issue: Discord reception stopped working entirely across multiple OpenClaw versions, silently dropping all inbound messages while outbound still worked. The agent could send messages but couldn’t hear replies. That regression lasted weeks.

The tricky part: the agent doesn’t know it’s deaf. It just stops receiving input and sits idle. Unless you’re actively checking channel connectivity, you won’t notice until a customer asks why their agent ghosted them.

Silent Cron Failures

Cron jobs are the backbone of business automation on OpenClaw: daily reports, scheduled research, recurring follow-ups. When they break, they break quietly. Another finding from the same report: web search fails silently in isolated cron sessions with no error message. The cron job runs, the agent tries to search, gets nothing back, and either produces a report without the data or skips the step entirely. No alert, no log entry, no indication anything went wrong.

Update Breakage

OpenClaw ships fast. In a two-week span in March 2026, there were 7 stable releases, nearly half of which included breaking changes. Three security advisories in a single week. You can’t ignore updates because some are genuine security patches (WebSocket hijacking, workspace plugin code execution). But applying them blindly means risking the exact failures described above.

Resource Exhaustion

OpenClaw’s Docker image requires 2 GB RAM minimum just for the build step. Running the agent in production with browser automation can push past that easily. We’ve seen instances where accumulated disabled cron jobs and session logs quietly filled the disk until the gateway had nowhere to write and crashed.

What We Actually Monitor

Most monitoring guides for OpenClaw cover the basics: is the gateway process running, is the port responding. Fiddler AI identified five observability gaps that are specific to AI agents: multi-agent coordination failures, channel behavioral inconsistency, tool execution without accountability, cost attribution across sessions, and behavioral drift from model updates. We hit all five in our first month of operation.

Here’s what we track beyond basic uptime:

What we monitorWhy it mattersWhat most self-hosted setups miss
Heartbeat freshnessIf the last heartbeat is stale, the gateway is likely stuck or crashedOften not checked at all
Channel connectivity per platformWhatsApp, Telegram, Slack, Discord each have different failure modesUsually just “is the process running”
Cron job completion ratesA cron that ran but produced empty output is a silent failureCron is either on or off, no output validation
Token spend driftA sudden spike means a skill is looping or a prompt is brokenNot tracked until the bill arrives
Skill execution success ratesSkills that return errors or empty results consistentlyNo visibility into individual skill health
Disk usage trajectory20 GB fills faster than you’d expect with session logs and media filesChecked manually, if at all

The OpenClaw CLI provides the building blocks: openclaw status --all, openclaw gateway probe, openclaw channels status --probe, openclaw cron status, openclaw doctor. These commands are designed for interactive debugging. The gap is automating them across a fleet and acting on the results without human intervention.

How We Handle This at Klaus

We built Clawbert, our automated SRE. It checks every instance every few minutes — gateway health, disk, memory, channel connectivity, token spend. When it detects a problem, it triggers the troubleshooter: an automated repair system that diagnoses the failure, attempts to fix it, and reports what it found. Customers can also trigger the troubleshooter manually from the Klaus Health page and see past runs there.

Think of it as detect-then-repair: Clawbert is the continuous sweep, the troubleshooter is the response. Between them, the common failure modes from the previous section get caught:

  • Gateway crashes: the troubleshooter detects the failed health check and attempts a restart.
  • Resource exhaustion: disk and memory checks flag pressure before the gateway runs out of space to write.
  • Channel disconnects: regular checks catch disconnections. WhatsApp sessions expire every 30 days, which is a known cycle we handle proactively.
  • Cost anomalies: Clawbert flags unusual token spend so we can investigate before a customer gets a surprise bill.

What these tools don’t catch well: intermittent issues that resolve between check intervals, silent cron failures where the job runs but produces empty output, and behavioral drift from model updates. Those still rely on customer feedback or manual investigation.

The Update Problem at Scale

Self-hosted users face the update treadmill one instance at a time. At scale, it’s the same problem multiplied by consequences. If a bad update breaks 500 instances, that’s 500 customers whose agents stopped working.

We don’t run latest. Each customer instance is pinned to a specific version. We test updates internally before rolling them out to customers. The troubleshooter catches health check failures after updates, which gives us a safety net when something goes wrong.

The math from TryOpenClaw’s analysis: at the current release pace, a self-hosted user spends roughly 20 minutes per update, or about 5 hours per month just on updates. The worst documented case was 7 hours for a single failed upgrade requiring 16 manual interventions. At scale, that maintenance burden is the reason managed hosting exists.

What We Still Get Wrong

If this article only covered what works, you should be suspicious.

Intermittent disconnects. A WhatsApp channel drops and reconnects between regular checks. We miss it. The customer notices a gap in their message history. More frequent checks would catch more of these, but increase load on every instance.

Behavioral drift. When a model provider updates their model, the agent’s behavior changes in ways that are hard to detect automatically. It might start formatting outputs differently, or handle edge cases worse. We don’t have a reliable way to catch this without customer feedback. This is the observability gap that the entire AI agent industry is still solving.

Alert fatigue. Early on, we set thresholds too aggressively. Every minor fluctuation in token spend generated an alert. Within a week we were ignoring alerts. We’ve since calibrated the thresholds, but it’s a constant tuning process. Too sensitive and you stop reading alerts. Too relaxed and you miss the early signs of a real problem.

Skills that work 95% of the time. A skill that succeeds on 19 out of 20 inputs looks healthy in aggregate monitoring. The 1-in-20 failure that happens to hit a customer’s important use case is invisible until they report it. Per-execution monitoring at that granularity is possible but expensive in terms of context overhead.

Frequently Asked Questions

How quickly does Klaus detect a problem?

Clawbert checks instances every few minutes. The troubleshooter runs automatically when an instance fails a health check. For gateway crashes, detection depends on when the next health check runs. For issues you notice first, you can trigger the troubleshooter manually from the Klaus Health page.

What happens if my Klaus instance goes down at 3 AM?

The troubleshooter detects the failed health check and attempts to fix it automatically. It checks gateway health, disk space, memory, and service status. For known failure modes, automated recovery happens without human intervention. For unknown failures, we get an alert and investigate.

Can I see my instance’s health status?

Yes. The Klaus Health page shows past troubleshooter runs and your instance health status. You can also trigger a troubleshoot manually from there if something seems wrong.

How does this compare to monitoring a self-hosted instance myself?

Self-hosted monitoring is entirely your responsibility. The OpenClaw CLI provides diagnostic commands (openclaw doctor, openclaw status --all, openclaw channels status --probe), but there’s no built-in alerting or automated recovery. You’d need to set up your own monitoring stack. LumaDock has a thorough guide on building a monitoring setup with Prometheus, Grafana, and Alertmanager. Expect to spend 2-4 weeks getting it production-ready.

What’s the most common failure you see?

Channel disconnects. WhatsApp in particular has session expiration issues that cause periodic reconnection cycles. Gateway crashes from config issues are second. Silent cron failures are third in frequency but arguably the most damaging because they go unnoticed the longest.

Key Takeaways

  • The failures that matter in production OpenClaw are silent, not loud. Channel disconnects, empty cron outputs, and stale heartbeats are harder to detect than crashes.
  • Config validation gaps in OpenClaw can cause crash-loops with no backoff. A single unrecognized key crashed one instance 377 times before detection.
  • OpenClaw ships multiple releases per week, many with breaking changes. Version pinning is necessary at any scale beyond one instance.
  • Klaus uses Clawbert for fleet-wide checks every few minutes, which triggers an automated troubleshooter on failures. Between them, most common failure modes get caught and fixed automatically.
  • The monitoring overhead is the hidden cost of self-hosting. Expect 5+ hours per month on updates alone, plus time building and maintaining your monitoring stack.
  • We still miss intermittent disconnects, behavioral drift from model updates, and the 5% failure rate in skills that look healthy in aggregate.

On Klaus, Clawbert handles the monitoring so you don’t have to. If reliability matters more than infrastructure control, sign up at klausai.com.

For more on the managed vs self-hosted decision, see OpenClaw Hosting: Managed vs Self-Hosted. For the Docker self-hosting path, see Running OpenClaw in Docker. For how scheduled tasks work, see OpenClaw Automation: Heartbeat, Scheduled Tasks, and What Runs Without You.

Sources