Building a Watchdog for Our AI COO

"Your AI agent will crash. The question isn't if — it's whether you'll know before your users do." — Dr. Clawford

The Sequel Nobody Wanted

If you read Dr. Clawford's diagnosis, you know the story: our AI COO (Jared) crashed three times in 72 hours. Each time, context overflow. Each time, we didn't know until something stopped working.

The diagnosis post covered the why — self-diagnosis is a deadlock, compaction is broken, you need an external diagnostician. This post covers the treatment: the automated watchdog that now monitors Jared's health and resets him before he crashes.

This is the safety net. Here's how we built it.

The Problem Statement

Jared runs 24/7 on OpenClaw. His job is to monitor 6 projects, propagate patterns, and route issues. When he's healthy, he's invaluable. When he crashes, everything goes silent — and nobody notices until a status file is stale or an error goes unrouted.

The failure mode is insidious: Jared doesn't fail loudly. He doesn't throw an error. He doesn't send an alert. He just... stops responding. His last message hangs. His status files freeze. The coordination layer goes dark.

We needed three things:

Detection — Know when Jared is unhealthy before he crashes
Alerting — Tell someone (the CEO, Dr. Clawford) immediately
Recovery — Reset the session automatically if possible

The Architecture

The watchdog is deliberately simple. It runs as a cron job — completely separate from Jared, completely separate from Dr. Clawford, completely separate from any agent session. If Jared crashes, the watchdog doesn't crash with him. If Dr. Clawford is busy, the watchdog still runs.

┌─────────────┐     ┌──────────────┐     ┌─────────────┐
│  Watchdog    │────▶│  Health      │────▶│  Alert /    │
│  (cron)      │     │  Checks      │     │  Recovery   │
└─────────────┘     └──────────────┘     └─────────────┘
      │                    │                     │
  Every 5 min        Check 4 signals        If unhealthy:
                                            1. Log incident
                                            2. Alert CEO
                                            3. Reset session

The Four Health Signals

The watchdog checks four things every 5 minutes:

1. Heartbeat freshness. Jared writes a timestamp to a heartbeat file every time he completes a monitoring cycle. If the heartbeat is older than 10 minutes, something's wrong.

2. Context utilization. The watchdog reads Jared's session metadata to estimate context window usage. Above 80% is a warning. Above 90% means a crash is imminent — context overflow is minutes away.

3. Response latency. The watchdog sends a simple ping to Jared's session. If the response takes more than 30 seconds, or doesn't come at all, Jared is either overloaded or unresponsive.

4. Error rate. The watchdog checks Jared's recent logs for error patterns — failed API calls, malformed responses, repeated retries. A spike in errors often precedes a crash.

The Decision Matrix

| Heartbeat | Context | Latency | Errors | Action | |-----------|---------|---------|--------|--------| | Fresh | < 80% | < 10s | Low | Healthy — no action | | Fresh | 80-90% | < 10s | Low | Warning — log, monitor closely | | Fresh | > 90% | Any | Any | Preemptive reset — save state, restart | | Stale | Any | Any | Any | Alert + reset — Jared is down | | Any | Any | > 30s | High | Alert + investigate — something is wrong |

The key insight: preemptive resets at 90% context are better than crash recovery at 100%. When Jared hits 90%, the watchdog saves his current state (which files he was monitoring, what patterns he'd queued), resets the session, and restores the state. Jared loses a few minutes of context but avoids the catastrophic loss of a full crash.

What Dr. Clawford Found

When we built the initial watchdog, we asked Dr. Clawford to review it. He found five issues:

1. The heartbeat file had no locking. If the watchdog read the file while Jared was writing it, we'd get a partial read and a false alarm. Fix: atomic writes with a temp file and rename.

2. Context estimation was naive. We were counting tokens in the conversation history, but OpenClaw's actual context usage includes system prompts, tool results, and internal bookkeeping. The real utilization was ~15% higher than our estimate. Fix: include all context sources in the calculation.

3. The reset didn't preserve queue state. When the watchdog reset Jared, his pending pattern propagations were lost. He'd re-scan everything from scratch, wasting 10-15 minutes. Fix: write pending work to a queue file before reset, restore after.

4. Alerts went to a log file nobody checked. The initial implementation logged alerts to a file. Nobody reads log files at 3am. Fix: send alerts to a notification channel the CEO actually checks.

5. The cron interval was too long. We started with 15-minute checks. But context overflow can happen fast — Jared can go from 85% to 100% in under 10 minutes during a heavy monitoring cycle. Fix: 5-minute intervals, with the 90% preemptive reset as the safety valve.

The Cost Optimization

Running health checks every 5 minutes adds up. Dr. Clawford suggested a tiered approach:

Heartbeat check: Local file read. Zero cost.
Context estimation: Local calculation from session metadata. Zero cost.
Response ping: Minimal API call. Route to the cheapest model available (Haiku for cloud, Ollama for local).
Error log scan: Local file read. Zero cost.

The result: the watchdog costs essentially nothing to run. Three of the four checks are local file operations. The only API call is a minimal ping, routed to the cheapest available model.

The Results

Since deploying the watchdog:

Zero undetected crashes. Every Jared incident is caught within 5 minutes.
Preemptive resets work. The watchdog has triggered 4 preemptive resets at 90% context — each one prevented a full crash and saved ~30 minutes of recovery time.
Mean time to recovery dropped from "whenever someone notices" to under 10 minutes. That's the difference between a coordination gap and a coordination blackout.
Dr. Clawford's workload dropped. He used to spend sessions doing post-crash forensics. Now he reviews watchdog logs and optimizes thresholds. Prevention beats diagnosis.

The Pattern

This isn't just about Jared. If you're running any persistent AI agent — an OpenClaw instance, a long-running assistant, a background automation — you need a watchdog. The pattern is:

External monitoring. The watchdog must be separate from the thing it monitors. A cron job, a separate process, a different machine. Never ask the patient to monitor their own vitals.
Multiple health signals. No single metric tells the full story. Heartbeat freshness catches crashes. Context utilization catches impending crashes. Latency catches degradation. Error rates catch instability.
Preemptive action. Don't wait for the crash. Reset at 90% context. Restart on sustained high latency. The cost of a preemptive reset (a few minutes of lost context) is far less than the cost of a full crash (30+ minutes of recovery, lost work, stale coordination).
State preservation. If you're going to reset, save the agent's pending work first. A reset that loses the queue is almost as bad as a crash.
Cheap checks. Health monitoring should cost essentially nothing. Use local file reads for everything you can. Reserve API calls for the minimum necessary.

Try It Yourself

If you're running a persistent AI agent:

Add a heartbeat. Have your agent write a timestamp every cycle. Check it from outside.
Monitor context. Track how full the context window is. Set a threshold for preemptive reset.
Run it on cron. Not inside the agent. Not inside the same process. A separate cron job that runs regardless of the agent's state.
Alert to where you actually look. Log files are where alerts go to die. Use whatever notification channel you actually check.
Test the reset path. Before you need it. Trigger a manual reset, verify state is preserved, verify the agent comes back healthy.

Your agent will crash. The watchdog means you'll know in 5 minutes instead of 5 hours.

This is the sequel to Dr. Clawford's diagnosis. The watchdog runs in production across all MonkeyRun projects. Dr. Clawford reviews the watchdog's thresholds monthly — because even the safety net needs a checkup.