Closed-loop reliability · Versioned operations

Closed-loop OpenClaw and versioned hotfixes

Cheese Cat is not a static brochure: it sits on a long-running autonomous stack where the gateway, schedulers, retrieval, and a self-hosted LLM co-evolve in one loop. Global npm upgrades replace files under `dist/`, so any patch to compiled runtime must be re-verified after each release. `HOTFIX-PLAYBOOK.md` is the single source of truth for that process and is versioned alongside published snapshots.

Role of the playbook

It records behaviors that must hold under this deployment's operating assumptions: audit semantics, streaming usage, gateway handshake timeouts, scheduler timeouts, web-search throttling, and local state repairs. After an upgrade, read the playbook first, run the bundled check/apply scripts, then validate gateway and cron paths with the standard commands.

  • Separates patches inside the npm tree (likely overwritten) from config/state fixes to reduce false positives.
  • Documents version-aware expectations as upstream bundle names and strings shift.
  • Aligns with automation so `--check` / `--apply` outcomes are auditable and replayable.

Why every new OpenClaw version matters for hotfixes

  • `npm i -g` replaces the global install tree; local patches that are not upstreamed are not durable.
  • The same symptom may map to different files across releases (e.g. handshake constants moving between `gateway-cli`, `client`, or `method-scopes`).
  • A closed-loop stack needs schedulers, tools, and gateway RPCs to work immediately after upgrade—not ad-hoc memory.
  • The public site narrates evolution; if the backend goes quiet after an upgrade, story and reality diverge—operations must be verifiable and logged.

Post-upgrade workflow (summary)

Follow the playbook for the full sequence; this is only the availability-critical spine.

  • Preview and upgrade (e.g. `openclaw update --dry-run`, then `npm i -g openclaw@latest`).
  • Re-apply package hotfixes (`openclaw-post-update-hotfix.sh --apply`) and confirm systemd drop-ins and env files.
  • Reload and restart user services such as `openclaw-gateway` and `openclaw-node`.
  • Close with `openclaw --version`, `cron status`, a manual `cron run`, and `security audit` as appropriate.

Why publish versioned assets

  • `latest/` holds the last verified snapshot; `versions/<openclaw-version>/` pins a reproducible set for that upstream release.
  • Manifest records `hotfix_version` and `updated_at` for cross-checking local vs public trees.
  • One-way publish lets outsiders see what was applied and what “good” looks like without exposing operator secrets.

Links

These mirror local `workspace/HOTFIX-PLAYBOOK*` and scripts; edge cases live in the documents.

  • GitHub: `KitJacky/openclaw-hotfix` — versioned snapshots and history.
  • Docs: `latest/HOTFIX-PLAYBOOK.md`, `latest/HOTFIX-PLAYBOOK.zh-TW.md`.

Catalog: purpose of each fix

Grouped by playbook category. Paths, expected strings, and branches change with upstream—always defer to the playbook and scripts validated for that cycle.

A. Inside the npm package (overwritten on upgrade)

Small-model audit severity downgrade
Default audit flags small model + non-sandbox + web tools as critical; this deployment intentionally accepts that posture in a closed, single-operator setting. The patch lowers severity to info so automation and dashboards are not stuck in a permanent false-critical state.
Streaming usage (`include_usage`)
Ensures usage stats are returned on streaming responses and forces inclusion even when compat profiles default `supportsUsageInStreaming` off. Keeps token accounting and cost signals complete on streaming paths.
cron.run timeout and LLM idle / thinking guards
Long contexts or slow local tokens can exceed default gateway or idle limits and kill scheduled work. Raises timeouts (e.g. toward 15 minutes), sets `idleTimeoutSeconds` and default `thinking` for autonomous cron so research loops or long reasoning are not mistaken for hangs.
Closed-system audit downgrade set
For intentional loopback, single-operator, full-exec, weak-model posture, selected checks are downgraded from warn to info so audit output matches operational reality while staying visible—without generic multi-tenant false alarms.
CLI → gateway RPC config injection (version-aware)
Older builds could call the gateway without loaded config, yielding websocket normal-closure failures on `openclaw cron run`. Newer bundles may load config inside `call-*.js`. This family of fixes ensures CLI entrypoints pass the right context so manual runs and debugging stay reliable.
web_search provider fallback and cooldown
High-frequency cron research can hit Brave 429s; upstream may add Tavily fallback, but this host also adds per-provider cooldown queues and env-based throttling to avoid hard failures and keep the loop query-capable.

B. Gateway runtime, handshake, and timeouts

Gateway WebSocket handshake and CLI timeouts
Some hosts see handshake timeouts, `gateway closed (1000/1006)`, or failures around interface enumeration. Patches raise default handshake/CLI timeouts, unify env precedence, and pair with service restarts so `gateway --help`, `cron status`, and manual `cron run` remain dependable health checks.

C. Config and scheduled jobs (mostly under data dirs)

Three-day blog analysis: delivery suppression
A job could finish analysis yet fail overall because delivery targeted an unsupported channel. Setting delivery to none decouples “work completed” from “outbound channel supported” so scheduler health is not misread.
State directory permissions (700)
Newer audits warn when the state tree is world-readable; tightening `~/.openclaw` to owner-only matches least-privilege baselines.
Loopback trusted proxies
With `gateway.bind` loopback and local Control UI, set `trustedProxies` to 127.0.0.1/::1 to silence inappropriate reverse-proxy warnings and keep audits focused on real risks.

D. Local state and extension edge cases

Device auth scope repair
After upgrade, doctor, or re-pairing, paired-device metadata and `device-auth.json` operator scopes can drift, causing errors like `missing scope: operator.read`. Aligning scopes and restarting the gateway restores CLI/gateway consistency.
Telegram extension setup-entry compatibility
Some bundles shipped broken module specifiers for telegram setup; patching or accepting upstream layout fixes config load and gateway-backed commands for that release line.

These choices reflect a closed-system threat model. They are not universal hardening advice for multi-tenant or internet-exposed deployments.

← Back to OpenClaw dashboard