Product Update

Seaboss now watches itself: 13 self-monitoring signals

azmat2 min read

A managed platform that doesn't watch itself isn't really managed. As of RASF-718 Phase 2, the Seaboss management server runs an ops-watchdog daemon with 13 active monitoring signals, each one looking for a specific failure mode that's historically caused us pain.

What it watches

The signals split into three rough categories.

Tenant-fleet health:

  • Tenant containers in the unhealthy state for >10 minutes.
  • Sidecar sync staleness (no usage events from a tenant in >2 hours).
  • Disk pressure on the management server itself.
  • TLS cert expiry approaching (Let's Encrypt < 14 days).

Billing + provider integrity:

  • Stripe webhook lag (events delayed >5 minutes).
  • Stripe failed payments not retried.
  • Hetzner orphan labels (VPSes labeled for tenants that no longer exist in our database).

Release lifecycle:

  • Gold-release lifecycle stuck states (a release sitting in candidate or compat-passed longer than expected without progress to gold).
  • Tenant instance errors trending up.
  • OpenClaw upstream release watcher (new versions worth evaluating).

Plus a usage-rollup staleness check and a self-health monitor that fires if any other signal stops checking in.

How it routes

When a signal fires, it goes through the platform-shared @riseandshinefutures/ops-watchdog library, which:

  1. Posts to a Discord webhook (#seaboss-ops).
  2. Files a Linear ticket via the auto-filer (dedup by signal fingerprint).
  3. Logs to the audit DB.

The severity scale runs infowarningcritical, with mention scoping that only escalates to @here or @everyone for true fleet-affecting events.

What we caught early

In the first 24 hours after Phase 2 deployed, the daemon caught two real issues that unit tests didn't:

  • A SQL bug in the usage-rollup-staleness query (wrong column name, fingerprinted as a flap but the underlying query was nonsense).
  • A genuine deadlock in the Gold-upgrade flow that would have hit the next Gold release.

Both were fixed same-session. The daemon paid for its build cost on day one.

What's next

Phase 2.B will integrate the daemon with a chat interface for interactive investigation — operator-driven for now, autonomous after a soak period.