Skip to content

Agent Watchdog

The Breeze Watchdog is a lightweight companion service that runs alongside the Breeze agent on every managed device. It continuously monitors the agent process, automatically recovers it when it becomes unhealthy, and maintains a fallback connection to the Breeze server when the agent is completely down.

The watchdog is installed automatically alongside the agent during normal installation — no separate setup is required.


The watchdog runs as a separate system service (breeze-watchdog) and monitors the agent through three independent health checks:

  1. Process liveness — Is the agent process still running? Detects crashes and unexpected exits.
  2. IPC connectivity — Can the watchdog reach the agent over its local IPC socket? Detects hangs and deadlocks.
  3. Heartbeat freshness — Has the agent sent a heartbeat to the server recently? Detects networking failures and stuck loops.

When a check fails, the watchdog transitions through escalating states — from monitoring to recovery to failover — to bring the agent back to a healthy state.


The watchdog operates as a state machine with five states:

StateDescription
ConnectingStarting up, attempting to establish IPC connection to the agent
MonitoringAgent is healthy; running periodic health checks
RecoveringAgent is unhealthy; performing escalating restart attempts
StandbyAgent signaled a graceful shutdown (e.g., for an update); waiting for it to come back
FailoverRecovery attempts exhausted; watchdog is communicating directly with the Breeze server on behalf of the device
CONNECTING ──ipc connected──→ MONITORING
CONNECTING ──agent not found─→ RECOVERING
MONITORING ──agent unhealthy─→ RECOVERING
MONITORING ──shutdown intent─→ STANDBY
RECOVERING ──agent recovered─→ MONITORING
RECOVERING ──recovery exhausted──→ FAILOVER
STANDBY ────agent recovered──→ MONITORING
STANDBY ────standby timeout──→ FAILOVER
STANDBY ────start agent──────→ RECOVERING
FAILOVER ───agent recovered──→ MONITORING

When the agent is detected as unhealthy, the watchdog performs escalating recovery actions:

  1. Graceful restart — asks the system service manager to restart the agent service cleanly.

  2. Force-kill + restart — terminates the agent process, then starts the service. Used when the agent process is hung and not responding to service stop requests.

  3. Start only — assumes the previous process is already gone and attempts a fresh service start.

Recovery attempts are tracked within a cooldown window. If the window elapses without exhausting all attempts, the counter resets. If all attempts are exhausted within the window, the watchdog transitions to Failover.


When recovery is exhausted, the watchdog enters Failover mode and takes over communication with the Breeze server. In this state the watchdog:

  • Sends periodic heartbeats so the device remains visible in the dashboard
  • Polls for commands from the server (restart agent, collect diagnostics, update agent, update watchdog)
  • Ships watchdog health journal excerpts for remote diagnosis
  • Accepts and applies agent or watchdog binary updates pushed from the server

Failover keeps the device manageable even when the agent process is completely down. Once the agent is successfully restarted (either via a server command or manual intervention), the watchdog transitions back to Monitoring.


The watchdog maintains a local rotating log file called the health journal. Each entry records a timestamped event (state transition, health check result, recovery action, IPC message) in structured JSON format.

Journal files rotate by size and count, keeping recent history available for diagnosis without consuming excessive disk space.

You can export the journal locally:

Terminal window
breeze-watchdog health-journal

The watchdog also ships journal entries to the Breeze server as diagnostic logs (component prefix watchdog.*), which you can query via the API — see Watchdog Logs below.


The watchdog reads its configuration from the same config directory as the agent. Default values are tuned for production use — most deployments do not need to change them.

ParameterDefaultDescription
Process check interval10 sHow often to check if the agent process is alive
IPC probe interval15 sHow often to ping the agent over IPC
Heartbeat stale threshold5 minHow long since the last heartbeat before the agent is considered stale
Max recovery attempts3Number of escalating recovery tries before entering failover
Recovery cooldown10 minWindow in which recovery attempts are counted; resets after expiry
Standby timeout5 minHow long to wait in standby before escalating to failover
Failover poll interval30 sHow often to poll the server for commands during failover

Watchdog diagnostic logs are queryable through a dedicated API endpoint that filters specifically for watchdog-component logs.

GET /devices/:id/watchdog-logs
ParameterTypeDescription
levelstringComma-separated levels: debug, info, warn, error
componentstringFilter to a specific watchdog sub-component (e.g., watchdog.recovery)
sinceISO 8601Include only logs at or after this datetime
untilISO 8601Include only logs at or before this datetime
searchstringFull-text search across message and structured fields
pagenumberPage number (1-based)
limitnumberResults per page (max 500)
Terminal window
# Get recent watchdog error logs for a device
GET /api/v1/devices/DEVICE_ID/watchdog-logs?level=warn,error&limit=50

The devices table includes three watchdog-specific fields:

FieldDescription
watchdog_statusCurrent watchdog state: connected (monitoring), failover, or offline
watchdog_last_seenTimestamp of the last watchdog heartbeat or check-in
watchdog_versionInstalled watchdog binary version

These fields update as the watchdog reports its state through heartbeats and failover polling.


Watchdog status shows failover. The agent process crashed and could not be automatically recovered after the configured number of attempts. Use the Restart Agent action from the device detail page, or SSH into the device and manually restart the agent service. Check the watchdog logs for recovery failure details: GET /devices/:id/watchdog-logs?level=error.

Watchdog status shows offline. The watchdog service itself is not running. Check the system service manager on the device (systemctl status breeze-watchdog on Linux, launchctl list | grep breeze-watchdog on macOS, Get-Service breeze-watchdog on Windows).

Agent keeps restarting in a loop. The watchdog recovers the agent, but it immediately crashes again. This is usually caused by a bad configuration file or a corrupted binary. Check the agent’s own diagnostic logs for startup errors. The watchdog will eventually exhaust its recovery attempts and enter failover, preventing an infinite restart loop.

No watchdog logs appearing. The watchdog ships logs through the same diagnostic log pipeline as the agent. Verify the watchdog service is running and the device is online. Watchdog logs use the watchdog.* component prefix — they do not appear in the standard agent logs filter unless you query the dedicated endpoint.