Skip to content

Observability Stack

Breeze includes an optional observability stack as a separate Docker Compose overlay (docker-compose.monitoring.yml). Enable it alongside the core stack to get full metrics, dashboards, log aggregation, and infrastructure alerting.

Terminal window
docker compose -f docker-compose.yml -f docker-compose.monitoring.yml up -d
ComponentPortPurpose
Prometheus9090 (localhost)Time-series metrics collection and alerting rules
Grafana3000 (localhost)Dashboards and visualization
Alertmanager9093 (localhost)Alert routing and notifications
Loki3100 (localhost)Log aggregation and querying
Promtail9080 (localhost)Log shipping from Docker containers to Loki
Redis Exporter9121 (internal)Exports Redis metrics for Prometheus
Postgres Exporter9187 (internal)Exports PostgreSQL metrics for Prometheus
Terminal window
# Via SSH tunnel
ssh -L 3000:127.0.0.1:3000 user@your-server
# Then open http://localhost:3000
# Username: admin
# Password: (your GRAFANA_ADMIN_PASSWORD from .env.prod)

Breeze ships with a Grafana dashboard (monitoring/grafana/dashboards/breeze-overview.json) that is automatically provisioned. It includes these panels:

PanelWhat It Shows
Service StatusUp/down status of API, Redis, PostgreSQL, and other services
Request RateHTTP requests per second with breakdown by method
Response TimesP50, P95, and P99 latency over time
Error Rate4xx and 5xx response rates as percentages
HTTP Status DistributionBreakdown of responses by status code
Top EndpointsMost-used API endpoints by request volume
Active DevicesCount of agents with recent heartbeats
OrganizationsNumber of active tenants
Redis MemoryMemory usage, evictions, and hit rate
PostgreSQL ConnectionsActive connection count vs. max pool size

To add your own dashboards:

  1. Create or import a dashboard in the Grafana UI.
  2. Export it as JSON from the Grafana dashboard settings.
  3. Save the JSON file to monitoring/grafana/dashboards/.
  4. The dashboard provisioner (monitoring/grafana/dashboards.yml) automatically picks up new files in that directory on the next Grafana restart.

Configured automatically via monitoring/grafana/datasources.yml:

SourceTypeURL
PrometheusTime-serieshttp://prometheus:9090
LokiLogshttp://loki:3100
PostgreSQLSQLpostgres:5432
RedisKey-valueredis://redis:6379

Located at monitoring/prometheus.yml. Key settings:

global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_timeout: 10s
external_labels:
monitor: 'breeze-rmm'
environment: 'production'
JobTargetIntervalAuth
prometheuslocalhost:909015s (default)None
breeze-apiapi:3001 at /metrics/scrape10sBearer token
redisredis-exporter:912115s (default)None
postgrespostgres-exporter:918715s (default)None
nodenode-exporter:9100 (optional)15s (default)None

The API metrics endpoint is protected by a bearer token. Store the token in monitoring/secrets/metrics_scrape_token and reference it in prometheus.yml:

- job_name: 'breeze-api'
metrics_path: /metrics/scrape
authorization:
type: Bearer
credentials_file: /run/secrets/metrics_scrape_token

Prometheus loads all .yml files from monitoring/rules/. Breeze ships with breeze-rules.yml containing API, infrastructure, and business alert rules plus recording rules for common aggregations. See Infrastructure Alerts for the full list.

Rules are evaluated every 30 seconds (API and infrastructure groups) or every 60 seconds (business alerts). Recording rules pre-compute expensive queries:

Recording RuleDescription
breeze:http_requests:rate5mRequest rate by status, method, and route
breeze:http_error_rate:ratio5m5xx error rate as a ratio
breeze:http_request_duration:avg5mAverage request duration by route
breeze:http_request_duration:p95_5m95th percentile request duration
breeze:http_request_duration:p99_5m99th percentile request duration
breeze:devices:active_countTotal active device count
breeze:http_requests:rate5m_by_orgRequest rate per organization
breeze:redis_ops:rate5mRedis operations per second
breeze:postgres_query_duration:avg5mAverage PostgreSQL query duration
MetricTypeDescription
http_requests_totalCounterTotal HTTP requests by method, path, status
http_request_duration_secondsHistogramRequest latency distribution
http_requests_in_flightGaugeCurrently processing requests
MetricTypeDescription
breeze_active_devicesGaugeDevices with a recent heartbeat
breeze_active_organizationsGaugeOrganizations with active devices
breeze_commands_totalCounterCommands executed, labeled by type
breeze_alerts_totalCounterAlerts fired, labeled by severity
MetricTypeDescription
redis_memory_used_bytesGaugeRedis memory consumption
redis_commands_processed_totalCounterTotal Redis commands processed
pg_stat_activity_countGaugePostgreSQL active connections
pg_database_size_bytesGaugeDatabase size in bytes
pg_settings_max_connectionsGaugePostgreSQL max allowed connections

Promtail scrapes Docker container logs and ships them to Loki. Loki stores logs for 14 days by default (configurable via retention_period in monitoring/loki-config.yml).

Open the Explore page in Grafana, select the Loki data source, and enter LogQL queries.

# All API logs
{container="breeze-api"}
# API errors only
{container="breeze-api"} |= "error"
# Structured JSON logs — filter by level
{container="breeze-api"} | json | level = "error"
# Logs from a specific container
{container="breeze-web"}
# Search for a specific device ID
{container="breeze-api"} |= "device_id=abc123"
# Exclude health check noise
{container="breeze-api"} != "/health"
# Filter by HTTP status code in structured logs
{container="breeze-api"} | json | status >= 500
# Rate of errors over time (useful for dashboards)
rate({container="breeze-api"} |= "error" [5m])
# Logs from the last hour containing "timeout"
{container="breeze-api"} |= "timeout"
# Count log lines per minute
sum(rate({container="breeze-api"} [1m])) by (container)
  1. Create a new YAML file in monitoring/rules/ (e.g., monitoring/rules/custom-rules.yml).

  2. Define your alert rules following the Prometheus format:

    groups:
    - name: custom-alerts
    rules:
    - alert: HighAgentChurn
    expr: rate(breeze_device_enrollments_total[1h]) > 10
    for: 30m
    labels:
    severity: warning
    annotations:
    summary: "High agent enrollment rate"
    description: "More than 10 new enrollments per hour for 30 minutes"
  3. Reload the Prometheus configuration (no restart required):

    Terminal window
    curl -X POST http://localhost:9090/-/reload
  4. Verify the rule loaded successfully by checking http://localhost:9090/rules in the Prometheus UI.

ComponentDefault RetentionConfiguration
Prometheus15 days--storage.tsdb.retention.time=15d in compose file
Loki14 days (336h)retention_period in monitoring/loki-config.yml
GrafanaUnlimited (dashboards only)N/A
AlertmanagerSilences and notification log only--storage.path in compose file

To change retention, edit the relevant configuration and restart the container.

Symptom: The breeze-api target shows as DOWN in http://localhost:9090/targets.

  1. Verify the API is running and healthy: curl http://localhost:3001/health
  2. Check the scrape token is correct. Compare monitoring/secrets/metrics_scrape_token with the METRICS_SCRAPE_TOKEN environment variable on the API container.
  3. Verify network connectivity. Both Prometheus and the API must be on the same Docker network (breeze).
  4. Check Prometheus logs: docker compose -f docker-compose.yml -f docker-compose.monitoring.yml logs prometheus --tail 50

Symptom: Dashboard panels show “No data” instead of charts.

  1. Confirm Prometheus is running and scraping: visit http://localhost:9090/targets and verify all targets are UP.
  2. In Grafana, go to Configuration > Data Sources > Prometheus and click Test. It should say “Data source is working.”
  3. Check the time range selector in Grafana. If metrics collection just started, narrow the range to “Last 15 minutes.”
  4. If using a custom dashboard, verify the metric names match what Prometheus is collecting. Test a simple query like up in Grafana Explore.

Symptom: Log queries in Grafana take more than 10 seconds or time out.

  1. Narrow the time range. Loki performs best with shorter ranges (last 1 hour vs. last 7 days).
  2. Add label matchers. {container="breeze-api"} |= "error" is much faster than {job="varlogs"} |= "error" because the label narrows the search before the text filter runs.
  3. Check Loki’s compactor. If it has fallen behind, compaction can slow queries: docker compose -f docker-compose.yml -f docker-compose.monitoring.yml logs loki --tail 50
  4. Increase Loki resources if needed. In the compose file, add memory limits and CPU limits that match your server capacity.

Symptom: Alerts fire in Prometheus but no notifications arrive.

  1. Confirm Alertmanager is receiving alerts: visit http://localhost:9093/#/alerts and check for active alerts.
  2. If no alerts appear, verify Prometheus is configured to send to Alertmanager. Check alerting.alertmanagers in monitoring/prometheus.yml.
  3. If alerts appear but notifications are not sent, check the receiver configuration in monitoring/alertmanager.yml. Look for commented-out sections that need to be enabled.
  4. Check Alertmanager logs for delivery errors: docker compose -f docker-compose.yml -f docker-compose.monitoring.yml logs alertmanager --tail 50
  5. Verify webhook URLs, API keys, and SMTP credentials are correct. Test Slack webhooks with curl to rule out network issues.

Symptom: One or more monitoring containers fail to start or keep restarting.

  1. Check which containers are failing: docker compose -f docker-compose.yml -f docker-compose.monitoring.yml ps
  2. Read the logs: docker compose -f docker-compose.yml -f docker-compose.monitoring.yml logs <container> --tail 100
  3. Common causes:
    • Grafana: GRAFANA_ADMIN_PASSWORD not set in .env.prod. The compose file requires this variable.
    • Postgres Exporter: POSTGRES_PASSWORD not set or incorrect. The exporter needs the same credentials as the database.
    • Prometheus: Invalid YAML in prometheus.yml or rule files. Validate with promtool check config monitoring/prometheus.yml.
    • Loki: Permissions issue on the data volume. Loki runs as a non-root user and needs write access to /loki.

Monitoring data can accumulate over time, especially on busy systems.

  1. Check volume sizes: docker system df -v | grep -E 'prometheus|grafana|loki'
  2. Reduce Prometheus retention: lower --storage.tsdb.retention.time from 15d to 7d in the compose file.
  3. Reduce Loki retention: lower retention_period in monitoring/loki-config.yml (e.g., from 336h to 168h).
  4. Prune old Docker volumes if containers were previously removed without cleaning up: docker volume prune
  5. Restart the affected containers after configuration changes.