Observability Stack
Breeze includes an optional observability stack as a separate Docker Compose overlay (docker-compose.monitoring.yml). Enable it alongside the core stack to get full metrics, dashboards, log aggregation, and infrastructure alerting.
docker compose -f docker-compose.yml -f docker-compose.monitoring.yml up -dComponents
Section titled “Components”| Component | Port | Purpose |
|---|---|---|
| Prometheus | 9090 (localhost) | Time-series metrics collection and alerting rules |
| Grafana | 3000 (localhost) | Dashboards and visualization |
| Alertmanager | 9093 (localhost) | Alert routing and notifications |
| Loki | 3100 (localhost) | Log aggregation and querying |
| Promtail | 9080 (localhost) | Log shipping from Docker containers to Loki |
| Redis Exporter | 9121 (internal) | Exports Redis metrics for Prometheus |
| Postgres Exporter | 9187 (internal) | Exports PostgreSQL metrics for Prometheus |
Accessing Grafana
Section titled “Accessing Grafana”# Via SSH tunnelssh -L 3000:127.0.0.1:3000 user@your-server
# Then open http://localhost:3000# Username: admin# Password: (your GRAFANA_ADMIN_PASSWORD from .env.prod)Pre-Built Dashboards
Section titled “Pre-Built Dashboards”Breeze ships with a Grafana dashboard (monitoring/grafana/dashboards/breeze-overview.json) that is automatically provisioned. It includes these panels:
| Panel | What It Shows |
|---|---|
| Service Status | Up/down status of API, Redis, PostgreSQL, and other services |
| Request Rate | HTTP requests per second with breakdown by method |
| Response Times | P50, P95, and P99 latency over time |
| Error Rate | 4xx and 5xx response rates as percentages |
| HTTP Status Distribution | Breakdown of responses by status code |
| Top Endpoints | Most-used API endpoints by request volume |
| Active Devices | Count of agents with recent heartbeats |
| Organizations | Number of active tenants |
| Redis Memory | Memory usage, evictions, and hit rate |
| PostgreSQL Connections | Active connection count vs. max pool size |
Adding Custom Dashboards
Section titled “Adding Custom Dashboards”To add your own dashboards:
- Create or import a dashboard in the Grafana UI.
- Export it as JSON from the Grafana dashboard settings.
- Save the JSON file to
monitoring/grafana/dashboards/. - The dashboard provisioner (
monitoring/grafana/dashboards.yml) automatically picks up new files in that directory on the next Grafana restart.
Data Sources
Section titled “Data Sources”Configured automatically via monitoring/grafana/datasources.yml:
| Source | Type | URL |
|---|---|---|
| Prometheus | Time-series | http://prometheus:9090 |
| Loki | Logs | http://loki:3100 |
| PostgreSQL | SQL | postgres:5432 |
| Redis | Key-value | redis://redis:6379 |
Prometheus Configuration
Section titled “Prometheus Configuration”Located at monitoring/prometheus.yml. Key settings:
global: scrape_interval: 15s evaluation_interval: 15s scrape_timeout: 10s external_labels: monitor: 'breeze-rmm' environment: 'production'Scrape Targets
Section titled “Scrape Targets”| Job | Target | Interval | Auth |
|---|---|---|---|
prometheus | localhost:9090 | 15s (default) | None |
breeze-api | api:3001 at /metrics/scrape | 10s | Bearer token |
redis | redis-exporter:9121 | 15s (default) | None |
postgres | postgres-exporter:9187 | 15s (default) | None |
node | node-exporter:9100 (optional) | 15s (default) | None |
The API metrics endpoint is protected by a bearer token. Store the token in monitoring/secrets/metrics_scrape_token and reference it in prometheus.yml:
- job_name: 'breeze-api' metrics_path: /metrics/scrape authorization: type: Bearer credentials_file: /run/secrets/metrics_scrape_tokenRule Files
Section titled “Rule Files”Prometheus loads all .yml files from monitoring/rules/. Breeze ships with breeze-rules.yml containing API, infrastructure, and business alert rules plus recording rules for common aggregations. See Infrastructure Alerts for the full list.
Rules are evaluated every 30 seconds (API and infrastructure groups) or every 60 seconds (business alerts). Recording rules pre-compute expensive queries:
| Recording Rule | Description |
|---|---|
breeze:http_requests:rate5m | Request rate by status, method, and route |
breeze:http_error_rate:ratio5m | 5xx error rate as a ratio |
breeze:http_request_duration:avg5m | Average request duration by route |
breeze:http_request_duration:p95_5m | 95th percentile request duration |
breeze:http_request_duration:p99_5m | 99th percentile request duration |
breeze:devices:active_count | Total active device count |
breeze:http_requests:rate5m_by_org | Request rate per organization |
breeze:redis_ops:rate5m | Redis operations per second |
breeze:postgres_query_duration:avg5m | Average PostgreSQL query duration |
Key Metrics
Section titled “Key Metrics”HTTP Metrics (from the API)
Section titled “HTTP Metrics (from the API)”| Metric | Type | Description |
|---|---|---|
http_requests_total | Counter | Total HTTP requests by method, path, status |
http_request_duration_seconds | Histogram | Request latency distribution |
http_requests_in_flight | Gauge | Currently processing requests |
Business Metrics
Section titled “Business Metrics”| Metric | Type | Description |
|---|---|---|
breeze_active_devices | Gauge | Devices with a recent heartbeat |
breeze_active_organizations | Gauge | Organizations with active devices |
breeze_commands_total | Counter | Commands executed, labeled by type |
breeze_alerts_total | Counter | Alerts fired, labeled by severity |
Infrastructure Metrics
Section titled “Infrastructure Metrics”| Metric | Type | Description |
|---|---|---|
redis_memory_used_bytes | Gauge | Redis memory consumption |
redis_commands_processed_total | Counter | Total Redis commands processed |
pg_stat_activity_count | Gauge | PostgreSQL active connections |
pg_database_size_bytes | Gauge | Database size in bytes |
pg_settings_max_connections | Gauge | PostgreSQL max allowed connections |
Log Aggregation with Loki
Section titled “Log Aggregation with Loki”Promtail scrapes Docker container logs and ships them to Loki. Loki stores logs for 14 days by default (configurable via retention_period in monitoring/loki-config.yml).
Querying Logs in Grafana
Section titled “Querying Logs in Grafana”Open the Explore page in Grafana, select the Loki data source, and enter LogQL queries.
Basic Queries
Section titled “Basic Queries”# All API logs{container="breeze-api"}
# API errors only{container="breeze-api"} |= "error"
# Structured JSON logs — filter by level{container="breeze-api"} | json | level = "error"
# Logs from a specific container{container="breeze-web"}Filtering and Searching
Section titled “Filtering and Searching”# Search for a specific device ID{container="breeze-api"} |= "device_id=abc123"
# Exclude health check noise{container="breeze-api"} != "/health"
# Filter by HTTP status code in structured logs{container="breeze-api"} | json | status >= 500
# Rate of errors over time (useful for dashboards)rate({container="breeze-api"} |= "error" [5m])Time-Based Queries
Section titled “Time-Based Queries”# Logs from the last hour containing "timeout"{container="breeze-api"} |= "timeout"
# Count log lines per minutesum(rate({container="breeze-api"} [1m])) by (container)Adding Custom Prometheus Alert Rules
Section titled “Adding Custom Prometheus Alert Rules”-
Create a new YAML file in
monitoring/rules/(e.g.,monitoring/rules/custom-rules.yml). -
Define your alert rules following the Prometheus format:
groups:- name: custom-alertsrules:- alert: HighAgentChurnexpr: rate(breeze_device_enrollments_total[1h]) > 10for: 30mlabels:severity: warningannotations:summary: "High agent enrollment rate"description: "More than 10 new enrollments per hour for 30 minutes" -
Reload the Prometheus configuration (no restart required):
Terminal window curl -X POST http://localhost:9090/-/reload -
Verify the rule loaded successfully by checking
http://localhost:9090/rulesin the Prometheus UI.
Data Retention
Section titled “Data Retention”| Component | Default Retention | Configuration |
|---|---|---|
| Prometheus | 15 days | --storage.tsdb.retention.time=15d in compose file |
| Loki | 14 days (336h) | retention_period in monitoring/loki-config.yml |
| Grafana | Unlimited (dashboards only) | N/A |
| Alertmanager | Silences and notification log only | --storage.path in compose file |
To change retention, edit the relevant configuration and restart the container.
Troubleshooting
Section titled “Troubleshooting”Prometheus Is Not Scraping the API
Section titled “Prometheus Is Not Scraping the API”Symptom: The breeze-api target shows as DOWN in http://localhost:9090/targets.
- Verify the API is running and healthy:
curl http://localhost:3001/health - Check the scrape token is correct. Compare
monitoring/secrets/metrics_scrape_tokenwith theMETRICS_SCRAPE_TOKENenvironment variable on the API container. - Verify network connectivity. Both Prometheus and the API must be on the same Docker network (
breeze). - Check Prometheus logs:
docker compose -f docker-compose.yml -f docker-compose.monitoring.yml logs prometheus --tail 50
Grafana Shows “No Data”
Section titled “Grafana Shows “No Data””Symptom: Dashboard panels show “No data” instead of charts.
- Confirm Prometheus is running and scraping: visit
http://localhost:9090/targetsand verify all targets areUP. - In Grafana, go to Configuration > Data Sources > Prometheus and click Test. It should say “Data source is working.”
- Check the time range selector in Grafana. If metrics collection just started, narrow the range to “Last 15 minutes.”
- If using a custom dashboard, verify the metric names match what Prometheus is collecting. Test a simple query like
upin Grafana Explore.
Loki Queries Are Slow
Section titled “Loki Queries Are Slow”Symptom: Log queries in Grafana take more than 10 seconds or time out.
- Narrow the time range. Loki performs best with shorter ranges (last 1 hour vs. last 7 days).
- Add label matchers.
{container="breeze-api"} |= "error"is much faster than{job="varlogs"} |= "error"because the label narrows the search before the text filter runs. - Check Loki’s compactor. If it has fallen behind, compaction can slow queries:
docker compose -f docker-compose.yml -f docker-compose.monitoring.yml logs loki --tail 50 - Increase Loki resources if needed. In the compose file, add memory limits and CPU limits that match your server capacity.
Alertmanager Is Not Sending Notifications
Section titled “Alertmanager Is Not Sending Notifications”Symptom: Alerts fire in Prometheus but no notifications arrive.
- Confirm Alertmanager is receiving alerts: visit
http://localhost:9093/#/alertsand check for active alerts. - If no alerts appear, verify Prometheus is configured to send to Alertmanager. Check
alerting.alertmanagersinmonitoring/prometheus.yml. - If alerts appear but notifications are not sent, check the receiver configuration in
monitoring/alertmanager.yml. Look for commented-out sections that need to be enabled. - Check Alertmanager logs for delivery errors:
docker compose -f docker-compose.yml -f docker-compose.monitoring.yml logs alertmanager --tail 50 - Verify webhook URLs, API keys, and SMTP credentials are correct. Test Slack webhooks with
curlto rule out network issues.
Containers Not Starting
Section titled “Containers Not Starting”Symptom: One or more monitoring containers fail to start or keep restarting.
- Check which containers are failing:
docker compose -f docker-compose.yml -f docker-compose.monitoring.yml ps - Read the logs:
docker compose -f docker-compose.yml -f docker-compose.monitoring.yml logs <container> --tail 100 - Common causes:
- Grafana:
GRAFANA_ADMIN_PASSWORDnot set in.env.prod. The compose file requires this variable. - Postgres Exporter:
POSTGRES_PASSWORDnot set or incorrect. The exporter needs the same credentials as the database. - Prometheus: Invalid YAML in
prometheus.ymlor rule files. Validate withpromtool check config monitoring/prometheus.yml. - Loki: Permissions issue on the data volume. Loki runs as a non-root user and needs write access to
/loki.
- Grafana:
Disk Space Growing
Section titled “Disk Space Growing”Monitoring data can accumulate over time, especially on busy systems.
- Check volume sizes:
docker system df -v | grep -E 'prometheus|grafana|loki' - Reduce Prometheus retention: lower
--storage.tsdb.retention.timefrom15dto7din the compose file. - Reduce Loki retention: lower
retention_periodinmonitoring/loki-config.yml(e.g., from336hto168h). - Prune old Docker volumes if containers were previously removed without cleaning up:
docker volume prune - Restart the affected containers after configuration changes.