Skip to content

Device Reliability

Device Reliability tracks the operational health of every managed device in your fleet by collecting crash events, application hangs, service failures, hardware errors, and uptime data. The system computes a reliability score from 0 to 100 for each device using a weighted formula, identifies trend direction over time using linear regression, and surfaces the top issues affecting each device. Scores are recomputed automatically each time new telemetry arrives from the agent.

The Breeze agent collects reliability telemetry from platform-specific event sources: Windows Event Log, macOS system logs, and Linux journal/syslog. Each heartbeat cycle, the agent sends a snapshot of recent events to the API, which stores the raw history and then triggers an asynchronous score computation via BullMQ (with an inline fallback if the queue is unavailable).


  1. Agent collects telemetry from the OS event log system. On Windows, this includes Event Log entries for BSODs, service crashes, hardware WHEA errors, and application hangs. On macOS, the agent reads system logs for kernel panics, application crashes, and launchd service failures. On Linux, the agent reads journald/syslog for kernel panics, OOM kills, systemd failures, and hardware errors.

  2. Agent submits metrics to POST /agents/:id/reliability with the structured payload: uptime seconds, boot time, crash events, application hangs, service failures, and hardware errors.

  3. API stores raw history in the device_reliability_history table. Each submission creates a new row, preserving the full event timeline.

  4. Score computation is enqueued via BullMQ. If the queue is unavailable, the computation runs inline as a fallback.

  5. The scoring service reads up to 90 days of history, aggregates events into daily buckets, computes sub-scores for each reliability factor, applies weights, and persists the result to the device_reliability table.

BandScore RangeMeaning
Critical0 — 50Device has significant stability problems requiring immediate attention
Poor51 — 70Device is experiencing frequent issues that affect usability
Fair71 — 85Device is generally stable but has notable issues
Good86 — 100Device is operating reliably with minimal issues

The reliability score is a weighted composite of five sub-scores, each calculated from the device’s event history over rolling time windows (7, 30, and 90 days).

FactorWeightDescription
Uptime30%Based on the 90-day uptime percentage. Score 100 at 100% uptime, linearly down to 0 at 90% or below
Crashes25%Penalizes crash events. Recent crashes (7-day) are weighted more heavily than 30-day crashes
Hangs15%Penalizes application hangs, with extra penalty for unresolved hangs
Service Failures15%Penalizes service failures, with partial credit for auto-recovered services
Hardware Errors15%Penalizes hardware errors by severity: critical (-30), error (-15), warning (-5) per event

The overall score is computed as:

reliabilityScore = clamp(0, 100,
uptimeScore * 0.30
+ crashScore * 0.25
+ hangScore * 0.15
+ serviceFailureScore * 0.15
+ hardwareErrorScore * 0.15
)

Each sub-score starts at 100 and is reduced by event counts with specific penalty multipliers:

  • Uptime Score: If uptime is ≥ 100%, score is 100. If ≤ 90%, score is 0. Linear interpolation between 90% and 100%.
  • Crash Score: 100 - (crashCount30d + crashCount7d * 0.5) * 20
  • Hang Score: 100 - hangCount30d * 10 - unresolvedHangCount30d * 20
  • Service Failure Score: 100 - serviceFailureCount30d * 15 + recoveredServiceCount30d * 5
  • Hardware Error Score: 100 - criticalCount30d * 30 - errorCount30d * 15 - warningCount30d * 5

Trend direction is computed using linear regression over 30 days of daily reliability estimates. Each day’s events are scored independently, and a regression line is fitted to the daily scores.

TrendSlope ThresholdMeaning
improvingslope > 2Reliability is getting better over time
stable-2 ≤ slope ≤ 2Reliability is holding steady
degradingslope < -2Reliability is getting worse over time

The trendConfidence field (0.0 to 1.0) indicates how well the linear model fits the data, factoring in both R-squared and data coverage (at least 14 days of data for full confidence).

MTBF is calculated from the 90-day window as:

mtbfHours = operatingHours / totalFailureCount

Where total failures include crashes, hangs, service failures, and hardware errors over the 90-day window. MTBF is null when there are zero failures or zero operating hours.


System-level crashes that indicate an unexpected shutdown or critical failure.

Crash TypeDescriptionPlatforms
bsodBlue Screen of Death / bugcheckWindows
kernel_panicKernel panic or oopsWindows, macOS, Linux
system_crashGeneral system or application crashWindows, macOS
oom_killOut-of-memory killLinux
unknownUnclassified crash eventAll

Detected when a process is reported as “not responding” or “hang” in system event logs.

FieldTypeDescription
processNamestringName of the hanging process
timestampISO 8601When the hang was detected
durationintegerDuration of the hang in seconds (0 if unknown)
resolvedbooleanWhether the hang resolved without intervention

Detected when system services terminate unexpectedly or fail to start.

FieldTypeDescription
serviceNamestringName of the failed service
timestampISO 8601When the failure occurred
errorCodestringOS-specific error code or event ID
recoveredbooleanWhether the service auto-recovered

Hardware-level errors from WHEA (Windows), MCE, disk I/O, and memory subsystems.

Hardware TypeClassification Criteria
mceMachine Check Exception: WHEA source, “machine check”, or “mce” keywords
memoryMemory errors: EDAC, Event ID 13/50/51, or “memory” keyword
diskDisk errors: I/O errors, Event ID 7/11/15, or “disk”/“blk_update_request” keywords
unknownHardware error that does not match known patterns

Hardware errors are further classified by severity:

SeverityWeight in Score
critical-30 per event
error-15 per event
warning-5 per event

The Windows collector reads from the Windows Event Log via the EventLogCollector. Detected signals include:

  • BSOD/Bugcheck: Event IDs 1001, 6008; messages containing “bugcheck”, “blue screen”, or “unexpected shutdown”
  • Service failures: Event ID 7034; messages with “service terminated” or “service failed”
  • Application hangs: Messages containing “hang” or “not responding”
  • Hardware errors: WHEA events, disk errors, memory errors
  • System crashes: Critical-level system events containing “crash”

Windows provides the richest reliability telemetry due to the structured Event Log system.


List reliability scores for all devices in your organization, sorted worst-first by default:

Terminal window
GET /reliability?orgId=uuid&scoreRange=critical&trendDirection=degrading&page=1&limit=25
ParameterTypeDescription
orgIdUUIDFilter by organization
siteIdUUIDFilter by site
scoreRangestringFilter by band: critical, poor, fair, good (also accepts legacy 0-50, 51-70, 71-85, 86-100 format)
trendDirectionstringFilter by trend: improving, stable, degrading
issueTypestringFilter by issue type: crashes, hangs, hardware, services, uptime
minScoreintegerMinimum reliability score (0-100)
maxScoreintegerMaximum reliability score (0-100)
pageintegerPage number (default 1)
limitintegerResults per page (1-100, default 25)

The response includes a summary section with the average score, count of critical devices (score ≤ 50), and count of degrading devices:

{
"data": [...],
"pagination": { "total": 150, "page": 1, "limit": 25, "totalPages": 6 },
"summary": {
"averageScore": 78,
"criticalDevices": 5,
"degradingDevices": 12
}
}

Get a high-level reliability overview for an organization, including the 10 worst devices:

Terminal window
GET /reliability/org/:orgId/summary

The response includes:

FieldDescription
devicesTotal device count with reliability data
averageScoreOrganization-wide average reliability score
criticalDevicesDevices with score 0-50
poorDevicesDevices with score 51-70
fairDevicesDevices with score 71-85
goodDevicesDevices with score 86-100
degradingDevicesDevices with a degrading trend
topIssuesRanked list of most common issue types across the org
worstDevicesThe 10 lowest-scoring devices with full reliability details

Get the full reliability snapshot and 30-day history for a specific device:

Terminal window
GET /reliability/:deviceId

The response contains two sections:

  • snapshot — The current computed reliability state: overall score, all sub-scores, uptime percentages (7d/30d/90d), event counts, MTBF, trend direction and confidence, and top issues.
  • history — An array of daily data points for the last 30 days, each containing sample count, max uptime seconds, crash/hang/service failure/hardware error counts, and a daily reliability estimate.

Retrieve daily reliability history for a configurable lookback window:

Terminal window
GET /reliability/:deviceId/history?days=90
ParameterTypeDescription
daysintegerLookback window in days (1-365, default 90)

Each data point in the response represents one day and includes:

FieldTypeDescription
datestringDay in YYYY-MM-DD format
sampleCountintegerNumber of telemetry submissions that day
uptimeSecondsMaxintegerHighest reported uptime that day
crashCountintegerTotal crash events
hangCountintegerTotal application hangs
serviceFailureCountintegerTotal service failures
hardwareErrorCountintegerTotal hardware errors
reliabilityEstimateintegerEstimated reliability score for that day (0-100)

The Breeze AI assistant can query device reliability data through its built-in tool system. The query_device_reliability tool allows natural language questions about fleet reliability to be answered with real data.

The AI tool supports the same filters as the list API: organization, score range, trend direction, issue type, and score bounds. When invoked, it returns the same paginated results with a summary section, allowing the AI to answer questions like:

  • “Which devices have the worst reliability scores?”
  • “How many devices are in a degrading trend?”
  • “Show me all devices with hardware errors in the last 30 days”
  • “What is the average reliability score for Contoso?”

MethodPathDescription
GET/reliabilityList device reliability scores with filtering and pagination
GET/reliability/org/:orgId/summaryOrganization-level reliability summary with worst devices
GET/reliability/:deviceIdFull reliability snapshot and 30-day history for a device
GET/reliability/:deviceId/historyDaily reliability history with configurable lookback (?days=)
MethodPathDescription
POST/agents/:id/reliabilitySubmit reliability metrics from the agent (agent auth required)

No reliability data for a device. Reliability data appears after the agent has submitted at least one telemetry payload via POST /agents/:id/reliability. Confirm the agent is online and the heartbeat cycle is running. The agent includes a 24-hour initial lookback on first collection, so the first submission should include recent events. If the device exists but has no reliability snapshot, the scoring computation may not have run yet — check BullMQ worker status.

Reliability score seems too low despite no visible issues. The score is a composite of five factors with different weights. A device can have a low score due to a single factor being severely penalized. Use GET /reliability/:deviceId to inspect the individual sub-scores (uptimeScore, crashScore, hangScore, serviceFailureScore, hardwareErrorScore) and identify which factor is dragging the score down. For example, a 90% uptime over 90 days produces an uptime sub-score of 0, which alone would reduce the overall score by up to 30 points.

Trend direction shows stable with low confidence. Trend computation requires at least 3 days of data and achieves full confidence at 14+ days. If the device was recently enrolled or has sparse telemetry, the trend will default to stable with trendConfidence: 0. Allow the device to accumulate more history before relying on trend data.

Agent event collection failing on specific platform. On all platforms, if the event log collector encounters an error, the reliability collector gracefully falls back to base metrics (uptime and boot time only). Check agent logs for warnings like “reliability event log collection failed, returning base metrics only”. Common causes include insufficient permissions to read system event logs, missing log sources, or the event log service being stopped.

MTBF showing null. MTBF is only computed when there is at least one failure event (crash, hang, service failure, or hardware error) in the 90-day window AND the device has positive operating hours. A device with zero failures has no meaningful MTBF — this is the ideal state. A device with zero uptime data also produces null MTBF.

Score not updating after new events arrive. Score computation is enqueued via BullMQ after each telemetry submission. If the queue worker is down, the system falls back to inline computation, but this fallback may fail silently if the database is under load. Check BullMQ dashboard for failed or stalled device-reliability-computation jobs. The computedAt timestamp on the reliability snapshot indicates when the score was last calculated.

Organization summary showing stale data. The org summary endpoint computes results in real time from the device_reliability table. If individual device scores have not been recomputed recently (check computedAt), the summary reflects outdated data. Trigger a fleet-wide recomputation by ensuring all agents are submitting telemetry and the reliability worker is processing jobs.