Monitoring — Metrics, Alerting & Dashboards

Comprehensive observability layer for SCADA‑LTS and supporting tools (Watchdog, Cleanup, Rate Watcher). Includes metrics collection, alert rules, Grafana dashboards and incident routing best practices.

Core components

Metrics: Prometheus collectors for JVM, Host, Application, Probe, and Cleanup metrics.
Traces: Optional OpenTelemetry instrumentation for request traces and slow paths.
Logs: Centralized ingestion (Loki/ELK) with structured JSON and correlation IDs.
Dashboards: Grafana panels for system health, capacity, probe success rates and SLA windows.
Alerting & Routing: Alertmanager with receiver chains (email, Slack, Ops SMS, PagerDuty, ServiceNow).

Quick facts

Stack: Prometheus + Alertmanager + Grafana + Loki/OTel

Retention: Metrics 15d (hot), 90d (cold) — configurable.

Dashboards:Export / Import JSON

Recommended metrics & labels

Design metric schema with consistent labels for service, instance, region, environment, and team. Example key metrics:

service_up{service,instance,env}
jvm_memory_bytes_used{area,service,instance}
http_request_duration_seconds_bucket{le,handler,service}
probe_success_ratio{probe,service,instance}
cleanup_runs_total{status,rule_set}
watchdog_restarts_total{service,reason}

Alerting strategy & example rules

Use multi‑tier alerting: P0 (page immediate), P1 (on‑call), P2 (email/Slack). Suppress noisy signals with grouping and inhibition rules.

# Example Prometheus alert (YAML snippet) - alert: ServiceDown expr: up{job="scada-service"} == 0 for: 3m labels: severity: critical annotations: summary: "{{ $labels.service }} is down on {{ $labels.instance }}" description: "No healthy targets for service {{ $labels.service }} for >3m."

Inhibition examples: suppress non‑critical CPU alerts during maintenance windows; route P0 to PagerDuty and SMS, P1 to Slack + Email.

Grafana dashboards — essential panels

Cluster Overview — service_up, instance counts, alert state summary.
JVM Health — heap/non‑heap, GC pause histogram, thread states.
Probe Health — probe success rate by probe and instance, recent failures.
Latency & Error Budget — p99/p95 request latency, error rates, SLA burn rate.
Cleanup & Archival — last run time, items archived, verification failures.
Capacity & Storage — DB size growth, archive storage usage, reclaimable space.

Logging & Tracing

Correlate logs with metrics using a correlation_id propagated in headers. Use Loki for logs and OTel for traces to connect slow traces to alerting signals.

Structured logs: include service, instance, request_id, correlation_id, user (if applicable).
Trace sample rate: start low (0.1%) and increase for error paths or performance investigations.
Retention & access: logs 30–90 days depending on compliance; archive longer to cold storage.

Operational playbooks

For each alert create a one‑page runbook containing: immediate checks, mitigation steps, rollbacks and post‑mortem triggers.

ServiceDown: check instance logs → run probe commands → verify network → scale or restart with Watchdog hooks.
HighGC: identify memory leak candidates → increase heap temporarily → enable allocation sampling → schedule heap dump.
ProbeFailuresHigh: check upstream dependencies → verify DNS & certificates → escalate to team owning dependency.

Scaling & High‑availability

Run Prometheus in HA (federation or Thanos/Cortex for long‑term retention), deploy Alertmanager in clustered mode and use redundant Grafana backends with a shared DB for dashboards.

Long‑term metrics: Thanos or Cortex for multi‑year retention and cross‑cluster queries.
Disaster recovery: backup Alertmanager configs, Grafana dashboards and Prometheus rules to repo with CI/CD deployment.
Access control: use Grafana RBAC and datasource permissions; restrict edit rights to dashboards that affect on‑call behavior.

Integrations & Notifications

Suggested receivers: Email (team buckets), Slack channels per team, SMS for critical P0, PagerDuty for escalation, ServiceNow for traceable incidents.

Templates & Resources

Prometheus alert rules repo (example):
Grafana dashboard JSON exports:
Runbook templates and incident report forms: