Monitoring — Metrics, Alerting & Dashboards
Comprehensive observability layer for SCADA‑LTS and supporting tools (Watchdog, Cleanup, Rate Watcher). Includes metrics collection, alert rules, Grafana dashboards and incident routing best practices.
Core components
- Metrics: Prometheus collectors for JVM, Host, Application, Probe, and Cleanup metrics.
- Traces: Optional OpenTelemetry instrumentation for request traces and slow paths.
- Logs: Centralized ingestion (Loki/ELK) with structured JSON and correlation IDs.
- Dashboards: Grafana panels for system health, capacity, probe success rates and SLA windows.
- Alerting & Routing: Alertmanager with receiver chains (email, Slack, Ops SMS, PagerDuty, ServiceNow).
Quick facts
Stack: Prometheus + Alertmanager + Grafana + Loki/OTel
Retention: Metrics 15d (hot), 90d (cold) — configurable.
Dashboards:Export / Import JSON
Recommended metrics & labels
Design metric schema with consistent labels for service, instance, region, environment, and team. Example key metrics:
- service_up{service,instance,env}
- jvm_memory_bytes_used{area,service,instance}
- http_request_duration_seconds_bucket{le,handler,service}
- probe_success_ratio{probe,service,instance}
- cleanup_runs_total{status,rule_set}
- watchdog_restarts_total{service,reason}
Alerting strategy & example rules
Use multi‑tier alerting: P0 (page immediate), P1 (on‑call), P2 (email/Slack). Suppress noisy signals with grouping and inhibition rules.
# Example Prometheus alert (YAML snippet) - alert: ServiceDown expr: up{job="scada-service"} == 0 for: 3m labels: severity: critical annotations: summary: "{{ $labels.service }} is down on {{ $labels.instance }}" description: "No healthy targets for service {{ $labels.service }} for >3m."
Inhibition examples: suppress non‑critical CPU alerts during maintenance windows; route P0 to PagerDuty and SMS, P1 to Slack + Email.
Grafana dashboards — essential panels
- Cluster Overview — service_up, instance counts, alert state summary.
- JVM Health — heap/non‑heap, GC pause histogram, thread states.
- Probe Health — probe success rate by probe and instance, recent failures.
- Latency & Error Budget — p99/p95 request latency, error rates, SLA burn rate.
- Cleanup & Archival — last run time, items archived, verification failures.
- Capacity & Storage — DB size growth, archive storage usage, reclaimable space.
Logging & Tracing
Correlate logs with metrics using a correlation_id propagated in headers. Use Loki for logs and OTel for traces to connect slow traces to alerting signals.
- Structured logs: include service, instance, request_id, correlation_id, user (if applicable).
- Trace sample rate: start low (0.1%) and increase for error paths or performance investigations.
- Retention & access: logs 30–90 days depending on compliance; archive longer to cold storage.
Operational playbooks
For each alert create a one‑page runbook containing: immediate checks, mitigation steps, rollbacks and post‑mortem triggers.
- ServiceDown: check instance logs → run probe commands → verify network → scale or restart with Watchdog hooks.
- HighGC: identify memory leak candidates → increase heap temporarily → enable allocation sampling → schedule heap dump.
- ProbeFailuresHigh: check upstream dependencies → verify DNS & certificates → escalate to team owning dependency.
Scaling & High‑availability
Run Prometheus in HA (federation or Thanos/Cortex for long‑term retention), deploy Alertmanager in clustered mode and use redundant Grafana backends with a shared DB for dashboards.
- Long‑term metrics: Thanos or Cortex for multi‑year retention and cross‑cluster queries.
- Disaster recovery: backup Alertmanager configs, Grafana dashboards and Prometheus rules to repo with CI/CD deployment.
- Access control: use Grafana RBAC and datasource permissions; restrict edit rights to dashboards that affect on‑call behavior.
Integrations & Notifications
Suggested receivers: Email (team buckets), Slack channels per team, SMS for critical P0, PagerDuty for escalation, ServiceNow for traceable incidents.
Templates & Resources
- Prometheus alert rules repo (example):
- Grafana dashboard JSON exports:
- Runbook templates and incident report forms:
