Datapoint Cleaner — Data Hygiene, Normalization & Retention
Service that cleans, normalizes and retires time‑series datapoints before they reach long‑term storage or analytics. Designed to reduce noise, storage costs and false alerts while preserving auditability and compliance.
Purpose & scope
- Filter out obvious noise (zero‑spikes, duplicate writes, format errors) and tag doubtful datapoints for review.
- Normalize timestamps, units and labels to canonical schemas before ingestion (Prometheus, Influx, OpenTelemetry metrics).
- Apply retention and TTL rules (hot storage vs. cold archive), plus aggregation rollups to reduce long‑term footprint.
- Respect audit and compliance needs: keep tamper‑evident logs of cleaning decisions and enable replay to restore raw data if required.
Quick facts
Package: Datapoint Cleaner v
Modes: realtime stream, batch cleanup, review queue
Download / Repo:
Core features
- Validation rules engine: JSON/YAML rule sets for range checks, allowed value lists, unit checks and heartbeat detection.
- Adaptive noise filters: automatic detection and suppression of single‑sample spikes, short runtime flaps and duplicated series.
- Schema normalization: label canonicalization, unit conversion, timestamp rounding (configurable windows).
- Review queue & human-in-the-loop: flag borderline datapoints to a dashboard for operator review with contextual logs and sample history.
- Retention & rollup: configure hot window (raw), aggregated medium window (1m/5m rollups), and long‑term cold store (hourly/daily summaries).
- Auditability: append immutable cleaning decisions to an append‑only log (WORM or signed ledger) for compliance and forensics.
Recommended rules & examples
# Example rule: temperature sensors - metric: env.temperature unit: C min: -40 max: 85 spike_threshold: 20 # suppress changes > 20°C within 1 sample unless repeated duplicate_window: 10s retention: hot: 7d aggregated: 90d cold: 365d
Example: probe heartbeat
metric: probe.heartbeat expect_interval: 60s alert_on_missing: 180s action: mark_probe_stale
Operational patterns
- Run Cleaner at the gateway edge (low latency) for early filtering; perform a second pass centrally for enrichment and rollups.
- Use deterministic idempotent cleaning operations so replay of raw logs yields the same cleaned output (important for audits).
- Store decisions with correlation IDs so incidents can trace back how a datapoint was transformed or dropped.
- Provide a “restore raw” path: keep raw payloads for the hot window with cryptographic checksums to enable recovery if cleaning rules were too aggressive.
Security, privacy & compliance
- Mask or remove PII fields before logs leave the local network. Avoid storing raw PII in long‑term archives.
- Use signing and checksums on logs; maintain WORM storage or write‑once append‑only logs for audit trails where required.
- Define retention policies aligning with GDPR: specify how long raw payloads and cleaned outputs are kept and who can request deletion.
Deployment & scaling
- Edge deployments: lightweight agents written in Go/ Rust for minimal overhead and predictable memory footprint.
- Central processors: horizontally scalable stream processors (Kafka / Pulsar + Flink/Beam) for enrichment, rollups and audit logging.
- Backpressure handling: when downstream is slow, buffer with bounded queues + prioritized review queue for flagged datapoints.
- Monitoring: expose metrics (cleaned_count, dropped_count, flagged_count, avg_processing_ms) to Prometheus and include Grafana dashboards.
Runbook & CLI examples
Run a dry‑run batch cleanup datacleanerctl run --mode dryrun --input /data/raw/2026-01-01.log --rules /etc/datacleaner/rules.yaml Push changes to ruleset and reload (graceful) datacleanerctl rules push --file rules.yaml datacleanerctl service reload Inspect flagged datapoints datacleanerctl review list --state flagged --limit 50
Abil’I.T. — Datapoint Cleaner
Contact: ops@abilit.eu
