Rate Watcher — Rate‑Limit, Throttling & Anomaly Detection
A focused service for detecting abnormal request/operation rates, enforcing rate limits and triggering automated or human responses. Designed to protect APIs, probes and critical endpoints from bursts, floods and slow‑burn anomalies.
Core capabilities
- Real‑time rate aggregation (per API key, IP, user, route, or service) with sliding windows and configurable buckets.
- Adaptive thresholds using baseline learning and seasonality-aware profiles to reduce false positives.
- Multiple enforcement modes: monitor, soft‑throttle (429), hard‑block, and graceful backoff signalling (Retry‑After header).
- Integration with API gateways, load‑balancers and WAFs (NGINX, Envoy, HAProxy, Traefik).
- Alerting & automated remediation: escalate to Watchdog, trigger temporary IP bans, or open incident tickets.
- Audit logs and metrics exported to Prometheus/Grafana for historical analysis and compliance.
How Rate Watcher works
Rate Watcher ingests request events or aggregated counters (from gateway logs, sidecars or Prometheus) and evaluates them against configured rules and learned baselines. It supports:
- Multiple keying strategies (API key, client IP, user ID, route).
- Sliding window and fixed bucket algorithms for short and long windows (e.g., 1s, 10s, 1m, 1h).
- Exemptions and whitelists for internal systems and critical clients.
- Rate‑limit policies with tiered actions and automatic cooldown timers.
- Behavioral anomaly detection that compares current rate to historic baseline and seasonality profile.
Example policies & configuration
Use policy templates to quickly apply standard protections and tune per service.
# Example policy (YAML) policies: - id: api_public_default key: api_key windows: - window: 1s limit: 20 action: soft-throttle - window: 1m limit: 1000 action: monitor baseline_learning: 14d exempt_clients: ["internal-service-1"]
Adaptive policy example: start in monitor mode for 7 days to learn baseline, then enforce soft‑throttle if burst profile exceeds 3× normal.
Integration & enforcement
- API Gateway: use Rate Watcher as an external policy engine (RP calls) or as a sidecar for local enforcement.
- WAF & LB: push decisions (block/throttle) to WAF rules or LB ACLs for immediate action.
- Watchdog: when abnormal sustained rates are detected, trigger Watchdog maintenance mode or automated scaling playbook.
- Incident systems: open tickets in ServiceNow / PagerDuty when P0 thresholds hit.
Metrics & dashboards
Expose these metrics to Prometheus for dashboards and alerting:
- ratewatcher_requests_total{policy,key,action}
- ratewatcher_throttled_total{policy,key}
- ratewatcher_blocked_total{policy,key}
- ratewatcher_baseline_deviation{policy,window}
- ratewatcher_policy_eval_duration_seconds{policy}
Suggested Grafana panels: top throttled clients, policy hit heatmap, baseline vs actual rate overlays, throttling impact on latency.
Alerting & runbooks
Define alert severities for burst vs sustained anomalies. Example rules:
- P0 — sustained blocked rate > X% of total traffic for > 5m → page on‑call, trigger Watchdog.
- P1 — burst above 10× baseline for specific key → open ticket to investigate client misbehaviour.
- P2 — repeated soft‑throttle events for non‑exempt client → notify team via Slack/email.
Operational best practices
- Start in monitor mode to collect 7–14 days of baseline data before enforcing hard limits.
- Maintain owner whitelists and emergency bypass tokens for critical traffic.
- Provide clear client feedback headers (Retry‑After, X‑RateLimit‑Reset) and API docs about limits.
- Automate cooldown and auto‑unblock after verification windows to reduce manual toil.
- Periodically review and tune policies based on seasonal patterns (business hours, campaigns).
Security & compliance
Keep audit trails of enforcement decisions and store them in tamper‑evident logs for investigations and compliance. Mask PII in logs and follow retention rules.
Deployment & scaling
- Scale horizontally: shard keyspace by hash (e.g., key → shard) and use consistent hashing for sticky routing.
- Use local caching for ultra‑low latency enforcement and eventual consistency to the central policy store.
- Provide high‑availability policy store (etcd/consul) and replicate learning models across nodes.
CLI examples
# Scan incoming logs and push counters to Rate Watcher ratewatcherctl ingest --source /var/log/gateway/access.log --format nginx --policy-map /etc/ratewatcher/policies.yaml Evaluate a specific key against policies (dry-run) ratewatcherctl eval --policy api_public_default --key "api-key-123" --window 1m --count 1500 Apply a temporary block on a client ratewatcherctl action block --key "bad-client-ip" --duration 3600 --reason "sustained flood"
Abil’I.T. — Rate Watcher
Contact: ops@abilit.eu
