All posts
engineeringrustobservability

How OxiPulse survives network outages

By SecuryBlack

A monitoring agent that stops recording data when the network goes down is not very useful. OxiPulse is designed to keep collecting and to flush everything once connectivity returns, with no gaps in the time series. Here's how the offline buffer works.

The ring buffer

When the ingestor is unreachable, metric snapshots are stored in an in-memory ring buffer. The default capacity is 8,640 snapshots — exactly 24 hours of data at the default 10-second interval. When the buffer is full, the oldest snapshot is dropped to make room for the newest. The agent always has the most recent 24 hours, never more.

# config.toml
buffer_max_size = 8640   # 24 h at 10 s interval (default)

You can raise this for longer outage tolerance or lower it on memory-constrained devices. At 10 seconds per snapshot, each snapshot is roughly 64 bytes of metric data, so 8,640 snapshots consume under 1 MB of RAM.

Exponential backoff for connectivity checks

Checking reachability on every collection tick would generate a lot of noise and waste resources during a prolonged outage. OxiPulse uses exponential backoff: after the first failure it checks again after 2 ticks, then 4, 8… up to a ceiling of approximately 30 seconds.

tick 1: unreachable → mark offline, backoff = 1 tick
tick 2: skip check (countdown = 1)
tick 3: check → still unreachable → backoff = 2 ticks
tick 4-5: skip
tick 6: check → still unreachable → backoff = 4 ticks
...ceiling at ~30 s

When the ingestor comes back, the backoff resets immediately and the agent flushes the buffer in order — oldest snapshot first — before recording the current tick.

The reachability check

Before attempting an OTLP export, OxiPulse does a lightweight TCP handshake to the ingestor's host and port. This avoids the overhead of a full gRPC connection attempt when the endpoint is clearly unreachable. The check prefers IPv4 addresses over IPv6 to avoid stalls on hosts where the ingestor has no IPv6 listener.

Flush on reconnect

When connectivity is restored the agent drains the entire buffer before sending the current metric snapshot. The ingestor receives a burst of historical data with the correct timestamps, so dashboards show a continuous time series rather than a gap followed by a sudden jump.

offline for 2 hours → reconnect
  → flush 720 buffered snapshots (in timestamp order)
  → send current snapshot
  → resume normal 10 s cadence

Practical implications

  • A VPS that loses connectivity overnight will have a full, uninterrupted time series in the morning.
  • A Raspberry Pi on a flaky home connection will never show gaps shorter than 24 hours.
  • The buffer is in-memory, so a crash or power cut loses any un-flushed data. Persistence to disk is on the roadmap for a future release.