OTP Supervision Trees Are the Best Distributed Systems Primitive Nobody Uses

Every microservice framework reinvents OTP badly. We just use the real thing, in Gleam.

By Anokuro Engineering·Feb 18, 2026·Engineering

Every distributed systems team eventually builds supervision. They call it "health checks" or "circuit breakers" or "pod restart policies," but the underlying need is identical: detect failure, isolate it, recover automatically, escalate if recovery fails. Erlang solved this in 1986 with OTP supervision trees. Forty years later, most teams are still reimplementing it poorly on top of Kubernetes.

We run our entire ad-serving pipeline on Gleam targeting the BEAM, and OTP supervision trees are the single most important architectural decision we have made. Not the language. Not the VM. The supervision model.

What OTP Supervision Trees Actually Provide

A supervision tree is a hierarchical structure where every process has a parent responsible for its lifecycle. When a child process crashes, the supervisor decides what to do based on a restart strategy:

one_for_one: Restart only the crashed child. Other siblings continue unaffected.
one_for_all: If one child crashes, restart all children. Used when children are interdependent.
rest_for_one: Restart the crashed child and all children started after it. Used for ordered dependency chains.

Each supervisor has configurable restart intensity: "allow N restarts within T seconds, then escalate to my parent supervisor." This creates automatic backpressure. A process crashing once gets restarted instantly. A process crashing in a loop causes its supervisor to die, which causes its supervisor to make a broader decision. Failure propagates up the tree exactly as far as it needs to, and no further.

This is hierarchical fault isolation with automatic recovery baked into the runtime. No sidecar process. No external orchestrator. No YAML.

Our Ad-Serving Supervision Architecture

We serve programmatic ads across Southeast Asia through real-time bidding exchanges. Each exchange connection, each geographic region, each campaign has its own supervision subtree. The hierarchy looks roughly like this:

root_supervisor
├── exchange_supervisor (one_for_one)
│   ├── google_adx_supervisor (rest_for_one)
│   │   ├── connection_pool
│   │   ├── bid_engine
│   │   └── response_serializer
│   ├── xandr_supervisor (rest_for_one)
│   │   ├── connection_pool
│   │   ├── bid_engine
│   │   └── response_serializer
│   └── ...per exchange
├── region_supervisor (one_for_one)
│   ├── sg_region (one_for_all)
│   ├── id_region (one_for_all)
│   └── th_region (one_for_all)
└── telemetry_supervisor (one_for_one)
    ├── metrics_aggregator
    └── trace_exporter

Each exchange supervisor uses rest_for_one because the bid engine depends on the connection pool, and the response serializer depends on both. If the connection pool crashes, everything downstream restarts in order. But a crash in the Google AdX subtree does not affect Xandr. The exchange_supervisor uses one_for_one because exchanges are independent.

Region supervisors use one_for_all because within a region, the bidding state, budget pacing, and frequency capping are tightly coupled. If any component in the Singapore region fails, we restart the entire regional subtree to guarantee consistency. But a failure in Singapore does not touch Indonesia.

Restart intensity is calibrated to our SLAs. Exchange supervisors allow 5 restarts in 10 seconds before escalating. This matches our 99.95% availability target: we can absorb transient failures (DNS blips, brief exchange outages) without human intervention, but sustained failures escalate fast enough that we never silently degrade for more than a few seconds.

Why Kubernetes Restarts Are Not the Same Thing

"We have liveness probes and restart policies in K8s" is the most common response we hear. It misses the point entirely.

Kubernetes restarts operate at the container level. A container restart takes seconds at minimum: kill the process, maybe pull an image layer, schedule on a node, run init containers, pass health checks. Our BEAM process restarts take microseconds. We measured an average of 11 microseconds from crash to new process accepting messages.

More importantly, Kubernetes has no concept of hierarchical isolation. A pod either runs or it does not. There is no "restart just the bid engine within the AdX service while keeping the connection pool alive." You would need to decompose that into separate deployments, adding network hops and operational complexity for something the BEAM gives you inside a single OS process.

Go circuit breakers (like gobreaker or hystrix-go) are closer but still operate at the wrong abstraction layer. A circuit breaker wraps a function call. A supervision tree manages a running process with state. Circuit breakers answer "should I call this thing?" Supervisors answer "this thing crashed, how do I recover the system?"

Tokio task groups in Rust are perhaps the closest analogy outside the BEAM. JoinSets let you spawn and manage related tasks. But Tokio has no built-in restart strategies, no intensity tracking, no hierarchical escalation. You build those yourself, and you always build them worse than what OTP provides because you are doing it ad hoc for your specific use case rather than using a general primitive that has been battle-tested for four decades.

Gleam-Specific Patterns

We chose Gleam over Erlang or Elixir for one reason: types. OTP's message-passing model in Erlang is dynamically typed. You send a tuple to a process and hope the receive block matches it. In production systems processing 2.3 million bid requests per second, "hope" is not an engineering strategy.

Gleam gives us typed actors. Every message a process can receive is defined as a Gleam type, and the compiler rejects messages that do not conform. This eliminates an entire class of runtime errors that plague Erlang codebases: the mismatched message that sits in a process mailbox forever, slowly leaking memory.

We also combine Gleam's Result type with the let-it-crash philosophy in a specific way. Business logic errors (budget exhausted, bid below floor price) return Result types and are handled explicitly. Infrastructure errors (exchange connection dropped, disk full, corrupted message) crash the process. The supervision tree handles infrastructure failures. Application code handles business logic failures. The boundary is clean and enforced by types.

pub fn handle_bid_request(req: BidRequest) -> Result(BidResponse, BidError) {
  use campaign <- result.try(lookup_campaign(req.campaign_id))
  use budget <- result.try(check_budget(campaign))
  use bid <- result.try(calculate_bid(campaign, req))
  Ok(BidResponse(bid:, campaign_id: campaign.id))
}

If lookup_campaign returns an error because the campaign does not exist, that is a business logic path handled by the caller. If lookup_campaign crashes because the ETS table is corrupted, the process dies and the supervisor restarts it with a fresh table. Two different failure modes, two different recovery mechanisms, zero ambiguity about which is which.

Incident Post-Mortem: 47ms Recovery

On January 14, 2026, one of the exchanges we connect to started sending malformed OpenRTB payloads. Specifically, their bid responses contained negative price values that violated their own schema. Our deserialization code did not anticipate negative prices (because the spec forbids them), and the parsing process crashed.

Here is what happened in the 47 milliseconds that followed:

T+0ms: Parser process crashes on malformed payload.
T+0.011ms: Exchange supervisor detects crash, restarts parser process (one_for_one).
T+2ms: New parser process initializes, begins processing queued messages.
T+3ms: Second malformed payload arrives, parser crashes again.
T+3.011ms: Supervisor restarts parser again. Restart count: 2.
T+8ms through T+31ms: Pattern continues. Restart count hits intensity limit (5 restarts in 10 seconds).
T+31ms: Exchange supervisor itself crashes due to restart intensity exceeded.
T+31.008ms: Parent exchange_supervisor catches the crash, triggers its one_for_one strategy, restarts the failing exchange's supervisor in a degraded mode that drops incoming messages from that exchange and logs an alert.
T+47ms: System stable. All other exchanges unaffected. Bid requests from the misbehaving exchange are being dropped with appropriate no-bid responses.

Total impact: 47 milliseconds of degraded service for one exchange. Zero impact on other exchanges. Zero human intervention required for immediate recovery. An engineer reviewed the alert, added negative price handling to the parser, deployed, and the exchange supervisor was returned to normal mode.

Compare this to what would have happened with a traditional microservice: the service crashes, Kubernetes detects it after the liveness probe interval (typically 10-30 seconds), schedules a restart (seconds), the new pod starts (seconds), passes readiness checks (seconds). Minimum recovery time: 30-60 seconds, during which all exchanges served by that pod are affected.

The Argument for OTP as a Distributed Systems Primitive

OTP supervision trees give you hierarchical fault isolation, automatic recovery with configurable strategies, backpressure through restart intensity limits, and clean escalation paths. These are not features of a framework. They are properties of the runtime.

Every team building distributed systems needs these properties. Most teams build them piecemeal out of health checks, circuit breakers, retry libraries, and Kubernetes restart policies, spread across multiple layers of the stack with no unified model. The result is brittle, hard to reason about, and slow to recover.

We use the real thing. In Gleam, with types. It is the best distributed systems primitive nobody uses, and our 99.99% uptime over the last twelve months is the proof.