Introduction

The saga pattern is one of those architectural decisions that looks straightforward in diagrams and becomes genuinely complex the moment you operate it in production. The premise is simple: in a distributed system where you cannot use a single ACID transaction across service boundaries, you decompose a long-running business transaction into a sequence of local transactions, each of which publishes an event or message to trigger the next step. If any step fails, you execute compensating transactions to undo the work already done.

What the premise omits is the question of who drives the sequence. That question — orchestration or choreography — is the central design decision in saga implementation, and it carries consequences that ripple through your codebase, your operational posture, and your team’s cognitive load for years after the initial commit.

This post takes a production-grounded look at both approaches: how they work, where each breaks down, what failure modes they introduce, and how to reason about which one belongs in your system. This is not a “use orchestration for complex sagas, choreography for simple ones” post — that heuristic is too coarse to be useful. The real trade-offs are more specific and more interesting.

Table of Contents


The Shared Foundation: What Both Approaches Must Solve

Before comparing the two, it’s worth being explicit about what any saga implementation — regardless of approach — must handle correctly. These are not optional concerns; they are the baseline requirements that separate a working saga from a reliable one.

Atomicity at step boundaries. Each local transaction must commit atomically with the publication of its outgoing event or message. If you commit the database write and then the message broker goes down before the event is published, the saga stalls silently with no way to resume. This is the dual-write problem, and it affects both orchestration and choreography equally. The standard solutions are the transactional outbox pattern or change data capture — both ensure the event is durably recorded in the same transaction as the state change, and a separate relay process handles publication.

Exactly-once or at-least-once with idempotent consumers. Message brokers almost universally guarantee at-least-once delivery. Your saga steps will be invoked more than once. Every step — whether it’s a choreography listener or an orchestrator-issued command — must be idempotent. This is where the previous article’s content applies directly: each step needs an idempotency key, a deduplication store, and atomic reservation logic. Sagas without idempotent steps are not sagas; they are time bombs.

Compensation correctness. Compensating transactions are not rollbacks. A rollback undoes changes as if they never happened. Compensation undoes changes in a world that has moved on — emails may have been sent, external APIs called, time has passed. Compensating transactions must be designed for the state the system is actually in, not the state it was in before the saga started. This means they can fail too, and you need a strategy for that.

Observability. A saga that spans five services across three teams, with no unified view of its current state, is an operational nightmare. Both approaches need correlation IDs that propagate through every step, structured logging at every transition, and ideally a queryable representation of saga state that a human can inspect without joining log streams across five services.

With this baseline established, the choice between orchestration and choreography becomes a question of where you locate control, coupling, and visibility — not whether you need to solve the above.


Orchestration: Central Control, Explicit State

In the orchestration model, a dedicated component — the orchestrator — is responsible for the entire saga. It knows the sequence of steps, issues commands to participating services, waits for responses, handles failures, and drives compensation. Participating services do not know they are part of a saga; they receive a command, execute it, and respond with success or failure.

How It Works

The orchestrator is typically implemented as a stateful workflow. Each saga instance has a record in a persistent store tracking its current step, the outcomes of completed steps, and any data it needs to pass forward. On each transition — a response received, a timeout fired, a failure acknowledged — the orchestrator updates its state and issues the next command.

Modern implementations use workflow engines (Temporal, Azure Durable Functions, AWS Step Functions) that handle the durability and replay mechanics transparently. You express the saga as code — sequential, conditional, parallel — and the engine ensures that even if the process crashes mid-execution, it resumes from exactly where it left off.

In systems without a dedicated workflow engine, orchestrators are often implemented as state machines persisted in a relational database, driven by a polling loop or event consumer. This is more work to build correctly but gives you full control over the execution model.

Where Orchestration Shines

Complex, conditional workflows. When the saga has branching logic — “if the inventory check returns low stock, trigger a backorder flow instead of the standard fulfillment path” — expressing this in an orchestrator is natural. The control flow is code. In choreography, conditional branching requires services to publish different event types based on state, and the routing logic is spread across multiple consumers and topic configurations. Orchestration keeps it in one place.

Long-running sagas with human steps. Orchestrators handle wait states elegantly. “Wait up to 48 hours for manual approval, then either proceed or cancel” is a single timer in Temporal or a scheduled check in a state machine. In choreography, this requires a service that holds the waiting state, listens for an approval event, and publishes a continuation event — an entire service built to coordinate what would be a few lines in an orchestrator.

Debugging and operational support. When a saga fails or stalls, the orchestrator’s state record tells you exactly which step it failed on, what input it had, and what it was waiting for. Support engineers can query saga state directly. In choreography, reconstructing the current state of a failed saga requires correlating events from multiple services’ event logs, which is slow and error-prone under production pressure.

Explicit timeout and retry policies. Orchestrators apply retry and timeout logic per step, with configurable backoff and max attempts. These policies are visible in one place. In choreography, retry behavior is governed by the message broker’s redelivery configuration and individual consumer implementations — distributed, inconsistent, and hard to reason about holistically.

Where Orchestration Breaks Down

The orchestrator becomes a bottleneck. Every step in the saga flows through the orchestrator. Under high throughput, this is a scaling constraint. Orchestrators handling thousands of concurrent sagas need to be designed with horizontal scalability in mind — partitioned state, careful lock management, and queuing of inbound responses. Workflow engines handle much of this, but naive implementations hit this wall quickly.

Tight temporal coupling. The orchestrator issues a command and waits for a response. If a downstream service is slow or unavailable, the orchestrator instance is blocked (or handling the timeout). This is different from the decoupling that event-driven architectures typically aim for. Orchestrators are more tolerant of service outages than synchronous calls, but they are less tolerant than fully choreographed flows where services process events on their own schedule.

Orchestrator god-object risk. The convenience of centralizing logic can become a gravity well for business logic that should live in the participating services. Over time, orchestrators accrete domain logic, validation, and data transformation that erodes the encapsulation of the services they coordinate. Disciplined teams avoid this; less disciplined teams end up with orchestrators that know too much about the internals of every service they touch.


Choreography: Decentralized Reactions, Implicit Coordination

In the choreography model, there is no central coordinator. Each service listens for events it cares about, performs its local transaction, and publishes events that other services react to. The saga emerges from the composition of these reactions. No single component has a view of the whole.

How It Works

A saga begins when an initiating event is published — say, OrderPlaced. The inventory service listens for OrderPlaced, reserves stock, and publishes InventoryReserved. The payment service listens for InventoryReserved, charges the customer, and publishes PaymentCaptured. The fulfillment service listens for PaymentCaptured and initiates shipment. If the payment fails, it publishes PaymentFailed, the inventory service listens for that and releases the reservation, and so on.

No service is aware of the saga as a whole. Each service is aware only of the events that are relevant to its domain and the events it needs to publish in response. The saga is an emergent property of the system.

Where Choreography Shines

True decoupling and independent deployability. Services are coupled only through event contracts — the schema of the events they publish and consume. Adding a new step to a choreographed saga can mean deploying a new service that subscribes to an existing event, without modifying any existing service. This is as close to the open/closed principle as distributed systems get. Orchestration, by contrast, requires modifying the orchestrator for every workflow change.

Resilience through autonomy. Because each service processes events independently and at its own pace, choreography is highly resilient to partial failures and slow services. If the fulfillment service is down, PaymentCaptured events accumulate in the queue. When it recovers, it processes them. The rest of the system continues unaffected. In orchestration, a slow or failing downstream service directly impacts saga throughput and may block orchestrator instances.

Natural fit for streaming and high-throughput systems. When you’re processing hundreds of thousands of events per second, choreography maps naturally onto Kafka partitions and consumer groups. Each service scales its consumers independently based on its own processing requirements. Orchestrators at this scale require significant engineering investment to avoid becoming a single-threaded bottleneck.

Low operational surface area for simple, stable flows. For a well-understood, rarely-changing flow with no branching, choreography requires no additional infrastructure. The event bus is already there. No workflow engine to deploy, monitor, or upgrade.

Where Choreography Breaks Down

The implicit saga problem. The most significant operational weakness of choreography is that the overall saga state is implicit. It exists only as a correlation across events spread across multiple service logs and message broker topics. When a saga stalls — because an event was dropped, a consumer had a bug, or a compensating transaction failed — diagnosing the problem requires reconstructing the event sequence from multiple sources. This is time-consuming and error-prone.

Many teams address this by building a dedicated event tracker — a service that listens to all saga-related events, correlates them by ID, and maintains a queryable read model of saga state. This is the right solution, but note what you’ve built: a component that aggregates the global view of saga state. It’s not an orchestrator, but it’s doing one of the things orchestrators do natively. If you’re building this anyway, the question of whether you should have just used an orchestrator is worth asking.

Distributed business logic and hidden coupling. In choreography, business logic about the saga workflow is distributed across every service that participates. The inventory service knows that after reserving stock, it should publish InventoryReserved — which is only meaningful in the context of the saga. If the saga workflow changes — say, a new compliance check step is added between inventory and payment — you need to coordinate changes across multiple services and their event schemas. This hidden coupling makes choreography harder to evolve than its initial simplicity suggests.

Compensation is harder to guarantee. In orchestration, if a step fails, the orchestrator explicitly issues compensation commands in reverse order, tracks which compensations have completed, and retries failures. In choreography, compensation happens through reactive event listeners. If the PaymentFailed event is published but the inventory service’s PaymentFailed consumer has a bug and doesn’t release the reservation, there is no component that knows a compensation was missed. The inventory remains reserved indefinitely, and no one knows until a customer or monitoring alert surfaces it.

Testing the full flow. Testing a choreographed saga end-to-end requires either a running instance of every participating service or a complex mock/stub setup that replicates the event-driven interactions. Integration tests for orchestrated sagas can test the orchestrator in isolation by mocking command responses, and test each service independently. The boundary is clearer.


Failure Modes That Apply to Both

Beyond the approach-specific weaknesses, several failure patterns affect sagas regardless of implementation style.

Lost Compensation Events

A compensating transaction fails — the service throws an exception, the message is nacked, and after maximum retries it lands in a dead-letter queue. In the DLQ, it sits until someone investigates. Meanwhile, the system is in an inconsistent state: some steps have been compensated, others haven’t. If the DLQ is not monitored with appropriate urgency and clear ownership, these inconsistencies persist indefinitely.

Every saga implementation needs explicit DLQ monitoring, alerting on messages older than a threshold, and a runbook for manual intervention. The runbook needs to answer: given this specific dead-lettered compensation command, what is the expected system state, and what manual steps restore consistency?

Pivot Transaction Ambiguity

In a saga, the “pivot transaction” is the last step that can succeed before compensation begins — the point of no return. Steps before the pivot can be compensated; steps after it cannot (or compensation is prohibitively expensive). Misidentifying the pivot creates sagas where compensation is attempted on steps that should not or cannot be reversed.

For example, in an order saga: reserving inventory can be compensated by releasing it. Charging a payment card can be compensated by issuing a refund. But sending a dispatch instruction to a third-party logistics provider may not be cancellable once sent. If your saga attempts to compensate the dispatch step, you may end up issuing a cancellation that the 3PL ignores, leading to the order being shipped despite the saga entering a compensating state.

Explicitly identify and document your pivot transaction. Steps after the pivot should be designed for forward recovery — retry until success — not compensation.

Saga Timeouts and Orphaned State

Sagas that time out without completing compensation leave orphaned state. An orchestrator that marks a saga as TIMED_OUT and stops processing has abandoned any in-progress local transactions across participating services. Choreography has no concept of a saga-level timeout by default — steps just stop reacting.

Saga timeout handling requires explicit design: what is the expected system state when a timeout fires, and what cleanup is needed? For each step that may have completed before the timeout, a compensation must be issued. Timeout handlers are effectively compensation initiators, and they need to be as robust as the happy path.

Event Schema Evolution Breaking Consumers

Choreography is particularly vulnerable to this. A service evolves the schema of an event it publishes. Consumers of that event have not yet been updated. Depending on your serialization format and schema registry configuration, this either causes consumers to fail silently (skipping fields), fail loudly (deserialization exceptions), or — most dangerously — process with incorrect data without raising an error.

Schema registries (Confluent Schema Registry, AWS Glue Schema Registry) with compatibility enforcement are not optional for production choreography systems. Forward and backward compatibility rules should be enforced at publish time, not discovered at consume time.


Making the Decision

The choice between orchestration and choreography is not binary and is not permanent. Most large systems use both — choreography for high-throughput, loosely-coupled flows where services genuinely don’t need to know about each other; orchestration for complex, stateful, business-critical workflows where operational visibility and explicit control are more valuable than decoupling.

A useful decision heuristic:

Lean toward orchestration when: the workflow has complex conditional branching, involves human steps or long wait states, needs tight operational visibility, involves compensation that must be guaranteed and sequenced, or is in a domain (finance, healthcare, compliance) where inconsistency has severe consequences.

Lean toward choreography when: services are genuinely independent and should remain so, throughput is high and latency requirements are strict, the workflow is stable and simple, or the team values independent deployability over centralized visibility.

Be explicit about what you’re giving up. Choosing choreography means accepting that saga state is implicit and debugging is harder. Mitigate it with an event tracker. Choosing orchestration means accepting a central dependency and potential throughput bottleneck. Mitigate it with a scalable workflow engine and disciplined separation of workflow logic from domain logic.

Neither approach eliminates the need for idempotent consumers, transactional outboxes, schema governance, DLQ monitoring, or explicit compensation design. These are table stakes. The approach determines where control and visibility live, not whether your system is correct.


Summary

Saga orchestration and choreography are both valid, production-proven approaches to managing distributed transactions. Orchestration gives you explicit control, centralized visibility, and easier debugging at the cost of a central dependency and tighter coupling. Choreography gives you true decoupling, independent deployability, and high-throughput scalability at the cost of implicit state, distributed business logic, and harder compensation guarantees.

The real work in either case is the baseline: dual-write safety through transactional outboxes, idempotent consumers at every step, explicit compensation design with DLQ handling, and observability that makes saga state queryable without forensic log analysis.

Get the baseline right. Then choose the approach that fits your operational context, your team’s cognitive preferences, and the specific characteristics of the workflow you’re building — not the one that looked better in the last conference talk you attended.