A field guide to event-driven orchestration
Retiring the orchestration server — replacing Airflow and commercial schedulers with Step Functions, EventBridge, and Lambda that only exist while a pipeline is running.
Most data platforms still keep a server whose only job is to start other work.
It runs Airflow, or a commercial scheduler, and it is usually the least reliable machine in the stack: patched on weekends, sized for the worst hour of the month, running at three in the morning whether there is work or not. When it goes down, nothing else is broken — and nothing else runs. The team ends up operating the thing that operates the pipelines.
The fix most teams reach for is a better scheduler — managed Airflow, a newer engine, more replicas. There is a more direct option: no scheduler. Pipelines that are started by the events that make them necessary — a file landing, an upstream run completing — executed as state machines that declare their own retries and leave a full execution history, on compute that exists only while the run does. We have retired the scheduler this way three times — for a research university's student data warehouse, for a platform inside Salesforce, and for Taco Bell's first AWS analytics environment — and the pattern has held each time. This brief is what makes it work, and what it costs.
One event in. A pipeline out.
The scheduler's job survives — something still has to know what runs after what. The server does not. The dependency graph moves out of a Python file on a box and into declared event subscriptions that assemble the pipeline at runtime.
The DAG is emergent
No file defines the graph. Each pipeline declares what it listens for; the DAG is what those subscriptions add up to, assembled fresh on every run.
Zero idle infrastructure
Between runs there is nothing to patch, size, or babysit. Cost tracks work done, not hours elapsed — the always-on line item disappears.
Failure handling is declared
Retries, backoff, timeouts, and catch paths are properties of the state machine, not scripts around it. Every run leaves a complete, inspectable history.
Four moves that retire the server.
Deleting the scheduler is not the work. The work is replacing each thing it quietly did — dependency ordering, retry logic, operational memory — with something that does it better. Four moves cover it.
Make events the dependency graph.
In a scheduler, "B runs after A" is a line of DAG code. In this model it is a subscription: pipeline A emits a completion event, and an EventBridge rule declares that this event starts B. That inversion matters more than it looks. Dependencies stop being one team's Python file and become declared, inspectable infrastructure — any pipeline can subscribe to any event without asking the owner of the graph, because there is no owner of the graph. New consumers attach without touching what they consume.
Let the state machine own the run.
Every run is a Step Functions execution: each step's retries, backoff, timeouts, and failure paths are declared in the machine's definition, and every execution leaves a step-by-step history you can open and read. This is where the operational muscle a scheduler never had shows up — the Taco Bell framework was built around automatic restartability, so a pipeline that failed at step six resumed at step six, not from the beginning. Recovery became a property of the machine rather than a 6 a.m. human ritual.
Generate the pipeline from metadata.
The state machines themselves should be boring and few. What varies per source — tables, schemas, validation rules, destinations — belongs in metadata, with the machine expanding it at runtime into parallel branches: dynamic DAG generation. At the university, one generated pattern carried the entire student data warehouse; at Salesforce, metadata-driven onboarding took a new source from weeks to days, because adding a source meant adding configuration, not writing orchestration code.
Keep every piece in Terraform.
An event-driven platform is many small parts — the university build ran to 50+ Lambda functions and 20+ Step Functions workflows — and the only way that stays operable is if all of it, rules and machines and functions and permissions, is declared in code. This is also the honest answer to "where did the scheduler UI go": the system's definition lives in the repository, reviewable in pull requests, reproducible in any account. A scheduler configured by hand drifts; a platform declared in Terraform cannot.
Three machines, one bus, no server.
The four moves compose into a single topology. Producers put events on the bus. Rules match events to state machines. Each machine runs its steps, emits a completion event, and vanishes — and that completion event is what starts the next machine. The chain below is a working pipeline with nothing running between runs.
Time-based runs do not disappear — some pipelines genuinely are "every morning at six." They become one more producer: a schedule rule that emits an event onto the same bus. The schedule keeps its job. It just stops being the architecture.
The pattern at enterprise scale.
This is the same topology rolled up from a system we designed and run in production — the enterprise data platform behind a research university's student data warehouse. Workday XML and Oracle change-data-capture land in an hour-partitioned raw bucket; per-table processors relationalize any schema into Parquet and merge it forward; Lake Formation governs it; Athena and Redshift serve it. Hundreds of tables move through this chain daily, and there is no orchestration server anywhere in the picture.
Two details worth noticing. The event ledger is the rebuilt pane of glass — every file landing, every job status, every failure is recorded as an event and queryable in Athena, which is how "did last night run?" gets its answer without a scheduler UI. And the handful of time-based feeds that remain — CDC pulls, warehouse refreshes — enter the picture as event producers on the same bus, exactly as the model prescribes.
What the scheduler was giving you for free.
A scheduler is a bad server but a real product, and deleting it deletes its conveniences too. Three of them have to be rebuilt deliberately — teams that skip this step end up missing the server they hated:
The single pane of glass.
Airflow's one genuine gift is a page listing every DAG, every run, every status. An event-driven platform has no such page until you build it — executions are scattered across state machines, and "did last night run?" has no default answer. Treat observability as a first-class deliverable: centralized structured logging, a run manifest per pipeline, and alerting on the absence of an expected completion event, not just on failures. At Taco Bell this was a dedicated build — CloudWatch, Elasticsearch, Kibana — and it shipped with the platform, not after it.
Exactly-once is now your job.
Event delivery is at-least-once on a good day and silently absent on a bad one — a misconfigured rule drops events without an error, and a duplicate delivery runs your pipeline twice. The discipline is to make both harmless: every step idempotent, so a replayed event converges instead of double-writing, and a periodic reconciliation sweep that compares what landed against what ran and re-emits anything missed. Build replay in from day one — it is the mechanism for both failure recovery and backfill, and retrofitting it is miserable.
Backfills lose their button.
"Re-run March" is a first-class Airflow concept and does not exist here by default — there is no calendar of runs to click. The replacement is the replay mechanism from the previous point: because every pipeline is started by an event, a backfill is just emitting the historical events again and letting the same machines process them. That is genuinely cleaner than scheduler catchup — the backfill path and the production path are the same code — but only if you designed events to carry enough context to be re-emitted. Decide that on day one, not the day finance asks for a restatement.
A platform that explains its own failures.
An event-driven platform turns out to be the natural substrate for AI operations, for a simple reason: an agent is just another subscriber. Everything that happens is already an event with structured context attached — no scraping a scheduler UI to find out what went wrong. Four applications that have earned their keep:
Failure triage, before a human arrives
A failure event carries the state machine, the step, the input, and the error. An LLM-based triage system subscribed to those events identifies the likely root cause and recommends remediation before an engineer opens the console — at Salesforce this meaningfully cut time-to-diagnosis for the operations team, because the 2 a.m. question changed from "what happened?" to "do we agree with the diagnosis?"
Metrics analysis, autonomous and cheap
At the university, a Bedrock-powered Lambda analyzes CloudWatch metrics on its own, scoping queries per business unit — a design that cut API call volume by 98%. The lesson generalizes: give the agent the platform's own events and metrics as its context, and scope its queries the way you would scope a human analyst's.
Missing-run detection with judgment
The absence-of-event alerting from the observability build is a natural agent job: something that knows the historical rhythm of each pipeline, notices "the Tuesday file is four hours late," checks the producer's side, and either waits, escalates, or re-emits — the decision a human on-call makes, made earlier.
Onboarding as a conversation
When adding a source is adding metadata (move 03), an agent can draft that metadata — read a sample of the new feed, propose the schema, validation rules, and subscriptions, and open the pull request. The human reviews a diff instead of authoring one, and the Terraform discipline (move 04) means the review is the deployment gate.
A caution to close on: this is not a mandate to delete Airflow by Friday. A team deep in Airflow's ecosystem, leaning on its operators and its backfill semantics, with pipelines that fit it well, has no emergency. The pattern earns its move when the scheduler is the thing you babysit — when the box itself is the incident. Then the answer is not a better box.
Get the next one by email.
A short note from Robin when a new brief publishes — the point, and the link. No cadence promises, no funnel, one-click unsubscribe.
You'll get one confirmation email first — nothing sends until you click it.
If the box that starts everything keeps stopping —
If the orchestration server is the machine your team patches, restarts, and worries about, the pipelines are not the problem — the architecture underneath them is, and it is a fixable one. An hour is usually enough to know whether event-driven orchestration fits your platform.