A field guide to auto-relationalization
Turning nested API exports — Workday, PeopleSoft, Salesforce — into a proper relational model, without rebuilding it by hand every release.
Most data teams treat nested-to-relational as a design problem. Someone draws the target schema, someone else writes the mapping code, and the work begins.
It is actually a path problem, in two parts. The relational model is already in the export — encoded in the path structure of the source itself — and Stage 1 is reading it, not designing it. Stage 2 is its inverse: walk each leaf path back through every intermediate parent and fold the columns along the way into one wide, query-ready table per leaf. Same primitive, both directions. The reframe changes the cost of the initial build and, more importantly, what happens when the source API ships v2.
The schema is in the path. Both directions.
The relational graph in the middle is not a target the team designed — it is what was already encoded in the nested source on the left. Each highlighted path becomes a table, each level of nesting becomes a parent-child relationship, each natural key on a parent travels down to every descendant. The leaf-flattened tables on the right are not a separate denormalization exercise — they are the same paths walked backward, from each leaf through every intermediate parent to the root, columns folded along the way, the natural key kept for re-join. Two stages, one primitive: paths.
Pulled apart by reading paths. Recombined along the same paths. Both stages deterministic; both regenerable on schema change. The pattern earns its keep on the day the source ships v2.
Extract: read, don't design.
The relational model is implicit in how the nested structure was authored. Most of Stage 1 is making it explicit — walking every path, propagating natural keys downward, preserving source-namespace prefixes as lineage.
Fold: walk back, don't denormalize.
One wide table per leaf path is not the same as one wide table for everything. Stage 2 is the surgical inverse of Stage 1 — for each leaf, walk the path back through every intermediate parent and join deterministically via the keys Stage 1 already propagated.
Four moves that make the extraction reliable.
Every team that has tried this and stopped has tripped over one of these four. The combination of all of them is what separates a one-time data load from a relational model that holds up across API releases.
Iterative array extraction, not recursive.
Recursive descent works for two levels and breaks at three. Arrays inside structs inside arrays — common in Workday compensation data, in any nested benefit-election structure, in most modern enterprise APIs — is where naive parsing breaks. A queue-based iterative approach instead: push the root onto a queue, pop it, find array columns at the top level, extract each one to a child table, push the children back on, repeat until the queue is empty. Mundane mechanically. The difference is correctness on the deeply-nested cases the source actually produces.
Natural-key propagation.
The relational structure is only useful if it can be reassembled. Every child table must carry the parent's natural keys — employee number, account ID, whatever the source treats as a stable identifier — plus a system-generated key tying it to its immediate parent in the extraction graph. Most flattening approaches lose this. You end up with rows in tables, but the relationships back to the source's notion of identity are gone, and reassembling them is a manual exercise.
Column-name normalization without losing lineage.
Workday's wd: namespace prefix, PeopleSoft's PSFT_ conventions, Salesforce's custom-field suffix __c — these all need to be normalized into something a SQL engine can parse cleanly. But they should not be flattened away. The source prefix should survive transformation through the early layers, so when someone asks "where did this column come from?", the answer is in the column name itself, not buried in catalog metadata three systems away.
System-generated join keys in their own namespace.
Source data has its own id columns. The system-generated keys for relational reassembly need to live in a different namespace — a deliberate prefix like _sys_id or similar — so there is never any ambiguity between "the source's notion of identity" and "the system's notion of identity." Reusing id is one of the most common quiet mistakes. It works until a source brings its own id field with semantically different meaning — and then join queries return wrong answers without errors.
Two moves that keep the fold lean.
Stage 2 is structurally simpler than Stage 1 — there is no algorithm to invent, only a hierarchy to walk. The hierarchy is already known; it was produced as a first-class artifact in Stage 1. What makes the fold lean rather than messy is the discipline that decides what to fold and how.
One wide table per leaf path, not one for everything.
The cheap denormalization is to join everything to everything and produce a single fact-shaped artifact bristling with nulls. It works for a quarter, until the second array branch appears at a different cardinality and quietly multiplies the row count. The discipline is to produce one wide table per leaf path — every array branch in the hierarchy gets its own folded output, joined back to the root via the natural key when a question crosses branches. The output set is exactly the leaf set the source itself produced. No more, no fewer.
Join via the keys Stage 1 already propagated.
The fold does not need to discover join keys, infer relationships, or build a semantic model. Stage 1 already propagated every ancestor's natural key to every descendant. The fold walks the path back to the root, inner-joins each level via those propagated keys, and keeps a deterministic prefix (L0_, L1_, L2_) so each column's provenance — which level of the path it came from — is visible in its name. No semantic interpretation. No hand-curated mappings. Same input, same output, every time.
The leaves are not a separate denormalization project. They are the same algorithm in reverse — using the same keys, the same hierarchy, the same source-namespace prefixes. Regenerate them when the source ships v2; you will not need to touch the joins.
Three details most teams underweight.
The four moves are about what to do at every level. These three are about the lower-level mechanics — the difference between a relational model that holds and one that quietly erodes:
posexplode_outer, never plain explode.
When an array becomes rows, the positional order of the array is information. Time-series-flavored data depends on it. Operational reconciliation depends on it. Plain explode loses both the position and (critically) the empty-array case — rows with no children are silently dropped. posexplode_outer preserves the index column and keeps the parent row even when the array is empty. The cost is one extra column. The benefit is a model that does not silently drop edge cases.
Hierarchy as queryable metadata, not private code.
The parent-child map of extracted tables should not live only inside the pipeline code. It should be a catalog artifact — discoverable, queryable, usable by downstream tools and agents to navigate the relational graph without re-reading the source schema. The pipeline knows what extracted what; the metadata layer should know it too. Once it is queryable, every downstream consumer benefits from it for free.
One Spark application per list of tables.
This is the same lesson the lake patterns generally teach — but it bites harder here. Each level of nesting produces another table. A six-level Workday export with a naive "one job per table" approach is six rounds of JVM startup before any actual work happens. The fix is the coordinator pattern — chunk the tables, single Spark driver iterates the list, and the JVM cost amortizes across the run instead of repeating per table.
Skip any of these and the model still works — for a while. The cost arrives on the day a source produces an edge case that hits the gap. Most of the time, the gap is one of these three.
Agents are more useful when the model is legible.
A relational graph extracted cleanly from the source — with hierarchy preserved as metadata, natural keys propagated, source lineage in the column names — is exactly the substrate an AI layer can work with. Three high-leverage applications:
Semantic column classification
Once the relational graph exists, an AI layer is good at classifying what each column actually represents — date, identifier, name, currency, free text, PII. The hierarchy makes the work easier: nested structure encodes meaning that flat-and-mapped data has already lost. The output is a column catalog downstream tools — BI, agents, governance — can use without hand-coding.
Self-documenting datasets
Hierarchy plus a sample of recent rows is enough for an agent to write — and keep current — a plain-English narrative of every table: what's in it, how it relates to other tables, what changed between API releases. Refreshed automatically when the schema evolves. Your catalog becomes self-documenting.
Schema-drift characterization, not just detection
When the source API ships a new version, the question is not 'what broke' — it is 'what changed semantically.' If the source used to return compensation.salary and now returns compensation.base_salary.amount, that is not a break, it is a refinement. An AI layer over the relational graph can characterize the change — distinguishing renames, refinements, structural rearrangements, and genuine semantic shifts — and recommend the migration response per case.
Cross-source semantic mapping
Two systems that both call something 'customer' rarely mean exactly the same thing. With the relational graphs from both sides as the input, an AI layer can propose a mapping between them — a hypothesis the team validates, not generates from scratch. The hierarchy is what makes the hypothesis tractable.
Auto-relationalization is not a translation between business semantics. It extracts what the source encoded; it does not reconcile what two different systems mean by "customer" or "employee." That conflict is semantic, not structural — and the right place to address it is one layer up, against the relational graph this pattern produces, not against the nested source itself. The brief above describes the substrate. What teams build on top of it is its own problem.
The infrastructure that hosts these tables — S3, Apache Iceberg, partitioning, the medallion architecture — is a separate subject. It is covered in our brief on building a lean data lake.
Read · A field guide to the lean data lakeIf the relational model is hand-maintained —
If the relational model your team is producing is hand-maintained, breaks on every API release, and is still missing the deeper-nested arrays — the pattern above is what most production-grade extractions look like. An hour is usually enough to see whether it fits how your sources behave.