Nikhil Verma

Why nested JSON will eventually bite you

I was staring at a memory spike in our Temporal worker pod. We were running this document processing pipeline. It did AI-assisted cleanup, statement rewrites and more — and everything was melting down. Every time we touched a single clause in a document, we were loading the entire document into memory just to find it.

I guess when we started it made sense to store documents in the way our brain works, or reads them visually. Documents are indented right? Let’s indent the JSON too! (nest it). A statements column where each statement had a statements array inside it, and those had more statements arrays… section → subsection → clause → sub-clause, all the way down. Seems intuitive at that time.

Why Nested JSON Falls Apart

Nested JSON for hierarchical stuff is tempting. When you’re building v1, you’re always loading the whole document anyway — rendering a page, populating an editor, exporting a report. So why not just store it as a tree?

But the second you start running async jobs on individual nodes, it all falls apart.

Here’s what we needed: process statements in chunks of ten, track where we left off, resume after failures. Standard stuff for Temporal workflows.

But with nested JSON there’s no cursor. A statement doesn’t have an address. Want statement #47? Cool, deserialize the entire blob, walk the tree in memory until you find it. There’s no row you can just SELECT ... WHERE id = $1. No ORDER BY that makes sense. The blob is completely opaque to Postgres.

So our chunk processing looked like this: load entire document → deserialize JSON → flatten it in code → slice the chunk we want → process it → re-serialize everything → save it back. Every. Single. Chunk.

For a 2,000-statement document processed in chunks of 10, that’s 200 full document loads. Two hundred!

The same nightmare hit our AI jobs. To give the model context for a single clause — like its parent headings, siblings, immediate children — we had to load the full document, rebuild the tree in memory, find the clause, then extract the context window. The I/O cost was bigger than the actual LLM call!

The Fix: Two Integer Columns

The migration was seemingly simple, we need to stop storing the tree. Just store a flat list of rows. Each row gets two integers: order and indent.

CREATE TABLE statements (
  id         uuid PRIMARY KEY DEFAULT gen_random_uuid(), -- requires PG 13+
  document_id uuid NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
  "order"    integer NOT NULL,
  indent     integer NOT NULL,
  title      text NOT NULL,
  -- ... other columns
);

-- Composite index: backs both the cursor scan and the subtree range queries below
CREATE INDEX ON statements (document_id, "order");

order is just a sequence number within the document. Start at zero, increment by one. indent is the nesting depth, also zero-based. Top-level section? indent: 0. Subsections? indent: 1. Clauses? indent: 2.

No parent ID column. No parentId. No recursive CTEs. You reconstruct parent-child relationships at read time from order and indent together. If you have rows in order, you can rebuild the full tree structure in a single forward pass.

In Drizzle it looks like this:

export const statementsTable = pgTable("statements", {
  id: uuid().defaultRandom().primaryKey(),
  documentId: uuid().notNull().references(() => documentsTable.id, { onDelete: "cascade" }),
  order: integer().notNull(),
  indent: integer().notNull(),
  title: text().notNull(),
  // Note: text({ enum }) only gives you TypeScript type hints — there's NO database constraint.
  // If you need actual DB-level validation, use pgEnum instead.
  // ...
});

Now every statement is its own row. Independently addressable. The chunk processing query becomes clean:

const chunk = await db
  .select()
  .from(statementsTable)
  .where(
    and(
      eq(statementsTable.documentId, documentId),
      gt(statementsTable.order, lastProcessedOrder)
    )
  )
  .orderBy(asc(statementsTable.order))
  .limit(10);

No more document load, JSON deserialization, in-memory tree building. The cursor is just the last processed statement’s ID. Failure at statement #47? Next run picks up at order 47, grabs the next chunk, continues. The whole document never touches memory.

Rebuilding the Tree

Okay, so you’ve flattened everything. But sometimes you do need the hierarchy back — which statement is a child of which heading, what are a clause’s ancestors, and so on. Here’s the bit I got slightly muddled the first time, so let me be precise: there are two situations, and they want completely different answers.

Situation one: you’ve already got the whole document in memory

You’re rendering the editor or exporting a report — you’ve loaded every row anyway. Don’t fire off a query per node. One O(n) forward scan with a stack rebuilds every parent-child link in a single pass:

function buildTree(statements: Array<{ id: string; indent: number; order: number }>) {
  const stack: Array<{ id: string; indent: number; childrenIds: string[] }> = [];
  const result = statements.map(stmt => ({ ...stmt, childrenIds: [] as string[] }));

  for (const statement of result) {
    // Pop everything off the stack that's at the same or deeper indent
    while (stack.length > 0 && stack[stack.length - 1].indent >= statement.indent) {
      stack.pop();
    }

    // Whatever's on top now? That's the parent
    if (stack.length > 0) {
      stack[stack.length - 1].childrenIds.push(statement.id);
    }

    stack.push(statement);
  }

  return result;
}

This runs in O(n). Each statement gets pushed once, popped at most once. The stack at any point is the “spine” of ancestors — the deepest open heading at each indent level. Simple. Use it when the rows are already in your hands.

Situation two: you want one node’s subtree without loading the document

This is the case that started this whole mess — the per-clause LLM context, the async worker job. Rebuilding the tree in JS here means loading the entire document to look at one node, which throws away the whole reason we flattened it. So don’t scan in app code. Let Postgres do it.

In a flattened pre-order list, a subtree is a contiguous run of rows — everything after your node until indent climbs back up to the node’s level or above. That’s just a range query with a computed boundary.

-- All descendants (the entire subtree) of a node
WITH boundary AS (
  SELECT MIN("order") AS o
  FROM statements
  WHERE document_id = $1 AND "order" > $2 AND indent <= $3   -- next row at/above target's level
)
SELECT s.*
FROM statements s, boundary b
WHERE s.document_id = $1
  AND s."order" > $2
  AND s."order" < COALESCE(b.o, 2147483647)   -- no boundary ⇒ subtree runs to end of doc
ORDER BY s."order";

In Drizzle:

import { and, eq, gt, lte, asc, sql } from "drizzle-orm";

async function findDescendants(
  documentId: string,
  target: { order: number; indent: number },
) {
  // The boundary: the first row after `target` that climbs back to its level (or above)
  const boundary = db
    .select({ o: sql<number>`min(${statementsTable.order})` })
    .from(statementsTable)
    .where(and(
      eq(statementsTable.documentId, documentId),
      gt(statementsTable.order, target.order),
      lte(statementsTable.indent, target.indent),
    ));

  return db
    .select()
    .from(statementsTable)
    .where(and(
      eq(statementsTable.documentId, documentId),
      gt(statementsTable.order, target.order),
      sql`${statementsTable.order} < coalesce((${boundary}), 2147483647)`,
    ))
    .orderBy(asc(statementsTable.order));
}

Want direct children only, not the whole subtree? Add one more condition to the outer query: indent = target.indent + 1. Same range, one level deep.

Ancestors are the backwards version. No boundary trick needed — a correlated NOT EXISTS says “this earlier row is still an open ancestor at the target’s position” (i.e. nothing between it and the target closed it off):

SELECT a.*
FROM statements a
WHERE a.document_id = $1 AND a."order" < $2 AND a.indent < $3
  AND NOT EXISTS (
    SELECT 1 FROM statements m
    WHERE m.document_id = $1
      AND m."order" > a."order" AND m."order" < $2
      AND m.indent <= a.indent          -- something closed this ancestor before we reached the target
  )
ORDER BY a."order";

That (document_id, "order") index from earlier is exactly what turns these into cheap range scans — and crucially, you get back only the rows you asked for, not a 10MB blob.

The one mistake to avoid is using the situation-one scan for a situation-two problem: loading the whole document into memory to pluck out a single node’s subtree. That’s the exact thing we set out to kill — don’t sneak it back in through the side door.

What You Actually Get From This

The big win is addressability. Every node is now a real row with a real ID. You can:

  • Run a Temporal activity on a specific statement without loading anything else
  • Resume batch jobs from an exact cursor with ORDER BY "order" and WHERE "order" > $cursor
  • Pull a single node’s subtree (or just its direct children, or its ancestors) with a couple of range queries — no full-document load
  • Use actual database constraints — foreign keys, indexes, NOT NULL — on fields that were buried in opaque JSON

The second win is that Postgres can actually help. Indexes on (document_id, "order") are useful in ways that JSONB GIN indexes can’t be. Range queries, partial loads, paginated processing — all straightforward SQL.

The tradeoff? You give up the comforting illusion that the document “is” a tree. It isn’t anymore — it’s an ordered list, and you reconstruct hierarchy on demand. For the whole-document case that’s a tiny O(n) scan; for the per-node case it’s a range query. Both are simple and easy to test. The single thing to get right in either form is the subtree boundary — stopping when indent returns to the target’s level — whether you express it as a break in the scan or a < boundary in SQL.

Should You Do This?

If you’re modeling any hierarchical content — legal docs, org charts, product taxonomies, config frameworks, knowledge bases — and you think you’ll ever need to process individual nodes async or stream partial subtrees to workers, this pays off fast.

Nested JSON feels natural at first. Then your documents get big. Your workers get concurrent. And suddenly you’re loading entire 10MB blobs just to touch one row.

That’s when you realize: they’re not trees. They’re just ordered lists with indentation. Store them that way.