Rate limiting LLM APIs across distributed workers
So picture this. The system is throwing Anthropic 429s every two or three minutes during big processing runs. Fine — expected, honestly. We were processing thousands of documents in parallel across a fleet of Temporal workers, so of course we were leaning on the rate limit. What was not expected was that the 429s got worse after we added retry logic. You add the thing that’s supposed to help, and it makes everything angrier. Cool stuff.
We’d implemented exponential backoff with a max of 120 seconds, by the time most of those long backoffs fired, the quota they were waiting on had already come back. We were manufacturing dead air on top of the rate-limit problem and then patting ourselves on the back for it.
Let me walk you through the three-layer pattern that actually fixed it. But first I have to clear up a misconception I had, because it’s the whole reason the backoff was wrong.
The thing about TPM limits that I got wrong
I used to picture token-per-minute limits like a parking meter: the clock ticks over at 60 seconds and bang, the whole allowance resets to full. So my mental model was “if I get a 429, just wait until the next reset and I’m golden.” Backing off past that reset point is obviously pointless, right?
Turns out that’s not how it works. A rate limit doesn’t snap back all at once — it replenishes gradually. Whatever a given provider uses under the hood, a token bucket that refills at a steady rate or a sliding window that lets old usage age out, the effect is the same: your headroom climbs back up continuously across the window rather than in a single tick at the top of the minute. There’s no parking-meter moment to wait for. So when you’ve completely drained your quota, getting back to full takes roughly the length of the window — call it a minute — and it happens a little at a time the whole way there, not in one lump at the end. (Most providers document exactly this; Anthropic, for instance, describes its limit as a continuously replenished token bucket.)
So why does that change anything? Because the practical upshot is almost the same but the reason is completely different. If you’ve fully drained the bucket, it refills back to maximum over roughly a minute. So the useful ceiling on a backoff is about 60 seconds — not because a window “resets,” but because after ~60s of trickle the bucket has recovered all the headroom it’s ever going to give you. Wait longer than that and you’re just standing there while a full bucket waits for you to do something with it.
Standard exponential backoff doesn’t know any of this. It just keeps doubling:
wait = base * 2^attempt + jitter
With a base of 10 seconds, attempt 4 lands you at ~160 seconds of waiting. But the bucket finished refilling about 100 seconds ago. You’re backing off from a problem that no longer exists. Every second past 60 is pure waste.
So cap it at the window’s worth of refill — 60 seconds — and stop there. Not 2 minutes, not 5.
const RATE_LIMIT_BASE_BACKOFF_MS = 10_000;
const RATE_LIMIT_MAX_BACKOFF_MS = 60_000; // ~one bucket-refill's worth
function calculateBackoffMs(attempt: number): number {
const exponential = RATE_LIMIT_BASE_BACKOFF_MS * 2 ** (attempt - 1);
const capped = Math.min(exponential, RATE_LIMIT_MAX_BACKOFF_MS);
const jitterRange = Math.max(0, capped - RATE_LIMIT_BASE_BACKOFF_MS);
return RATE_LIMIT_BASE_BACKOFF_MS + Math.floor(Math.random() * (jitterRange + 1));
}
Once that clicked, the rest of the architecture kind of fell out on its own.
Layer 1: tell everyone about the 429, immediately (Redis)
Here’s where it gets fun with multiple workers. Per-worker backoff is useless on its own when you’ve got a whole pool hammering the same endpoint. One worker gets a 429 and starts politely backing off — meanwhile the other nine are still firing away, blissfully ignorant. They each catch their own 429, each start their own little timer, and the whole pool spends the next few minutes flailing before it settles. It’s like everyone in the pub trying to get to the bar at once and nobody noticing it’s already three-deep.
The fix is to make a 429 everybody’s problem the instant it happens. When any worker hits a rate limit, it writes a pause state to Redis. And every worker checks that state before every single AI call.
// When we get a 429
await redis.set(
`rate_limit:pause:${modelKey}`,
JSON.stringify({ pauseUntil: now + waitMs, lastThrottleTime: now }),
"EX",
300 // 5-minute TTL; auto-expires if workers die
);
// Before every AI call
const pauseState = await redis.get(`rate_limit:pause:${modelKey}`);
if (pauseState && parsedState.pauseUntil > Date.now()) {
await sleep(parsedState.pauseUntil - Date.now() + jitter);
}
One detail that matters more than it looks: namespace the key by modelKey (provider:model). A 429 on provider-a:model-x has no business pausing work on provider-b:model-y — those are completely separate buckets. Pause everything every time one model throttles and you’ve just tanked your throughput for no reason.
And that 5-minute TTL is the safety net. If a worker keels over mid-flight with a pause state still written, the TTL makes sure the rest of the pool isn’t held hostage forever by a ghost.
Layer 2: jitter on the way back, or you’ll do it all again
Right, so here’s the sneaky one that survives even after you’ve added the cross-worker pause. The pause expires at the exact same instant for everybody. So all the workers wake up together, all check the gate together, all see it’s clear together, and all fire together — perfectly recreating the original spike that caused the 429 in the first place. You’ve built a thundering herd and handed it a starting pistol.
The fix is to add a bit of random jitter on each worker’s wake-up:
const remainingPause = pauseState.pauseUntil - Date.now();
const jitter = Math.floor(Math.random() * 30_000); // 0-30s
await sleep(remainingPause + jitter);
Now the wake-ups smear out across a 30-second window. Worker A pops up at T+0 and fires one request, worker B at T+7s, and so on down the line. By the time the whole pool is back in business, that initial burst has been staggered into something the provider can actually stomach. Bog-standard thundering-herd mitigation — but so easy to forget, because you add the cross-worker pause, feel clever, and assume the problem’s solved.
Layer 3: kill the client library’s own retry (this was the villain)
This was the bug. The one that made everything worse the moment we added rate limiting.
Almost every LLM client library has retry logic baked in. By default, when a call gets a 429, the client quietly retries on your behalf with its own exponential backoff — often two or three extra attempts you might not even realise are happening. Now stack that on top of your Temporal activity retries and your own withRateLimitRetry wrapper, and congratulations: you’ve got three independent retry loops running at once, none of them aware of the others.
The interaction is genuinely nasty. Temporal fires an activity. Inside it, the client library eats a 429 and starts its own retry, sleeping a while. Meanwhile your withRateLimitRetry wrapper has also clocked the 429 and started its own sleep. When the client’s retry eventually fires, it goes back through withRateLimitRetry, which records another 429 and sets another pause. It cascades. And the worst part is the logs make it look like the backoff is working a treat, when really it’s just multiplying your wait times behind your back.
So turn the client’s built-in retry off and let a single layer own the whole thing. Most libraries expose this as a max-retries option — set it to zero:
// Whatever your LLM client is, find its retry knob and switch it off,
// so the outer layer (Temporal + withRateLimitRetry) is the only thing retrying.
const result = await llm.generate({
model,
prompt,
maxRetries: 0,
});
One retry layer that actually understands how the quota refills will always beat three that are each guessing in the dark.
Going proactive: token accounting with a Redis ZSET
Everything above is reactive — it stops the bleeding after a 429. But by the time you’ve got a 429, you’ve already paid for it: a failed network call, wasted tokens, and a provider that’s now mildly cross with you. The proactive layer’s job is to stop the 429 ever happening.
The trick is a sorted set in Redis, keyed by provider:model. Every completed call drops its token count in as a member, scored by the timestamp:
await redis.zadd(
`token_usage:${modelKey}`,
Date.now(),
`${reservationId}:${tokens}`
);
// Evict entries older than 60 seconds
await redis.zremrangebyscore(key, 0, Date.now() - 60_000);
Before each call, you read the set, sum the tokens still in the last 60 seconds, and check against your TPM ceiling. Getting close to the edge? Wait for the oldest entries to age out instead of lobbing in a call that’s just going to 429. (This proactive accounting earns its keep doubly with Anthropic, by the way, because the bucket also enforces short bursts — a “60 requests per minute” limit can show up as “one request per second” — and it doesn’t love sharp usage ramps. Smoothing your own flow keeps you on its good side.)
Now, the detail that’ll bite you if you skip it: reserve upfront, commit afterward. When a bunch of workers are calling concurrently, they each read the current usage and each see lovely headroom. If they all fire at once, the real usage is the sum of everyone’s reservations — not the number any one of them read. So the read-then-write has to be atomic. A tiny Lua script does it in a single round-trip (and Redis runs Lua atomically — nothing else gets to sneak in mid-script):
-- Atomic: inspect bucket, reserve in one shot or reject
local currentTpm = sumBucketEntries(KEYS[1], windowStart)
if currentTpm + reserveTokens > tpmLimit then
return {0, 'tpm', retryAfterSeconds, deficit}
end
redis.call('ZADD', KEYS[1], now, reservationId .. ':' .. reserveTokens)
return {1}
After the call finishes, commit the actual usage to replace the reservation. Models almost always generate fewer tokens than max_tokens, so committing the real number hands the unused budget back to the shared pool. Be greedy on the reservation, honest on the commit.
This is the layer that keeps the system from ever really needing the reactive pause under normal load. The 429 handler becomes the fire alarm, not the day-to-day traffic cop.
The whole picture
Three layers, working together:
-
Redis ZSET sliding window — proactive token accounting with atomic reservation. Workers see each other’s consumption before they fire. Stops most 429s ever happening.
-
Cross-worker pause state — the reactive 429 handler that writes to Redis the instant it’s hit. The whole pool backs off together, and the backoff caps at one bucket-refill’s worth (~60s) instead of some open-ended exponential fantasy.
-
One retry layer — SDK retry switched off. Temporal’s activity retry plus the
withRateLimitRetrywrapper own the semantics. No more three independent loops tripping over each other.
End result: 429s went from a regular feature of every big batch run to a rare event the system shrugs off gracefully when it does show up.
And the general lesson travels well beyond Temporal — BullMQ, Celery, your own hand-rolled queue, doesn’t matter. The key fact is that TPM limits are time-windowed, and time-windowed limits have a natural ceiling on how long backing off is even worth it. Build your retry ceiling around how long the bucket takes to refill, and make that state shared across all your workers, and you’ll get further than any amount of lovingly hand-tuned exponential backoff coefficients ever will.