Retry, Backoff, and Timeout Patterns
The Retry Loop That Took Down a Server
A frontend team added retry logic to their API client:
async function fetchWithRetry(url, retries = 5) {
for (let i = 0; i < retries; i++) {
try {
return await fetch(url);
} catch {
// retry immediately
}
}
throw new Error('All retries failed');
}
The API server went down for 10 seconds. Every client immediately started hammering it with 5 retries, zero delay between attempts. The server came back up, got obliterated by a thundering herd of retry storms, and went down again. Repeat. For 45 minutes.
The fix wasn't removing retries — it was adding backoff and jitter. Two concepts that are the difference between graceful recovery and cascading failure.
Exponential Backoff
Exponential backoff is like knocking on someone's door. You knock, wait a second, knock again, wait two seconds, knock again, wait four seconds. Each time, you wait longer — giving them more time to answer. If you just hammered the door nonstop, you'd be annoying at best and break it down at worst.
The formula is simple: wait baseDelay * 2^attempt milliseconds before each retry.
function getBackoffDelay(attempt, baseDelay = 1000) {
return baseDelay * Math.pow(2, attempt);
}
// Attempt 0: 1000ms (1s)
// Attempt 1: 2000ms (2s)
// Attempt 2: 4000ms (4s)
// Attempt 3: 8000ms (8s)
// Attempt 4: 16000ms (16s)
Why You Need Jitter
Pure exponential backoff has a problem: if 1000 clients all start retrying at the same time, they'll all wait the same durations and retry at the same time. You've just turned one spike into a series of synchronized spikes.
Jitter adds randomness so retries spread out:
function getBackoffWithJitter(attempt, baseDelay = 1000) {
const exponentialDelay = baseDelay * Math.pow(2, attempt);
const jitter = Math.random() * exponentialDelay;
return jitter; // "full jitter" — random between 0 and exponentialDelay
}
There are three common jitter strategies:
// Full jitter: random(0, exponentialDelay) — best for thundering herd
const fullJitter = Math.random() * baseDelay * Math.pow(2, attempt);
// Equal jitter: exponentialDelay/2 + random(0, exponentialDelay/2)
const half = (baseDelay * Math.pow(2, attempt)) / 2;
const equalJitter = half + Math.random() * half;
// Decorrelated jitter: random(baseDelay, previousDelay * 3)
const decorrelated = Math.min(
maxDelay,
baseDelay + Math.random() * (previousDelay * 3 - baseDelay)
);
AWS recommends full jitter for most use cases. It provides the widest spread and prevents synchronized retry waves.
Production Retry Implementation
Here's a complete retry utility with all the pieces:
async function retry(fn, options = {}) {
const {
maxAttempts = 3,
baseDelay = 1000,
maxDelay = 30000,
shouldRetry = () => true,
onRetry = () => {},
signal,
} = options;
let lastError;
for (let attempt = 0; attempt < maxAttempts; attempt++) {
try {
return await fn(attempt, signal);
} catch (err) {
lastError = err;
if (signal?.aborted) throw err;
const isLastAttempt = attempt === maxAttempts - 1;
if (isLastAttempt || !shouldRetry(err, attempt)) {
throw err;
}
const delay = Math.min(
maxDelay,
Math.random() * baseDelay * Math.pow(2, attempt)
);
onRetry(err, attempt, delay);
await new Promise(resolve => setTimeout(resolve, delay));
}
}
throw lastError;
}
Usage:
const data = await retry(
(attempt, signal) => fetch('/api/data', { signal }).then(r => r.json()),
{
maxAttempts: 3,
baseDelay: 500,
shouldRetry: (err) => {
if (err.name === 'AbortError') return false;
if (err.status === 404) return false;
if (err.status === 429) return true;
return err.status >= 500;
},
onRetry: (err, attempt, delay) => {
console.log(`Retry ${attempt + 1} in ${delay}ms: ${err.message}`);
},
}
);
- 1Only retry transient failures (5xx, network errors, timeouts) — never retry 4xx client errors (except 429)
- 2Always add jitter to prevent thundering herd
- 3Set a maximum delay cap to prevent absurdly long waits
- 4Support cancellation via AbortSignal so retries can be stopped
- 5Log each retry for observability
AbortController and Timeouts
AbortController is the standard cancellation mechanism for async operations in JavaScript. Here's the thing most developers miss: it's not just for fetch.
AbortSignal.timeout()
The cleanest way to add a timeout to fetch:
const response = await fetch('/api/data', {
signal: AbortSignal.timeout(5000),
});
When the timeout fires, fetch throws a TimeoutError (a DOMException with name: 'TimeoutError'). No manual setTimeout + AbortController wiring needed.
AbortSignal.any()
What if you need both a timeout AND manual cancellation? AbortSignal.any() combines multiple signals:
const controller = new AbortController();
const response = await fetch('/api/data', {
signal: AbortSignal.any([
controller.signal, // manual cancellation
AbortSignal.timeout(5000), // automatic timeout
]),
});
// User navigates away:
controller.abort();
The fetch aborts if EITHER signal fires — whichever comes first.
Making Any Async Operation Abortable
You can make your own functions respect AbortSignal:
function delay(ms, { signal } = {}) {
return new Promise((resolve, reject) => {
if (signal?.aborted) {
reject(signal.reason);
return;
}
const timer = setTimeout(resolve, ms);
signal?.addEventListener('abort', () => {
clearTimeout(timer);
reject(signal.reason);
}, { once: true });
});
}
async function pollForResult(taskId, { signal } = {}) {
while (true) {
signal?.throwIfAborted();
const result = await fetch(`/api/tasks/${taskId}`, { signal });
const data = await result.json();
if (data.status === 'complete') return data;
await delay(2000, { signal });
}
}
Always check signal.aborted at the START of your function, not just on the event listener. If the signal was already aborted before your function was called, the abort event will never fire — it already happened. The throwIfAborted() method is the cleanest way to check.
The Circuit Breaker Pattern
Retries handle transient failures. But what about persistent failures? If a service is down for 5 minutes, retrying every request wastes resources and adds latency for users who are all going to get errors anyway.
The circuit breaker pattern borrows from electrical engineering: when too many failures occur, the circuit "opens" and immediately rejects requests without even trying.
class CircuitBreaker {
#state = 'closed';
#failureCount = 0;
#lastFailureTime = 0;
#options;
constructor(options = {}) {
this.#options = {
failureThreshold: 5,
cooldownMs: 30000,
...options,
};
}
async execute(fn) {
if (this.#state === 'open') {
const elapsed = Date.now() - this.#lastFailureTime;
if (elapsed < this.#options.cooldownMs) {
throw new Error('Circuit breaker is open');
}
this.#state = 'half-open';
}
try {
const result = await fn();
this.#onSuccess();
return result;
} catch (err) {
this.#onFailure();
throw err;
}
}
#onSuccess() {
this.#failureCount = 0;
this.#state = 'closed';
}
#onFailure() {
this.#failureCount++;
this.#lastFailureTime = Date.now();
if (this.#failureCount >= this.#options.failureThreshold) {
this.#state = 'open';
}
}
}
Usage:
const apiBreaker = new CircuitBreaker({
failureThreshold: 3,
cooldownMs: 10000,
});
async function fetchData() {
return apiBreaker.execute(() =>
fetch('/api/data').then(r => r.json())
);
}
Idempotency: The Prerequisite for Safe Retries
Here's a question that separates junior from senior engineers: is this operation safe to retry?
await fetch('/api/payments', {
method: 'POST',
body: JSON.stringify({ amount: 100, to: 'user-42' }),
});
If this request succeeds but the response is lost (network hiccup), should you retry? If the server isn't idempotent, you just charged the user twice.
The solution: idempotency keys. Send a unique key with each logical operation. The server checks if it's already processed that key:
const idempotencyKey = crypto.randomUUID();
await retry(
() => fetch('/api/payments', {
method: 'POST',
headers: { 'Idempotency-Key': idempotencyKey },
body: JSON.stringify({ amount: 100, to: 'user-42' }),
}),
{ maxAttempts: 3 }
);
The key is generated ONCE, outside the retry loop. Every retry sends the same key. The server sees the duplicate key and returns the original response instead of processing again.
| What developers do | What they should do |
|---|---|
| Retrying immediately without delay Immediate retries create thundering herd effects that can overwhelm recovering servers | Use exponential backoff with jitter |
| Retrying all errors including 400 Bad Request Client errors (4xx) won't succeed on retry — the request itself is wrong. Retrying wastes time and resources | Only retry transient errors (5xx, network, timeout) |
| Generating a new idempotency key on each retry The whole point of idempotency keys is to let the server detect duplicate requests. New keys on each retry defeat this purpose | Generate the key once before the retry loop |
| Using setTimeout + AbortController instead of AbortSignal.timeout() AbortSignal.timeout() handles cleanup automatically and throws a distinguishable TimeoutError instead of a generic AbortError | Use AbortSignal.timeout() for cleaner timeout management |
Combining Everything: Production API Client
Here's how these patterns compose into a robust API client:
const circuitBreaker = new CircuitBreaker({
failureThreshold: 5,
cooldownMs: 30000,
});
async function apiRequest(url, options = {}) {
const {
method = 'GET',
body,
timeout = 10000,
maxRetries = 3,
idempotencyKey,
} = options;
const headers = { 'Content-Type': 'application/json' };
if (idempotencyKey) {
headers['Idempotency-Key'] = idempotencyKey;
}
return retry(
(attempt, signal) =>
circuitBreaker.execute(() =>
fetch(url, {
method,
headers,
body: body ? JSON.stringify(body) : undefined,
signal: AbortSignal.any([
signal,
AbortSignal.timeout(timeout),
].filter(Boolean)),
}).then(async (res) => {
if (!res.ok) {
const err = new Error(`HTTP ${res.status}`);
err.status = res.status;
throw err;
}
return res.json();
})
),
{
maxAttempts: maxRetries,
shouldRetry: (err) => {
if (err.name === 'TimeoutError') return true;
if (err.message === 'Circuit breaker is open') return false;
return err.status >= 500 || err.status === 429;
},
}
);
}
Design a retry strategy for a payment processing system. The system processes credit card charges. You need to handle: network timeouts, server errors (5xx), rate limiting (429 with Retry-After header), and ensure no double charges. What patterns would you combine, and how would you handle the Retry-After header?