Retry, Backoff, and Timeout Patterns

advanced19 min read

The Retry Loop That Took Down a Server

A frontend team added retry logic to their API client:

async function fetchWithRetry(url, retries = 5) {
  for (let i = 0; i < retries; i++) {
    try {
      return await fetch(url);
    } catch {
      // retry immediately
    }
  }
  throw new Error('All retries failed');
}

The API server went down for 10 seconds. Every client immediately started hammering it with 5 retries, zero delay between attempts. The server came back up, got obliterated by a thundering herd of retry storms, and went down again. Repeat. For 45 minutes.

The fix wasn't removing retries — it was adding backoff and jitter. Two concepts that are the difference between graceful recovery and cascading failure.

Exponential Backoff

Mental Model

Exponential backoff is like knocking on someone's door. You knock, wait a second, knock again, wait two seconds, knock again, wait four seconds. Each time, you wait longer — giving them more time to answer. If you just hammered the door nonstop, you'd be annoying at best and break it down at worst.

The formula is simple: wait baseDelay * 2^attempt milliseconds before each retry.

function getBackoffDelay(attempt, baseDelay = 1000) {
  return baseDelay * Math.pow(2, attempt);
}

// Attempt 0: 1000ms  (1s)
// Attempt 1: 2000ms  (2s)
// Attempt 2: 4000ms  (4s)
// Attempt 3: 8000ms  (8s)
// Attempt 4: 16000ms (16s)

Why You Need Jitter

Pure exponential backoff has a problem: if 1000 clients all start retrying at the same time, they'll all wait the same durations and retry at the same time. You've just turned one spike into a series of synchronized spikes.

Jitter adds randomness so retries spread out:

function getBackoffWithJitter(attempt, baseDelay = 1000) {
  const exponentialDelay = baseDelay * Math.pow(2, attempt);
  const jitter = Math.random() * exponentialDelay;
  return jitter; // "full jitter" — random between 0 and exponentialDelay
}

There are three common jitter strategies:

// Full jitter: random(0, exponentialDelay) — best for thundering herd
const fullJitter = Math.random() * baseDelay * Math.pow(2, attempt);

// Equal jitter: exponentialDelay/2 + random(0, exponentialDelay/2)
const half = (baseDelay * Math.pow(2, attempt)) / 2;
const equalJitter = half + Math.random() * half;

// Decorrelated jitter: random(baseDelay, previousDelay * 3)
const decorrelated = Math.min(
  maxDelay,
  baseDelay + Math.random() * (previousDelay * 3 - baseDelay)
);

AWS recommends full jitter for most use cases. It provides the widest spread and prevents synchronized retry waves.

Quiz

Why is 'full jitter' (random value between 0 and the exponential delay) generally preferred over 'no jitter' in retry strategies?

ABCD

Production Retry Implementation

Here's a complete retry utility with all the pieces:

async function retry(fn, options = {}) {
  const {
    maxAttempts = 3,
    baseDelay = 1000,
    maxDelay = 30000,
    shouldRetry = () => true,
    onRetry = () => {},
    signal,
  } = options;

  let lastError;

  for (let attempt = 0; attempt < maxAttempts; attempt++) {
    try {
      return await fn(attempt, signal);
    } catch (err) {
      lastError = err;

      if (signal?.aborted) throw err;

      const isLastAttempt = attempt === maxAttempts - 1;
      if (isLastAttempt || !shouldRetry(err, attempt)) {
        throw err;
      }

      const delay = Math.min(
        maxDelay,
        Math.random() * baseDelay * Math.pow(2, attempt)
      );

      onRetry(err, attempt, delay);
      await new Promise(resolve => setTimeout(resolve, delay));
    }
  }

  throw lastError;
}

Usage:

const data = await retry(
  (attempt, signal) => fetch('/api/data', { signal }).then(r => r.json()),
  {
    maxAttempts: 3,
    baseDelay: 500,
    shouldRetry: (err) => {
      if (err.name === 'AbortError') return false;
      if (err.status === 404) return false;
      if (err.status === 429) return true;
      return err.status >= 500;
    },
    onRetry: (err, attempt, delay) => {
      console.log(`Retry ${attempt + 1} in ${delay}ms: ${err.message}`);
    },
  }
);

Key Rules

1Only retry transient failures (5xx, network errors, timeouts) — never retry 4xx client errors (except 429)
2Always add jitter to prevent thundering herd
3Set a maximum delay cap to prevent absurdly long waits
4Support cancellation via AbortSignal so retries can be stopped
5Log each retry for observability

AbortController and Timeouts

AbortController is the standard cancellation mechanism for async operations in JavaScript. Here's the thing most developers miss: it's not just for fetch.

AbortSignal.timeout()

The cleanest way to add a timeout to fetch:

const response = await fetch('/api/data', {
  signal: AbortSignal.timeout(5000),
});

When the timeout fires, fetch throws a TimeoutError (a DOMException with name: 'TimeoutError'). No manual setTimeout + AbortController wiring needed.

AbortSignal.any()

What if you need both a timeout AND manual cancellation? AbortSignal.any() combines multiple signals:

const controller = new AbortController();

const response = await fetch('/api/data', {
  signal: AbortSignal.any([
    controller.signal,          // manual cancellation
    AbortSignal.timeout(5000),  // automatic timeout
  ]),
});

// User navigates away:
controller.abort();

The fetch aborts if EITHER signal fires — whichever comes first.

Quiz

What is the difference between AbortSignal.timeout(5000) and manually creating an AbortController with a setTimeout?

ABCD

Making Any Async Operation Abortable

You can make your own functions respect AbortSignal:

function delay(ms, { signal } = {}) {
  return new Promise((resolve, reject) => {
    if (signal?.aborted) {
      reject(signal.reason);
      return;
    }

    const timer = setTimeout(resolve, ms);

    signal?.addEventListener('abort', () => {
      clearTimeout(timer);
      reject(signal.reason);
    }, { once: true });
  });
}

async function pollForResult(taskId, { signal } = {}) {
  while (true) {
    signal?.throwIfAborted();

    const result = await fetch(`/api/tasks/${taskId}`, { signal });
    const data = await result.json();

    if (data.status === 'complete') return data;

    await delay(2000, { signal });
  }
}

Common Trap

Always check signal.aborted at the START of your function, not just on the event listener. If the signal was already aborted before your function was called, the abort event will never fire — it already happened. The throwIfAborted() method is the cleanest way to check.

The Circuit Breaker Pattern

Retries handle transient failures. But what about persistent failures? If a service is down for 5 minutes, retrying every request wastes resources and adds latency for users who are all going to get errors anyway.

The circuit breaker pattern borrows from electrical engineering: when too many failures occur, the circuit "opens" and immediately rejects requests without even trying.

Circuit Breaker StatesPhase 1 / 3

Phase 1 / 3Closed (Normal)

Requests pass through normally. Failures are counted. If failures exceed threshold, circuit opens.

1/3

class CircuitBreaker {
  #state = 'closed';
  #failureCount = 0;
  #lastFailureTime = 0;
  #options;

  constructor(options = {}) {
    this.#options = {
      failureThreshold: 5,
      cooldownMs: 30000,
      ...options,
    };
  }

  async execute(fn) {
    if (this.#state === 'open') {
      const elapsed = Date.now() - this.#lastFailureTime;
      if (elapsed < this.#options.cooldownMs) {
        throw new Error('Circuit breaker is open');
      }
      this.#state = 'half-open';
    }

    try {
      const result = await fn();
      this.#onSuccess();
      return result;
    } catch (err) {
      this.#onFailure();
      throw err;
    }
  }

  #onSuccess() {
    this.#failureCount = 0;
    this.#state = 'closed';
  }

  #onFailure() {
    this.#failureCount++;
    this.#lastFailureTime = Date.now();
    if (this.#failureCount >= this.#options.failureThreshold) {
      this.#state = 'open';
    }
  }
}

Usage:

const apiBreaker = new CircuitBreaker({
  failureThreshold: 3,
  cooldownMs: 10000,
});

async function fetchData() {
  return apiBreaker.execute(() =>
    fetch('/api/data').then(r => r.json())
  );
}

Quiz

Why does a circuit breaker transition to a 'half-open' state instead of going directly from 'open' back to 'closed'?

ABCD

Idempotency: The Prerequisite for Safe Retries

Here's a question that separates junior from senior engineers: is this operation safe to retry?

await fetch('/api/payments', {
  method: 'POST',
  body: JSON.stringify({ amount: 100, to: 'user-42' }),
});

If this request succeeds but the response is lost (network hiccup), should you retry? If the server isn't idempotent, you just charged the user twice.

The solution: idempotency keys. Send a unique key with each logical operation. The server checks if it's already processed that key:

const idempotencyKey = crypto.randomUUID();

await retry(
  () => fetch('/api/payments', {
    method: 'POST',
    headers: { 'Idempotency-Key': idempotencyKey },
    body: JSON.stringify({ amount: 100, to: 'user-42' }),
  }),
  { maxAttempts: 3 }
);

The key is generated ONCE, outside the retry loop. Every retry sends the same key. The server sees the duplicate key and returns the original response instead of processing again.

What developers do	What they should do
Retrying immediately without delay Immediate retries create thundering herd effects that can overwhelm recovering servers	Use exponential backoff with jitter
Retrying all errors including 400 Bad Request Client errors (4xx) won't succeed on retry — the request itself is wrong. Retrying wastes time and resources	Only retry transient errors (5xx, network, timeout)
Generating a new idempotency key on each retry The whole point of idempotency keys is to let the server detect duplicate requests. New keys on each retry defeat this purpose	Generate the key once before the retry loop
Using setTimeout + AbortController instead of AbortSignal.timeout() AbortSignal.timeout() handles cleanup automatically and throws a distinguishable TimeoutError instead of a generic AbortError	Use AbortSignal.timeout() for cleaner timeout management

Combining Everything: Production API Client

Here's how these patterns compose into a robust API client:

const circuitBreaker = new CircuitBreaker({
  failureThreshold: 5,
  cooldownMs: 30000,
});

async function apiRequest(url, options = {}) {
  const {
    method = 'GET',
    body,
    timeout = 10000,
    maxRetries = 3,
    idempotencyKey,
  } = options;

  const headers = { 'Content-Type': 'application/json' };
  if (idempotencyKey) {
    headers['Idempotency-Key'] = idempotencyKey;
  }

  return retry(
    (attempt, signal) =>
      circuitBreaker.execute(() =>
        fetch(url, {
          method,
          headers,
          body: body ? JSON.stringify(body) : undefined,
          signal: AbortSignal.any([
            signal,
            AbortSignal.timeout(timeout),
          ].filter(Boolean)),
        }).then(async (res) => {
          if (!res.ok) {
            const err = new Error(`HTTP ${res.status}`);
            err.status = res.status;
            throw err;
          }
          return res.json();
        })
      ),
    {
      maxAttempts: maxRetries,
      shouldRetry: (err) => {
        if (err.name === 'TimeoutError') return true;
        if (err.message === 'Circuit breaker is open') return false;
        return err.status >= 500 || err.status === 429;
      },
    }
  );
}

Interview Question

Design a retry strategy for a payment processing system. The system processes credit card charges. You need to handle: network timeouts, server errors (5xx), rate limiting (429 with Retry-After header), and ensure no double charges. What patterns would you combine, and how would you handle the Retry-After header?