Skip to content
All posts
#backend#api-design#distributed-systems#postgres

The idempotency key: a small string that prevents catastrophic bugs

Why your API needs idempotency keys, what they actually buy you, and how to implement one without creating a new class of footguns.

Every senior backend engineer has a story. Mine involves a payments endpoint, a flaky mobile network, and a customer who was charged eleven times for a single coffee.

The fix was a 36-character string the client had to send. The architecture that grew around that string is what I want to talk about.

This is the long version of why idempotency keys exist, what they really protect against, and the design decisions that determine whether your implementation is a safety net or a new outage waiting to happen.

What idempotency actually means

In the strict mathematical sense, an operation is idempotent if applying it twice produces the same result as applying it once. x = 5 is idempotent. x = x + 1 is not.

For HTTP APIs, the standard goes a little further. An idempotent endpoint is one where a client can safely retry the same request any number of times and observe at most one effect on server state. The response might be different on the second call — the server might say “you already did this” — but the world has only been changed once.

This distinction matters. Idempotency is not about the response. It’s about side effects.

POST /payments
Idempotency-Key: 8a1f-4b2c-9d3e-7f6a

→ 201 Created
   { "id": "pay_42", "amount": 450, "status": "captured" }

# Retry (network blip on the way back)
POST /payments
Idempotency-Key: 8a1f-4b2c-9d3e-7f6a

200 OK
   { "id": "pay_42", "amount": 450, "status": "captured" }

Same effect on the world. Same response body. Different status code, deliberately.

The failure mode that forces your hand

You will, at some point, build an endpoint that does work the client cannot afford to do twice. Transfer money. Send an email. Create a subscription. Issue a refund. Provision a phone number.

Then the network will betray you. The TCP connection will close before the client reads the response. The mobile user will walk into a tunnel mid-request. Your load balancer will 502 exactly when the database commits. The retry will fire, and you will have a duplicate.

I have seen a SaaS company lose a customer over a $14 duplicate charge. Not the money. The trust. The customer could not get anyone on the phone fast enough to confirm it was a glitch, assumed they were being scammed, and cancelled a six-figure contract the following week.

The lesson was not “process refunds faster.” The lesson was never create the duplicate in the first place.

Why naive retries make it worse

Most teams reach for a retry layer first. Exponential backoff, jitter, max attempts. It is good hygiene. It is also insufficient.

Retries only help with the case where the server did not do the work. The dangerous case is the one where the server did the work, sent the response, and the response was lost on its way back. The client has no way to distinguish “I never reached you” from “you did it but I never heard back.” It retries either way.

If your endpoint is POST /transfers { from, to, amount }, a retry will create a second transfer. Your carefully tuned backoff is irrelevant. The atomicity boundary is in the wrong place.

Idempotency keys move that boundary. They give the client a way to say: “I’m about to ask you to do something. If you and I have already agreed on this exact request, give me back the answer we already negotiated.”

The client controls the key. The server uses it to recognize repeats.

The contract

There are four rules I have come to treat as non-negotiable. They are not in the spec — they are in the scars.

1. The key is scoped to the resource being created

Stripe’s documentation is explicit about this: an idempotency key is tied to a single account and a single endpoint. Reusing a key for POST /payments and POST /refunds is undefined behavior.

This is not a politeness. It is a correctness property. If a key is globally unique, your collision surface includes every operation a user has ever performed, and the storage cost grows without bound. If a key is locally scoped, the worst case is a per-endpoint collision, which is far easier to reason about.

type IdempotencyScope = {
  accountId: string;
  endpoint: string;     // "POST /v1/payments"
  key: string;          // the client's key
};

The composite scope + key is what you actually store and look up.

2. The server holds the response

This is the part most homegrown implementations get wrong. The first time the request arrives, you do the work and you store the response. The second time the same scope + key arrives, you do not re-execute the handler. You return the stored response.

Without this, the idempotency key only protects you from partial failures during the first call. It does not protect you from the second call running the work again. Which is the whole point.

The storage needs to be strongly consistent. A read-after-write in a single region is fine. A read-after-write across a leader-follower setup, where the follower might lag, is a footgun. Use the primary. Or use a dedicated idempotency store that has its own durability guarantees.

3. Concurrent requests with the same key are serialized

Two requests with the same key arrive within 50ms of each other. What happens?

If you check for an existing key, find none, and let both run, you have just created the race you were trying to prevent. The classic fix is to use the idempotency record itself as a lock:

BEGIN;
  -- Lock the key row, or insert a placeholder
  INSERT INTO idempotency_keys (scope, key, status, locked_until)
  VALUES ($1, $2, 'in_progress', now() + interval '30 seconds')
  ON CONFLICT (scope, key) DO NOTHING
  RETURNING id;

  -- If RETURNING is empty, the key already exists.
  -- Either it's complete (return stored response) or
  -- in_progress (poll or 409 Conflict with Retry-After).
COMMIT;

The second request blocks on the row lock. The first request finishes, writes the response, and commits. The second request wakes up, sees the completed row, and returns the stored response.

I have seen teams skip this. The argument is usually “we don’t get that much traffic.” The argument is wrong. The two requests are a result of a client retrying after a timeout. They will land on different workers. They will both see “no key exists.” Both will run. You will have the bug.

4. Keys expire, and you say so

A key with no TTL is a memory leak you can never clean up. A key with a TTL is a policy decision about how long you promise to remember.

Stripe’s default is 24 hours. That is a reasonable starting point. Long enough to cover retries across any reasonable outage. Short enough that the table does not become an archaeological record.

When a key expires, you must be explicit. A request that arrives after expiry with an old key should run as a new request. This is correct behavior. The client had a 24-hour window. If they are still retrying the same key after that, something else is wrong, and they should see the error.

CREATE TABLE idempotency_keys (
  id            uuid PRIMARY KEY DEFAULT gen_random_uuid(),
  scope         text NOT NULL,
  key           text NOT NULL,
  status        text NOT NULL CHECK (status IN ('in_progress', 'complete')),
  request_hash  text NOT NULL,           -- see below
  response_code integer,
  response_body jsonb,
  locked_until  timestamptz,
  created_at    timestamptz NOT NULL DEFAULT now(),
  completed_at  timestamptz,
  UNIQUE (scope, key)
);

CREATE INDEX idempotency_keys_expires_idx
  ON idempotency_keys (completed_at)
  WHERE status = 'complete';

The parameter mismatch problem

A subtle question: what if the client sends the same key twice but with different request bodies? POST /payments with key abc-123 and amount 450, then again with key abc-123 and amount 9999.

If you just return the stored response, you have silently lied. The client thinks they paid 9999, but the system says 450. They will not notice until reconciliation.

The fix is to hash the request body and store the hash alongside the response.

const requestHash = sha256(canonicalize(requestBody));

// On replay:
if (existing.requestHash !== requestHash) {
  return 422 Unprocessable Entity
    with { error: "idempotency_key_reused_with_different_parameters" };
}

canonicalize is important here. JSON object key order is not stable across clients. Sort the keys before hashing. The hash is over your canonical form, not the raw bytes the client sent.

This single check has caught more bugs in production than I can count. It is the kind of defensive engineering that pays for itself the first time someone fat-fingers a parameter and the retry layer silently masks it.

When not to use idempotency keys

They are not free. There is storage, there is a database round-trip on every write, there is the operational cost of explaining yet another concept to your team. Use them deliberately.

  • Read endpoints. GET /users/42 is already idempotent. Do not add a key.
  • Truly safe writes. PUT /users/42 with the full user state replacing whatever is there is idempotent. The key would be ceremony.
  • Bulk operations where the cost of the check exceeds the cost of the work. If you are inserting 50,000 rows in a batch, the idempotency check is noise. Use a different mechanism, like a job ID and a “have I processed this job” table.
  • Endpoints where the response is the only side effect. Logs, metrics, audit events. Idempotent by construction.

The right question is: “what is the cost of this work firing twice?” If the answer is anything more than “wasted compute,” use a key.

A working sketch

Here is a version of the middleware I have written more than once. It is a sketch, not a library. The shape is what matters.

import { createHash } from 'node:crypto';
import type { Request, Response, NextFunction } from 'express';

type StoredResponse = {
  status: number;
  body: unknown;
  headers: Record<string, string>;
};

export function idempotent(opts: {
  store: IdempotencyStore;
  scope: (req: Request) => string;
  ttlSeconds: number;
}) {
  return async (req: Request, res: Response, next: NextFunction) => {
    const key = req.header('Idempotency-Key');
    if (!key) return next();           // opt-in
    if (!isValidKey(key)) {
      return res.status(400).json({
        error: 'invalid_idempotency_key',
      });
    }

    const scope = opts.scope(req);
    const requestHash = hash(canonicalize(req.body));
    const stored = await opts.store.lookup(scope, key);

    if (stored && stored.requestHash !== requestHash) {
      return res.status(422).json({
        error: 'idempotency_key_reused_with_different_parameters',
      });
    }

    if (stored?.response) {
      // Replay the original response exactly.
      res.status(stored.response.status)
         .set(stored.response.headers)
         .json(stored.response.body);
      return;
    }

    // Capture the response so we can store it.
    const originalJson = res.json.bind(res);
    res.json = (body: unknown) => {
      const captured: StoredResponse = {
        status: res.statusCode,
        body,
        headers: { 'content-type': 'application/json' },
      };
      // Best-effort storage. If it fails, the next retry will run again.
      opts.store.complete(scope, key, requestHash, captured)
        .catch((err) => log.error({ err, scope, key }, 'idempotency write failed'));
      return originalJson(body);
    };

    next();
  };
}

A few things worth pointing out.

The capture happens by monkey-patching res.json. This is ugly. It is also the simplest way to handle every Express handler that uses the standard response methods, which is most of them. If you control the response layer, do this. If not, you will spend a week teaching the team to remember a wrapper.

The storage failure is logged and ignored. This is a judgment call. I would rather risk a duplicate than a 500. The alternative — failing the request because you could not remember its key — is usually worse. The next retry will re-run. You will see the duplicate in your metrics. You can refund the $14 and move on.

The key validation is hand-rolled and minimal. UUIDs are good. Stripe-style 8a1f-4b2c-9d3e-7f6a is good. A 1MB string is a denial-of-service vector. Cap it, even if it feels pedantic.

What the operations team actually sees

The interesting failure modes are not in the code. They are in the runbook.

“Why is the same payment being returned twice from the lookup?” The request hashes are identical. The response is identical. The client is fine. The metric is not a bug.

“Why did this payment succeed but the client says it failed?” The handler ran. The handler stored the response. The response storage succeeded. The client then disconnected before reading the body. The retry came in. The idempotency key was recognized. The stored response was returned. The client received it. This is correct.

“Why is the response storage failing?” Usually, the table is full. Usually, the table is full because someone set the TTL to “forever” and a test suite is generating fresh keys per request without expiry. The TTL is not a setting. It is a contract. Audit it.

“The key was stored as in_progress and never completed.” The handler crashed. The lock expired. The next retry saw the lock was free, claimed it, and ran. The original request is gone. This is correct, but you should alert on the rate of in_progress records that expire without completing. A spike is a signal of a class of handler bug, not a single incident.

The thing nobody tells you

Idempotency keys are not a backend technique. They are a contract between you and every client you will ever have, including future you, including a frontend engineer who will be at this company in two years, including a script that someone writes at 2am to migrate data and forgets to add the key.

The work of getting them right is mostly the work of writing things down. The endpoint. The key format. The TTL. The error code for parameter mismatches. The header name. The scope rules. The retention window.

If those things are not in your public API documentation, your idempotency implementation is an internal folklore trick. The day someone on the client team reads the source code to figure out what is happening is the day you have a multi-hour debugging session that could have been avoided by a paragraph.

I have made the mistake. More than once. I have opened a Jira ticket labeled “Document the idempotency contract” and closed it as “won’t fix, internal pattern.” I have been wrong every time. The contract is the system. The code is just the mechanism that enforces it.

Write it down. Put it in the docs. Make it a public, boring, well-worded page. Future you, debugging at 3am, will send a thank-you note.

A closing thought

The most underrated property of a good idempotency implementation is that it makes the system feel honest. Clients can retry without fear. Operators can see exactly what the server did. Failed requests are recoverable, not mysterious. The state of the world is what the documentation says it is, not whatever the last network packet happened to contain.

It is a small piece of infrastructure. It will not show up in any product demo. It will, eventually, be the reason a customer trusts you with the next ten years of their business.

That is the whole job, really.

Enjoyed this?

Drop me a note or share it with someone who might.