Building Reliable Webhooks: Idempotency, Retries, and Replay Defense

Stripe retries webhook deliveries for up to approximately 72 hours with exponential backoff. That single behavior explains why so many integrations accidentally double-charge users, send duplicate emails, or corrupt state. Webhooks are not reliable by default. They become reliable only when your handler is designed for failure, duplication, and replay from day one.

This article walks through production patterns that actually work: idempotency, retry-safe processing, and replay defense. If you have ever debugged “why did this run twice?” or “why did this event never process?”, these are the issues addressed here.

Webhook Architecture That Survives Failures

Start with working code. This example shows a baseline pattern used in production systems: receive, acknowledge quickly, and process asynchronously.

This photo shows a row of server racks in a data center, with focus on a specifically labeled “NETWORK-2,” illuminated by blue lighting. It highlights technology infrastructure essential for digital networking and data management. It would suit articles about data centers, network infrastructure, or cybersecurity.

import express from 'express';
import { Queue, Worker } from 'bullmq';
import IORedis from 'ioredis';

const app = express();
app.use(express.json());

const connection = new IORedis();
const queue = new Queue('webhooks', { connection });
app.post('/webhook/payment', async (req, res) => {
 const event = req.body;

 await queue.add('process-event', { event });

 res.status(200).json({ received: true });
});
new Worker('webhooks', async job => {
 const { event } = job.data;

 if (event.type === 'payment_intent.succeeded') {
 console.log('Processing payment:', event.id);
 }
}, { connection });

app.listen(3000);

Idempotency Patterns That Actually Work

Idempotency means running the same event twice produces the same result as running it once. Without it, retries corrupt your system.

Start with the simplest working implementation: store processed event IDs.

async fn handleEvent(db, event) {
 const existing = await db.query(
 'SELECT event_id FROM processed_events WHERE event_id = $1',
 [event.id]
 );

 if (existing.rows.length > 0) {
 console.log('Duplicate event skipped:', event.id);
 return;
 }
 await db.query(
 'UPDATE orders SET status = $1 WHERE payment_id = $2',
 ['paid', event.data.object.id]
 );

 await db.query(
 'INSERT INTO processed_events (event_id, processed_at) VALUES ($1, NOW())',
 [event.id]
 );
}

This works, but it has a race condition under concurrent retries. The safer pattern is an atomic insert with conflict handling.

async fn handleEventSafe(db, event) {
 const result = await db.query(`
 INSERT INTO processed_events (event_id, processed_at)
 VALUES ($1, NOW())
 ON CONFLICT (event_id) DO NOTHING
 RETURNING event_id
 `, [event.id]);

 if (result.rows.length === 0) {
 return; // already processed
 }

 await db.query(
 'UPDATE orders SET status = $1 WHERE payment_id = $2',
 ['paid', event.data.object.id]
 );
}

This approach avoids double execution even if two retries arrive at the same time.

Another key technique is using upserts instead of inserts.

-- Instead of this:
INSERT INTO subscriptions (stripe_id, status)
VALUES ($1, $2);

-- Use this:
INSERT INTO subscriptions (stripe_id, status)
VALUES ($1, $2)
ON CONFLICT (stripe_id)
DO UPDATE SET status = EXCLUDED.status;

Benefits of these techniques:

Retries become harmless
Out-of-order events can be handled
State remains consistent

This connects directly to event-driven systems like PostgreSQL LISTEN/NOTIFY patterns. In both cases, you must assume events can arrive late, duplicated, or out of order.

Retry Strategies and HTTP Semantics

Webhook retries are controlled by HTTP status codes. Returning the wrong code either loses events or creates retry storms.

Here is an effective handler that uses correct semantics:

app.post('/webhook', async (req, res) => {
 let event;

 try {
 event = verifySignature(req.body, req.headers);
 } catch (err) {
 return res.status(400).json({ error: 'Invalid signature' });
 }

 if (!event.type) {
 return res.status(200).json({ skipped: true });
 }

 try {
 await queue.add('event', { event });
 return res.status(200).json({ received: true });
 } catch (err) {
 return res.status(503).json({ error: 'Queue unavailable' });
 }
});

Behavior breakdown:

200: success, no retry
400: permanent failure, no retry
503: temporary failure, retry later

According to real-world webhook systems described by Hookdeck, retries often use exponential backoff over hours or days. That means a single failure can re-trigger processing many times, and transient outages create burst traffic later.

This is similar to queue backpressure problems seen in async systems like Tokio. If your worker cannot keep up, retries increase load instead of helping.

Replay Attacks and Defense Strategies

Retries are expected. Replays from attackers are not.

A replay attack occurs when someone captures a valid webhook request and sends it again later. If your system blindly trusts it, actions may be triggered multiple times.

Basic replay defense has three layers:

Signature verification
Timestamp validation
Idempotency tracking

Here is a working example:

fn verifyWebhook(payload, signature, timestamp) {
 const toleranceSeconds = 300; // 5 minutes

 const now = Math.floor(Date.now() / 1000);

 if (Math.abs(now - timestamp) > toleranceSeconds) {
 throw new Error('Replay detected: timestamp too old');
 }

 const expectedSignature = generateHmac(payload);

 if (signature !== expectedSignature) {
 throw new Error('Invalid signature');
 }

 return true;
}

Why timestamps matter:

They prevent replay of old valid requests
They limit the attack window
They complement idempotency

Idempotency alone does not fully protect you. If an attacker sends a modified payload with a new ID, your system will treat it as new. Signature verification closes that gap.

Replay defense is an area where many teams cut corners. It works fine until someone probes your endpoint. Then it becomes a security problem, not just a reliability issue.

Approach Comparison and Tradeoffs

Approach	What it Solves	Failure Mode	Source
Store event IDs	Prevents duplicate processing	Race conditions without atomic insert	Guide
Queue-based processing	Handles retries and async work	Queue outage blocks processing	Guide
HTTP status control	Controls retry behavior	Wrong code causes data loss or retry storms	Guide

Key tradeoff: simplicity versus safety.

Simple handlers are easy to write but fail under retries
Reliable handlers require queues, storage, and careful design

There is no shortcut here. Reliability is a matter of architecture, not syntax.

Production Pitfalls You Will Hit

These issues appear in almost every real system.

1. Out-of-order events

You may receive “updated” before “created.” Always design database writes to tolerate this. Upserts are helpful.

2. Silent failures

If your handler returns 200 but fails internally, the provider assumes success. Logging and monitoring are required.

3. Retry amplification

A temporary outage leads to many retries later. Your system must absorb bursts.

4. Missing observability

You need logs for:

Event received
Event queued
Event processed
Event failed

5. Dead-letter handling

Events that fail repeatedly should not disappear. Move them to a dead-letter queue for inspection.

Key Takeaways:

Webhook reliability depends more on your handler than the provider
Idempotency is mandatory, not optional
Return correct HTTP codes to control retries
Process asynchronously to avoid timeouts
Replay defense requires signatures and timestamps, not just IDs

Reliable webhook handling is one of those areas where small mistakes compound quickly. A missing idempotency check turns a single failure into ten duplicate actions. A slow handler leads to hours of retries instead of one missed delivery.

The solution is straightforward, but it takes discipline: treat every event as if it might arrive twice, late, or maliciously. When you build for that reality, your system stops breaking under normal conditions.

Reliable Webhook Handling Strategies for Data Integrity