Professional analyzing a repeatable process flowchart on a whiteboard representing idempotent system design

Reliable Webhook Handling Strategies for Data Integrity

April 30, 2026 · 6 min read · By Thomas A. Anderson

Building Reliable Webhooks: Idempotency, Retries, and Replay Defense

Stripe retries webhook deliveries for up to approximately 72 hours with exponential backoff. That single behavior explains why so many integrations accidentally double-charge users, send duplicate emails, or corrupt state. Webhooks are not reliable by default. They become reliable only when your handler is designed for failure, duplication, and replay from day one.

This article walks through production patterns that actually work: idempotency, retry-safe processing, and replay defense. If you have ever debugged “why did this run twice?” or “why did this event never process?”, these are the issues addressed here.

Webhook Architecture That Survives Failures

Start with working code. This example shows a baseline pattern used in production systems: receive, acknowledge quickly, and process asynchronously.

This photo shows a row of server racks in a data center, with focus on a specifically labeled “NETWORK-2,” illuminated by blue lighting. It highlights technology infrastructure essential for digital networking and data management. It would suit articles about data centers, network infrastructure, or cybersecurity.

import express from 'express';
import { Queue, Worker } from 'bullmq';
import IORedis from 'ioredis';

const app = express();
app.use(express.json());

const connection = new IORedis();
const queue = new Queue('webhooks', { connection });
app.post('/webhook/payment', async (req, res) => {
 const event = req.body;

 await queue.add('process-event', { event });

 res.status(200).json({ received: true });
});
new Worker('webhooks', async job => {
 const { event } = job.data;

 if (event.type === 'payment_intent.succeeded') {
 console.log('Processing payment:', event.id);
 }
}, { connection });

app.listen(3000);

This pattern matters because webhook providers expect a quick 2xx response. According to this webhook best practices guide, providers typically timeout within 5 to 30 seconds and retry if they do not receive a success response.

If you do real work inside the HTTP handler, you will eventually hit timeouts. When that happens:

  • The provider retries
  • Your logic runs again
  • You get duplicate side effects

This is the root cause of most webhook bugs.

Developer working on webhook processing system. Production webhook systems rely on async processing and queues.

This architecture connects directly to reliability lessons from platform reliability failures in 2026. When any single dependency becomes unreliable, tightly coupled systems break first. Webhooks present the same problem at a smaller scale. If your handler blocks or fails, retries amplify the damage.

Idempotency Patterns That Actually Work

Idempotency means running the same event twice produces the same result as running it once. Without it, retries corrupt your system.

Start with the simplest working implementation: store processed event IDs.

async fn handleEvent(db, event) {
 const existing = await db.query(
 'SELECT event_id FROM processed_events WHERE event_id = $1',
 [event.id]
 );

 if (existing.rows.length > 0) {
 console.log('Duplicate event skipped:', event.id);
 return;
 }
 await db.query(
 'UPDATE orders SET status = $1 WHERE payment_id = $2',
 ['paid', event.data.object.id]
 );

 await db.query(
 'INSERT INTO processed_events (event_id, processed_at) VALUES ($1, NOW())',
 [event.id]
 );
}

This works, but it has a race condition under concurrent retries. The safer pattern is an atomic insert with conflict handling.

async fn handleEventSafe(db, event) {
 const result = await db.query(`
 INSERT INTO processed_events (event_id, processed_at)
 VALUES ($1, NOW())
 ON CONFLICT (event_id) DO NOTHING
 RETURNING event_id
 `, [event.id]);

 if (result.rows.length === 0) {
 return; // already processed
 }

 await db.query(
 'UPDATE orders SET status = $1 WHERE payment_id = $2',
 ['paid', event.data.object.id]
 );
}

This approach avoids double execution even if two retries arrive at the same time.

Another key technique is using upserts instead of inserts.

-- Instead of this:
INSERT INTO subscriptions (stripe_id, status)
VALUES ($1, $2);

-- Use this:
INSERT INTO subscriptions (stripe_id, status)
VALUES ($1, $2)
ON CONFLICT (stripe_id)
DO UPDATE SET status = EXCLUDED.status;

Benefits of these techniques:

  • Retries become harmless
  • Out-of-order events can be handled
  • State remains consistent

This connects directly to event-driven systems like PostgreSQL LISTEN/NOTIFY patterns. In both cases, you must assume events can arrive late, duplicated, or out of order.

Retry Strategies and HTTP Semantics

Webhook retries are controlled by HTTP status codes. Returning the wrong code either loses events or creates retry storms.

Here is an effective handler that uses correct semantics:

app.post('/webhook', async (req, res) => {
 let event;

 try {
 event = verifySignature(req.body, req.headers);
 } catch (err) {
 return res.status(400).json({ error: 'Invalid signature' });
 }

 if (!event.type) {
 return res.status(200).json({ skipped: true });
 }

 try {
 await queue.add('event', { event });
 return res.status(200).json({ received: true });
 } catch (err) {
 return res.status(503).json({ error: 'Queue unavailable' });
 }
});

Behavior breakdown:

  • 200: success, no retry
  • 400: permanent failure, no retry
  • 503: temporary failure, retry later

According to real-world webhook systems described by Hookdeck, retries often use exponential backoff over hours or days. That means a single failure can re-trigger processing many times, and transient outages create burst traffic later.

This is similar to queue backpressure problems seen in async systems like Tokio. If your worker cannot keep up, retries increase load instead of helping.

Replay Attacks and Defense Strategies

Retries are expected. Replays from attackers are not.

A replay attack occurs when someone captures a valid webhook request and sends it again later. If your system blindly trusts it, actions may be triggered multiple times.

Basic replay defense has three layers:

  • Signature verification
  • Timestamp validation
  • Idempotency tracking

Here is a working example:

fn verifyWebhook(payload, signature, timestamp) {
 const toleranceSeconds = 300; // 5 minutes

 const now = Math.floor(Date.now() / 1000);

 if (Math.abs(now - timestamp) > toleranceSeconds) {
 throw new Error('Replay detected: timestamp too old');
 }

 const expectedSignature = generateHmac(payload);

 if (signature !== expectedSignature) {
 throw new Error('Invalid signature');
 }

 return true;
}

Why timestamps matter:

  • They prevent replay of old valid requests
  • They limit the attack window
  • They complement idempotency

Idempotency alone does not fully protect you. If an attacker sends a modified payload with a new ID, your system will treat it as new. Signature verification closes that gap.

Replay defense is an area where many teams cut corners. It works fine until someone probes your endpoint. Then it becomes a security problem, not just a reliability issue.

Approach Comparison and Tradeoffs

Approach What it Solves Failure Mode Source
Store event IDs Prevents duplicate processing Race conditions without atomic insert Guide
Queue-based processing Handles retries and async work Queue outage blocks processing Guide
HTTP status control Controls retry behavior Wrong code causes data loss or retry storms Guide

Key tradeoff: simplicity versus safety.

  • Simple handlers are easy to write but fail under retries
  • Reliable handlers require queues, storage, and careful design

There is no shortcut here. Reliability is a matter of architecture, not syntax.

Production Pitfalls You Will Hit

These issues appear in almost every real system.

1. Out-of-order events

You may receive “updated” before “created.” Always design database writes to tolerate this. Upserts are helpful.

2. Silent failures

If your handler returns 200 but fails internally, the provider assumes success. Logging and monitoring are required.

3. Retry amplification

A temporary outage leads to many retries later. Your system must absorb bursts.

4. Missing observability

You need logs for:

  • Event received
  • Event queued
  • Event processed
  • Event failed

5. Dead-letter handling

Events that fail repeatedly should not disappear. Move them to a dead-letter queue for inspection.

Key Takeaways:

  • Webhook reliability depends more on your handler than the provider
  • Idempotency is mandatory, not optional
  • Return correct HTTP codes to control retries
  • Process asynchronously to avoid timeouts
  • Replay defense requires signatures and timestamps, not just IDs

Reliable webhook handling is one of those areas where small mistakes compound quickly. A missing idempotency check turns a single failure into ten duplicate actions. A slow handler leads to hours of retries instead of one missed delivery.

The solution is straightforward, but it takes discipline: treat every event as if it might arrive twice, late, or maliciously. When you build for that reality, your system stops breaking under normal conditions.

Thomas A. Anderson

Mass-produced in late 2022, upgraded frequently. Has opinions about Kubernetes that he formed in roughly 0.3 seconds. Occasionally flops — but don't we all? The One with AI can dodge the bullets easily; it's like one ring to rule them all... sort of...