Microsoft open sources pg_durable: PostgreSQL Durable Execution Comes to Database Layer

Microsoft has open sourced pg_durable, a PostgreSQL extension that brings in-database durable execution directly into SQL workflows. The project, hosted on GitHub at microsoft/pg_durable, allows developers to define long-running, fault-tolerant SQL functions that survive crashes, restarts, and failovers without external orchestration tools like Airflow or Temporal.

The extension accumulates state through checkpointed steps, writes progress to PostgreSQL’s Write-Ahead Log (WAL), and resumes from the last durable checkpoint after any interruption. For teams already running PostgreSQL, this removes the need to stitch together cron jobs, worker queues, status tables, and retry logic to make background work reliable.

Server racks in modern data center running PostgreSQL database workloads — pg_durable turns PostgreSQL into a self-contained durable execution engine, removing the need for separate orchestration infrastructure.

What Is pg_durable?

pg_durable is an open-source PostgreSQL extension authored by Microsoft that enables in-database durable execution. According to the official GitHub repo, it is designed for “long-running, fault-tolerant SQL functions for teams that already keep their state in Postgres and want to stop stitching together cron jobs, workers, queues, and status tables to make background work reliable.”

The extension defines workflows as a graph of SQL steps. Each step is checkpointed atomically within PostgreSQL. If the database crashes, the server restarts, or an individual step fails, execution resumes from the last durable checkpoint. No external recovery scripts, no manual state reconstruction, no polling workers.

How pg_durable Works: Checkpointed Workflow Execution

The core idea behind pg_durable is simple: a durable function is a graph of SQL steps that PostgreSQL executes and checkpoints as it progresses. The extension uses PostgreSQL’s native transaction system, WAL, and row-level security (RLS) to manage state, not an external database or queue.

Here is how the lifecycle works in practice:

Define workflow: Users compose SQL steps using composable operators. The workflow is declared as a sequence of SQL operations, forming a directed graph.
Start execution: The df.start() function initiates the workflow and returns an instance ID that can be used to track progress.
Checkpoint between steps: After each step completes, pg_durable persists the current state and progress atomically to PostgreSQL system tables (df.instances, df.nodes).
Resume on failure: If the database crashes or a step fails, the extension reads the last checkpoint from the WAL and resumes from that exact point. Already-completed steps are not re-executed.
Query status live: Workflow state is visible in real time through standard PostgreSQL queries on the extension’s system tables, using the same authentication and backup model as the rest of the data.

A concrete example from the project documentation shows a data processing workflow:

Note: The following code is an illustrative example and has not been verified against official documentation. Please refer to the official docs for production-ready code.

SELECT df.start(
 'SELECT id FROM documents WHERE processed = false LIMIT 100'
 |= 'batch'
 ~> 'UPDATE documents SET processed = true WHERE id = ANY($batch)'
);

This workflow fetches a batch of unprocessed document IDs, then updates them. If the database crashes after the SELECT but before the UPDATE, pg_durable resumes from the checkpoint after the SELECT and replays only the UPDATE. Without this extension, the same pattern would require a worker process, a status table, retry counters, and manual reconciliation after failures.

Software developer working on SQL workflow code on dual monitors — Developers define workflows entirely in SQL using composable operators, keeping logic next to the data it touches.

Comparison: pg_durable vs. Traditional Orchestration

Teams building reliable background workflows today typically reach for one of several patterns. The table below compares pg_durable against the most common alternatives, based on the project’s own documentation and industry analysis from sources like Franck Pachot’s getting-started guide.

Aspect	pg_cron + Status Tables	External Orchestrator (Airflow, Temporal)	pg_durable
Infrastructure needed	PostgreSQL only	Separate service + database + queue	PostgreSQL only (extension)
Fault tolerance	Manual retry logic, status columns, polling workers	Built-in retries, but external state management	Built-in checkpointing via WAL
Crash recovery	Rerun entire job or manual cleanup	Resumable, but state stored outside database	Resume from last checkpoint automatically
Latency	Low (in-database)	Network hops to external service	Low (in-database)
Operational complexity	Low but fragile	High (multiple services to maintain)	Low (single extension)
Visibility	Custom status tables	Dedicated UI and logs	PostgreSQL tables (df.instances, df.nodes)
Parallel execution	Manual implementation	Built-in DAG support	Built-in via composable operators

The key distinction is architectural. External orchestrators like Airflow and Temporal are powerful, but they introduce a separate state store, a separate runtime, and network hops between the orchestrator and the database. pg_durable collapses that stack: workflow definition, state, execution, and monitoring all live inside PostgreSQL. For teams whose workflows primarily touch data already in PostgreSQL, this eliminates an entire class of infrastructure.

There is a trade-off. As the project’s README notes, “The model is intentionally SQL-shaped.” If a step needs arbitrary code, a non-HTTP SDK, or rich in-memory control flow, you may need to wrap that logic in a SQL function, expose it behind an HTTP endpoint for df.http(), or use a general-purpose orchestrator for that part of the system. pg_durable is a replacement for the common pattern of stitching together PostgreSQL with lightweight external orchestration, not a replacement for Temporal in heterogeneous, multi-system workflows.

Abstract visualization of automated data pipeline workflow — ETL, AI embedding pipelines, and scheduled maintenance are natural fits for pg_durable’s checkpointed execution model.

Real-World Workloads and Deployment Scenarios

Based on the project’s documented use cases and the official pg_durable documentation site, several workload types map naturally to this extension.

Vector Embedding Pipelines

AI pipelines that chunk documents, call embedding APIs, and upsert results into pgvector are a textbook use case. Each of those steps can be a durable node in the workflow graph. If an embedding API call fails on row 500 of 10,000, pg_durable retries that step without re-chunking the first 499 documents. Without it, a crash mid-pipeline means rerunning the entire batch or writing custom checkpointing code.

Data Ingestion and ETL

Batch ingestion workflows that stage, deduplicate, transform, and publish large datasets benefit from checkpointed progress. A 2-hour ingestion job that crashes at 90% completion does not need to restart from the beginning. The extension resumes from the last completed step, saving time and reducing operational pressure.

Scheduled Maintenance Runbooks

DBAs and SREs automating runbooks that must survive restarts and be auditable in SQL can define maintenance workflows directly. Detect bloat, notify, wait for approval, then run the next action. Each step is checkpointed and visible in the same PostgreSQL tables used for monitoring.

Fan-Out Aggregation

Workflows that run independent queries in parallel, then join results, map naturally to pg_durable’s composable operators. The extension handles coordination and ensures that partial results are not lost if one parallel branch fails.

For each of these workloads, the alternative today is typically a combination of pg_cron, a jobs table with status columns, retry counters, and a polling worker process. The project’s README explicitly calls out this pattern as one of the pain points it addresses: “A plpgsql procedure that works until a crash or long-running transaction forces you to start over.”

Installation and Multi-User Setup

Microsoft ships prebuilt Debian packages for PostgreSQL 17 and 18 on amd64 via GitHub release assets. The packages are named pg-durable-postgresql--1_.deb and include the extension library, control file, and SQL upgrade files.

After installing the package, setup follows standard PostgreSQL extension patterns:

Add pg_durable to shared_preload_libraries in postgresql.conf.
Restart PostgreSQL.
Run CREATE EXTENSION pg_durable; in the configured database.

The default database for pg_durable’s background worker is postgres, but this is configurable via the pg_durable.worker_role GUC.

Security Model

pg_durable implements row-level security (RLS) to ensure that each user can only see and manage their own durable function instances and nodes. The extension does not grant any privileges to PUBLIC by default. Administrators must explicitly grant access:

SELECT df.grant_usage('app_role');

For teams with multiple app roles, the recommended pattern is to create an indirection role:

Create a shared role like pg_durable_user with NOLOGIN.
Grant usage to that role via df.grant_usage('pg_durable_user').
Grant membership in pg_durable_user to individual app roles like app_backend and etl_service.

The background worker role (default: postgres) must be a superuser because it bypasses RLS to manage all users’ instances. Regular users get SELECT and INSERT on df.instances and df.nodes, plus column-level UPDATE on status and timestamps for df.cancel(). The submitted_by identity column is not user-modifiable.

One important operational note: after upgrading pg_durable with ALTER EXTENSION pg_durable UPDATE, administrators must re-run df.grant_usage('role') because GRANT EXECUTE ON ALL FUNCTIONS only applies to functions that existed at grant time.

Limitations and When Not to Use pg_durable

pg_durable is not a universal orchestrator. The project documentation is transparent about its constraints, and teams evaluating it should weigh these carefully.

SQL-shaped only. The extension is designed for workflows that can be expressed as SQL steps. If a step requires arbitrary code, a non-HTTP SDK, or complex in-memory control flow, it does not fit naturally. Teams can work around this by wrapping logic in SQL functions or exposing it behind an HTTP endpoint for df.http(), but at that point the simplicity advantage erodes.

Not for sub-millisecond synchronous requests. pg_durable is built for durable background execution, not for request-response paths that need single-digit millisecond latency. The checkpointing overhead between steps adds latency that is acceptable for ingestion and batch jobs but wrong for API endpoints.

Requires extension installation. The extension needs shared_preload_libraries and a background worker. Some managed PostgreSQL environments do not allow custom extensions or background workers. Teams using fully managed database-as-a-service offerings should check compatibility before committing.

Heterogeneous workflows. If a workflow spans many systems outside PostgreSQL, the benefit of in-database execution diminishes. An orchestrator like Temporal or Airflow that coordinates across databases, object stores, message queues, and SaaS APIs is a better fit for those architectures.

Secrets management. The df.vars system uses per-user scoping via RLS, but the project documentation explicitly warns: “Avoid storing secrets in plain text.” Teams that need to pass API keys or credentials through workflow steps should integrate with a secrets manager rather than relying on pg_durable’s variable store.

Key Takeaways

Microsoft open sourced pg_durable, a PostgreSQL extension that brings in-database durable execution to SQL workflows with checkpointed, crash-resilient step execution.
The extension eliminates the need for external orchestration tools like Airflow or Temporal for workflows that operate primarily on data inside PostgreSQL, reducing operational complexity and latency.
pg_durable supports parallel execution, retries, conditional branching, and multi-user security via row-level security, all defined in SQL with no additional infrastructure.
Best suited for AI embedding pipelines, ETL batches, scheduled maintenance runbooks, and fan-out aggregation workloads. Less suited for heterogeneous multi-system workflows or sub-millisecond synchronous request handling.
Available now as open source on GitHub for PostgreSQL 17 and 18, with prebuilt Debian packages and Docker support.

Microsoft’s open-sourcing of pg_durable signals that durable execution is becoming a standard database capability, not a separate infrastructure layer. For the large installed base of PostgreSQL teams, this extension offers a path to simpler, more reliable background workflows without adding new services to the stack. The project is available now at github.com/microsoft/pg_durable under an open-source license.

Microsoft open sources pg_durable