Why the Outbox Pattern Exists
Reliable event pipelines often fail in a very specific place: the boundary between a PostgreSQL transaction and an external side effect such as publishing to Kafka, enqueueing to a queue, or calling a webhook. If you update the database and then publish an event as a separate step, you can end up with a committed state change but no event (publish failed), or an event without a committed state change (publish succeeded but the transaction rolled back). The outbox pattern removes this gap by making “the event to publish” part of the same PostgreSQL transaction as the business change.
The core idea is to write an outbox record (an event row) into PostgreSQL in the same transaction as the domain update. A separate relay process later reads those outbox rows and publishes them to the external system. Because the outbox row is committed atomically with the domain change, the relay can crash, restart, and still publish everything that was committed.
Logical replication (logical decoding) complements the outbox pattern by providing a streaming mechanism from PostgreSQL’s WAL (Write-Ahead Log). Instead of polling the outbox table, you can stream committed inserts from the outbox via a replication slot. This typically reduces load, improves latency, and gives you a durable resume position (the slot’s confirmed LSN) for recovery.
Two Implementation Styles: Polling vs Logical Replication
Outbox + Polling Relay
The relay periodically queries the outbox table for unpublished events, claims a batch, publishes them, then marks them as published. This approach is simple and works without replication privileges. It does, however, create extra database work: repeated queries, row locking, and updates to mark published rows.
Outbox + Logical Replication (CDC-style)
The application inserts outbox rows (ideally append-only). A connector consumes those inserts from WAL using a logical replication slot and publishes them to the broker. The connector advances the slot only after successful delivery, so restarts resume from the last confirmed position. This approach is efficient and low-latency, but requires PostgreSQL logical replication configuration and careful monitoring of slot lag and WAL retention.
- Listen to the audio with the screen off.
- Earn a certificate upon completion.
- Over 5000 courses for you to explore!
Download the app
Outbox Table Design as an Integration Contract
Think of the outbox table as a contract between your application and the pipeline. It should be stable, easy to insert into, and easy to consume from. A practical schema:
CREATE TABLE outbox_events ( id bigserial PRIMARY KEY, aggregate_type text NOT NULL, aggregate_id text NOT NULL, event_type text NOT NULL, payload jsonb NOT NULL, headers jsonb NOT NULL DEFAULT '{}'::jsonb, occurred_at timestamptz NOT NULL DEFAULT now(), trace_id text, tenant_id text, -- polling-oriented fields (optional if using logical replication only) published_at timestamptz, publish_attempts integer NOT NULL DEFAULT 0, next_attempt_at timestamptz);Practical guidance for these columns:
aggregate_type,aggregate_idhelp downstream routing and partitioning decisions (for example, message key = aggregate_id).event_typeis a stable name consumers can subscribe to (for example,OrderCreated).payloadis the event body; keep it self-contained enough for consumers to act.headersis where you put schema version, content type, correlation identifiers, or routing hints.occurred_atis the event creation timestamp inside the transaction (useful for observability and debugging).published_atand retry fields are needed for polling; for logical replication you can often avoid updating rows entirely.
Step-by-Step: Writing Domain Changes and Outbox Events Atomically
The application must insert into the outbox in the same database transaction as the domain write. Example: create an order and emit OrderCreated.
BEGIN;INSERT INTO orders (order_id, customer_id, status, total_amount, created_at)VALUES ($1, $2, 'created', $3, now());INSERT INTO outbox_events (aggregate_type, aggregate_id, event_type, payload, headers, trace_id, tenant_id)VALUES ( 'order', $1, 'OrderCreated', jsonb_build_object( 'order_id', $1, 'customer_id', $2, 'status', 'created', 'total_amount', $3, 'created_at', now() ), jsonb_build_object( 'schema_version', 1, 'content_type', 'application/json' ), $4, $5);COMMIT;What this buys you operationally:
- If the transaction commits, the outbox row exists and will eventually be published.
- If the transaction rolls back, there is no outbox row to publish.
- The relay can be restarted without losing committed events.
Polling Relay: A Practical, Concurrent-Safe Pattern
A polling relay should do three things repeatedly: claim a batch, publish it, then mark it as published (or schedule a retry). The key is to claim rows in a way that supports multiple workers without double-processing.
Indexes for Polling
For a table that is frequently queried for unpublished rows, add a partial index:
CREATE INDEX outbox_unpublished_idxON outbox_events (id)WHERE published_at IS NULL;If you use backoff via next_attempt_at:
CREATE INDEX outbox_retry_idxON outbox_events (next_attempt_at, id)WHERE published_at IS NULL;Claiming Rows with FOR UPDATE SKIP LOCKED
This pattern allows multiple relay instances to share the work safely:
WITH batch AS ( SELECT id FROM outbox_events WHERE published_at IS NULL AND (next_attempt_at IS NULL OR next_attempt_at <= now()) ORDER BY id LIMIT 100 FOR UPDATE SKIP LOCKED)UPDATE outbox_events oSET publish_attempts = publish_attempts + 1FROM batchWHERE o.id = batch.idRETURNING o.id, o.aggregate_type, o.aggregate_id, o.event_type, o.payload, o.headers, o.trace_id, o.tenant_id, o.occurred_at;Important operational detail: keep this transaction short. The relay should not hold locks while performing network calls. A common approach is:
- Transaction A: claim rows and return them.
- Publish outside the transaction.
- Transaction B: mark published or schedule retry.
Marking Published
UPDATE outbox_eventsSET published_at = now(), next_attempt_at = NULLWHERE id = ANY($1::bigint[]);Scheduling Retries with Backoff
If publishing fails, do not set published_at. Instead, schedule a retry:
UPDATE outbox_eventsSET next_attempt_at = now() + interval '30 seconds'WHERE id = ANY($1::bigint[]);Practical retry policy:
- Use exponential backoff to avoid hammering a broken broker.
- After N attempts, move the event to a dead-letter table and alert.
- Record the last error message in a separate column if you need debugging (but be mindful of unbounded text growth).
Reducing Table Bloat in Polling Mode
Polling mode typically updates rows (attempt counters, published timestamps), which can create bloat. Common mitigations:
- Delete or archive published rows on a schedule (for example, keep 7–30 days online).
- Partition the outbox by time and drop old partitions (if you already operate partitioning safely).
- Keep the outbox minimal: avoid large payloads if you can store a reference and fetch details elsewhere (but note that fetching details later can reintroduce coupling and failure modes).
Logical Replication: Streaming the Outbox from WAL
Logical replication reads committed changes from PostgreSQL’s WAL and emits them as a change stream. For outbox usage, the cleanest approach is to make the outbox table append-only and stream only INSERT operations. Each insert corresponds to one event to publish.
This approach avoids the “mark published” update entirely. The connector’s progress is tracked by the replication slot position, not by mutating outbox rows.
PostgreSQL Configuration Essentials
Logical decoding requires settings like:
-- postgresql.conf (values depend on your environment)wal_level = logicalmax_replication_slots = 10max_wal_senders = 10You also need a replication-capable role (availability depends on hosting):
CREATE ROLE outbox_replicator WITH LOGIN REPLICATION PASSWORD '...';GRANT SELECT ON outbox_events TO outbox_replicator;Step-by-Step: Publication and Slot
1) Create a publication that includes only the outbox table:
CREATE PUBLICATION outbox_pub FOR TABLE outbox_events;2) Create a logical replication slot. Some connectors create slots automatically; creating it manually helps with visibility:
SELECT pg_create_logical_replication_slot('outbox_slot', 'pgoutput');3) Run a connector/relay that:
- Connects via the replication protocol.
- Subscribes to the publication.
- Receives inserted outbox rows.
- Publishes them to the broker.
- Acknowledges WAL positions (LSNs) only after successful publish.
The acknowledgment rule is the reliability core: if the connector reads changes but crashes before publishing, it must not acknowledge the LSN. On restart, PostgreSQL will resend from the last confirmed position.
Monitoring Replication Slots to Avoid WAL Accumulation
Replication slots retain WAL until consumers confirm progress. Monitor slot state:
SELECT slot_name, active, restart_lsn, confirmed_flush_lsnFROM pg_replication_slotsWHERE slot_name = 'outbox_slot';Operational practices:
- Alert on increasing lag (difference between current WAL LSN and
confirmed_flush_lsn). - Alert when a slot is inactive for too long.
- Have a documented procedure for decommissioning consumers: dropping a slot discards the resume position and can cause data loss in the stream if not handled carefully.
Append-Only Outbox Enforcement (Optional)
If you commit to logical replication and want a clean stream, you can enforce that the outbox table is append-only (no updates/deletes). This prevents accidental “mark published” updates that would appear in the change stream.
CREATE OR REPLACE FUNCTION prevent_outbox_update_delete()RETURNS trigger LANGUAGE plpgsql AS $$BEGIN RAISE EXCEPTION 'outbox_events is append-only';END;$$;CREATE TRIGGER outbox_no_update_deleteBEFORE UPDATE OR DELETE ON outbox_eventsFOR EACH ROW EXECUTE FUNCTION prevent_outbox_update_delete();Use this only if you have an alternative retention strategy (for example, partition dropping) and you are confident you will not need to mutate outbox rows for operational reasons.
Event Shaping: Turning Outbox Rows into Broker Messages
Whether you poll or stream via logical replication, you still need a consistent message format. A practical approach is to publish a clean envelope that is independent of the CDC tool’s internal representation. For example:
{ "event_type": "OrderCreated", "aggregate_type": "order", "aggregate_id": "ORD-123", "occurred_at": "2026-01-11T10:15:00Z", "trace_id": "c9f1...", "tenant_id": "t-42", "schema_version": 1, "payload": { "order_id": "ORD-123", "customer_id": "C-9", "status": "created", "total_amount": "49.90" }}Mapping rules that work well in practice:
- Topic selection:
aggregate_type(for example,order.events) orevent_type(for example,OrderCreatedtopic). Prefer fewer topics with clear filtering rules unless you have strong operational reasons to split. - Message key:
aggregate_idso all events for an aggregate land in the same partition in partitioned brokers. - Headers: copy
trace_id,tenant_id, andschema_versionfor observability and compatibility.
Handling Multi-Event Transactions and Atomic Visibility
Applications sometimes write multiple outbox rows in one transaction (for example, OrderCreated plus InventoryReserved). Both polling and logical replication can handle this, but there are differences in how you should think about it:
- Polling: if you claim rows by
id, you may publish them in the same order they were inserted, but you are not inherently aware of transaction boundaries unless you store a transaction identifier in the outbox row. If atomic “publish as a unit” matters, add atx_idcolumn (application-generated) and publish grouped by it. - Logical replication: changes are emitted in commit order and typically grouped by transaction. A connector can preserve transaction boundaries or flatten them depending on configuration. If you need “all or nothing” publication for a group, ensure your connector supports transaction metadata or that your downstream processing can tolerate partial visibility.
Operational Tradeoffs: When to Prefer Each Approach
Prefer Polling When
- You cannot enable logical replication (permissions, hosting restrictions).
- You want the simplest operational model and can accept slightly higher database load.
- You need fine-grained retry scheduling and dead-letter handling directly in SQL.
Prefer Logical Replication When
- You need low-latency streaming with minimal database overhead.
- You want a durable resume cursor without updating outbox rows.
- You already operate CDC connectors and have monitoring for replication slots and lag.
Common Failure Modes and How to Design for Them
Relay Publishes but Crashes Before Marking Published (Polling)
This creates a duplicate publish risk because the row remains unpublished. The standard mitigation is to make the publish operation safe to retry at the broker/consumer side and to include stable identifiers in the message (for example, outbox id as a message header). Even if consumers handle duplicates, you should still aim to minimize them by marking published promptly and using small batches.
Connector Stops Consuming a Replication Slot (Logical Replication)
WAL accumulates. Mitigate with monitoring and alerting on lag, and ensure connectors are deployed with high availability. If you must rebuild a connector, plan how to recover the stream position (for example, by keeping the slot and reusing it, or by replaying from a known point if you intentionally drop it).
Schema Changes to the Outbox Table
Because the outbox is an integration contract, treat schema changes carefully:
- Additive changes (new nullable columns) are usually safe.
- Renames and type changes can break connectors or downstream transforms.
- If using logical replication, ensure the publication and connector handle the new columns as expected.
Practical Checklist for Production Readiness
- Outbox inserts are in the same transaction as domain changes.
- Outbox rows contain routing metadata (aggregate_type/id, event_type) and observability metadata (trace_id, tenant_id).
- Polling mode:
FOR UPDATE SKIP LOCKEDclaiming, retry backoff, dead-letter strategy, and cleanup/retention plan. - Logical replication mode:
wal_level=logical, publication limited to outbox table, replication slot monitoring, and a plan for connector restarts and slot lifecycle. - Message format is stable and decoupled from CDC envelopes; include schema version and stable identifiers.