Re: Assertion failure in SnapBuildInitialSnapshot() - Mailing list pgsql-hackers

From Pradeep Kumar
Subject Re: Assertion failure in SnapBuildInitialSnapshot()
Date
Msg-id CAJ4xhP=6h4RrwWpWSaJY4KkrJFCMUZTTGuM53=3wCMqMTBjqKQ@mail.gmail.com
Whole thread Raw
In response to Re: Assertion failure in SnapBuildInitialSnapshot()  (Alexander Lakhin <exclusion@gmail.com>)
List pgsql-hackers
Hi All,
In this thread they proposed fix_concurrent_slot_xmin_update.patch will solve this assert failure. After applying this patch I execute 
pg_sync_replication_slots() (which calls SyncReplicationSlots → synchronize_slots() → synchronize_one_slot() → ReplicationSlotsComputeRequiredXmin(true)) can hit an assertion failure in ReplicationSlotsComputeRequiredXmin() because the ReplicationSlotControlLock is not held in that code path. By default sync_replication_slots is off, so the background slot-sync worker is not spawned; invoking the UDF directly exercises the path without the lock. I have a small patch that acquires ReplicationSlotControlLock in the manual sync path; that stops the assert.

Call Stack :
TRAP: failed Assert("!already_locked || (LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_EXCLUSIVE) && LWLockHeldByMeInMode(ProcArrayLock, LW_EXCLUSIVE))"), File: "slot.
c", Line: 1061, PID: 67056
0   postgres                            0x000000010104aad4 ExceptionalCondition + 216
1   postgres                            0x0000000100d8718c ReplicationSlotsComputeRequiredXmin + 180
2   postgres                            0x0000000100d6fba8 synchronize_one_slot + 1488
3   postgres                            0x0000000100d6e8cc synchronize_slots + 1480
4   postgres                            0x0000000100d6efe4 SyncReplicationSlots + 164
5   postgres                            0x0000000100d8da84 pg_sync_replication_slots + 476
6   postgres                            0x0000000100b34c58 ExecInterpExpr + 2388
7   postgres                            0x0000000100b33ee8 ExecInterpExprStillValid + 76
8   postgres                            0x00000001008acd5c ExecEvalExprSwitchContext + 64
9   postgres                            0x0000000100b54d48 ExecProject + 76
10  postgres                            0x0000000100b925d4 ExecResult + 312
11  postgres                            0x0000000100b5083c ExecProcNodeFirst + 92
12  postgres                            0x0000000100b48b88 ExecProcNode + 60
13  postgres                            0x0000000100b44410 ExecutePlan + 184
14  postgres                            0x0000000100b442dc standard_ExecutorRun + 644
15  postgres                            0x0000000100b44048 ExecutorRun + 104
16  postgres                            0x0000000100e3053c PortalRunSelect + 308
17  postgres                            0x0000000100e2ff40 PortalRun + 736
18  postgres                            0x0000000100e2b21c exec_simple_query + 1368
19  postgres                            0x0000000100e2a42c PostgresMain + 2508
20  postgres                            0x0000000100e22ce4 BackendInitialize + 0
21  postgres                            0x0000000100d1fd4c postmaster_child_launch + 304
22  postgres                            0x0000000100d26d9c BackendStartup + 448
23  postgres                            0x0000000100d23f18 ServerLoop + 372
24  postgres                            0x0000000100d22f18 PostmasterMain + 6396
25  postgres                            0x0000000100bcffd4 init_locale + 0
26  dyld                                0x0000000186d82b98 start + 6076

The assert is raised inside ReplicationSlotsComputeRequiredXmin() because that function expects either that already_locked is false (and it will acquire what it needs), or that callers already hold both ReplicationSlotControlLock (exclusive) and ProcArrayLock (exclusive). In the manual-sync path called by the UDF, neither lock is held, so the assertion trips.

Why this happens:
The background slot sync worker (spawned when sync_replication_slots = on) acquires the necessary locks before calling the routines that update/compute slot xmins, so the worker path is safe.The manual path through the SQL-callable UDF does not take the same locks before calling synchronize_slots()/synchronize_one_slot(). As a result the invariant assumed by ReplicationSlotsComputeRequiredXmin() can be violated, leading to the assert.

Proposed fix:
In synchronize_slots() (the code path used by SyncReplicationSlots()/pg_sync_replication_slots()), acquire ReplicationSlotControlLock before any call that can end up calling ReplicationSlotsComputeRequiredXmin(true).

Thanks and Regards
Pradeep


On Mon, Oct 27, 2025 at 3:09 PM Alexander Lakhin <exclusion@gmail.com> wrote:
Hello,

01.02.2024 21:20, vignesh C wrote:
> The patch which you submitted has been awaiting your attention for
> quite some time now.  As such, we have moved it to "Returned with
> Feedback" and removed it from the reviewing queue. Depending on
> timing, this may be reversible.  Kindly address the feedback you have
> received, and resubmit the patch to the next CommitFest.

While analyzing buildfarm failures, I found [1], which demonstrates the
assertion failure discussed here:
---
031_column_list_publisher.log
TRAP: FailedAssertion("TransactionIdPrecedesOrEquals(safeXid, snap->xmin)", File:
"/home/bf/bf-build/skink/REL_15_STABLE/pgsql.build/../pgsql/src/backend/replication/logical/snapbuild.c", Line: 614,
PID: 1882382)
---

I've managed to reproduce the assertion failure on REL_15_STABLE with the
following modification:
@@ -3928,6 +3928,7 @@ ProcArraySetReplicationSlotXmin(TransactionId xmin, TransactionId catalog_xmin,
  {
      Assert(!already_locked || LWLockHeldByMe(ProcArrayLock));

+pg_usleep(1000);
      if (!already_locked)
          LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);

using the script:
numjobs=100
createdb db
export PGDATABASE=db

for ((i=1;i<=100;i++)); do
echo "iteration $i"

for ((j=1;j<=numjobs;j++)); do
echo "
SELECT pg_create_logical_replication_slot('s$j', 'test_decoding');
SELECT txid_current();
" | psql >>/dev/null 2>&1 &

echo "
BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;
CREATE_REPLICATION_SLOT slot$j LOGICAL test_decoding USE_SNAPSHOT;
" | psql -d "dbname=db replication=database" >>/dev/null 2>&1 &
done
wait

for ((j=1;j<=numjobs;j++)); do
echo "
DROP_REPLICATION_SLOT slot$j;
" | psql -d "dbname=db replication=database" >/dev/null

echo "SELECT pg_drop_replication_slot('s$j');" | psql >/dev/null
done

grep 'TRAP' server.log && break;
done

(with
wal_level = logical
max_replication_slots = 200
max_wal_senders = 200
in postgresql.conf)

iteration 18
ERROR:  replication slot "slot13" is active for PID 538431
TRAP: FailedAssertion("TransactionIdPrecedesOrEquals(safeXid, snap->xmin)", File: "snapbuild.c", Line: 614, PID: 538431)


I've also confirmed that fix_concurrent_slot_xmin_update.patch fixes the
issue.

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2024-05-15%2020%3A55%3A17

Best regards,
Alexander




pgsql-hackers by date:

Previous
From: Alexander Korotkov
Date:
Subject: Re: Add SPLIT PARTITION/MERGE PARTITIONS commands
Next
From: Alena Vinter
Date:
Subject: Re: Resetting recovery target parameters in pg_createsubscriber