Hi,
Commit f41d846 [1] introduced a race condition that allows a process
to acquire and continue using an invalidated replication slot, leading
to unexpected behavior.
Example Scenario:
Suppose there is an orphaned logical slot that will be invalidated on
the next checkpoint. Consider the following sequence of events during
ReplicationSlotAcquire() -
- A process (P1) attempts to acquire the slot. It checks the
invalidated flag and sees that the slot is still valid, so it proceeds
to set the s->active_pid to its own PID.
- Before P1 can acquire the spinlock and update s->active_pid =
MyProcPid, the checkpointer process invalidates the slot.
- Now, P1 proceeds to set s->active_pid = MyProcPid and starts using
an already invalidated slot, which should not happen.
Using an invalidated slot should be prevented in the first place. In
my test, this scenario triggered an assertion failure in
CreateDecodingContext() -
```
TRAP: failed Assert("slot->data.invalidated == RS_INVAL_NONE"), File:
"logical.c", Line: 546, PID: 1042884
postgres: nisha postgres [local]
SELECT(ExceptionalCondition+0xbb)[0x649ff7696c82]
postgres: nisha postgres [local]
SELECT(CreateDecodingContext+0x23b)[0x649ff73bf9d3]
postgres: nisha postgres [local] SELECT(+0x61fb71)[0x649ff73c3b71]
postgres: nisha postgres [local]
SELECT(pg_logical_slot_get_changes+0x26)[0x649ff73c3ed6]
...
```
The expected behavior was for ReplicationSlotAcquire() to detect the
invalid slot and return an error instead of allowing its usage.
Attached the patch v1 fixing this issue.
Issue reported by: Hou Zhijie.
[1] https://github.com/postgres/postgres/commit/f41d8468ddea34170fe19fdc17b5a247e7d3ac78
--
Thanks,
Nisha