BUG #18961: Race scenario where max_standby_streaming_delay is not honored - Mailing list pgsql-bugs

From PG Bug reporting form
Subject BUG #18961: Race scenario where max_standby_streaming_delay is not honored
Date
Msg-id 18961-455d3d9b0225b24b@postgresql.org
Whole thread Raw
List pgsql-bugs
The following bug has been logged on the website:

Bug reference:      18961
Logged by:          Anthony Hsu
Email address:      erwaman@gmail.com
PostgreSQL version: 17.5
Operating system:   Linux
Description:

In the current ResolveRecoveryConflictWithBufferPin implementation in
standby.c, I think there's a race scenario where a backend holding a
conflicting buffer pin won't receive a PROCSIG_RECOVERY_CONFLICT_BUFFERPIN
message promptly:
1. Assume max_standby_streaming_delay has expired when the startup process
enters ResolveRecoveryConflictWithBufferPin
2. Assume backend 1 holds a conflicting buffer pin while backend 2 does not
3. Since we are past the standby limit time, the startup process broadcasts
PROCSIG_RECOVERY_CONFLICT_BUFFERPIN here [1] without enabling any timeouts
4. Then the startup process waits to be woken up via
ProcWaitForSignal(WAIT_EVENT_BUFFER_PIN) here [2]
5. Suppose backend 2 receives PROCSIG_RECOVERY_CONFLICT_BUFFERPIN first,
sees it does not hold a conflicting buffer pin, and *then* proceeds to pin
the buffer
6. Suppose then backend 1 receives PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
processes interrupts, and cancels itself. During cleanup, in UnpinBuffer(),
it will see the pin count is still > 1 (startup process + backend 2 have it
pinned), so it will NOT wake up the startup process.
7. The startup process will only get woken up once backend 2 unpins the
buffer and the pin count reaches 1 (or some other signal causes the startup
process's latch to be set). Only then will it try to acquire the cleanup
lock again and broadcast another PROCSIG_RECOVERY_CONFLICT_BUFFERPIN message
if it fails to acquire the cleanup lock.
The unexpected behavior is in step (7), where it could be arbitrarily long
until the startup process is woken up again. The expected behavior is that
since max_standby_streaming_delay has already expired, the startup process
should wake up quickly and broadcast another
PROCSIG_RECOVERY_CONFLICT_BUFFERPIN message if there are still conflicting
backends.
I was able to reproduce this scenario with some custom code to control the
execution sequence.
One way to fix this scenario is to just remove the `if` block here [3]
entirely so that we always enable the STANDBY_TIMEOUT and
STANDBY_DEADLOCK_TIMEOUT timeouts.
[1]

https://github.com/postgres/postgres/blob/45c357e0e85d2dffe7af5440806150124a725a01/src/backend/storage/ipc/standby.c#L805
[2]

https://github.com/postgres/postgres/blob/45c357e0e85d2dffe7af5440806150124a725a01/src/backend/storage/ipc/standby.c#L842
[3]

https://github.com/postgres/postgres/blob/45c357e0e85d2dffe7af5440806150124a725a01/src/backend/storage/ipc/standby.c#L800-L806


pgsql-bugs by date:

Previous
From: Tom Lane
Date:
Subject: Re: BUG #18960: Mistake in test test_simple_pipeline (libpq_pipeline.c)
Next
From: Ashutosh Bapat
Date:
Subject: Re: "unexpected duplicate for tablespace" problem in logical replication