On Thu, Jun 18, 2020 at 03:39:09PM +0300, Vyacheslav Makarov wrote:
> If the WAL segment for the specified restart_lsn (STOP_LSN of the backup)
> exists, then the function will create a physical replication slot and will
> keep all the WAL segments required by the replica to catch up with the
> primary. Otherwise, it returns error, which means that the required WAL
> segments have been already utilised, so we do need to take a new backup.
> Without passing this newly added parameter
> pg_create_physical_replication_slot() works as before.
>
> What do you think about this?
I think that this was discussed in the past (perhaps one of the
threads related to WAL advancing actually?), and this stuff is full of
holes when it comes to think about error handling with checkpoints
running in parallel, potentially doing recycling of segments you would
expect to be around based on your input value for restart_lsn *while*
pg_create_physical_replication_slot() is still running and
manipulating the on-disk slot information. I suspect that this also
breaks a couple of assumptions behind concurrent calls of the minimum
LSN calculated across slots when a caller sees fit to recompute the
thresholds (WAL senders mainly here, depending on the replication
activity).
--
Michael