On Sat, Jun 12, 2021 at 1:13 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> wrasse has just failed with what looks like a timing error with a
> replication slot drop:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&dt=2021-06-12%2006%3A16%3A30
>
> Here is the error:
> error running SQL: 'psql:<stdin>:1: ERROR: could not drop replication
> slot "tap_sub" on publisher: ERROR: replication slot "tap_sub" is
> active for PID 1641'
>
> It seems to me that this just lacks a poll_query_until() doing some
> slot monitoring?
>
I think it is showing a race condition issue in the code. In
DropSubscription, we first stop the worker that is receiving the WAL,
and then in a separate connection with the publisher, it tries to drop
the slot which leads to this error. The reason is that walsender is
still active as we just wait for wal receiver (or apply worker) to
stop. Normally, as soon as the apply worker is stopped the walsender
detects it and exits but in this case, it took some time to exit, and
in the meantime, we tried to drop the slot which is still in use by
walsender.
If we want to fix this, we might want to wait till the slot is active
on the publisher before trying to drop it but not sure if it is a good
idea. In the worst case, if the user retries this operation (Drop
Subscription), it will succeed.
--
With Regards,
Amit Kapila.